Document-specific gazetteers for named entity recognition
First Claim
1. An entity recognition method comprising:
- providing a named entity recognition model which has been trained on features extracted from training samples tagged with document-level entity tags, each training sample comprising at least one text sequence;
receiving a text document to be labeled, the text document being tagged with at least one document-level entity tag;
generating a document-specific gazetteer based on the at least one document-level entity tag, the document-specific gazetteer including a set of entries, one entry for each of a set of entity names;
for a text sequence of the text document, extracting features for tokens of the text sequence, the features including document-specific features for tokens matching at least a part of the entity name of one of the gazetteer entries, the document-specific features comprising at least 12 document-specific features;
predicting entity labels for tokens in the document text sequence with the named entity recognition model, based on the extracted features, andwherein at least one of the generating, extracting, and predicting is performed with a processor.
5 Assignments
0 Petitions
Accused Products
Abstract
A method for entity recognition employs document-level entity tags which correspond to mentions appearing in the document, without specifying their locations. A named entity recognition model is trained on features extracted from text samples tagged with document-level entity tags. A text document to be labeled is received, the text document being tagged with at least one document-level entity tag. A document-specific gazetteer is generated, based on the at least one document-level entity tag. The gazetteer includes a set of entries, one entry for each of a set of entity names. For a text sequence of the document, features for tokens of the text sequence are extracted. The features include document-specific features for tokens matching at least a part of the entity name of one of the gazetteer entries. Entity labels are predicted for the tokens in the text sequence with the named entity recognition model, based on the extracted features.
-
Citations
20 Claims
-
1. An entity recognition method comprising:
-
providing a named entity recognition model which has been trained on features extracted from training samples tagged with document-level entity tags, each training sample comprising at least one text sequence; receiving a text document to be labeled, the text document being tagged with at least one document-level entity tag; generating a document-specific gazetteer based on the at least one document-level entity tag, the document-specific gazetteer including a set of entries, one entry for each of a set of entity names; for a text sequence of the text document, extracting features for tokens of the text sequence, the features including document-specific features for tokens matching at least a part of the entity name of one of the gazetteer entries, the document-specific features comprising at least 12 document-specific features; predicting entity labels for tokens in the document text sequence with the named entity recognition model, based on the extracted features, and wherein at least one of the generating, extracting, and predicting is performed with a processor. - View Dependent Claims (2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18)
-
-
3. An entity recognition method comprising:
-
training a named entity recognition model on features extracted from training samples tagged with document-level entity tags, each training sample comprising at least one text sequence, wherein the training comprises; receiving annotated training samples, each training sample being tagged with at least one document-level entity tag having a mention in at least one of the text sequences of the training sample, each text sequences of the training sample being annotated with token-level entity labels; for each training sample, generating a document-specific gazetteer based on the at least one document-level entity tag of the annotated training sample, the document-specific gazetteer including a set of entity names; using the document-specific gazetteer, extracting features for tokens of each text sequence in the training sample, the features including document-specific features; and training the named entity recognition model with the extracted features and the token-level entity labels for each training sequence; receiving a text document to be labeled, the text document being tagged with at least one document-level entity tag; generating a document-specific gazetteer based on the at least one document-level entity tag, the document-specific gazetteer including a set entries, one entry for each of a set of entity names; for a text sequence of the text document, extracting features for tokens of the text sequence, the features including document-specific features for tokens matching at least a part of the entity name of one of the gazetteer entries; predicting entity labels for tokens in the document text sequence with the named entity recognition model, based on the extracted features, and wherein at least one of the generating, extracting, and predicting is performed with a processor.
-
-
15. An entity recognition method, comprising:
-
providing a named entity recognition model which has been trained on features extracted from training samples tagged with document-level entity tags, each training sample comprising at least one text sequence; receiving a text document to be labeled, the text document being tagged with at least one document-level entity tag, the at least one document-level entity tag for the text document to be labeled having at least one mention in the text document which refers to that entity, and the at least one document-level entity tag not being aligned to a specific token or specific sequence of tokens in the text document; providing a named entity recognition model which has been trained on features extracted from training samples tagged with document-level entity tags, each training sample comprising at least one text sequence; receiving a text document to be labeled, the text document being tagged with at least one document-level entity tag; generating a document-specific gazetteer based on the at least one document-level entity tag, the document-specific gazetteer including a set entries, one entry for each of a set of entity names; for a text sequence of the text document, extracting features for tokens of the text sequence, the features including document-specific features for tokens matching at least a part of the entity name of one of the gazetteer entries; predicting entity labels for tokens in the document text sequence with the named entity recognition model, based on the extracted features, and wherein at least one of the generating, extracting, and predicting is performed with a processor.
-
-
19. An entity recognition system comprising:
-
memory which stores a named entity recognition model which has been trained on features extracted from text sequences tagged with document-level entity tags; a gazetteer generator which generates a document-specific gazetteer for an input text document to be labeled with named entities, the text document being tagged with at least one document-level entity tag, the document-specific gazetteer including an entry based on each of the at least one document-level entity tag, the gazetteer entry including an entity name and optionally an entity type selected from a predefined set of entity types; a feature extraction component which, for a text sequence of the text document, extracts features for tokens of the text sequence, the features including document-specific features for tokens matching one of the gazetteer entries, the document-specific features include features selected from the group consisting of; a feature indicating whether a token matches an initial token of a gazetteer entity name of at least two tokens; a feature indicating whether a token matches an intermediate token of a gazetteer entity name of at least three tokens; a feature indicating whether a token matches a final token of a gazetteer entity name of at least two tokens; and a feature indicating whether a token matches a unigram gazetteer entity name; a recognition component which predicts entity labels for at least some of the tokens in the text sequence with the named entity recognition model, based on the extracted features, and a processor, in communication with the memory, which implements the gazetteer generator, feature extraction component and recognition component.
-
-
20. A method for training a named entity recognition system comprising:
-
receiving a collection of training samples, each training sample including at least one annotated training sequence, each training sequence comprising a sequence of tokens, each training sample being tagged with at least one document-level entity tag which includes an entity name that corresponds to a mention in the sample without being aligned with the mention, each of the training sequences being annotated with token-level entity labels; for each training sample, generating a document-specific gazetteer based on the at least one document-level entity tag of the annotated training sample, the document-specific gazetteer including a set of entries, each entry including a respective entity name; using the document-specific gazetteer, extracting features for tokens of the annotated training sequences, the features including document-specific features, the document-specific features being selected from the group consisting of; a feature indicating whether a token matches an initial token of a gazetteer entity name of at least two tokens, a feature indicating whether a token matches an intermediate token of a gazetteer entity name of at least three tokens, a feature indicating whether a token matches a final token of a gazetteer entity name of at least two tokens, and a feature indicating whether a token matches a unigram gazetteer entity name; and training a named entity recognition model with the extracted features and the token-level entity labels for each training sequence, wherein the generating, extracting and training are performed with a processor.
-
Specification