SYSTEM AND METHOD FOR EXTRACTING ENTITIES OF INTEREST FROM TEXT USING N-GRAM MODELS
First Claim
1. A method of using at least two n-gram models, at least one of which is based on a training set of entities of interest and at least one of which is based on a training set of entities not of interest, the method comprising:
- tokenizing a document to produce a string of tokens corresponding to terms within the document;
for each token, evaluating the token against the n-gram models to determine which model is most likely to be associated with the token;
identifying tokens corresponding to at least one n-gram model that is of interest; and
annotating the identified entities by at least one name for said at least one n-gram model.
1 Assignment
0 Petitions
Accused Products
Abstract
A document (or multiple documents) is analyzed to identify entities of interest within that document. This is accomplished by constructing n-gram or bi-gram models that correspond to different kinds of text entities, such as chemistry-related words and generic English words. The models can be constructed from training text selected to reflect a particular kind of text entity. The document is tokenized, and the tokens are run against the models to determine, for each token, which kind of text entity is most likely to be associated with that token. The entities of interest in the document can then be annotated accordingly.
-
Citations
27 Claims
-
1. A method of using at least two n-gram models, at least one of which is based on a training set of entities of interest and at least one of which is based on a training set of entities not of interest, the method comprising:
-
tokenizing a document to produce a string of tokens corresponding to terms within the document; for each token, evaluating the token against the n-gram models to determine which model is most likely to be associated with the token; identifying tokens corresponding to at least one n-gram model that is of interest; and annotating the identified entities by at least one name for said at least one n-gram model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method for use with tokens corresponding to terms within a document, comprising:
-
evaluating each token against at least 2 different Markov models to determine respective relative probabilities that the token corresponds to the Markov models; for each token, comparing the relative probabilities with each other to determine which Markov model is more likely to be associated with the token; and identifying tokens most likely to correspond to a particular one of the Markov models, so that terms of interest within the document are identified. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A method, comprising:
-
creating respective bi-gram language models for i) entities of interest (“
MINT”
), and ii) entities that are not of interest (“
MNOT— INT”
);parsing unstructured text of a document into a collection C of phrases; for each phrase in C, calculating i) the probability that the phrase is associated with the model MINT and ii) the probability that the phrase is associated with the model MNOT — INT; anddetermining whether each phrase is an entity of interest by comparing the calculated probabilities. - View Dependent Claims (22, 23)
-
-
24. A computer program product comprising a computer useable medium that includes computer usable program code tangibly embodied thereon for use with tokens corresponding to terms within a document, the product including:
-
code for evaluating each token against at least 2 different Markov models to determine respective relative probabilities that the token corresponds to the Markov models; code that, for each token, compares the relative probabilities with each other to determine which Markov model is more likely to be associated with the token; and code for identifying tokens most likely to correspond to a particular one of the Markov models, so that terms of interest within the document are identified. - View Dependent Claims (25, 26, 27)
-
Specification