SYSTEM AND METHOD FOR EXTRACTING ENTITIES OF INTEREST FROM TEXT USING N-GRAM MODELS

US 20080040298A1
Filed: 05/31/2006
Published: 02/14/2008
Est. Priority Date: 05/31/2006
Status: Active Grant

First Claim

Patent Images

1. A method of using at least two n-gram models, at least one of which is based on a training set of entities of interest and at least one of which is based on a training set of entities not of interest, the method comprising:

tokenizing a document to produce a string of tokens corresponding to terms within the document;

for each token, evaluating the token against the n-gram models to determine which model is most likely to be associated with the token;

identifying tokens corresponding to at least one n-gram model that is of interest; and

annotating the identified entities by at least one name for said at least one n-gram model.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A document (or multiple documents) is analyzed to identify entities of interest within that document. This is accomplished by constructing n-gram or bi-gram models that correspond to different kinds of text entities, such as chemistry-related words and generic English words. The models can be constructed from training text selected to reflect a particular kind of text entity. The document is tokenized, and the tokens are run against the models to determine, for each token, which kind of text entity is most likely to be associated with that token. The entities of interest in the document can then be annotated accordingly.

Citations

27 Claims

1. A method of using at least two n-gram models, at least one of which is based on a training set of entities of interest and at least one of which is based on a training set of entities not of interest, the method comprising:
- tokenizing a document to produce a string of tokens corresponding to terms within the document;
  
  for each token, evaluating the token against the n-gram models to determine which model is most likely to be associated with the token;
  
  identifying tokens corresponding to at least one n-gram model that is of interest; and
  
  annotating the identified entities by at least one name for said at least one n-gram model.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, comprising annotating a group of adjacent tokens, in order to generate a maximal entity of interest that includes more than one word.
  - 3. The method of claim 1, wherein the n-gram model of interest is directed to chemical entities.
  - 4. The method of claim 1, wherein said evaluating comprises:
    - calculating a relative probability that a given token has been generated by a model of interest;
      
      calculating a relative probability that the given token has been generated by a model that is not of interest;
      
      comparing the calculated relative probabilities; and
      
      associating each token with the model that yields the greater relative probability.
  - 5. The method of claim 4, wherein a Markov model is used to determine the relative probabilities.
  - 6. The method of claim 4, wherein a count matrix is used to determine the relative probabilities.
  - 7. The method of claim 1, wherein said at least two n-gram models include models directed to different languages.
  - 8. The method of claim 1, wherein the terms within the document include terms of a chemical nature.
  - 9. The method of claim 8, wherein all the terms of a chemical nature within the document are identified.
  - 10. The method of claim 1, wherein the method is implemented by at least one computer.

11. A method for use with tokens corresponding to terms within a document, comprising:
- evaluating each token against at least 2 different Markov models to determine respective relative probabilities that the token corresponds to the Markov models;
  
  for each token, comparing the relative probabilities with each other to determine which Markov model is more likely to be associated with the token; and
  
  identifying tokens most likely to correspond to a particular one of the Markov models, so that terms of interest within the document are identified.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The method of claim 11, further comprising generating the tokens corresponding to the terms within the document.
  - 13. The method of claim 11, further comprising adding tags to at least some of the terms within the document.
  - 14. The method of claim 13, wherein tags are added to all terms corresponding to said particular one of the Markov models.
  - 15. The method of claim 11, wherein said at least two Markov models correspond to respective n-gram models, at least one of which is based on a training set of entities of interest and at least one of which is based on a training set of entities not of interest.
  - 16. The method of claim 15, wherein the training set of interest is directed to chemical terms.
  - 17. The method of claim 11, comprising:
    - evaluating each token against at least 3 different Markov models to determine respective relative probabilities that the token corresponds to the Markov models, at least 2 of the Markov models being directed to terms that are of interest, and at least one of the Markov models being directed to terms that are not of interest; and
      
      identifying tokens associated with said at least 2 of the Markov models directed to terms of interest.
  - 18. The method of claim 17, further comprising annotating the document in view of the identified tokens.
  - 19. The method of claim 11, further comprising generating a file that includes annotation information associated with said identified tokens.
  - 20. The method of claim 11, wherein the method is implemented by at least one computer.

21. A method, comprising:
- creating respective bi-gram language models for i) entities of interest (“
  
  M_INT”
  
  ), and ii) entities that are not of interest (“
  
  M_NOT_—_INT”
  
  );
  
  parsing unstructured text of a document into a collection C of phrases;
  
  for each phrase in C, calculating i) the probability that the phrase is associated with the model M_INTand ii) the probability that the phrase is associated with the model M_NOT_—_INT; and
  
  determining whether each phrase is an entity of interest by comparing the calculated probabilities.
- View Dependent Claims (22, 23)
- - 22. The method of claim 21, further comprising annotating the entities of interest in the document.
  - 23. The method of claim 21, wherein the method is implemented by at least one computer.

24. A computer program product comprising a computer useable medium that includes computer usable program code tangibly embodied thereon for use with tokens corresponding to terms within a document, the product including:
- code for evaluating each token against at least 2 different Markov models to determine respective relative probabilities that the token corresponds to the Markov models;
  
  code that, for each token, compares the relative probabilities with each other to determine which Markov model is more likely to be associated with the token; and
  
  code for identifying tokens most likely to correspond to a particular one of the Markov models, so that terms of interest within the document are identified.
- View Dependent Claims (25, 26, 27)
- - 25. The computer program product of claim 24, wherein said at least two Markov models correspond to respective n-gram models, at least one of which is based on a training set of entities of interest and at least one of which is based on a training set of entities not of interest.
  - 26. The computer program product of claim 25, wherein the training set of interest is directed to chemical terms.
  - 27. The computer program product of claim 26, further comprising code for annotating the document in view of the identified tokens.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Rhodes, James J., Kanungo, Tapas

Granted Patent

US 7,493,293 B2
Time in Patent Office

Days
Field of Search
US Class Current

706/12
CPC Class Codes

G06F 40/295 Named entity recognition

SYSTEM AND METHOD FOR EXTRACTING ENTITIES OF INTEREST FROM TEXT USING N-GRAM MODELS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM AND METHOD FOR EXTRACTING ENTITIES OF INTEREST FROM TEXT USING N-GRAM MODELS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links