Methods and systems relating to information extraction

US 20060253274A1
Filed: 04/24/2006
Published: 11/09/2006
Est. Priority Date: 05/05/2005
Status: Active Grant

First Claim

Patent Images

1. A method of training an information extraction system comprising:

employing a first corpus of annotated text;

automatedly extracting information from a second corpus of unannotated text;

automatedly populating a discriminative information extraction model based on the information extracted from the second corpus and the first corpus of annotated text;

automatedly identifying from at least one of the first corpus, the second corpus, and a third corpus one or more word strings including words having an ambiguous relationship and providing the one or more word strings to a trainer for annotation; and

automatedly updating the discriminative information extraction model based on annotations to the one or more word strings provided by the trainer.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention relates to information extraction systems having discriminative models which utilize hierarchical cluster trees and active learning to enhance training.

117 Citations

View as Search Results

30 Claims

1. A method of training an information extraction system comprising:
- employing a first corpus of annotated text;
  
  automatedly extracting information from a second corpus of unannotated text;
  
  automatedly populating a discriminative information extraction model based on the information extracted from the second corpus and the first corpus of annotated text;
  
  automatedly identifying from at least one of the first corpus, the second corpus, and a third corpus one or more word strings including words having an ambiguous relationship and providing the one or more word strings to a trainer for annotation; and
  
  automatedly updating the discriminative information extraction model based on annotations to the one or more word strings provided by the trainer.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein automatedly extracting information from the second corpus comprises parsing the second corpus of text based on relative positions of words in the second corpus and generating a hierarchical cluster tree indicative thereof.
  - 3. The method of claim 1, wherein the information automatically extracted from the second corpus comprises information indicative of relative word positions in sentences in the second corpus.
  - 4. The method of claim 3, wherein the information automatically extracted from the second corpus comprises a hierarchical cluster tree.

5. A storage medium including computer readable instructions for carrying out a method of training an information extraction system comprising:
- employing a first corpus of annotated text;
  
  automatedly extracting information from a second corpus of unannotated text;
  
  automatedly populating a discriminative information extraction model based on the information extracted from the second corpus and the first corpus of annotated text;
  
  automatedly identifying from at least one of the first corpus, the second corpus, and a third corpus one or more word strings including words having an ambiguous relationship and providing the one or more word strings to a trainer for annotation; and
  
  automatedly updating the discriminative information extraction model based on annotations to the one or more word strings provided by the trainer.
- View Dependent Claims (6, 7, 8)
- - 6. The storage medium of claim 5, wherein automatedly extracting information from the second corpus comprises parsing the second corpus of text based on relative positions of words in the second corpus and generating a hierarchical cluster tree indicative thereof.
  - 7. The storage medium of claim 5, wherein the information automatically extracted from the second corpus comprises information indicative of relative word positions in sentences in the second corpus.
  - 8. The storage medium of claim 7, wherein the information automatically extracted from the second corpus comprises a hierarchical cluster tree.

9. A method of training an information extraction system comprising:
- employing a first corpus of annotated text;
  
  automatedly parsing a second corpus of text based on relative positions of words in the second corpus and generating a hierarchical cluster tree indicative thereof;
  
  automatedly populating a discriminative information extraction model based on the hierarchical cluster tree and the first corpus of annotated text;
  
  automatedly identifying from at least one of the first corpus, the second corpus, and a third corpus one or more word strings including words having an ambiguous relationship and providing the one or more word strings to a trainer for annotation; and
  
  automatedly updating the discriminative information extraction model based on annotations to the one or more word strings provided by the trainer.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 10. The method of claim 9, wherein generating the hierarchical cluster tree comprises calculating word bigram statistics corresponding to the occurrences of word bigrams in the second corpus of text.
  - 11. The method of claim 10, wherein generating the hierarchical cluster tree comprises aggregating words into clusters based on the bigram occurrence statistics.
  - 12. The method of claim 9, comprising iteratively clustering words and word clusters such that the hierarchical cluster tree forms a binary tree.
  - 13. The method of claim 9, wherein populating the discriminative information extraction model comprises characterizing a word based in part on a cluster to which the word belongs.
  - 14. The method of claim 9, wherein populating the discriminative information extraction model comprises characterizing a word based in part on a plurality of clusters to which the word belongs, each cluster corresponding to a different level in the hierarchical cluster tree.
  - 15. The method of claim 9, wherein training the discriminative information extraction model comprises calculating word feature statistics based in part on the annotations in the first corpus of annotated text.
  - 16. The method of claim 15, wherein automatedly updating the discriminative information extraction model comprises updating the word feature statistics based in part on the annotations provided by the trainer to the one or more word strings.
  - 17. The method of claim 9, wherein the one or more word strings are identified for trainer annotation based at least in part on a confidence measure indicating a level of confidence that the discriminative information extraction model correctly extracts information from the identified one ore more word strings.
  - 18. The method of claim 9, wherein the one or more word strings are identified for trainer annotation based at least in part on a rarity of a feature found in the one or more word strings in relation to other features identified in the corpora of text used to populate the discriminative information extraction model.

19. A storage medium including computer readable instructions for carrying out a method of training an information extraction system comprising:
- employing a first corpus of annotated text;
  
  automatedly parsing a second corpus of text based on relative positions of words in the second corpus and generating a hierarchical cluster tree indicative thereof;
  
  automatedly populating a discriminative information extraction model based on the hierarchical cluster tree and the first corpus of annotated text;
  
  automatedly identifying from at least one of the first corpus, the second corpus, and a third corpus one or more word strings including words having an ambiguous relationship and providing the one or more word strings to a trainer for annotation; and
  
  automatedly updating the discriminative information extraction model based on annotations to the one or more word strings provided by the trainer.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28)
- - 20. The storage medium of claim 19, wherein generating the hierarchical cluster tree comprises calculating word bigram statistics corresponding to the occurrences of word bigrams in the second corpus of text.
  - 21. The storage medium of claim 20, wherein generating the hierarchical cluster tree comprises aggregating words into clusters based on the bigram occurrence statistics.
  - 22. The storage medium of claim 19, wherein the method comprises iteratively clustering words and word clusters such that the hierarchical cluster tree forms a binary tree.
  - 23. The storage medium of claim 19, wherein populating the discriminative information extraction model comprises characterizing a word based in part on a cluster to which the word belongs.
  - 24. The storage medium of claim 19, wherein populating the discriminative information extraction model comprises characterizing a word based in part on a plurality of clusters to which the word belongs, each cluster corresponding to a different level in the hierarchical cluster tree.
  - 25. The storage medium of claim 19, wherein training the discriminative information extraction model comprises calculating word feature statistics based in part on the annotations in the first corpus of annotated text.
  - 26. The storage medium of claim 25, wherein automatedly updating the discriminative information extraction model comprises updating the word feature statistics based in part on the annotations provided by the trainer to the one or more word strings.
  - 27. The storage medium of claim 19, wherein the one or more word strings are identified for trainer annotation based at least in part on a confidence measure indicating a level of confidence that the discriminative information extraction model correctly extracts information from the identified one ore more word strings.
  - 28. The storage medium of claim 19, wherein the one or more word strings are identified for trainer annotation based at least in part on a rarity of a feature found in the one or more word strings in relation to other features identified in corpora of text used to populate the information extraction model.

29. A method of training an information extraction system comprising:
- receiving a first corpus of annotated text;
  
  receiving a hierchical cluster tree indicative of the relative positions of words in a second corpus of text;
  
  automatedly populating a discriminative information extraction model based on the hierarchical cluster tree and the first corpus of annotated text;
  
  automatedly identifying from at least one of the first corpus, the second corpus, and a third corpus one or more word strings including words having an ambiguous relationship and providing the one or more word strings to a trainer for annotation; and
  
  automatedly updating the discriminative information extraction model based on annotations to the one or more word strings provided by the trainer.

30. A storage medium including computer readable instructions for carrying out a method of training an information extraction system comprising:
- employing a first corpus of annotated text;
  
  receiving a hierchical cluster tree indicative of the relative positions of words in a second corpus of text;
  
  automatedly populating a discriminative information extraction model based on the hierarchical cluster tree and the first corpus of annotated text;
  
  automatedly identifying from at least one of the first corpus, the second corpus, and a third corpus one or more word strings including words having an ambiguous relationship and providing the one or more word strings to a trainer for annotation; and
  
  automatedly updating the discriminative information extraction model based on annotations to the one or more word strings provided by the trainer.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cxense ASA
Original Assignee
BBN Technologies (Rtx Corporation)
Inventors
Miller, Scott

Granted Patent

US 8,280,719 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/9
CPC Class Codes

G06F 40/211 Syntactic parsing, e.g. bas...

G06F 40/295 Named entity recognition

Methods and systems relating to information extraction

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

117 Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems relating to information extraction

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

117 Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links