Methods and systems relating to information extraction

US 8,280,719 B2
Filed: 04/24/2006
Issued: 10/02/2012
Est. Priority Date: 05/05/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A method of training an information extraction system comprising:

employing a first corpus of annotated text;

automatically, using a first computer, extracting information from a second corpus of unannotated text, automatically extracting information from the second corpus comprising parsing the second corpus of text based on relative positions of words in the second corpus and generating a hierarchical cluster tree indicative thereof, the cluster tree containing word groups as hierarchical groups, which have decreasingly similar usage statistics as the groups increase in size, wherein the information automatically extracted from the second corpus comprises information indicative of relative word positions in sentences in the second corpus;

automatically, using a second computer, populating a discriminative information extraction model based on the information extracted from the second corpus of unannotated text and information extracted from the first corpus of annotated text;

automatically, using a third computer, identifying from at least one of the first corpus, the second corpus, and a third corpus one or more word strings including words having an ambiguous relationship and providing the one or more word strings to a trainer for annotation or to an information extraction system previously trained with sufficient information to accurately annotate the one or more word strings; and

automatically, using a fourth computer, updating the discriminative information extraction model based on annotations to the one or more word strings provided by the trainer or by the information system previously trained.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention relates to information extraction systems having discriminative models which utilize hierarchical cluster trees and active learning to enhance training.

Citations

10 Claims

1. A method of training an information extraction system comprising:
- employing a first corpus of annotated text;
  
  automatically, using a first computer, extracting information from a second corpus of unannotated text, automatically extracting information from the second corpus comprising parsing the second corpus of text based on relative positions of words in the second corpus and generating a hierarchical cluster tree indicative thereof, the cluster tree containing word groups as hierarchical groups, which have decreasingly similar usage statistics as the groups increase in size, wherein the information automatically extracted from the second corpus comprises information indicative of relative word positions in sentences in the second corpus;
  
  automatically, using a second computer, populating a discriminative information extraction model based on the information extracted from the second corpus of unannotated text and information extracted from the first corpus of annotated text;
  
  automatically, using a third computer, identifying from at least one of the first corpus, the second corpus, and a third corpus one or more word strings including words having an ambiguous relationship and providing the one or more word strings to a trainer for annotation or to an information extraction system previously trained with sufficient information to accurately annotate the one or more word strings; and
  
  automatically, using a fourth computer, updating the discriminative information extraction model based on annotations to the one or more word strings provided by the trainer or by the information system previously trained.
- View Dependent Claims (2)
- - 2. The method of claim 1, wherein the first computer, second computer, third computer, and fourth computer are a single computer.

3. A method of training an information extraction system comprising:
- employing a first corpus of annotated text;
  
  automatically, using a first computer, parsing a plurality of sentences in a second corpus of text based on relative positions of words in the second corpus and generating a hierarchical cluster tree indicative thereof, generating the hierarchical cluster tree comprising calculating word bigram occurrence statistics corresponding to the occurrences of word bigrams in the second corpus of text and aggregating words into clusters based on the bigram occurrence statistics;
  
  automatically, using a second computer, populating a discriminative information extraction model based on the hierarchical cluster tree and the first corpus of annotated text;
  
  automatically, using a third computer, identifying from at least one of the first corpus, the second corpus, and a third corpus one or more word strings including words having an ambiguous relationship and providing the one or more word strings to a trainer for annotation or to an information extraction system previously trained with sufficient information to accurately annotate the one or more word strings;
  
  automatically, using a fourth computer, updating the discriminative information extraction model based on annotations to the one or more word strings provided by the trainer or by the information system previously trained;
  
  iteratively clustering words and word clusters such that the hierarchical cluster tree forms a binary tree.
- View Dependent Claims (4, 5, 6, 7, 8, 9, 10)
- - 4. The method of claim 3, wherein populating the discriminative information extraction model comprises characterizing a word based in part on a cluster to which the word belongs.
  - 5. The method of claim 3, wherein populating the discriminative information extraction model comprises characterizing a word based in part on a plurality of clusters to which the word belongs, each cluster corresponding to a different level in the hierarchical cluster tree.
  - 6. The method of claim 3, wherein training the discriminative information extraction model comprises calculating word feature statistics based in part on the annotations in the first corpus of annotated text.
  - 7. The method of claim 6, wherein automatically updating the discriminative information extraction model comprises updating the word feature statistics based in part on the annotations provided by the trainer to the one or more word strings.
  - 8. The method of claim 3, wherein the one or more word strings are identified for trainer annotation based at least in part on a confidence measure indicating a level of confidence that the discriminative information extraction model correctly extracts information from the identified one or more word strings.
  - 9. The method of claim 3, wherein the one or more word strings are identified for trainer annotation based at least in part on a rarity of a feature found in the one or more word strings in relation to other features identified in the corpora of text used to populate the discriminative information extraction model.
  - 10. The method of claim 3, wherein the first computer, second computer, third computer, and fourth computer are a single computer.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cxense ASA
Original Assignee
Ramp Holdings Incorporated (Clean Harbors Incorporated)
Inventors
Miller, Scott
Primary Examiner(s)
He, Jialong

Application Number

US11/411,206
Publication Number

US 20060253274A1
Time in Patent Office

2,353 Days
Field of Search

704 1- 10, 707736-738
US Class Current

704/9
CPC Class Codes

G06F 40/211 Syntactic parsing, e.g. bas...

G06F 40/295 Named entity recognition

Methods and systems relating to information extraction

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems relating to information extraction

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links