Methods and systems relating to information extraction
First Claim
Patent Images
1. A method of training an information extraction system comprising:
- employing a first corpus of annotated text;
automatically, using a first computer, extracting information from a second corpus of unannotated text, automatically extracting information from the second corpus comprising parsing the second corpus of text based on relative positions of words in the second corpus and generating a hierarchical cluster tree indicative thereof, the cluster tree containing word groups as hierarchical groups, which have decreasingly similar usage statistics as the groups increase in size, wherein the information automatically extracted from the second corpus comprises information indicative of relative word positions in sentences in the second corpus;
automatically, using a second computer, populating a discriminative information extraction model based on the information extracted from the second corpus of unannotated text and information extracted from the first corpus of annotated text;
automatically, using a third computer, identifying from at least one of the first corpus, the second corpus, and a third corpus one or more word strings including words having an ambiguous relationship and providing the one or more word strings to a trainer for annotation or to an information extraction system previously trained with sufficient information to accurately annotate the one or more word strings; and
automatically, using a fourth computer, updating the discriminative information extraction model based on annotations to the one or more word strings provided by the trainer or by the information system previously trained.
5 Assignments
0 Petitions
Accused Products
Abstract
The invention relates to information extraction systems having discriminative models which utilize hierarchical cluster trees and active learning to enhance training.
-
Citations
10 Claims
-
1. A method of training an information extraction system comprising:
-
employing a first corpus of annotated text; automatically, using a first computer, extracting information from a second corpus of unannotated text, automatically extracting information from the second corpus comprising parsing the second corpus of text based on relative positions of words in the second corpus and generating a hierarchical cluster tree indicative thereof, the cluster tree containing word groups as hierarchical groups, which have decreasingly similar usage statistics as the groups increase in size, wherein the information automatically extracted from the second corpus comprises information indicative of relative word positions in sentences in the second corpus; automatically, using a second computer, populating a discriminative information extraction model based on the information extracted from the second corpus of unannotated text and information extracted from the first corpus of annotated text; automatically, using a third computer, identifying from at least one of the first corpus, the second corpus, and a third corpus one or more word strings including words having an ambiguous relationship and providing the one or more word strings to a trainer for annotation or to an information extraction system previously trained with sufficient information to accurately annotate the one or more word strings; and automatically, using a fourth computer, updating the discriminative information extraction model based on annotations to the one or more word strings provided by the trainer or by the information system previously trained. - View Dependent Claims (2)
-
-
3. A method of training an information extraction system comprising:
-
employing a first corpus of annotated text; automatically, using a first computer, parsing a plurality of sentences in a second corpus of text based on relative positions of words in the second corpus and generating a hierarchical cluster tree indicative thereof, generating the hierarchical cluster tree comprising calculating word bigram occurrence statistics corresponding to the occurrences of word bigrams in the second corpus of text and aggregating words into clusters based on the bigram occurrence statistics; automatically, using a second computer, populating a discriminative information extraction model based on the hierarchical cluster tree and the first corpus of annotated text; automatically, using a third computer, identifying from at least one of the first corpus, the second corpus, and a third corpus one or more word strings including words having an ambiguous relationship and providing the one or more word strings to a trainer for annotation or to an information extraction system previously trained with sufficient information to accurately annotate the one or more word strings; automatically, using a fourth computer, updating the discriminative information extraction model based on annotations to the one or more word strings provided by the trainer or by the information system previously trained; iteratively clustering words and word clusters such that the hierarchical cluster tree forms a binary tree. - View Dependent Claims (4, 5, 6, 7, 8, 9, 10)
-
Specification