Learning and using generalized string patterns for information extraction
First Claim
Patent Images
1. A computer-implemented method of extracting information from an information source comprising a plurality of documents, comprising:
- generating generalized extraction patterns, wherein the generalized extraction patterns express elements of consecutive patterns containing a wildcard, wherein the consecutive patterns specify a number of words in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern;
accessing strings of text in the information source;
comparing the strings of text in the information source to the generalized extraction patterns and identifying a plurality of strings in the information source that match at least one generalized extraction pattern, the generalized extraction patterns including related elements pertaining to a subject, at least one word and at least one wildcard, wherein the at least one word and at least one wildcard are positioned between the related elements and wherein the at least one wildcard denotes that at least one word and up to the specified number of words in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern;
extracting a first set of related elements of text pertaining to the subject from a first string of the plurality of strings based on the related elements pertaining to the subject in the at least one generalized extraction pattern, the first string being associated with a first document in the plurality of documents;
extracting a second set of related elements of text pertaining to the subject from a second string of the plurality of strings based on the related elements in the at least one generalized extraction pattern, the second string being associated with a second document in the plurality of documents, wherein at least one of the related elements of text in the first set of related elements is different from each of the related elements of text in the second set of related elements of text;
and outputting the first set of related elements and the second set of related elements.
5 Assignments
0 Petitions
Accused Products
Abstract
The present invention relates to extracting information from an information source. During extraction, strings in the information source are accessed. These strings in the information source are matched with generalized extraction patterns that include words and wildcards. The wildcards denote that at least one word in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern.
-
Citations
16 Claims
-
1. A computer-implemented method of extracting information from an information source comprising a plurality of documents, comprising:
-
generating generalized extraction patterns, wherein the generalized extraction patterns express elements of consecutive patterns containing a wildcard, wherein the consecutive patterns specify a number of words in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern; accessing strings of text in the information source; comparing the strings of text in the information source to the generalized extraction patterns and identifying a plurality of strings in the information source that match at least one generalized extraction pattern, the generalized extraction patterns including related elements pertaining to a subject, at least one word and at least one wildcard, wherein the at least one word and at least one wildcard are positioned between the related elements and wherein the at least one wildcard denotes that at least one word and up to the specified number of words in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern; extracting a first set of related elements of text pertaining to the subject from a first string of the plurality of strings based on the related elements pertaining to the subject in the at least one generalized extraction pattern, the first string being associated with a first document in the plurality of documents; extracting a second set of related elements of text pertaining to the subject from a second string of the plurality of strings based on the related elements in the at least one generalized extraction pattern, the second string being associated with a second document in the plurality of documents, wherein at least one of the related elements of text in the first set of related elements is different from each of the related elements of text in the second set of related elements of text; and outputting the first set of related elements and the second set of related elements. - View Dependent Claims (2, 3, 7, 8, 9, 10, 11)
-
-
4. A computer-readable storage medium for extracting information from an information source comprising a plurality of documents, comprising:
-
a data structure including a set of generalized extraction patterns, wherein the generalized extraction patterns express elements of consecutive patterns containing a wildcard, wherein the consecutive patterns specify a number of words in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern, further, including related elements pertaining to a subject, at least one word and at least one wildcard, wherein the at least one word and at least one wildcard are positioned between the related elements and wherein the at least one wildcard denotes that the at least one word and up to the specified number of words in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern; and an extraction module using the set of generalized extraction patterns to match a first string and a second string in the information source with one of the generalized extraction patterns, the first string associated with a first document in the plurality of documents and the second string associated with a second document in the plurality of documents, extract a first set of related elements of text pertaining to the subject from the first string based on the related elements in said one of the generalized extraction patterns and a second set of related elements of text pertaining to the subject from the second string based on the related elements in said one of the generalized extraction patterns, wherein at least one of the related elements of text in the first set of related elements is different from each of the related elements of text in the second set of related elements of text, and output the first of related elements and the second set of related elements. - View Dependent Claims (5, 6, 12, 13, 14, 15, 16)
-
Specification