Learning and using generalized string patterns for information extraction

US 7,299,228 B2
Filed: 12/11/2003
Issued: 11/20/2007
Est. Priority Date: 12/11/2003
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method of extracting information from an information source comprising a plurality of documents, comprising:

generating generalized extraction patterns, wherein the generalized extraction patterns express elements of consecutive patterns containing a wildcard, wherein the consecutive patterns specify a number of words in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern;

accessing strings of text in the information source;

comparing the strings of text in the information source to the generalized extraction patterns and identifying a plurality of strings in the information source that match at least one generalized extraction pattern, the generalized extraction patterns including related elements pertaining to a subject, at least one word and at least one wildcard, wherein the at least one word and at least one wildcard are positioned between the related elements and wherein the at least one wildcard denotes that at least one word and up to the specified number of words in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern;

extracting a first set of related elements of text pertaining to the subject from a first string of the plurality of strings based on the related elements pertaining to the subject in the at least one generalized extraction pattern, the first string being associated with a first document in the plurality of documents;

extracting a second set of related elements of text pertaining to the subject from a second string of the plurality of strings based on the related elements in the at least one generalized extraction pattern, the second string being associated with a second document in the plurality of documents, wherein at least one of the related elements of text in the first set of related elements is different from each of the related elements of text in the second set of related elements of text;

and outputting the first set of related elements and the second set of related elements.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to extracting information from an information source. During extraction, strings in the information source are accessed. These strings in the information source are matched with generalized extraction patterns that include words and wildcards. The wildcards denote that at least one word in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern.

Citations

16 Claims

1. A computer-implemented method of extracting information from an information source comprising a plurality of documents, comprising:
- generating generalized extraction patterns, wherein the generalized extraction patterns express elements of consecutive patterns containing a wildcard, wherein the consecutive patterns specify a number of words in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern;
  
  accessing strings of text in the information source;
  
  comparing the strings of text in the information source to the generalized extraction patterns and identifying a plurality of strings in the information source that match at least one generalized extraction pattern, the generalized extraction patterns including related elements pertaining to a subject, at least one word and at least one wildcard, wherein the at least one word and at least one wildcard are positioned between the related elements and wherein the at least one wildcard denotes that at least one word and up to the specified number of words in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern;
  
  extracting a first set of related elements of text pertaining to the subject from a first string of the plurality of strings based on the related elements pertaining to the subject in the at least one generalized extraction pattern, the first string being associated with a first document in the plurality of documents;
  
  extracting a second set of related elements of text pertaining to the subject from a second string of the plurality of strings based on the related elements in the at least one generalized extraction pattern, the second string being associated with a second document in the plurality of documents, wherein at least one of the related elements of text in the first set of related elements is different from each of the related elements of text in the second set of related elements of text;
  
  and outputting the first set of related elements and the second set of related elements.
- View Dependent Claims (2, 3, 7, 8, 9, 10, 11)
- - 2. The computer-implemented method of claim 1 and further comprising processing the first related set of elements and the second set of related elements to analyze data in the information source.
  - 3. The computer-implemented method of claim 1 wherein for at least one of the corresponding elements in each of the generalized extraction patterns, there is at least one word positioned between said at least one of the corresponding elements and the wildcards.
  - 7. The computer-implemented method of claim 1 wherein each of the elements of the first set of related elements of text are different from each of the elements of the second set of related elements of text.
  - 8. The computer-implemented method of claim 1 wherein the corresponding related set of elements refer to general elements pertaining to the subject and the first set of related elements and the second set of related elements refer to specific text associated with the general elements.
  - 9. The computer-implemented method of claim 8 wherein the corresponding related set of general elements include at least one of a company/product pair, a book title/author pair, an inventor/invention information pair and a question/answer pair.
  - 10. The computer-implemented method of claim 9 wherein the first set of related elements and the second set of related elements refer to at least one of a specific company, a specific product, a specific book title, a specific author, a specific inventor, a specific invention, a specific question and a specific answer.
  - 11. The computer-implemented method of claim 1 wherein the plurality of documents include at least one of a collection of documents, news articles and a collection of customer feedback.

4. A computer-readable storage medium for extracting information from an information source comprising a plurality of documents, comprising:
- a data structure including a set of generalized extraction patterns, wherein the generalized extraction patterns express elements of consecutive patterns containing a wildcard, wherein the consecutive patterns specify a number of words in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern, further, including related elements pertaining to a subject, at least one word and at least one wildcard, wherein the at least one word and at least one wildcard are positioned between the related elements and wherein the at least one wildcard denotes that the at least one word and up to the specified number of words in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern; and
  
  an extraction module using the set of generalized extraction patterns to match a first string and a second string in the information source with one of the generalized extraction patterns, the first string associated with a first document in the plurality of documents and the second string associated with a second document in the plurality of documents, extract a first set of related elements of text pertaining to the subject from the first string based on the related elements in said one of the generalized extraction patterns and a second set of related elements of text pertaining to the subject from the second string based on the related elements in said one of the generalized extraction patterns, wherein at least one of the related elements of text in the first set of related elements is different from each of the related elements of text in the second set of related elements of text, and output the first of related elements and the second set of related elements.
- View Dependent Claims (5, 6, 12, 13, 14, 15, 16)
- - 5. The computer-readable storage medium of claim 4 and further comprising a module adapted to process the first set of related elements of text and the second set of related elements of text.
  - 6. The computer-readable storage medium of claim 4 wherein for the generalized extraction patterns there is at least one word positioned between at least one of the elements and the indication.
  - 12. The computer-readable storage medium of claim 4 wherein each of the elements of the first set of related elements of text are different from each of the elements of the second set of related elements of text.
  - 13. The computer-readable storage medium of claim 4 wherein the corresponding related set of elements refer to general elements pertaining to the subject and the first set of related elements and the second set of related elements refer to specific text associated with the general elements.
  - 14. The computer-readable storage medium of claim 13 wherein the corresponding related set of general elements include at least one of a company/product pair, a book title/author pair, and inventor/invention pair and a question/answer pair.
  - 15. The computer-readable storage medium of claim 14 wherein the first set of related elements and the second set of related elements refer to at least one of a specific company, a specific product, a specific book title, a specific author, a specific inventor, a specific invention, a specific question and a specific answer.
  - 16. The computer-readable storage medium of claim 4 wherein the plurality of documents include at least one of a collection of documents, news articles and a collection of customer feedback.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
ServiceNow Incorporated
Original Assignee
Microsoft Corporation
Inventors
Li, Hang, Cao, Yunbo
Primary Examiner(s)
Ly; Cheyne D.

Application Number

US10/733,541
Publication Number

US 20050131896A1
Time in Patent Office

1,440 Days
Field of Search

707/1, 707/6, 707/10, 709/310, 709/100
US Class Current

1/1
CPC Class Codes

G06F 16/30 of unstructured textual dat...

Y10S 707/99936 Pattern matching access

Learning and using generalized string patterns for information extraction

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Learning and using generalized string patterns for information extraction

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links