×

Learning and using generalized string patterns for information extraction

  • US 7,299,228 B2
  • Filed: 12/11/2003
  • Issued: 11/20/2007
  • Est. Priority Date: 12/11/2003
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer-implemented method of extracting information from an information source comprising a plurality of documents, comprising:

  • generating generalized extraction patterns, wherein the generalized extraction patterns express elements of consecutive patterns containing a wildcard, wherein the consecutive patterns specify a number of words in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern;

    accessing strings of text in the information source;

    comparing the strings of text in the information source to the generalized extraction patterns and identifying a plurality of strings in the information source that match at least one generalized extraction pattern, the generalized extraction patterns including related elements pertaining to a subject, at least one word and at least one wildcard, wherein the at least one word and at least one wildcard are positioned between the related elements and wherein the at least one wildcard denotes that at least one word and up to the specified number of words in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern;

    extracting a first set of related elements of text pertaining to the subject from a first string of the plurality of strings based on the related elements pertaining to the subject in the at least one generalized extraction pattern, the first string being associated with a first document in the plurality of documents;

    extracting a second set of related elements of text pertaining to the subject from a second string of the plurality of strings based on the related elements in the at least one generalized extraction pattern, the second string being associated with a second document in the plurality of documents, wherein at least one of the related elements of text in the first set of related elements is different from each of the related elements of text in the second set of related elements of text;

    and outputting the first set of related elements and the second set of related elements.

View all claims
  • 5 Assignments
Timeline View
Assignment View
    ×
    ×