×

Attribute-value extraction from structured documents

  • US 8,645,391 B1
  • Filed: 07/02/2009
  • Issued: 02/04/2014
  • Est. Priority Date: 07/03/2008
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method, comprising:

  • obtaining an initial attribute whitelist, the initial attribute whitelist including one or more initial attributes;

    processing a first collection of documents, wherein each of the documents has content to be displayed and an underlying structure that defines how the content is to be displayed, to identify a plurality of pairings of candidate attributes with candidate values in the documents, wherein each candidate attribute and each candidate value is content found in the content to be displayed;

    grouping the candidate attributes into a plurality of groups according to both a particular document in the first collection in which each candidate attribute was identified and the underlying structure in the particular document in the first collection in which each candidate attribute was identified;

    calculating a score for each unique attribute in the candidate attributes, where the score reflects a number of groups containing both the unique attribute and an attribute on the initial attribute whitelist;

    generating an expanded attribute whitelist, the expanded attribute whitelist including the initial attributes and each unique attribute having a respective score that satisfies a threshold; and

    using the expanded attribute whitelist to identify valid pairings of candidate attributes with candidate values.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×