Attribute-value extraction from structured documents
First Claim
1. A computer-implemented method, comprising:
- obtaining an initial attribute whitelist, the initial attribute whitelist including one or more initial attributes;
processing a first collection of documents, wherein each of the documents has content to be displayed and an underlying structure that defines how the content is to be displayed, to identify a plurality of pairings of candidate attributes with candidate values in the documents, wherein each candidate attribute and each candidate value is content found in the content to be displayed;
grouping the candidate attributes into a plurality of groups according to both a particular document in the first collection in which each candidate attribute was identified and the underlying structure in the particular document in the first collection in which each candidate attribute was identified;
calculating a score for each unique attribute in the candidate attributes, where the score reflects a number of groups containing both the unique attribute and an attribute on the initial attribute whitelist;
generating an expanded attribute whitelist, the expanded attribute whitelist including the initial attributes and each unique attribute having a respective score that satisfies a threshold; and
using the expanded attribute whitelist to identify valid pairings of candidate attributes with candidate values.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for attribute-value extraction from structured documents. In one aspect, a method includes obtaining an initial attribute whitelist, extracting candidate attributes from a first collection of documents, and grouping the candidate attributes. The method further includes calculating a score for each unique attribute in the candidate attributes, generating an expanded attribute whitelist including the initial attributes and each unique attribute having a score that satisfies a threshold, and using the expanded attribute whitelist to identify valid attribute-value pairs. In another aspect, a method includes extracting candidate attribute-value pairs from a collection of documents and identifying one or more features for each candidate attribute-value pair. The method further includes filtering out non valid attribute-value pairs.
-
Citations
27 Claims
-
1. A computer-implemented method, comprising:
-
obtaining an initial attribute whitelist, the initial attribute whitelist including one or more initial attributes; processing a first collection of documents, wherein each of the documents has content to be displayed and an underlying structure that defines how the content is to be displayed, to identify a plurality of pairings of candidate attributes with candidate values in the documents, wherein each candidate attribute and each candidate value is content found in the content to be displayed; grouping the candidate attributes into a plurality of groups according to both a particular document in the first collection in which each candidate attribute was identified and the underlying structure in the particular document in the first collection in which each candidate attribute was identified; calculating a score for each unique attribute in the candidate attributes, where the score reflects a number of groups containing both the unique attribute and an attribute on the initial attribute whitelist; generating an expanded attribute whitelist, the expanded attribute whitelist including the initial attributes and each unique attribute having a respective score that satisfies a threshold; and using the expanded attribute whitelist to identify valid pairings of candidate attributes with candidate values. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer storage device having a computer program stored thereon, the computer program comprising instructions that when executed cause one or more computers to perform actions comprising:
-
obtaining an initial attribute whitelist, the initial attribute whitelist including one or more initial attributes; processing a first collection of documents, wherein each of the documents has content to be displayed and an underlying structure that defines how the content is to be displayed, to identify a plurality of pairings of candidate attributes with candidate values in the documents, wherein each candidate attribute and each candidate value is content found in the content to be displayed; grouping the candidate attributes into a plurality of groups according to both a particular document in the first collection in which each candidate attribute was identified and the underlying structure in the particular document in the first collection in which each candidate attribute was identified; calculating a score for each unique attribute in the candidate attributes, where the score reflects a number of groups containing both the unique attribute and an attribute on the initial attribute whitelist; generating an expanded attribute whitelist, the expanded attribute whitelist including the initial attributes and each unique attribute having a respective score that satisfies a threshold; and using the expanded attribute whitelist to identify valid pairings of candidate attributes with candidate values. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A system comprising:
-
one or more computers configured to perform operations comprising; obtaining an initial attribute whitelist, the initial attribute whitelist including one or more initial attributes; processing a first collection of documents, wherein each of the documents has content to be displayed and an underlying structure that defines how the content is to be displayed, to identify a plurality of pairings of candidate attributes with candidate values in the documents, wherein each candidate attribute and each candidate value is content found in the content to be displayed; grouping the candidate attributes into a plurality of groups according to both a particular document in the first collection in which each candidate attribute was extracted identified and the underlying structure in the particular document in the first collection in which each candidate attribute was identified; calculating a score for each unique attribute in the candidate attributes, where the score reflects a number of groups containing both the unique attribute and an attribute on the initial attribute whitelist; generating an expanded attribute whitelist, the expanded attribute whitelist including the initial attributes and each unique attribute having a respective score that satisfies a threshold; and using the expanded attribute whitelist to identify valid pairings of candidate attributes with candidate values. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A computer-implemented method, comprising:
-
obtaining an initial attribute whitelist, the initial attribute whitelist including one or more initial attributes; processing a first collection of structured documents to identify structures in the structured documents; identifying a plurality of candidate attributes in the first collection of structured documents based on the identified structures in the structured documents in the first collection; grouping the candidate attributes into a plurality of groups according to both a particular structured document in the first collection from which each candidate attribute was extracted and a structure in the particular structured document in the first collection; calculating a score for each unique attribute in the candidate attributes, where the score reflects a number of groups containing the unique attribute and an attribute on the initial attribute whitelist, wherein the score for each unique attribute is calculated according to; - View Dependent Claims (17, 18, 19)
-
-
20. A computer storage device having a computer program stored thereon, the computer program comprising instructions that when executed cause one or more computers to perform actions comprising:
-
obtaining an initial attribute whitelist, the initial attribute whitelist including one or more initial attributes; processing a first collection of structured documents to identify structures in the structured documents; identifying a plurality of candidate attributes in the first collection of structured documents based on the identified structures in the structured documents in the first collection; grouping the candidate attributes into a plurality of groups according to both a particular structured document in the first collection from which each candidate attribute was extracted and a structure in the particular structured document; calculating a score for each unique attribute in the candidate attributes, where the score reflects a number of groups containing the unique attribute and an attribute on the initial attribute whitelist, wherein the score for each unique attribute is calculated according to; - View Dependent Claims (21, 22, 23)
-
-
24. A system comprising:
one or more computers configured to perform operations comprising; obtaining an initial attribute whitelist, the initial attribute whitelist including one or more initial attributes; processing a first collection of structured documents to identify structures in the structured documents; identifying a plurality of candidate attributes in the first collection of structured documents based on the identified structures in the structured documents in the first collection; grouping the candidate attributes into a plurality of groups according to both a particular structured document in the first collection from which each candidate attribute was extracted and a structure in that structured document in the first collection; calculating a score for each unique attribute in the candidate attributes, where the score reflects a number of groups containing the unique attribute and an attribute on the initial attribute whitelist;
wherein the score for each unique attribute is calculated according to;- View Dependent Claims (25, 26, 27)
Specification