Identifying product references in user-generated content
First Claim
1. A method for product extraction, the method comprising:
- receiving, by a computer system, a document;
identifying, by the computer system, a product type for the document according to content of the document;
extracting, by the computer system, product attributes and attribute values from the document;
retrieving, by the computer system, an attribute set corresponding to the product type from a database;
identifying, by the computer system, a first set of products that have at least the product attributes and the attribute values of the document that are included in the attribute set, the first set of products being nodes in a hierarchical taxonomy;
filtering, by the computer system, the first set of products by;
identifying a common ancestor node in the hierarchical taxonomy having all of the first set of products as descendants;
identifying immediate child nodes of the common ancestor node;
identifying a majority child node having a major portion of the first set of products as descendants; and
identifying a second set of products including a portion of the first set of products that are descendants of the majority child node and excluding those products of the first set of products that are not descendants of the majority child node;
selecting, by the computer system, an inferred product for the document from the second set of products;
wherein;
identifying the second set of products comprises;
calculating a score for each product in the first set of products; and
selecting the second set of products based at least in part on the calculated scores for the first set of products;
selecting the second set of products comprises;
removing products from the first set of products if application of a blacklist rule to the document so indicates; and
selecting the inferred product comprises;
selecting the inferred product as specified by a whitelist rule if application of the whitelist rule to the document so indicates; and
at least one of the blacklist rule and the whitelist rule take as an input a list of keywords from the document.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods are disclosed herein for extracting products referenced in a document. A document is analyzed to identify a product type that is referenced in the document. Attributes are extracted from the document. A set of candidate products are identified corresponding to the extracted attributes. A score is calculated for the candidate products and the products are further selected or filtered based on the score, whitelist rules, and blacklist rules in order to identify one or more inferred products referenced by the document. The whitelist and blacklist rules may take as inputs a domain, a user identifier, and keywords included in the document. A set of sufficient attributes may be identified for each product type. Selection of a candidate product may be based at least in part on the document including all of the attributes in the set of sufficient attributes.
21 Citations
14 Claims
-
1. A method for product extraction, the method comprising:
-
receiving, by a computer system, a document; identifying, by the computer system, a product type for the document according to content of the document; extracting, by the computer system, product attributes and attribute values from the document; retrieving, by the computer system, an attribute set corresponding to the product type from a database; identifying, by the computer system, a first set of products that have at least the product attributes and the attribute values of the document that are included in the attribute set, the first set of products being nodes in a hierarchical taxonomy; filtering, by the computer system, the first set of products by; identifying a common ancestor node in the hierarchical taxonomy having all of the first set of products as descendants; identifying immediate child nodes of the common ancestor node; identifying a majority child node having a major portion of the first set of products as descendants; and identifying a second set of products including a portion of the first set of products that are descendants of the majority child node and excluding those products of the first set of products that are not descendants of the majority child node; selecting, by the computer system, an inferred product for the document from the second set of products;
wherein;identifying the second set of products comprises; calculating a score for each product in the first set of products; and selecting the second set of products based at least in part on the calculated scores for the first set of products; selecting the second set of products comprises; removing products from the first set of products if application of a blacklist rule to the document so indicates; and
selecting the inferred product comprises;selecting the inferred product as specified by a whitelist rule if application of the whitelist rule to the document so indicates; and at least one of the blacklist rule and the whitelist rule take as an input a list of keywords from the document. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system comprising:
-
one or more processors, the one or more processors embodied as one or more processing devices; and one or more non-transitory storage modules storing executable and operational data effective to cause the one or more processors to; receive a document; identify a product type for the document according to content of the document; extract product attributes and attribute values from the document; retrieve an attribute set corresponding to the product type from a database; identify a first set of products that have at least the product attributes and the attribute values of the document that are included in the attribute set, the first set of products being nodes in a hierarchical taxonomy; filter the first set of products by; identifying a common ancestor node in the hierarchical taxonomy having all of the first set of products as descendants; identifying immediate child nodes of the common ancestor node; identifying a majority child node having a major portion of the first set of products as descendants; and identifying a second set of products including a portion of the first set of products that are descendants of the majority child node and excluding those products of the first set of products that are not descendants of the majority child node; select an inferred product for the document from the second set of products;
wherein;the executable and operational data are further effective to cause the one or more processors to identify the second set of products by; calculating a score for each product in the first set of products; and selecting the second set of products based at least in part on the calculated scores for the first set of products; the executable and operational data are further effective to cause the one or more processors to select the second set of products by; removing products from the first set of products if application of a blacklist rule to the document so indicates; and wherein selecting the inferred product comprises selecting the inferred product as specified by a whitelist rule if application of the whitelist rule to the document so indicates; and at least one of the blacklist rule and the whitelist rule take as an input a list of keywords from the document. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
Specification