Extraction of attributes and values from natural language documents
First Claim
1. A method for extracting at least one attribute and at least one value for a product based on at least one natural language document comprising unlabeled data, the method comprising:
- labeling, by a computer, at least a portion of the unlabeled data as the at least one attribute for the product via at least one classification algorithm operating upon the at least one natural language document;
labeling at least another portion of the unlabeled data as the at least one value for the product via the at least one classification algorithm operating upon the at least one natural language document;
for at least two attributes of the at least one attribute, calculating correlation values between each of the at least two attributes;
for at least two values of the at least one value, calculating correlation values between each of the at least two values;
merging attributes of the at least two attributes having correlation values above a correlation threshold;
merging values of the at least two values having correlation values above the correlation threshold; and
storing the at least one attribute and the at least one value.
2 Assignments
0 Petitions
Accused Products
Abstract
One or more classification algorithms are applied to at least one natural language document in order to extract both attributes and values of a given product. Supervised classification algorithms, semi-supervised classification algorithms, unsupervised classification algorithms or combinations of such classification algorithms may be employed for this purpose. The at least one natural language document may be obtained via a public communication network. Two or more attributes (or two or more values) thus identified may be merged to form one or more attribute phrases or value phrases. Once attributes and values have been extracted in this manner, association or linking operations may be performed to establish attribute-value pairs that are descriptive of the product. In a presently preferred embodiment, an (unsupervised) algorithm is used to generate seed attributes and values which can then support a supervised or semi-supervised classification algorithm.
-
Citations
34 Claims
-
1. A method for extracting at least one attribute and at least one value for a product based on at least one natural language document comprising unlabeled data, the method comprising:
-
labeling, by a computer, at least a portion of the unlabeled data as the at least one attribute for the product via at least one classification algorithm operating upon the at least one natural language document; labeling at least another portion of the unlabeled data as the at least one value for the product via the at least one classification algorithm operating upon the at least one natural language document; for at least two attributes of the at least one attribute, calculating correlation values between each of the at least two attributes; for at least two values of the at least one value, calculating correlation values between each of the at least two values; merging attributes of the at least two attributes having correlation values above a correlation threshold; merging values of the at least two values having correlation values above the correlation threshold; and storing the at least one attribute and the at least one value. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. An apparatus for extracting at least one attribute and at least one value for a product based on at least one natural language document comprising unlabeled data, comprising:
-
a classification algorithm module, implemented by a processor executing instructions stored by a storage device, operative to label at least a portion of the unlabeled data as the at least one attribute via at least one classification algorithm operating upon the at least one natural language document and to label at least another portion of the unlabeled data as the at least one value for the product via the at least one classification algorithm operating upon the at least one natural language document; a linking module, implemented by the processor executing instructions stored by the storage device, operative to calculate correlation values between each of at least two attributes of the at least one attribute, calculate correlation values between each of at least two values of the at least one value, merge attributes of the at least two attributes having correlation values above a correlation threshold and merge values of the at least two values having correlation values above the correlation threshold; and a machine readable store operative to store the at least one attribute and the at least one value. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory computer-readable medium having stored thereon executable instructions that, when executed, cause a computer to:
-
label at least a portion of unlabeled data in at least one natural language document as at least one attribute for a product via at least one classification algorithm operating upon the at least one natural language document; label at least another portion of the unlabeled data as at least one value for the product via the at least one classification algorithm operating upon at least one natural language document; for at least two attributes of the at least one attribute, calculate correlation values between each of the at least two attributes; for at least two values of the at least one value, calculate correlation values between each of the at least two values; merge attributes of the at least two attributes having correlation values above a correlation threshold; merge values of the at least two values having correlation values above the correlation threshold; and store the at least one attribute and the at least one value. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
-
25. A method for identifying at least one attribute and at least one value of a product based on at least one natural language document, the method comprising:
-
identifying, by a computer, a first set of attributes and a first set of values of the product via a supervised algorithm as applied to the at least one natural language document; identifying a second set of attributes and a second set of values of the product via a semi-supervised algorithm as applied to the at least one natural language document based at least in part upon the first set of attributes and the first set of values; providing the first set of attributes and the second set of attributes as the at least one attribute; providing the first set of values and the second set of values as the at least one value; for at least two attributes of the at least one attribute, calculating correlation values between each of the at least two attributes; for at least two values of the at least one value, calculating correlation values between each of the at least two values; merging attributes of the at least two attributes having correlation values above a correlation threshold; merging values of the at least two values having correlation values above the correlation threshold; and storing the at least one attribute and the at least one value. - View Dependent Claims (26, 27)
-
-
28. An apparatus for identifying at least one attribute and at least one value of a product based on at least one natural language document, comprising:
-
a supervised classification algorithm module, implemented by a processor executing instructions stored by a storage device, operative to identify a first set of attributes and a first set of values of the product based on the at least one natural language document; a semi-supervised classification algorithm module, implemented by the processor executing instructions stored by the storage device, operative to identify a second set of attributes and a second set of values of the product based at least in part upon the at least one natural language document and the first set of attributes and the first set of values, wherein the first set of attributes and the second set of attributes constitute the at least one attribute and the first set of values and the second set of values constitute the at least one value; a linking module, implemented by the processor executing instructions stored by the storage device, operative to calculate correlation values between each of at least two attributes of the at least one attribute, calculate correlation values between each of at least two values of the at least one value, merge attributes of the at least two attributes having correlation values above a correlation threshold and merge values of the at least two values having correlation values above the correlation threshold; and a machine readable store operative to store the at least one attribute, and to store the at least one value. - View Dependent Claims (29, 30)
-
-
31. A non-transitory computer-readable medium having stored thereon executable instructions that, when executed, cause the computer to:
-
identify a first set of attributes and a first set of values of a product via a supervised classification algorithm as applied to at least one natural language document; identify a second set of attributes and a second set of values of the product via a semi-supervised classification algorithm as applied to the at least one natural language document based at least in part upon the first set of attributes and the first set of values; provide the first set of attributes and the second set of attributes as at least one attribute; provide the first set of values and the second set of values as at least one value; for at least two attributes of the at least one attribute, calculate correlation values between each of the at least two attributes; for at least two values of the at least one value, calculate correlation values between each of the at least two values; merge attributes of the at least two attributes having correlation values above a correlation threshold; merge values of the at least two values having correlation values above the correlation threshold; and store the at least one attribute and the at least one value. - View Dependent Claims (32, 33)
-
-
34. A method for extracting at least one attribute and at least one value for a product based on at least one natural language document comprising unlabeled data, the method comprising:
-
labeling, by a computer, the unlabeled data as seed attributes and corresponding seed values via an unsupervised classification algorithm as applied to the at least one natural language document; labeling the unlabeled data as the at least one attribute for the product via at least one of a supervised classification algorithm and a semi-supervised classification algorithm operating upon the at least one natural language document and based on the seed attributes and the corresponding seed values; labeling the unlabeled data as the at least one value for the product via at least one of the supervised classification algorithm and the semi-supervised classification algorithm operating upon the at least one natural language document and based on the seed attributes and the corresponding seed values; for at least two attributes of the at least one attribute, calculating correlation values between each of the at least two attributes; for at least two values of the at least one value, calculating correlation values between each of the at least two values; merging attributes of the at least two attributes having correlation values above a correlation threshold; merging values of the at least two values having correlation values above the correlation threshold; and storing the at least one attribute and the at least one value.
-
Specification