Extraction of attributes and values from natural language documents

US 7,996,440 B2
Filed: 04/30/2007
Issued: 08/09/2011
Est. Priority Date: 06/05/2006
Status: Active Grant

First Claim

Patent Images

1. A method for extracting at least one attribute and at least one value for a product based on at least one natural language document comprising unlabeled data, the method comprising:

labeling, by a computer, at least a portion of the unlabeled data as the at least one attribute for the product via at least one classification algorithm operating upon the at least one natural language document;

labeling at least another portion of the unlabeled data as the at least one value for the product via the at least one classification algorithm operating upon the at least one natural language document;

for at least two attributes of the at least one attribute, calculating correlation values between each of the at least two attributes;

for at least two values of the at least one value, calculating correlation values between each of the at least two values;

merging attributes of the at least two attributes having correlation values above a correlation threshold;

merging values of the at least two values having correlation values above the correlation threshold; and

storing the at least one attribute and the at least one value.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

One or more classification algorithms are applied to at least one natural language document in order to extract both attributes and values of a given product. Supervised classification algorithms, semi-supervised classification algorithms, unsupervised classification algorithms or combinations of such classification algorithms may be employed for this purpose. The at least one natural language document may be obtained via a public communication network. Two or more attributes (or two or more values) thus identified may be merged to form one or more attribute phrases or value phrases. Once attributes and values have been extracted in this manner, association or linking operations may be performed to establish attribute-value pairs that are descriptive of the product. In a presently preferred embodiment, an (unsupervised) algorithm is used to generate seed attributes and values which can then support a supervised or semi-supervised classification algorithm.

Citations

34 Claims

1. A method for extracting at least one attribute and at least one value for a product based on at least one natural language document comprising unlabeled data, the method comprising:
- labeling, by a computer, at least a portion of the unlabeled data as the at least one attribute for the product via at least one classification algorithm operating upon the at least one natural language document;
  
  labeling at least another portion of the unlabeled data as the at least one value for the product via the at least one classification algorithm operating upon the at least one natural language document;
  
  for at least two attributes of the at least one attribute, calculating correlation values between each of the at least two attributes;
  
  for at least two values of the at least one value, calculating correlation values between each of the at least two values;
  
  merging attributes of the at least two attributes having correlation values above a correlation threshold;
  
  merging values of the at least two values having correlation values above the correlation threshold; and
  
  storing the at least one attribute and the at least one value.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the at least one classification algorithm comprises a supervised classification algorithm.
  - 3. The method of claim 1, wherein the at least one classification algorithm comprises a semi-supervised classification algorithm.
  - 4. The method of claim 1, wherein the at least one classification algorithm comprises an unsupervised classification algorithm.
  - 5. The method of claim 1, further comprising:
    - obtaining the at least one natural language document via a public communication network.
  - 6. The method of claim 1, further comprising:
    - associating one of the at least one attribute with one of the at least one value to provide an attribute-value pair; and
      
      storing the attribute-value pair.
  - 7. The method of claim 6, further comprising associating the attribute with the value based on selection criteria to provide the attribute-value pair.
  - 8. The method of claim 1, further comprising:
    - generating seed attributes and corresponding seed values, wherein generating the seed attributes and the seed values further comprises identifying the seed attributes and the corresponding seed values via an unsupervised classification algorithm as applied to the at least one natural language document; and
      
      wherein labeling the at least one attribute and the least one value further comprises labeling the at least one attribute and the at least one value via the at least one classification algorithm operating upon the at least one nature language document based on the seed attributes and the corresponding seed values.

9. An apparatus for extracting at least one attribute and at least one value for a product based on at least one natural language document comprising unlabeled data, comprising:
- a classification algorithm module, implemented by a processor executing instructions stored by a storage device, operative to label at least a portion of the unlabeled data as the at least one attribute via at least one classification algorithm operating upon the at least one natural language document and to label at least another portion of the unlabeled data as the at least one value for the product via the at least one classification algorithm operating upon the at least one natural language document;
  
  a linking module, implemented by the processor executing instructions stored by the storage device, operative to calculate correlation values between each of at least two attributes of the at least one attribute, calculate correlation values between each of at least two values of the at least one value, merge attributes of the at least two attributes having correlation values above a correlation threshold and merge values of the at least two values having correlation values above the correlation threshold; and
  
  a machine readable store operative to store the at least one attribute and the at least one value.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The apparatus of claim 9, wherein the classification algorithm module comprises a supervised classification algorithm.
  - 11. The apparatus of claim 9, wherein the classification algorithm module comprises a semi-supervised classification algorithm.
  - 12. The apparatus of claim 9, wherein the classification algorithm module comprises an unsupervised classification algorithm.
  - 13. The apparatus of claim 9, further comprising:
    - a network interface operative to obtain the at least one natural language document via a public communication network.
  - 14. The apparatus of claim 9, wherein thelinking module is further operative to associate one of the at least one attribute with one of the at least one value to provide an attribute-value pair,wherein the machine readable store is further operative to store the attribute-value pair.
  - 15. The apparatus of claim 14, wherein the linking module is further operative to associate the attribute with the value based on selection criteria to provide the attribute-value pair.
  - 16. The apparatus of claim 9, further comprising:
    - a seed generation module operative to provide seed attributes and corresponding seed values, and further operative to identify the seed attributes and the corresponding seed values via an unsupervised classification algorithm as applied to the at least one natural language document,wherein the classification algorithm module is further operative to label the at least one attribute and the at least one value via the at least one classification algorithm operating upon the at least one natural language document based on the seed attributes and the corresponding seed values.

17. A non-transitory computer-readable medium having stored thereon executable instructions that, when executed, cause a computer to:
- label at least a portion of unlabeled data in at least one natural language document as at least one attribute for a product via at least one classification algorithm operating upon the at least one natural language document;
  
  label at least another portion of the unlabeled data as at least one value for the product via the at least one classification algorithm operating upon at least one natural language document;
  
  for at least two attributes of the at least one attribute, calculate correlation values between each of the at least two attributes;
  
  for at least two values of the at least one value, calculate correlation values between each of the at least two values;
  
  merge attributes of the at least two attributes having correlation values above a correlation threshold;
  
  merge values of the at least two values having correlation values above the correlation threshold; and
  
  store the at least one attribute and the at least one value.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
- - 18. The non-transitory, computer-readable medium of claim 17, wherein the executable instructions that, when executed, cause the computer to identify the at least one attribute and the at least one value via the at least one classification algorithm further cause the computer to identify the at least one attribute and the at least one value via a supervised classification algorithm.
  - 19. The non-transitory, computer-readable medium of claim 17, wherein the executable instructions that, when executed, cause the computer to identify the at least one attribute and the at least one value via the at least one classification algorithm further cause the computer to identify the at least one attribute and the at least one value via a semi-supervised classification algorithm.
  - 20. The non-transitory, computer-readable medium of claim 17, wherein the executable instructions that, when executed, cause the computer to identify the at least one attribute and the at least one value via the at least one classification algorithm further cause the computer to identify the at least one attribute and the at least one value via an unsupervised classification algorithm.
  - 21. The non-transitory, computer-readable medium of claim 17, further comprising executable instructions that, when executed, cause the computer to:
    - obtain the at least one natural language document via a public communication network.
  - 22. The non-transitory, computer-readable medium of claim 17, further comprising executable instructions that, when executed, cause the computer to:
    - associate one of the at least one attribute with one of the at least one value to provide an attribute-value pair; and
      
      store the attribute-value pair.
  - 23. The non-transitory, computer-readable medium claim 22, wherein the executable instructions that, when executed, cause the computer to associate the attribute with the value further cause the computer to associate the attribute with the value based on selection criteria to provide the attribute-value pair.
  - 24. The non-transitory, computer-readable medium of claim 17, further comprising executable instructions that, when executed, cause the computer to:
    - generate seed attributes and corresponding seed values, wherein the executable instructions that, when executed, cause the computer to generate the seed attributes and the seed values further cause the computer to identify the seed attributes and the corresponding seed values via an unsupervised classification algorithm as applied to the at least one natural language document; and
      
      wherein the executable instructions that, when executed, cause the computer to label the at least one attribute and the least one value further cause the computer to label the at least one attribute and the at least one value via the classification algorithm operating upon the at least one natural language document based on the seed attributes and the corresponding seed values.

25. A method for identifying at least one attribute and at least one value of a product based on at least one natural language document, the method comprising:
- identifying, by a computer, a first set of attributes and a first set of values of the product via a supervised algorithm as applied to the at least one natural language document;
  
  identifying a second set of attributes and a second set of values of the product via a semi-supervised algorithm as applied to the at least one natural language document based at least in part upon the first set of attributes and the first set of values;
  
  providing the first set of attributes and the second set of attributes as the at least one attribute;
  
  providing the first set of values and the second set of values as the at least one value;
  
  for at least two attributes of the at least one attribute, calculating correlation values between each of the at least two attributes;
  
  for at least two values of the at least one value, calculating correlation values between each of the at least two values;
  
  merging attributes of the at least two attributes having correlation values above a correlation threshold;
  
  merging values of the at least two values having correlation values above the correlation threshold; and
  
  storing the at least one attribute and the at least one value.
- View Dependent Claims (26, 27)
- - 26. The method of claim 25, further comprising:
    - providing seed attributes and corresponding seed values identified by an unsupervised classification algorithm as applied to the at least one natural language document,wherein the supervised algorithm identifies the first set of attributes and the first set of values based on the seed attributes and the corresponding seed values.
  - 27. The method of claim 26, further comprising:
    - obtaining the at least one natural language document via a public communication network.

28. An apparatus for identifying at least one attribute and at least one value of a product based on at least one natural language document, comprising:
- a supervised classification algorithm module, implemented by a processor executing instructions stored by a storage device, operative to identify a first set of attributes and a first set of values of the product based on the at least one natural language document;
  
  a semi-supervised classification algorithm module, implemented by the processor executing instructions stored by the storage device, operative to identify a second set of attributes and a second set of values of the product based at least in part upon the at least one natural language document and the first set of attributes and the first set of values, wherein the first set of attributes and the second set of attributes constitute the at least one attribute and the first set of values and the second set of values constitute the at least one value;
  
  a linking module, implemented by the processor executing instructions stored by the storage device, operative to calculate correlation values between each of at least two attributes of the at least one attribute, calculate correlation values between each of at least two values of the at least one value, merge attributes of the at least two attributes having correlation values above a correlation threshold and merge values of the at least two values having correlation values above the correlation threshold; and
  
  a machine readable store operative to store the at least one attribute, and to store the at least one value.
- View Dependent Claims (29, 30)
- - 29. The apparatus of claim 28, further comprising:
    - an unsupervised classification algorithm module operative to identify seed attributes and corresponding seed values based on the at least one natural language document,wherein the supervised classification algorithm module is further operative to identify the first set of attributes and the first set of values based on the seed attributes and the corresponding seed values.
  - 30. The apparatus of claim 28, further comprising:
    - a network interface operative to obtain the at least one natural language document via a public communication network.

31. A non-transitory computer-readable medium having stored thereon executable instructions that, when executed, cause the computer to:
- identify a first set of attributes and a first set of values of a product via a supervised classification algorithm as applied to at least one natural language document;
  
  identify a second set of attributes and a second set of values of the product via a semi-supervised classification algorithm as applied to the at least one natural language document based at least in part upon the first set of attributes and the first set of values;
  
  provide the first set of attributes and the second set of attributes as at least one attribute;
  
  provide the first set of values and the second set of values as at least one value;
  
  for at least two attributes of the at least one attribute, calculate correlation values between each of the at least two attributes;
  
  for at least two values of the at least one value, calculate correlation values between each of the at least two values;
  
  merge attributes of the at least two attributes having correlation values above a correlation threshold;
  
  merge values of the at least two values having correlation values above the correlation threshold; and
  
  store the at least one attribute and the at least one value.
- View Dependent Claims (32, 33)
- - 32. The computer-readable medium of claim 31, further comprising executable instructions that, when executed, cause the computer to:
    - provide seed attributes and corresponding seed values identified by an unsupervised classification algorithm as applied to the at least one natural language document,wherein the executable instructions that, when executed, cause the computer to identify the first set of attributes and the first set of values further cause the computer to identify the first set of attributes and the first set of values via the supervised classification algorithm based on the seed attributes and the corresponding seed values.
  - 33. The computer-readable medium of claim 31, further comprising executable instructions that, when executed, cause the computer to:
    - obtain the at least one natural language document via a public communication network.

34. A method for extracting at least one attribute and at least one value for a product based on at least one natural language document comprising unlabeled data, the method comprising:
- labeling, by a computer, the unlabeled data as seed attributes and corresponding seed values via an unsupervised classification algorithm as applied to the at least one natural language document;
  
  labeling the unlabeled data as the at least one attribute for the product via at least one of a supervised classification algorithm and a semi-supervised classification algorithm operating upon the at least one natural language document and based on the seed attributes and the corresponding seed values;
  
  labeling the unlabeled data as the at least one value for the product via at least one of the supervised classification algorithm and the semi-supervised classification algorithm operating upon the at least one natural language document and based on the seed attributes and the corresponding seed values;
  
  for at least two attributes of the at least one attribute, calculating correlation values between each of the at least two attributes;
  
  for at least two values of the at least one value, calculating correlation values between each of the at least two values;
  
  merging attributes of the at least two attributes having correlation values above a correlation threshold;
  
  merging values of the at least two values having correlation values above the correlation threshold; and
  
  storing the at least one attribute and the at least one value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Accenture Global Services Limited (Accenture PLC)
Original Assignee
Accenture Global Services Limited (Accenture PLC)
Inventors
Probst, Katharina, Ghani, Rayid, Fano, Andrew E., Liu, Yan, Krema, Marko
Primary Examiner(s)
Timblin; Robert
Assistant Examiner(s)
Arjomandi; Noosha

Application Number

US11/742,215
Publication Number

US 20070282892A1
Time in Patent Office

1,562 Days
Field of Search

None
US Class Current

707/803
CPC Class Codes

G06F 40/20 Natural language analysis s...

G06F 40/258 Heading extraction; Automat...

Extraction of attributes and values from natural language documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

34 Claims

Specification

Solutions

Use Cases

Quick Links

Extraction of attributes and values from natural language documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

34 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links