Detection of attributes in unstructured data

US 20070299855A1
Filed: 05/31/2007
Published: 12/27/2007
Est. Priority Date: 06/21/2006
Status: Active Grant

First Claim

Patent Images

1. A method for processing information, comprising:

receiving a set of records, which comprise a plurality of fields containing data regarding respective items;

selecting a field that occurs in all of the records and contains multiple terms in each of the records;

identifying at least first and second terms that occur among the terms in the selected field in the records, such that the records are partitioned into at least first and second respective subsets by occurrences of the at least first and second terms in the selected field;

determining, responsively to partitioning of the records by the occurrences, that the at least first and second terms correspond to at least first and second different values of an attribute of the items; and

classifying the data according to the values of the attribute and outputting the classified data.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for processing information includes receiving a set of records, which include a plurality of fields containing data regarding respective items, and selecting a field that occurs in all of the records and contains multiple terms in each of the records. At least first and second terms that occur among the terms in the selected field in the records are identified, such that the records are partitioned into at least first and second respective subsets by occurrences of the at least first and second terms in the selected field. Responsively to partitioning of the records by the occurrences, it is determined that the at least first and second terms correspond to at least first and second different values of an attribute of the items. The data are classified according to the values of the attribute.

29 Citations

View as Search Results

20 Claims

1. A method for processing information, comprising:
- receiving a set of records, which comprise a plurality of fields containing data regarding respective items;
  
  selecting a field that occurs in all of the records and contains multiple terms in each of the records;
  
  identifying at least first and second terms that occur among the terms in the selected field in the records, such that the records are partitioned into at least first and second respective subsets by occurrences of the at least first and second terms in the selected field;
  
  determining, responsively to partitioning of the records by the occurrences, that the at least first and second terms correspond to at least first and second different values of an attribute of the items; and
  
  classifying the data according to the values of the attribute and outputting the classified data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method according to claim 1, wherein identifying the at least first and second terms comprises identifying a group of the terms that occur with at least a predetermined frequency among the records such that the terms in the group optimally partition the records in the set.
  - 3. The method according to claim 2, wherein identifying the group comprises computing a metric that increases in response to a union of the subsets and decreases in response to an intersection of the subsets, and selecting the terms to add to the group so as to maximize the metric.
  - 4. The method according to claim 1, wherein identifying the at least first and second terms comprises identifying a first group of the terms that are associated with a first attribute of the items, andwherein the method comprises, after identifying the first group, identifying a second group of the terms, which is disjoint from the first group and partitions the records into different respective subsets from the terms in the first group, and determining that the terms in the second group correspond to respective values of a second attribute of the items.
  - 5. The method according to claim 1, wherein the selected field contains the multiple terms as unstructured data, without an identification of respective attributes of the items to which the terms correspond.
  - 6. The method according to claim 1, and comprising identifying a multi-term pattern among the terms in the selected field, wherein the at least first and second terms comprise the multi-term pattern as one of the values of the attribute.
  - 7. The method according to claim 1, and comprising cleansing the terms in the selected field so as to make an association between one of the terms that occurs among the records with a frequency less than a given threshold and the first term, and to determine, responsively to the association, that the one of the terms represents the first value of the attribute.
  - 8. The method according to claim 7, wherein the terms comprise characters, and wherein cleansing the terms comprises computing a measure of correlation between the one of the terms and the first term responsively to a difference between the characters in the one of the terms and the first term, and deciding whether the one of the terms represents the first value of the attribute responsively to the correlation.

9. Apparatus for processing information, comprising:
- a memory, which is configured to store a set of records, which comprise a plurality of fields containing data regarding respective items; and
  
  a processor, which is configured to select a field that occurs in all of the records and contains multiple terms in each of the records, to identify at least first and second terms that occur among the terms in the selected field in the records, such that the records are partitioned into at least first and second respective subsets by occurrences of the at least first and second terms in the selected field, to determine responsively to partitioning of the records by the occurrences, that the at least first and second terms correspond to at least first and second different values of an attribute of the items, and to classify the data according to the values of the attribute.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The apparatus according to claim 9, wherein the processor is configured to identify a group of the terms, comprising the at least first and second terms, that occur with at least a predetermined frequency among the records such that the terms in the group optimally partition the records in the set.
  - 11. The apparatus according to claim 10, wherein the processor is configured to compute a metric that increases in response to a union of the subsets and decreases in response to an intersection of the subsets, and selecting the terms to add to the group so as to maximize the metric.
  - 12. The apparatus according to claim 9, wherein the processor is configured, after identifying a first group of the terms that are associated with a first attribute of the items, to identify a second group of the terms, which is disjoint from the first group and partitions the records into different respective subsets from the terms in the first group, and to determine that the terms in the second group correspond to respective values of a second attribute of the items.
  - 13. The apparatus according to claim 9, wherein the processor is configured to identify a multi-term pattern among the terms in the selected field, wherein the at least first and second terms comprise the multi-term pattern as one of the values of the attribute.
  - 14. The apparatus according to claim 9, wherein the processor is configured to cleanse the terms in the selected field so as to make an association between one of the terms that occurs among the records with a frequency less than a given threshold and the first term, and to determine, responsively to the association, that the one of the terms represents the first value of the attribute.

15. A computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to receive a set of records, which comprise a plurality of fields containing data regarding respective items, to select a field that occurs in all of the records and contains multiple terms in each of the records, to identify at least first and second terms that occur among the terms in the selected field in the records, such that the records are partitioned into at least first and second respective subsets by occurrences of the at least first and second terms in the selected field, to determine responsively to partitioning of the records by the occurrences, that the at least first and second terms correspond to at least first and second different values of an attribute of the items, and to classify the data according to the values of the attribute.
- View Dependent Claims (16, 18, 19, 20)
- - 16. The product according to claim 15, wherein the instructions cause the computer to identify a group of the terms, comprising the at least first and second terms, that occur with at least a predetermined frequency among the records such that the terms in the group optimally partition the records in the set.
  - 18. The product according to claim 15, wherein the instructions cause the computer, after identifying a first group of the terms that are associated with a first attribute of the items, to identify a second group of the terms, which is disjoint from the first group and partitions the records into different respective subsets from the terms in the first group, and to determine that the terms in the second group correspond to respective values of a second attribute of the items.
  - 19. The product according to claim 15, wherein the instructions cause the computer to identify a multi-term pattern among the terms in the selected field, wherein the at least first and second terms comprise the multi-term pattern as one of the values of the attribute.
  - 20. The product according to claim 15, wherein the instructions cause the computer to cleanse the terms in the selected field so as to make an association between one of the terms that occurs among the records with a frequency less than a given threshold and the first term, and to determine, responsively to the association, that the one of the terms represents the first value of the attribute.

17. The product according to claim 17, wherein the instructions cause the computer to compute a metric that increases in response to a union of the subsets and decreases in response to an intersection of the subsets, and selecting the terms to add to the group so as to maximize the metric.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Corporation
Original Assignee
Zoomix Data Mastering Ltd. (Microsoft Corporation)
Inventors
Levin, Boris I.

Granted Patent

US 7,711,736 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/313 Selection or weighting of t...

Detection of attributes in unstructured data

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

29 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Detection of attributes in unstructured data

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

29 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links