Detection of attributes in unstructured data

US 7,711,736 B2
Filed: 05/31/2007
Issued: 05/04/2010
Est. Priority Date: 06/21/2006
Status: Active Grant

First Claim

Patent Images

1. A method for processing information, comprising:

receiving a set of records, which comprise a plurality of fields containing data regarding respective items;

selecting a field that occurs in all of the records and contains multiple terms in each of the records;

identifying, by at least one processor in a computer, at least first and second terms that occur among the terms in the selected field in the records, such that the records are partitioned into at least first and second respective subsets by occurrences of the at least first and second terms in the selected field, wherein identifying the at least first and second terms comprises identifying a first group of the terms that are associated with a first attribute of the items;

determining, responsively to partitioning of the records by the occurrences, that the at least first and second terms correspond to at least first and second different values of the first attribute of the items;

after identifying the first group, identifying a second group of the terms, which is disjoint from the first group and partitions the records into different respective subsets from the terms in the first group, and determining that the terms in the second group correspond to respective values of a second attribute of the items; and

classifying the data according to the values of the first attribute and outputting the classified data.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for processing information includes receiving a set of records, which include a plurality of fields containing data regarding respective items, and selecting a field that occurs in all of the records and contains multiple terms in each of the records. At least first and second terms that occur among the terms in the selected field in the records are identified, such that the records are partitioned into at least first and second respective subsets by occurrences of the at least first and second terms in the selected field. Responsively to partitioning of the records by the occurrences, it is determined that the at least first and second terms correspond to at least first and second different values of an attribute of the items. The data are classified according to the values of the attribute.

30 Citations

View as Search Results

16 Claims

1. A method for processing information, comprising:
- receiving a set of records, which comprise a plurality of fields containing data regarding respective items;
  
  selecting a field that occurs in all of the records and contains multiple terms in each of the records;
  
  identifying, by at least one processor in a computer, at least first and second terms that occur among the terms in the selected field in the records, such that the records are partitioned into at least first and second respective subsets by occurrences of the at least first and second terms in the selected field, wherein identifying the at least first and second terms comprises identifying a first group of the terms that are associated with a first attribute of the items;
  
  determining, responsively to partitioning of the records by the occurrences, that the at least first and second terms correspond to at least first and second different values of the first attribute of the items;
  
  after identifying the first group, identifying a second group of the terms, which is disjoint from the first group and partitions the records into different respective subsets from the terms in the first group, and determining that the terms in the second group correspond to respective values of a second attribute of the items; and
  
  classifying the data according to the values of the first attribute and outputting the classified data.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method according to claim 1, wherein identifying a first group of the terms comprises identifying a first group of the terms that occur with at least a predetermined frequency among the records such that the terms in the first group optimally partition the records in the set.
  - 3. The method according to claim 2, wherein identifying the first group further comprises computing a metric that increases in response to a union of the at least first and second subsets and decreases in response to an intersection of the at least first and second subsets, and selecting the terms to add to the first group so as to maximize the metric.
  - 4. The method according to claim 1, wherein the selected field contains the multiple terms as unstructured data, without an identification of respective attributes of the items to which the terms correspond.
  - 5. The method according to claim 1, and comprising identifying a multi-term pattern among the terms in the selected field, wherein the at least first and second terms comprise the multi-term pattern as one of the values of the first attribute.
  - 6. The method according to claim 1, and comprising cleansing the terms in the selected field so as to make an association between one of the terms that occurs among the records with a frequency less than a given threshold and the first term, and to determine, responsively to the association, that the one of the terms represents the first value of the first attribute.
  - 7. The method according to claim 6, wherein the terms comprise characters, and wherein cleansing the terms comprises computing a measure of correlation between the one of the terms and the first term responsively to a difference between the characters in the one of the terms and the first term, and deciding whether the one of the terms represents the first value of the first attribute responsively to the correlation.

8. Apparatus for processing information, comprising:
- a memory, which is configured to store a set of records, which comprise a plurality of fields containing data regarding respective items; and
  
  a processor, which is configured to select a field that occurs in all of the records and contains multiple terms in each of the records, to identify at least first and second terms that occur among the terms in the selected field in the records, such that the records are partitioned into at least first and second respective subsets by occurrences of the at least first and second terms in the selected field, to compute a metric that increases in response to a union of the subsets and decreases in response to an intersection of the subsets, to identify a group of the terms, comprising the at least first and second terms, that occur with at least a predetermined frequency among the records such that the terms in the group optimally partition the records in the set, wherein the terms to add to the group are selected so as to maximize the metric, to determine responsively to partitioning of the records by the occurrences, that the at least first and second terms correspond to at least first and second different values of an attribute of the items, and to classify the data according to the values of the attribute.
- View Dependent Claims (9, 10, 11)
- - 9. The apparatus according to claim 8, wherein the attribute comprises a first attribute and the group of terms comprises a first group of terms associated with the first attribute, and wherein the processor is configured, after identifying the first group of the terms that are associated with the first attribute of the items, to identify a second group of the terms, which is disjoint from the first group and partitions the records into different respective subsets from the terms in the first group, and to determine that the terms in the second group correspond to respective values of a second attribute of the items.
  - 10. The apparatus according to claim 8, wherein the processor is configured to identify a multi-term pattern among the terms in the selected field, wherein the at least first and second terms comprise the multi-term pattern as one of the values of the attribute.
  - 11. The apparatus according to claim 8, wherein the processor is configured to cleanse the terms in the selected field so as to make an association between one of the terms that occurs among the records with a frequency less than a given threshold and the first term, and to determine, responsively to the association, that the one of the terms represents the first value of the attribute.

12. A computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to receive a set of records, which comprise a plurality of fields containing data regarding respective items, to select a field that occurs in all of the records and contains multiple terms in each of the records, to identify at least first and second terms that occur among the terms in the selected field in the records, such that the records are partitioned into at least first and second respective subsets by occurrences of the at least first and second terms in the selected field, to determine responsively to partitioning of the records by the occurrences, that the at least first and second terms correspond to at least first and second different values of a first attribute of the items, to identify, after identifying a first group of the terms that are associated with the first attribute of the items, a second group of the terms, which is disjoint from the first group and partitions the records into different respective subsets from the terms in the first group, to determine that the terms in the second group correspond to respective values of a second attribute of the items and to classify the data according to the values of the first attribute.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The product according to claim 12, wherein identifying a first group of the terms comprises identifying a first group of the terms that occur with at least a predetermined frequency among the records such that the terms in the first group optimally partition the records in the set.
  - 14. The product according to claim 13 wherein the instructions cause the computer to compute a metric that increases in response to a union of the at least first and second subsets and decreases in response to an intersection of the at least first and second subsets, and selecting the terms to add to the first group so as to maximize the metric.
  - 15. The product according to claim 12, wherein the instructions cause the computer to identify a multi-term pattern among the terms in the selected field, wherein the at least first and second terms comprise the multi-term pattern as one of the values of the first attribute.
  - 16. The product according to claim 12, wherein the instructions cause the computer to cleanse the terms in the selected field so as to make an association between one of the terms that occurs among the records with a frequency less than a given threshold and the first term, and to determine, responsively to the association, that the one of the terms represents the first value of the first attribute.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Corporation
Original Assignee
Microsoft International Holdings BV (Microsoft Corporation)
Inventors
Levin, Boris I.
Primary Examiner(s)
Wong; Don
Assistant Examiner(s)
Nguyen; Merilyn P

Application Number

US11/809,167
Publication Number

US 20070299855A1
Time in Patent Office

1,069 Days
Field of Search

707 1- 3, 707/5, 707100-102, 707/205, 704/5, 704/10
US Class Current

707/737
CPC Class Codes

G06F 16/313 Selection or weighting of t...

Detection of attributes in unstructured data

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

30 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Detection of attributes in unstructured data

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

30 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links