Detection of attributes in unstructured data
First Claim
1. A method for processing information, comprising:
- receiving a set of records, which comprise a plurality of fields containing data regarding respective items;
selecting a field that occurs in all of the records and contains multiple terms in each of the records;
identifying, by at least one processor in a computer, at least first and second terms that occur among the terms in the selected field in the records, such that the records are partitioned into at least first and second respective subsets by occurrences of the at least first and second terms in the selected field, wherein identifying the at least first and second terms comprises identifying a first group of the terms that are associated with a first attribute of the items;
determining, responsively to partitioning of the records by the occurrences, that the at least first and second terms correspond to at least first and second different values of the first attribute of the items;
after identifying the first group, identifying a second group of the terms, which is disjoint from the first group and partitions the records into different respective subsets from the terms in the first group, and determining that the terms in the second group correspond to respective values of a second attribute of the items; and
classifying the data according to the values of the first attribute and outputting the classified data.
3 Assignments
0 Petitions
Accused Products
Abstract
A method for processing information includes receiving a set of records, which include a plurality of fields containing data regarding respective items, and selecting a field that occurs in all of the records and contains multiple terms in each of the records. At least first and second terms that occur among the terms in the selected field in the records are identified, such that the records are partitioned into at least first and second respective subsets by occurrences of the at least first and second terms in the selected field. Responsively to partitioning of the records by the occurrences, it is determined that the at least first and second terms correspond to at least first and second different values of an attribute of the items. The data are classified according to the values of the attribute.
30 Citations
16 Claims
-
1. A method for processing information, comprising:
-
receiving a set of records, which comprise a plurality of fields containing data regarding respective items; selecting a field that occurs in all of the records and contains multiple terms in each of the records; identifying, by at least one processor in a computer, at least first and second terms that occur among the terms in the selected field in the records, such that the records are partitioned into at least first and second respective subsets by occurrences of the at least first and second terms in the selected field, wherein identifying the at least first and second terms comprises identifying a first group of the terms that are associated with a first attribute of the items; determining, responsively to partitioning of the records by the occurrences, that the at least first and second terms correspond to at least first and second different values of the first attribute of the items; after identifying the first group, identifying a second group of the terms, which is disjoint from the first group and partitions the records into different respective subsets from the terms in the first group, and determining that the terms in the second group correspond to respective values of a second attribute of the items; and classifying the data according to the values of the first attribute and outputting the classified data. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. Apparatus for processing information, comprising:
-
a memory, which is configured to store a set of records, which comprise a plurality of fields containing data regarding respective items; and a processor, which is configured to select a field that occurs in all of the records and contains multiple terms in each of the records, to identify at least first and second terms that occur among the terms in the selected field in the records, such that the records are partitioned into at least first and second respective subsets by occurrences of the at least first and second terms in the selected field, to compute a metric that increases in response to a union of the subsets and decreases in response to an intersection of the subsets, to identify a group of the terms, comprising the at least first and second terms, that occur with at least a predetermined frequency among the records such that the terms in the group optimally partition the records in the set, wherein the terms to add to the group are selected so as to maximize the metric, to determine responsively to partitioning of the records by the occurrences, that the at least first and second terms correspond to at least first and second different values of an attribute of the items, and to classify the data according to the values of the attribute. - View Dependent Claims (9, 10, 11)
-
- 12. A computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to receive a set of records, which comprise a plurality of fields containing data regarding respective items, to select a field that occurs in all of the records and contains multiple terms in each of the records, to identify at least first and second terms that occur among the terms in the selected field in the records, such that the records are partitioned into at least first and second respective subsets by occurrences of the at least first and second terms in the selected field, to determine responsively to partitioning of the records by the occurrences, that the at least first and second terms correspond to at least first and second different values of a first attribute of the items, to identify, after identifying a first group of the terms that are associated with the first attribute of the items, a second group of the terms, which is disjoint from the first group and partitions the records into different respective subsets from the terms in the first group, to determine that the terms in the second group correspond to respective values of a second attribute of the items and to classify the data according to the values of the first attribute.
Specification