Methods and apparatus for automated matching and classification of data
First Claim
1. A computer-implemented method for processing data, comprising:
- receiving an initial set of records comprising initial terms describing respective items in specified categories;
calculating, based on the initial set of records, respective term weights for at least some of the initial terms with respect to at least some of the categories, each term weight indicating, for a given initial term and a given category, a likelihood that a record containing the given initial term belongs to the given category, wherein calculating term weights comprises computing a general probability of occurrence of the given initial term over all of the categories, computing a specific probability of the occurrence of the given initial term in the records belonging to the given category, and determining the term weight responsively to a difference between the specific probability and the general probability for the given initial term with respect to the given category;
receiving a new record, not included in the initial set, the new record comprising particular terms, wherein the particular terms are a subset of the initial terms;
computing respective assignment metrics for two or more of the categories using the respective term weights of the particular terms in the new record with respect to the two or more of the categories; and
classifying the new record in one of the two or more of the categories responsively to the assignment metrics.
3 Assignments
0 Petitions
Accused Products
Abstract
A computer-implemented method for processing data includes receiving an initial set of records including terms describing respective items in specified categories. Based on the initial set of records, respective term weights are calculated for at least some of the terms with respect to at least some of the categories. Each term weight indicates, for a given term and a given category, a likelihood that a record containing the given term belongs to the given category. Upon receiving a new record, not included in the initial set, respective assignment metrics are computed for two or more of the categories using the respective term weights of the particular terms in the new record with respect to the two or more of the categories. The new record is classified in one of the two or more of the categories responsively to the assignment metrics.
37 Citations
17 Claims
-
1. A computer-implemented method for processing data, comprising:
-
receiving an initial set of records comprising initial terms describing respective items in specified categories;
calculating, based on the initial set of records, respective term weights for at least some of the initial terms with respect to at least some of the categories, each term weight indicating, for a given initial term and a given category, a likelihood that a record containing the given initial term belongs to the given category, wherein calculating term weights comprises computing a general probability of occurrence of the given initial term over all of the categories, computing a specific probability of the occurrence of the given initial term in the records belonging to the given category, and determining the term weight responsively to a difference between the specific probability and the general probability for the given initial term with respect to the given category;receiving a new record, not included in the initial set, the new record comprising particular terms, wherein the particular terms are a subset of the initial terms; computing respective assignment metrics for two or more of the categories using the respective term weights of the particular terms in the new record with respect to the two or more of the categories; and classifying the new record in one of the two or more of the categories responsively to the assignment metrics. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. Apparatus for processing data, comprising:
- a memory, which is arranged to hold an initial set of records comprising initial terms describing respective items in specified categories; and
a processor, which is arranged to;calculate, based on the initial set of records, respective term weights for at least some of the initial terms with respect to at least some of the categories, each term weight indicating, for a given initial term and a given category, a likelihood that a record containing the given initial term belongs to the given category, wherein calculating term weights comprises computing a general probability of occurrence of the given initial term over all of the categories, computing a specific probability of the occurrence of the given initial term in the records belonging to the given category, and determining the term weight responsively to a difference between the specific probability and the general probability for the given initial term with respect to the given category; receive a new record, not included in the initial set, the new record comprising particular terms, wherein the particular terms are a subset of the initial terms; compute respective assignment metrics for two or more of the categories using the respective term weights of the particular terms in the new record with respect to the two or more of the categories; and classify the new record in one of the two or more of the categories responsively to the assignment metrics. - View Dependent Claims (8, 9, 10, 11)
- a memory, which is arranged to hold an initial set of records comprising initial terms describing respective items in specified categories; and
-
12. A computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by the computer, cause the computer to:
-
receive an initial set of records comprising initial terms describing respective items in specified categories, and to calculate, based on the initial set of records, respective term weights for at least some of the initial terms with respect to at least some of the categories, each term weight indicating, for a given initial term and a given category, a likelihood that a record containing the given initial term belongs to the given category, wherein calculating the term weight for the given initial term comprises computing a general probability of occurrence of the given initial term over all of the categories, computing, a specific probability of the occurrence of the given initial term in the records belonging to the given category, and determining the term weight responsively to a difference between the specific probability and the general probability for the given initial term with respect to the given category; receive a new record, not included in the initial set, the new record comprising particular terms, wherein the particular terms are a subset of the initial terms; compute respective assignment metrics for two or more of the categories using the respective term weights of the particular terms in the new record with respect to the two or more of the categories; and classify the new record in one of the two or more of the categories responsively to the assignment metrics. - View Dependent Claims (13, 14, 15, 16, 17)
-
Specification