Categorization based on record linkage theory
First Claim
1. A method for categorizing an item based on a training set of categorized items, the method comprising the steps of:
- (a) parsing an item into at least one token;
(b) identifying, from a plurality of categories, at least one category that contains at least one occurrence of at least one token from the item;
(c) calculating an agreement weight based on Record Linkage Theory for each of the at least one token from the item that occurs in a first category of the at least one category;
(d) combining the agreement weights to determine a total weight for the first category; and
(e) repeating steps (c) and (d) for each category that contains at least one occurrence of at least one token from the item.
2 Assignments
0 Petitions
Accused Products
Abstract
The method and apparatus for categorizing an item based on Record Linkage Theory is disclosed. A related method and apparatus for assigning a confidence level to the categorization process is disclosed. In one aspect, the item to be categorized is parsed into at least one token. At least one category that contains the token in the training set is identified. A weight is calculated for each token with respect to a first category. Weights are combined to determine the total weight of the first category. The weighting process is repeated for each relevant category. Where one of a plurality of threshold values is met or exceeded, the item may be automatically assigned to the category with the highest total weight. The combination of threshold values may be selected based on the confidence level associated with that combination of threshold values. Weights for each relevant category, possibly ordered, may be presented.
-
Citations
53 Claims
-
1. A method for categorizing an item based on a training set of categorized items, the method comprising the steps of:
-
(a) parsing an item into at least one token;
(b) identifying, from a plurality of categories, at least one category that contains at least one occurrence of at least one token from the item;
(c) calculating an agreement weight based on Record Linkage Theory for each of the at least one token from the item that occurs in a first category of the at least one category;
(d) combining the agreement weights to determine a total weight for the first category; and
(e) repeating steps (c) and (d) for each category that contains at least one occurrence of at least one token from the item. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
-
-
15. A method for categorizing an item based on a training set of categorized items, the method comprising the steps of:
-
(a) parsing an item into at least one token;
(b) a step for identifying, from a plurality of categories, at least one category that contains at least one occurrence of the at least one token;
(c) calculating an agreement weight based on Record Linkage Theory for each of the at least one token from the item that occurs in a first category of the at least one category;
(d) a step for combining the agreement weights to determine a total weight for the first category; and
(e) repeating steps (c) and (d) for each category that contains at least one occurrence of at least one token from the item.
-
-
24. A method for selecting threshold values for automatically categorizing an item with a training set of categorized items, the method comprising the steps of:
-
(a) calculating a weight for at least one category for each item in a training set;
(b) selecting a combination of a plurality of threshold values for a plurality of variables;
(c) deeming an item in the training set assigned to the highest weight category for purposes of determining whether the item is correctly categorized when one of the threshold values in the combination of threshold values is met or exceeded by the item;
(d) repeating steps (b) through (c) for each combination of threshold values;
(e) calculating a confidence level for the combination of threshold values, the confidence level being a number of correctly categorized items in the training set less a number of incorrectly categorized items in the training set; and
(f) choosing the combination of threshold values that results in the highest confidence level.
-
-
29. An apparatus for categorizing an item based on a training set of categorized items, the apparatus comprising:
-
an input device;
a parser in signal communication with the input device, said parser parsing an item from the input device into at least one token;
an occurrence identifier in signal communication with the parser, the occurrence identifier identifying, from a plurality of categories, at least one category that contains at least one occurrence of the at least one token;
a calculator in signal communication with the occurrence identifier, the calculator calculating an agreement weight based on Record Linkage Theory for each of the at least one token from the item that occurs in a first category of the at least one category;
a summer in signal communication with the calculator, the summer summing the agreement weights to determine a total weight for the first category; and
a looping mechanism in signal communication with the occurrence identifier and the summer, the looping mechanism initializing the calculator for each category that contains at least one occurrence of at least one token from the item.
-
-
41. An apparatus for categorizing an item based on a training set of categorized items, the apparatus comprising:
-
an input device;
a parser in signal communication with the input device, said parser parsing an item from the input device into at least one token;
a means for identifying at least one category that contains at least one occurrence of the at least one token, from a plurality of categories, the means for identifying at least one category in signal communication with the parser;
a means for calculating a weight based on Record Linkage Theory for each of the at least one token from the item that occurs in a first category of the at least one category, the means for calculating a weight in signal communication with the occurrence identifier;
a means for combining weights to determine a total weight for the first category, the means for combining weights in signal communication with the calculator; and
a looping mechanism in signal communication with the means for identifying at least one category and the means for calculating a weight, the looping mechanism initializing the means for calculating a weight for each category that contains at least one occurrence of at least one token from the item. - View Dependent Claims (42, 43, 44, 45, 46, 47, 48, 50, 51, 52, 53)
-
-
49. An apparatus for selecting threshold values for automatically categorizing an item with a training set of categorized items, the apparatus comprising:
-
a memory element, the memory element containing a plurality of items in a training set;
a categorizer in signal communication with the memory element, the categorizer calculating a weight for at least one category for each of the plurality of items in the training set;
a selector in signal communication with the categorizer, the selector selecting a combination of a plurality of threshold values for a plurality of variables;
a first comparator in signal communication with the selector, the first comparator deeming an item in the training set assigned to the highest weight category for purposes of determining whether the item is correctly categorized when one or more of the threshold values in the combination of threshold values is met or exceeded by the item;
a looping mechanism in signal communication with the selector and the first comparator, the looping mechanism initializing the first comparator for each combination of threshold values; and
a calculator in signal communication with the looping mechanism, the calculator calculating a confidence level for the combination of threshold values, the confidence level being a number of correctly categorized items in the training set less a number of incorrectly categorized items in the training set;
a second comparator in signal communication with the calculator, the second comparator identifying the combination of threshold values that results in the highest confidence level.
-
Specification