Method for detecting incorrectly categorized data
First Claim
1. A method for determining the accuracy of an assignment of an entry to a category, comprising the steps of:
- obtaining a database containing a plurality of entry-category pairs;
calculating a score for each entry-category pair, wherein the score corresponds to likelihood that the entry is correctly assigned to the category;
generating a curve based on the scores from the calculating step, wherein the curve indicates the likelihood that a given portion of the entry-category pairs contain inaccurate assignments; and
determining a threshold point on the curve corresponding to a predetermined inaccurate categorization threshold level.
5 Assignments
0 Petitions
Accused Products
Abstract
A method for detecting incorrect categorization of data includes obtaining a database containing a plurality of entry-category pairs, calculating a score for each entry-category pair that corresponds to a likelihood that the pair contains an incorrect category assignment, and verifying the correctness of the assignment based on the score. The verification step can be conducted manually. The score assists users in focusing any manual verification efforts on data that may actually contain incorrect category assignments, thereby making the verification process more efficient. The method can be used to review and correct business name and phone number listings in telephone directories.
-
Citations
26 Claims
-
1. A method for determining the accuracy of an assignment of an entry to a category, comprising the steps of:
-
obtaining a database containing a plurality of entry-category pairs;
calculating a score for each entry-category pair, wherein the score corresponds to likelihood that the entry is correctly assigned to the category;
generating a curve based on the scores from the calculating step, wherein the curve indicates the likelihood that a given portion of the entry-category pairs contain inaccurate assignments; and
determining a threshold point on the curve corresponding to a predetermined inaccurate categorization threshold level. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 24, 25, 26)
selecting a reference entry-category pair;
counting a total number of times that the reference entry-category pair appear in the database; and
counting a total number of times that each entry and each category in the reference entry-category pair appear in the database.
-
-
5. The method of claim 1, wherein the score in the calculating step for a given category is:
-
wherein J(c,w) is the total number of times that a selected entry-category pair having a selected entry and a selected category appear in the database, N is the total number of pairs in the database, C(c) is the total number of pairs that the selected category appears in the database, and W(w) is the total number of pairs that the selected entry appears in the database.
-
-
6. The method of claim 4, wherein the categories are business categories and the entries are words in a business name.
-
7. The method of claim 4, wherein the score in the calculating step for a given entry-category pair is the ratio of the probability that a given word in the entry is in the category and the probability that the given word is not in the category.
-
8. The method of claim 7, wherein the entry has a plurality of words, and wherein the ratio is calculated for at least one word in the entry.
-
9. The method of claim 7, further comprising the step of calculating a logodds ratio of a multivariate model of the entry-category pair and an independent model of the entry-category pair expressed as:
-
wherein N is the total number of entry-category pairs, P(w,c) is a probability that a selected entry-category pair having a selected entry and a selected category appears in the database, P(w) is a probability that the selected entry appears in the database, and P(c) is a probability that the selected category appears in the database.
-
-
10. The method of claim 1, wherein the categories are business categories and the entries are words in a business name.
-
11. The method of claim 1, further comprising are steps of:
-
ordering the entry-category pairs based on their scores; and
reviewing the entry-category pairs to detect at least one specific pair containing an inaccurate assignment.
-
-
24. The method of claim 1, wherein the threshold value is a constant value.
-
25. The method of claim 24, wherein the given portion of the entry-category pairs is an entirety of the entry-category pairs.
-
26. The method of claim 25, wherein an X axis of the curve represents individual entry-category pairs.
-
12. A method for determining the accuracy of an assignment of an entry to a category, comprising the steps of:
-
obtaining a first database containing a plurality of reference entry-category pairs;
obtaining a second database containing a plurality of test entry-category pairs;
calculating a score for each test entry-category pair corresponding to a comparison between the second database and the first database, wherein the score corresponds to a likelihood that the test entry is correctly assigned to the test category;
sorting the test entry-category pairs according to the scores from the calculating step;
generating a curve based on the scores from the calculating step, wherein the curve indicates the likelihood that a given portion of the entry-category pairs contain inaccurate assignments; and
determining a threshold point on the curve corresponding to a predetermined inaccurate categorization threshold level, wherein a region outside of the threshold level indicates a greater likelihood of inaccurate assignments. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 21, 22, 23)
selecting a reference entry-category pair;
counting a total number of times that the reference entry-category pair appear in the database; and
counting a total number of times that each entry and each category in the reference entry-category pair appear in the database.
-
-
17. The method of claim 12, wherein the score in the calculating step for a given category is:
-
wherein J(c,w) is the total number of times that a selected entry-category pair having a selected entry and a selected category appear in the database, N is the total number of pairs in the database, C(c) is the total number of pairs that the selected category appears in the database, and W(w) is the total number of pairs that the selected entry appears in the database.
-
-
18. The method of claim 17, wherein the categories arc business categories and the entries are words in a business name.
-
19. The method of claim 17, wherein the score in the calculating step for a given entry-category pair is the ratio of the probability that a given word in the entry is in the category and the probability that the given word is not in the category.
-
21. The method of claim 19, further comprising the step of calculating a logodds ratio of a multivariate model of the entry-category pair and an independent model of the entry-category pair expressed as:
-
wherein N is the total number of entry-category pairs, P(w,c) is a probability that a selected entry-category pair having a selected entry and a selected category appears in the database, P(w) is a probability that the selected entry appears in the database, and P(c) is a probability that the selected category appears in the database.
-
-
22. The method of claim 12, wherein the categories are business categories and the entries are words in a business name.
-
23. The method of claim 12, further comprising the step of reviewing the entry-category pairs associated with the region outside the threshold level to detect at least one pair containing an inaccurate assignment.
-
20. The method of 19, wherein the entry has a plurality of words, and wherein the ratio is calculated for at least one word in the entry.
Specification