Method and apparatus for score normalization for information retrieval applications
First Claim
Patent Images
1. A method facilitated by a human annotator and performed in a computer environment for normalizing a score associated with a document, the method comprising the steps of:
- (a) establishing, through the human annotator, a query relevant to a topic (on-topic) and a set of training documents not relevant to the topic (off-topic);
(b) assigning, through the computer environment, a training document relevance score to each one of the training documents, each training document relevance score representing a measure of relevance of its respective document to the topic;
(c) determining, through the computer environment, statistics relating to all training document relevance scores;
(d) receiving a testing document;
(e) calculating, through the computer environment, a score of relevance of the testing document to the topic to obtain a testing document relevance score;
(f) normalizing, through the computer environment and based on the statistics, the testing document relevance score to obtain a normalized score wherein;
normalizing adjusts the testing document relevance score based on the statistics to be comparable to other scores from which the statistics were determined, and the normalized score is a better predictor of probability of the testing document being relevant than the testing document relevant score;
(g) establishing, through the computer environment, a threshold score representing a relevance threshold for the topic;
(h) comparing the normalized score to the threshold score to obtain a comparison; and
(i) designating the testing document as relevant or not relevant to the topic based on the comparison.
12 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for normalizing a score associated with a document is presented. Statistics relating to scores assigned to a set of training documents not relevant to a topic arc determined. Scores represent a measure of relevance to the topic. After the various statistics have been collected, a score assigned to a testing document is normalized based on those statistics. The normalized score is then compared to a threshold score. Subsequently, the testing document is designated as relevant or not relevant to the topic based on the comparison.
108 Citations
27 Claims
-
1. A method facilitated by a human annotator and performed in a computer environment for normalizing a score associated with a document, the method comprising the steps of:
-
(a) establishing, through the human annotator, a query relevant to a topic (on-topic) and a set of training documents not relevant to the topic (off-topic);
(b) assigning, through the computer environment, a training document relevance score to each one of the training documents, each training document relevance score representing a measure of relevance of its respective document to the topic;
(c) determining, through the computer environment, statistics relating to all training document relevance scores;
(d) receiving a testing document;
(e) calculating, through the computer environment, a score of relevance of the testing document to the topic to obtain a testing document relevance score;
(f) normalizing, through the computer environment and based on the statistics, the testing document relevance score to obtain a normalized score wherein;
normalizing adjusts the testing document relevance score based on the statistics to be comparable to other scores from which the statistics were determined, and the normalized score is a better predictor of probability of the testing document being relevant than the testing document relevant score;
(g) establishing, through the computer environment, a threshold score representing a relevance threshold for the topic;
(h) comparing the normalized score to the threshold score to obtain a comparison; and
(i) designating the testing document as relevant or not relevant to the topic based on the comparison. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
4. The method of claim 1, said determining step further comprising:
determining statistics relating to scores assigned to a set of training documents relevant to the topic.
-
5. The method of claim 4, wherein the statistics relating to scores assigned to a set of training documents relevant to the topic include a mean score of the documents relevant to the topic and a standard deviation of the scores assigned to the set of training documents relevant to the topic.
-
6. The method of claim 5, wherein said normalizing step comprises:
normalizing a score assigned to a testing document based on the statistics relating to the scores assigned to the set of training documents not relevant to the topic and based on the statistics relating to the scores assigned to the set of training documents relevant to the topic.
-
7. The method of claim 6, wherein said normalizing step determines the normalized score according the following formula:
-
8. The method of claim 1, wherein said designating step comprises:
-
designating the testing document as relevant to the topic based on a determination that the normalized score is greater than the threshold score; and
designating the testing document as not relevant to the topic based on a determination that the normalized score is not greater than the threshold score.
-
-
9. The method of claim 1, further comprising:
repeating steps (a)-(d) for a plurality of topics.
-
10. The method of claim 1, further comprising:
repeating steps (a)-(d) for a plurality of testing documents.
-
11. The method of claim 1, wherein the statistics include a robust estimate of a mean score of the training documents not relevant to the topic and a robust estimate of a standard deviation of the scores assigned to the set of training documents not relevant to the topic.
-
12. A data processing system for normalizing a score associated with a document, comprising:
-
a memory having program instructions; and
a processor responsive to the program instructions to;
determine statistics relating to scores assigned to a set of training documents not relevant to a topic, the scores representing a measure of relevance to the topic;
normalize a score assigned to a testing document based on the statistics to obtain a normalized score wherein;
normalizing adjusts the testing document relevance score based on the statistics to be comparable to other scores from which the statistics were determined, and the normalized score is a better predictor of probability of the testing document being relevant than the testing document relevant score;
compare the normalized score to a threshold score to obtain a comparison; and
designate the testing document as relevant or not relevant to the topic based on the comparison.
-
-
13. A computer-readable medium containing instructions for performing a method for normalizing a score associated with a document, the method facilitated by a human annotator comprising:
-
(a) establishing, through the human annotator, a query relevant to a topic (on-topic) and a set of training documents not relevant to the topic (off-topic);
(b) assigning, through the computer environment, a training document relevance score to each one of the training documents, each training document relevance score representing a measure of relevance of its respective document to the topic;
(c) determining, through the computer environment, statistics relating to all training document relevance scores;
(d) receiving a testing document;
(e) calculating, through the computer environment, a score of relevance of the testing document to the topic to obtain a testing document relevance score;
(f) normalizing, through the computer environment and based on the statistics, the testing document relevance score to obtain a normalized score wherein;
normalizing adjusts the testing document relevance score based on the statistics to be comparable to other scores from which the statistics were determined, and the normalized score is a better predictor of probability of the testing document being relevant than the testing document relevant score;
(g) establishing, through the computer environment, a threshold score representing a relevance threshold for the topic;
(h) comparing the normalized score to the threshold score to obtain a comparison; and
(i) designating the testing document as relevant or not relevant to the topic based on the comparison. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
-
-
16. The computer-readable medium of claim 13, said determining step further comprising:
determining statistics relating to scores assigned to a set of training documents relevant to the topic.
-
17. The computer-readable medium of claim 16, wherein the statistics relating to scores assigned to a set of training documents relevant to the topic include a mean score of the documents relevant to the topic and a standard deviation of the scores assigned to the set of training documents relevant to the topic.
-
18. The computer-readable medium of claim 17, wherein said normalizing step comprises:
normalizing a score assigned to a testing document based on the statistics relating to the scores assigned to the set of training documents not relevant to the topic and based on the statistics relating to the scores assigned to the set of training documents relevant to the topic.
-
19. The computer-readable medium of claim 18, wherein said normalizing step determines the normalized score according the following formula:
-
20. The computer-readable medium of claim 13, wherein said designating step comprises:
-
designating the testing document as relevant to the topic based on a determination that the normalized score is greater than the threshold score; and
designating the testing document as not relevant to the topic based on a determination that the normalized score is not greater than the threshold score.
-
-
21. The computer-readable medium of claim 13, further comprising:
repeating steps (a)-(d) for a plurality of topics.
-
22. The computer-readable medium of claim 13, further comprising:
repeating steps (a)-(d) for a plurality of testing documents.
-
23. The computer-readable medium of claim 13, wherein the statistics include a robust estimate of a mean score of the training documents not relevant to the topic and a robust estimate of a standard deviation of the scores assigned to the set of training documents not relevant to the topic.
-
24. A method, facilitated by a human annotator and performed in a computer environment for normalizing a score associated with a document, the method comprising the steps of:
-
(a) receiving through the human annotator a query including a topic and a set of training documents not relevant to the topic (off-topic);
(b) assigning, through the computer environment, a training document relevance score to each one of the training documents, each training document relevance score representing a measure of relevance of its respective document to the topic;
(c) determining, through the computer environment, statistics relating to all training document relevance scores;
(d) receiving a testing document;
(e) calculating, through the computer environment, a score of relevance of the testing document to the topic to obtain a testing document relevance score; and
(f) normalizing, through the computer environment and based on the statistics, the testing document relevance score to obtain a normalized score wherein;
normalizing adjusts the testing document relevance score based on the statistics to be comparable to other scores from which the statistics were determined, and the normalized score is a better predictor of probability of the testing document being relevant than the testing document relevant score. - View Dependent Claims (25, 26)
designating the testing document as relevant or not relevant to the topic based on the normalized score.
-
-
26. The method of claim 24, further comprising:
-
comparing the normalized score to a threshold score; and
designating the testing document as relevant or not relevant to the topic based on the comparison.
-
-
27. A method facilitated by a human annotator and performed by a processor for searching for documents relevant to a topic comprising the steps of:
-
the human annotator both sending a query including a topic to the processor and establishing a set of training documents not relevant to the topic (off-topic); and
the human annotator receiving results from the processor indicating a document relevant to the topic, wherein the processor;
assigns, through the computer environment, a training document relevance score to each one of the training documents, each training document relevance score representing a measure of relevance of its respective document to the topic;
determines , through the computer environment, statistics relating to all training document relevance scores;
receives a testing document;
calculates, through the computer environment, a score of relevance of the testing document to the topic to obtain a testing document relevance score;
normalizes, through the computer environment and based on the statistics, the testing document relevance score to obtain a normalized score wherein;
normalizing adjusts the testing document relevance score based on the statistics to be comparable to other scores from which the statistics were determined, and the normalized score is a better predictor of probability of the testing document being relevant than the testing document relevant score;
establishes, through the computer environment, a threshold score representing a relevance threshold for the topic;
compares the normalized score to the threshold score to obtain a comparison; and
designates the testing document as relevant or not relevant to the topic based on the comparison.
-
Specification