Method and apparatus for score normalization for information retrieval applications

US 7,062,485 B1
Filed: 09/18/2003
Issued: 06/13/2006
Est. Priority Date: 09/01/2000
Status: Expired due to Term

First Claim

Patent Images

1. A method facilitated by a human annotator and performed in a computer environment for normalizing a score associated with a document, the method comprising the steps of:

(a) establishing (1) through the computer environment a set of training documents most of which are believed not to be relevant to a topic (off-topic) and (2) through the human annotator a query relevant to the topic (on-topic);

(b) assigning, through the computer environment, a training document relevance score to each one of the training documents, each training document relevance score representing a measure of relevance of its respective document to the topic;

(c) determining, through the computer environment, statistics relating to all training document relevance scores and thereby obtaining determined statistics;

(d) receiving a testing document;

(e) calculating, through the computer environment, a score of relevance of the testing document to the topic to obtain a testing document relevance score;

(f) normalizing, through the computer environment and based on the statistics, the testing document relevance score to obtain a normalized score wherein;

normalizing adjusts the testing document relevance score based on the statistics to be comparable to other scores from which the statistics were determined, andthe normalized score is a better predictor of probability of the testing document being relevant than the testing document relevant score;

(g) establishing, through the computer environment, a threshold score representing a relevance threshold for the topic;

(h) comparing the normalized score to the threshold score to obtain a comparison; and

(i) designating the testing document as relevant or not relevant to the topic based on the comparison.

View all claims

11 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for normalizing a score associated with a document is presented. Statistics relating to scores assigned to a set of training documents not relevant to a topic are determined. Scores represent a measure of relevance to the topic. After the various statistics have been collected, a score assigned to a testing document is normalized based on those statistics. The normalized score is then compared to a threshold score. Subsequently, the testing document is designated as relevant or not relevant to the topic based on the comparison.

Citations

29 Claims

1. A method facilitated by a human annotator and performed in a computer environment for normalizing a score associated with a document, the method comprising the steps of:
- (a) establishing (1) through the computer environment a set of training documents most of which are believed not to be relevant to a topic (off-topic) and (2) through the human annotator a query relevant to the topic (on-topic);
  
  (b) assigning, through the computer environment, a training document relevance score to each one of the training documents, each training document relevance score representing a measure of relevance of its respective document to the topic;
  
  (c) determining, through the computer environment, statistics relating to all training document relevance scores and thereby obtaining determined statistics;
  
  (d) receiving a testing document;
  
  (e) calculating, through the computer environment, a score of relevance of the testing document to the topic to obtain a testing document relevance score;
  
  (f) normalizing, through the computer environment and based on the statistics, the testing document relevance score to obtain a normalized score wherein;
  
  normalizing adjusts the testing document relevance score based on the statistics to be comparable to other scores from which the statistics were determined, andthe normalized score is a better predictor of probability of the testing document being relevant than the testing document relevant score;
  
  (g) establishing, through the computer environment, a threshold score representing a relevance threshold for the topic;
  
  (h) comparing the normalized score to the threshold score to obtain a comparison; and
  
  (i) designating the testing document as relevant or not relevant to the topic based on the comparison.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, wherein the statistics include a mean score of the training documents not relevant to the topic and a standard deviation of the scores assigned to the set of training documents not relevant to the topic.
  - 3. The method of claim 2, wherein said normalizing step determines the normalized score according the following formula:
    - normalized_score=(s−
      
      μ
      
      _off_—_topic)/σ
      
      _off_—_topicwherein s represents the score assigned to the testing document, μ
      
      _off_—_topicrepresents the mean score of the documents not relevant to the topic, and σ
      
      _off_—_topicrepresents the standard deviation of the scores assigned to the set of training documents not relevant to the topic.
  - 4. The method of claim 1, said determining step further comprising:
    - determining statistics relating to scores assigned to a set of training documents relevant to the topic.
  - 5. The method of claim 4, wherein the statistics relating to scores assigned to a set of training documents relevant to the topic include a mean score of the documents relevant to the topic and a standard deviation of the scores assigned to the set of training documents relevant to the topic.
  - 6. The method of claim 5, wherein said normalizing step comprises:
    - normalizing a score assigned to a testing document based on the statistics relating to the scores assigned to the set of training documents not relevant to the topic and based on the statistics relating to the scores assigned to the set of training documents relevant to the topic.
  - 7. The method of claim 6, wherein said normalizing step determines the normalized score according the following formula:
    - normalized_score=f_on_—_topic*((s−
      
      μ
      
      _off_—_topic)/σ
      
      _off_—_topic)wherein f_on_—_topicrepresents a scale factor based on the statistics relating to the scores assigned to the set of training documents relevant to the topic, s represents the score assigned to the testing document, μ
      
      _off_—_topic) represents the mean score of the documents not relevant to the topic, and σ
      
      _off_—_topicrepresents the standard deviation of the scores assigned to the set of training documents not relevant to the topic.
  - 8. The method of claim 1, wherein said designating step comprises:
    - designating the testing document as relevant to the topic based on a determination that the normalized score is greater than the threshold score; and
      
      designating the testing document as not relevant to the topic based on a determination that the normalized score is not greater than the threshold score.
  - 9. The method of claim 1, further comprising:
    - repeating steps (a)–
      
      (d) for a plurality of topics.
  - 10. The method of claim 1, further comprising:
    - repeating steps (a)–
      
      (d) for a plurality of testing documents.
  - 11. The method of claim 1, wherein the statistics include a first robust estimate of a mean score of the set of training documents not relevant to the topic and a second robust estimate of a standard deviation of the scores assigned to the set of training documents not relevant to the topic.
  - 12. The method of claim 11 wherein the first robust estimate comprises:
    - setting a first robust estimate threshold, based on the determined statistics;
      
      removing each document of the set of training documents not relevant to the topic which is above the first robust estimate threshold, thereby creating a remaining set of training documents not relevant to the topic; and
      
      determining new estimates of the statistics that may be more appropriate for off topic documents based solely on the remaining set of training documents.
  - 13. The method of claim 12 wherein the second robust estimate comprises:
    - setting a second robust estimate threshold, based on the determined statistics;
      
      removing each document of the set of training documents not relevant to the topic which is above the second robust estimate threshold, thereby creating a remaining set of training documents not relevant to the topic; and
      
      determining new estimates of the statistics that may be more appropriate for offtopic documents based solely on the remaining set of training documents.
  - 14. The method of claim 11 wherein the second robust estimate comprises:
    - setting a second robust estimate threshold, based on the determined statistics;
      
      removing each document of the set of training documents not relevant to the topic which is above the second robust estimate threshold, thereby creating a remaining set of training documents not relevant to the topic; and
      
      determining new estimates of the statistics that may be more appropriate for offtopic documents based solely on the remaining set of training documents.

15. A computer-readable storage medium containing instructions for performing a method in a computer environment for normalizing a score associated with a document, the method facilitated by a human annotator comprising:
- (a) establishing (1) through the computer environment a set of training documents most of which are believed not to be relevant to a topic (off-topic) and (2) through the human annotator a query relevant to the topic (on-topic);
  
  (b) assigning, through the computer environment, a training document relevance score to each one of the training documents, each training document relevance score representing a measure of relevance of its respective document to the topic;
  
  (c) determining, through the computer environment, statistics relating to all training document relevance scores;
  
  (d) receiving a testing document;
  
  (e) calculating, through the computer environment, a score of relevance of the testing document to the topic to obtain a testing document relevance score;
  
  (f) normalizing, through the computer environment and based on the statistics, the testing document relevance score to obtain a normalized score wherein;
  
  normalizing adjusts the testing document relevance score based on the statistics to be comparable to other scores from which the statistics were determined, andthe normalized score is a better predictor of probability of the testing document being relevant than the testing document relevant score;
  
  (g) establishing, through the computer environment, a threshold score representing a relevance threshold for the topic;
  
  (h) comparing the normalized score to the threshold score to obtain a comparison; and
  
  (i) designating the testing document as relevant or not relevant to the topic based on the comparison.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
- - 16. The computer-readable storage medium of claim 15, wherein the statistics include a mean score of the training documents not relevant to the topic and a standard deviation of the scores assigned to the set of training documents not relevant to the topic.
  - 17. The computer-readable storage medium of claim 16, wherein said normalizing step determines the normalized score according the following formula:
    - normalized_score=(s−
      
      μ
      
      _off_—_topic)/(σ
      
      _off_—_topic)wherein s represents the score assigned to the testing document, μ
      
      _off_—_topicrepresents the mean score of the documents not relevant to the topic, and σ
      
      _off_—_topicrepresents the standard deviation of the scores assigned to the set of training documents not relevant to the topic.
  - 18. The computer-readable storage medium of claim 15, said determining step further comprising:
    - determining statistics relating to scores assigned to a set of training documents relevant to the topic.
  - 19. The computer-readable storage medium of claim 18, wherein the statistics relating to scores assigned to a set of training documents relevant to the topic include a mean score of the documents relevant to the topic and a standard deviation of the scores assigned to the set of training documents relevant to the topic.
  - 20. The computer-readable storage medium of claim 19, wherein said normalizing step comprises:
    - normalizing a score assigned to a testing document based on the statistics relating to the scores assigned to the set of training documents not relevant to the topic and based on the statistics relating to the scores assigned to the set of training documents relevant to the topic.
  - 21. The computer-readable storage medium of claim 20, wherein said normalizing step determines the normalized score according the following formula:
    - normalized_score=f_on_—_topic*((s−
      
      μ
      
      _off_—_topic)/σ
      
      _off_—_topic)wherein f_on_—_topicrepresents a scale factor based on the statistics relating to the scores assigned to the set of training documents relevant to the topic, s represents the score assigned to the testing document, μ
      
      _off_—_topicrepresents the mean score of the documents not relevant to the topic, and σ
      
      _off_—_topicrepresents the standard deviation of the scores assigned to the set of training documents not relevant to the topic.
  - 22. The computer-readable storage medium of claim 15, wherein said designating step comprises:
    - designating the testing document as relevant to the topic based on a determination that the normalized score is greater than the threshold score; and
      
      designating the testing document as not relevant to the topic based on a determination that the normalized score is not greater than the threshold score.
  - 23. The computer-readable storage medium of claim 15, further comprising:
    - repeating steps (a)–
      
      (d) for a plurality of topics.
  - 24. The computer-readable storage medium of claim 15, further comprising:
    - repeating steps (a)–
      
      (d) for a plurality of testing documents.
  - 25. The computer-readable storage medium of claim 15, wherein the statistics include a robust estimate of a mean score of the training documents not relevant to the topic and a robust estimate of a standard deviation of the scores assigned to the set of training documents not relevant to the topic.

26. A method, facilitated by a human annotator and performed in a computer environment for normalizing a score associated with a document, the method comprising the steps of:
- (a) receiving (1) through the computer environment a set of training documents not relevant to a topic (off-topic) and (2) through the human annotator a query including the topic (on-topic);
  
  (b) assigning, through the computer environment, a training document relevance score to each one of the training documents, each training document relevance score representing a measure of relevance of its respective document to the topic;
  
  (c) determining, through the computer environment, statistics relating to all training document relevance scores;
  
  (d) receiving a testing document;
  
  (e) calculating, through the computer environment, a score of relevance of the testing document to the topic to obtain a testing document relevance score; and
  
  (f) normalizing, through the computer environment and based on the statistics, the testing document relevance score to obtain a normalized score wherein;
  
  normalizing adjusts the testing document relevance score based on the statistics to be comparable to other scores from which the statistics were determined, andthe normalized score is a better predictor of probability of the testing document being relevant than the testing document relevant score.
- View Dependent Claims (27, 28)
- - 27. The method of claim 26, further comprising:
    - designating the testing document as relevant or not relevant to the topic based on the normalized score.
  - 28. The method of claim 26, further comprising:
    - comparing the normalized score to a threshold score; and
      
      designating the testing document as relevant or not relevant to the topic based on the comparison.

29. A method facilitated by a human annotator and performed by a processor in a computer environment for searching for documents relevant to a topic comprising the steps of:
- establishing through the computer environment a set of training documents not relevant to the topic (off-topic);
  
  the human annotator sending a query including the topic (on-topic) to the processor; and
  
  the human annotator receiving results from the processor indicating a document relevant to the topic, wherein the processor;
  
  assigns, through the computer environment, a training document relevance score to each one of the training documents, each training document relevance score representing a measure of relevance of its respective document to the topic;
  
  determines, through the computer environment, statistics relating to all training document relevance scores;
  
  receives a testing document;
  
  calculates, through the computer environment, a score of relevance of the testing document to the topic to obtain a testing document relevance score;
  
  normalizes, through the computer environment and based on the statistics, the testing document relevance score to obtain a normalized score wherein;
  
  normalizing adjusts the testing document relevance score based on the statistics to be comparable to other scores from which the statistics were determined, andthe normalized score is a better predictor of probability of the testing document being relevant than the testing document relevant score;
  
  establishes, through the computer environment, a threshold score representing a relevance threshold for the topic;
  
  compares the normalized score to the threshold score to obtain a comparison; and
  
  designates the testing document as relevant or not relevant to the topic based on the comparison.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cxense ASA
Original Assignee
BBN Technologies Corporation (Rtx Corporation)
Inventors
Schwartz, Richard, Walls, Frederick G., Sista, Sreenivasa P., Jin, Huaichuan Hubert
Primary Examiner(s)
Mofiz, Apu M

Application Number

US10/665,056
Time in Patent Office

999 Days
Field of Search

707/3, 707/5, 707/6, 707/7, 707/10, 707/102, 707/104.1, 600/301, 704/224, 704/272, 715/531
US Class Current

1/1
CPC Class Codes

G06F 16/33   Querying

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Method and apparatus for score normalization for information retrieval applications

First Claim

11 Assignments

0 Petitions

Accused Products

Abstract

Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for score normalization for information retrieval applications

First Claim

11 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links