System and method for using an exemplar document to retrieve relevant documents from an inverted index of a large corpus

US 8,122,043 B2
Filed: 06/30/2009
Issued: 02/21/2012
Est. Priority Date: 06/30/2009
Status: Active Grant

First Claim

Patent Images

1. A method for ranking the relevance of each of a plurality of documents in a corpus to a search query of words comprising the steps of:

a) grouping words in the search query by synonym into one or more word groups, said grouping being performed by a processing unit;

b) for each word group, counting the number of instances (the “

FQ”

value) that a word from the word group appears in the search query, said counting being performed by the processing unit;

c) determining, by the processing unit, the maximum FQ value among all the word groups;

d) calculating, by the processing unit, a scaling factor K;

e) for each word group, calculating a term frequency (“

TF”

) value by dividing the FQ value for the word group by the maximum FQ value and applying scaling factor K to the resulting quotient, said calculating being performed by the processing unit;

f) for each word group, counting the number of documents (“

FC”

) in the corpus that contain at least one word from the word group, said counting being performed by the processing unit;

g) counting the number of documents (“

N”

) in the corpus, said counting being performed by the processing unit;

h) for each word group, calculating an inverse document frequency (“

IDF”

) value by dividing N by FC, adding one to the resulting quotient, and taking the natural logarithm of the resulting sum, said calculating being performed by the processing unit;

i) for each word group, calculating a TF-IDF value by multiplying said TF value by said IDF value, said calculating being performed by the processing unit; and

j) ranking the relevance of each document in the corpus utilizing the TF-IDF values for the word groups in the search query, said ranking being performed by the processing unit.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for using an exemplar document or search query to retrieve relevant documents from an inverted index of a large corpus of documents. The system and method groups words by synonym and calculates term frequency (TF) and inverse document frequency (IDF) scores for the respective word groups. A composite term frequency-inverse document frequency (TF-IDF) score is calculated for each word group and the documents of the corpus are ranked based on the TF-IDF scores, utilizing a vector space model incorporating a cosine similarity function.

Citations

20 Claims

1. A method for ranking the relevance of each of a plurality of documents in a corpus to a search query of words comprising the steps of:
- a) grouping words in the search query by synonym into one or more word groups, said grouping being performed by a processing unit;
  
  b) for each word group, counting the number of instances (the “
  
  FQ”
  
  value) that a word from the word group appears in the search query, said counting being performed by the processing unit;
  
  c) determining, by the processing unit, the maximum FQ value among all the word groups;
  
  d) calculating, by the processing unit, a scaling factor K;
  
  e) for each word group, calculating a term frequency (“
  
  TF”
  
  ) value by dividing the FQ value for the word group by the maximum FQ value and applying scaling factor K to the resulting quotient, said calculating being performed by the processing unit;
  
  f) for each word group, counting the number of documents (“
  
  FC”
  
  ) in the corpus that contain at least one word from the word group, said counting being performed by the processing unit;
  
  g) counting the number of documents (“
  
  N”
  
  ) in the corpus, said counting being performed by the processing unit;
  
  h) for each word group, calculating an inverse document frequency (“
  
  IDF”
  
  ) value by dividing N by FC, adding one to the resulting quotient, and taking the natural logarithm of the resulting sum, said calculating being performed by the processing unit;
  
  i) for each word group, calculating a TF-IDF value by multiplying said TF value by said IDF value, said calculating being performed by the processing unit; and
  
  j) ranking the relevance of each document in the corpus utilizing the TF-IDF values for the word groups in the search query, said ranking being performed by the processing unit.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1 wherein said search query comprises a pre-existing document.
  - 3. The method of claim 1 wherein said search query comprises a string entered by a user in real time.
  - 4. The method of claim 1 wherein said scaling factor K is a monotonically decreasing function over the domain of positive integers whose range does not exceed 1 or fall below 0 where the domain represents the number of unique words (“
    - C”
      
      ) in the search query.
  - 5. The method of claim 1 wherein said scaling factor K is a strictly decreasing function over the domain of positive integers whose range does not exceed 1 or fall below 0 where the domain represents the number of unique words (“
    - C”
      
      ) in the search query.
  - 6. The method of claim 1 wherein step (d) further comprises:
    - a) counting, by the processing unit, the number of unique words (“
      
      C”
      
      ) in the search query; and
      
      b) calculating scaling factor K by adding 2 to C, dividing the resulting sum by 3, taking the square root of the resulting quotient, and dividing 1 by the resulting square root, said calculating being performed by the processing unit.
  - 7. The method of claim 4 wherein step (e) further comprises:
    - a) for each word group, dividing the FQ value for the word group by the maximum FQ value to calculate an intermediate TF value, said dividing being performed by the processing unit; and
      
      b) calculating TF by subtracting scaling factor K from 1, multiplying the resulting difference by the intermediate TF value, and adding the resulting product to scaling factor K, said calculating being performed by the processing unit.
  - 8. The method of claim 6 wherein step (e) further comprises:
    - a) for each word group, dividing the FQ value for the word group by the maximum FQ value to calculate an intermediate TF value, said dividing being performed by the processing unit; and
      
      b) calculating TF by subtracting scaling factor K from 1, multiplying the resulting difference by the intermediate TF value, and adding the resulting product to scaling factor K, said calculating being performed by the processing unit.

9. A system for ranking the relevance of each of a plurality of documents in a corpus to a search query comprising:
- a) a processing unit capable of performing calculations;
  
  b) a storage device on which is stored a corpus of documents;
  
  c) an input device for receiving the search query;
  
  d) an output device for displaying the results of the ranking;
  
  wherein the processing unit groups words in the search query by synonym into one or more word groups;
  
  wherein the processing unit, for each word group, counts the number of instances (the “
  
  FQ”
  
  value) that a word from the word group appears in the search query;
  
  wherein the processing unit determines the maximum FQ value among all the word groups;
  
  wherein the processing unit calculates a scaling factor K;
  
  wherein the processing unit, for each word group, calculates a term frequency (“
  
  TF”
  
  ) value by dividing the FQ value for the word group by the maximum FQ value and applying scaling factor K to the resulting quotient;
  
  wherein the processing unit, for each word group, counts the number of documents (“
  
  FC”
  
  ) in the corpus that contain at least one word from the word group;
  
  wherein the processing unit counts the number of documents (“
  
  N”
  
  ) in the corpus;
  
  wherein the processing unit, for each word group, calculates an inverse document frequency (“
  
  IDF”
  
  ) value by dividing N by FC, adding one to the resulting quotient, and taking the natural logarithm of the resulting sum;
  
  wherein the processing unit, for each word group, calculates a TF-IDF value by multiplying said TF value by said IDF value; and
  
  wherein the processing unit ranks the relevance of each document in the corpus utilizing the TF-IDF values for the word groups in the search query.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 10. The system of claim 9 wherein said search query comprises a pre-existing document.
  - 11. The system of claim 9 wherein said search query comprises a string entered by a user in real time.
  - 12. The system of claim 9 wherein said scaling factor K is a monotonically decreasing function over the domain of positive integers whose range does not exceed 1 or fall below 0 where the domain represents the number of unique words (“
    - C”
      
      ) in the search query.
  - 13. The system of claim 9 wherein said scaling factor K is a strictly decreasing function over the domain of positive integers whose range does not exceed 1 or fall below 0 where the domain represents the number of unique words (“
    - C”
      
      ) in the search query.
  - 14. The system of claim 9 wherein the processing unit counts the number of unique words (“
    - C”
      
      ) in the search query and calculates scaling factor K by adding 2 to C, dividing the resulting sum by 3, taking the square root of the resulting quotient, and dividing 1 by the resulting square root.
  - 15. The system of claim 12 wherein the processing unit, for each word group, divides the FQ value for the word group by the maximum FQ value to calculate an intermediate TF value and calculates TF by subtracting scaling factor K from 1, multiplying the resulting difference by the intermediate TF value, and adding the resulting product to scaling factor K.
  - 16. The system of claim 14 wherein the processing unit, for each word group, divides the FQ value for the word group by the maximum FQ value to calculate an intermediate TF value and calculates TF by subtracting scaling factor K from 1, multiplying the resulting difference by the intermediate TF value, and adding the resulting product to scaling factor K.
  - 17. The system of claim 9 further comprising:
    - a) a document search server;
      
      b) a search appliance; and
      
      c) a database.
  - 18. The system of claim 17 wherein said document search server comprises:
    - a) a tokenizer;
      
      b) a synonym finder;
      
      c) a TF-IDF calculator;
      
      d) a query builder; and
      
      e) a search results formatter.
  - 19. The system of claim 17 wherein said search appliance comprises:
    - a) a synonym list;
      
      b) a term frequency calculator; and
      
      c) a search engine.
  - 20. The system of claim 17 wherein said database contains an index of said corpus.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
EBSCO Industries Incorporated
Original Assignee
EBSCO Industries Incorporated
Inventors
Buckley, Brad, Motov, Igor
Primary Examiner(s)
ORTIZ DITREN, BELIX M

Application Number

US12/494,452
Publication Number

US 20100332503A1
Time in Patent Office

966 Days
Field of Search

None
US Class Current

707/759
CPC Class Codes

G06F 16/31 Indexing; Data structures t...

System and method for using an exemplar document to retrieve relevant documents from an inverted index of a large corpus

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for using an exemplar document to retrieve relevant documents from an inverted index of a large corpus

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links