System and method for using an exemplar document to retrieve relevant documents from an inverted index of a large corpus
First Claim
Patent Images
1. A method for ranking the relevance of each of a plurality of documents in a corpus to a search query of words comprising the steps of:
- a) grouping words in the search query by synonym into one or more word groups, said grouping being performed by a processing unit;
b) for each word group, counting the number of instances (the “
FQ”
value) that a word from the word group appears in the search query, said counting being performed by the processing unit;
c) determining, by the processing unit, the maximum FQ value among all the word groups;
d) calculating, by the processing unit, a scaling factor K;
e) for each word group, calculating a term frequency (“
TF”
) value by dividing the FQ value for the word group by the maximum FQ value and applying scaling factor K to the resulting quotient, said calculating being performed by the processing unit;
f) for each word group, counting the number of documents (“
FC”
) in the corpus that contain at least one word from the word group, said counting being performed by the processing unit;
g) counting the number of documents (“
N”
) in the corpus, said counting being performed by the processing unit;
h) for each word group, calculating an inverse document frequency (“
IDF”
) value by dividing N by FC, adding one to the resulting quotient, and taking the natural logarithm of the resulting sum, said calculating being performed by the processing unit;
i) for each word group, calculating a TF-IDF value by multiplying said TF value by said IDF value, said calculating being performed by the processing unit; and
j) ranking the relevance of each document in the corpus utilizing the TF-IDF values for the word groups in the search query, said ranking being performed by the processing unit.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for using an exemplar document or search query to retrieve relevant documents from an inverted index of a large corpus of documents. The system and method groups words by synonym and calculates term frequency (TF) and inverse document frequency (IDF) scores for the respective word groups. A composite term frequency-inverse document frequency (TF-IDF) score is calculated for each word group and the documents of the corpus are ranked based on the TF-IDF scores, utilizing a vector space model incorporating a cosine similarity function.
-
Citations
20 Claims
-
1. A method for ranking the relevance of each of a plurality of documents in a corpus to a search query of words comprising the steps of:
-
a) grouping words in the search query by synonym into one or more word groups, said grouping being performed by a processing unit; b) for each word group, counting the number of instances (the “
FQ”
value) that a word from the word group appears in the search query, said counting being performed by the processing unit;c) determining, by the processing unit, the maximum FQ value among all the word groups; d) calculating, by the processing unit, a scaling factor K; e) for each word group, calculating a term frequency (“
TF”
) value by dividing the FQ value for the word group by the maximum FQ value and applying scaling factor K to the resulting quotient, said calculating being performed by the processing unit;f) for each word group, counting the number of documents (“
FC”
) in the corpus that contain at least one word from the word group, said counting being performed by the processing unit;g) counting the number of documents (“
N”
) in the corpus, said counting being performed by the processing unit;h) for each word group, calculating an inverse document frequency (“
IDF”
) value by dividing N by FC, adding one to the resulting quotient, and taking the natural logarithm of the resulting sum, said calculating being performed by the processing unit;i) for each word group, calculating a TF-IDF value by multiplying said TF value by said IDF value, said calculating being performed by the processing unit; and j) ranking the relevance of each document in the corpus utilizing the TF-IDF values for the word groups in the search query, said ranking being performed by the processing unit. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system for ranking the relevance of each of a plurality of documents in a corpus to a search query comprising:
-
a) a processing unit capable of performing calculations; b) a storage device on which is stored a corpus of documents; c) an input device for receiving the search query; d) an output device for displaying the results of the ranking; wherein the processing unit groups words in the search query by synonym into one or more word groups; wherein the processing unit, for each word group, counts the number of instances (the “
FQ”
value) that a word from the word group appears in the search query;wherein the processing unit determines the maximum FQ value among all the word groups; wherein the processing unit calculates a scaling factor K; wherein the processing unit, for each word group, calculates a term frequency (“
TF”
) value by dividing the FQ value for the word group by the maximum FQ value and applying scaling factor K to the resulting quotient;wherein the processing unit, for each word group, counts the number of documents (“
FC”
) in the corpus that contain at least one word from the word group;wherein the processing unit counts the number of documents (“
N”
) in the corpus;wherein the processing unit, for each word group, calculates an inverse document frequency (“
IDF”
) value by dividing N by FC, adding one to the resulting quotient, and taking the natural logarithm of the resulting sum;wherein the processing unit, for each word group, calculates a TF-IDF value by multiplying said TF value by said IDF value; and wherein the processing unit ranks the relevance of each document in the corpus utilizing the TF-IDF values for the word groups in the search query. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification