System and method for accelerated query evaluation of very large full-text databases
First Claim
1. In an information retrieval apparatus including a database of documents, each document having a plurality of terms and a unique document identifier, the information retrieval apparatus further including a programmed processor adapted to receive a query containing at least one term and to compute in response to the query a document score for each of a selected plurality of documents, the document score being a function of the terms of the query, a computer memory readable by the processor and comprising:
- a first ordered plurality of unique terms, each unique term associated in the memory with;
a plurality of (document, term contribution) tuples, the term contribution computed by the processor prior to the receipt of some query containing the unique term and being a scalar measure of the contribution of the unique term to a document score computable by the processor for the document after receipt of a query containing the unique term, the tuples selected for those documents having the highest term contributions for the unique term from all documents in the database, the tuples ordered by the term contribution, such that the processor serially accesses a first subset of the tuples to compute a document score for each document in the first subset of tuples associated with a received term of a query.
7 Assignments
0 Petitions
Accused Products
Abstract
A system, method, and various software products provide for improved information retrieval in very large document databases through the use of a predetermined static cache. The static cache includes for terms that appear in a large number of documents, a plurality of documents ordered by a contribution that the term makes to the document score of the document. The contribution is a scalar measure of the influence of the term in the computed document score. The contribution reflects both the within document frequency and the between document frequency of the term. In addition, the static cache includes for each term a lookup table that references selected entries for the term in an inverted index. Queries to the database are then processed by first traversing the static cache and obtaining the contribution information thereform and computing the document score from this information. Additional term frequency information for other terms in the query is obtained by looking up the document in the lookup tables of the other query terms, and obtaining the term frequency information for such terms from the inverted index, or by searching the contribution caches of the query terms.
478 Citations
31 Claims
-
1. In an information retrieval apparatus including a database of documents, each document having a plurality of terms and a unique document identifier, the information retrieval apparatus further including a programmed processor adapted to receive a query containing at least one term and to compute in response to the query a document score for each of a selected plurality of documents, the document score being a function of the terms of the query, a computer memory readable by the processor and comprising:
a first ordered plurality of unique terms, each unique term associated in the memory with; a plurality of (document, term contribution) tuples, the term contribution computed by the processor prior to the receipt of some query containing the unique term and being a scalar measure of the contribution of the unique term to a document score computable by the processor for the document after receipt of a query containing the unique term, the tuples selected for those documents having the highest term contributions for the unique term from all documents in the database, the tuples ordered by the term contribution, such that the processor serially accesses a first subset of the tuples to compute a document score for each document in the first subset of tuples associated with a received term of a query. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
8. A computer implemented method of processing a query containing a single term, the method comprising:
storing a first ordered plurality of unique terms, each unique term associated with; a plurality of (document, term contribution) tuples, the term contribution computed prior to the receipt of the query containing the unique term and being a scalar measure of the contribution of the unique term to a document score computable for the document after receipt of the query containing the unique term, the tuples selected for those documents having the highest term contributions for the unique term from all documents in the database, the tuples ordered by the term contribution; matching the single term to one of the plurality of unique terms; in the tuples associated with the matched unique term, determining for a number of the tuples, a document score for the document in the tuple from the term contribution in the tuple; and
,returning as the results of the queries, the documents from the number of tuples that have been scored.
-
9. A computer implemented method of processing a query containing a plurality of terms to identify documents in a database in response to the query, the method comprising:
-
prior to the receipt of the query, determining for each of a plurality of terms a contribution of the term to a document score of each of a plurality of documents, the contribution of a term based on a frequency of the term in the document and a frequency of the term in the database; receiving the query; and
,for each term of the query, scoring a plurality of documents from the determined contribution of at least one term, and from frequency information for any remaining terms of the query. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A computer implemented method of preprocessing a database of documents for subsequent query processing, each document having a plurality of terms, each term contained in a number of documents, comprising:
-
selecting a plurality of terms T in the database for which the number of documents containing term T exceeds a threshold k, such that there are at least k documents containing term T; for each term T; determining for each document containing term T a contribution of term T to a document score of the document; ranking the documents containing term T by the contribution of term T to the document score; selecting a plurality of highest ranked documents; and
,for each of selected documents, storing in association with the term, indicia of the document and the contribution of the term T to the document. - View Dependent Claims (16, 17, 18, 19)
-
-
20. A computer memory readable by a processor in a database management system including a database of documents, for controlling the system to preprocess the documents, the memory including computer executable instructions for causing the system to perform the steps of:
-
selecting unique terms in the database for which a plurality of documents containing the term exceeds a threshold k, such that there are at least k documents containing the unique term; for each selected term; determining for each document containing the term a contribution of the term to a document score of the document; ranking the documents containing the term by the contribution of the term to the document score; selecting a plurality of highest ranked documents; and
,for each selected document, storing in association with the term indicia of the document and the contribution of the term to the document. - View Dependent Claims (21, 22, 23)
-
-
24. A computer memory readable by a processor in a database management system including a database of documents, for controlling the system to process a query for selected documents, the query including a plurality of terms, the memory including computer executable instructions for causing the system to perform the steps of:
-
prior to the receipt of the query, determining for each of a plurality of terms a contribution of the term to a document score of each of a plurality of documents, the contribution of a term based on a frequency of the term in the document and a frequency of the term in the database; receiving the query; and
,for each term of the query, scoring a plurality of documents from the determined contribution of at least one term, and from frequency information for any remaining terms of the query. - View Dependent Claims (25, 26, 27, 28, 29)
-
-
30. A database management system, comprising:
a first ordered plurality of unique terms stored in a computer readable memory, each unique term associated in the memory with; a plurality of (document, term contribution) tuples, the term contribution computed prior to the receipt of some query containing the unique term and being a scalar measure of the contribution of the unique term to a document score computable by a processor for the document after receipt of a query containing the unique term, the tuples selected as those documents having the highest term contributions for the unique term from all documents in the database, the tuples ordered by the term contribution; a first method executable by a processor that receives a query containing a plurality of terms and for each term of the query, serially accesses a first subset of the tuples associated with the term, and computes for each document in a tuple in the first subset of the tuples, a document score for the document based on the contribution in the tuple. - View Dependent Claims (31)
Specification