Lossy index compression
First Claim
Patent Images
1. Apparatus for indexing a corpus of text documents, comprising:
- an index processor configured to generate an inverted index of entries for terms appearing in a plurality of documents,each entry including a term present in the plurality of documents and a plurality of postings,each posting including an identifier that represents a document of the plurality of documents and a ranking that represents a rank of the term in the document,the index processor being further configured to update the ranking for each posting and to compress the inverted index by deleting a posting having a ranking set below a given level to generate a compressed inverted index.
1 Assignment
0 Petitions
Accused Products
Abstract
An apparatus and method is provided for pruning an index of a corpus of text documents by creating an inverted index of terms appearing in the documents, wherein the index includes postings of the terms in the documents, ranking the postings in the index, and pruning from the index the postings below a given level in the ranking.
-
Citations
15 Claims
-
1. Apparatus for indexing a corpus of text documents, comprising:
-
an index processor configured to generate an inverted index of entries for terms appearing in a plurality of documents, each entry including a term present in the plurality of documents and a plurality of postings, each posting including an identifier that represents a document of the plurality of documents and a ranking that represents a rank of the term in the document, the index processor being further configured to update the ranking for each posting and to compress the inverted index by deleting a posting having a ranking set below a given level to generate a compressed inverted index.
-
-
2. Apparatus according to claim 1, wherein the processor is arranged to determine a separate ranking for each of at least some of the terms separately, and to prune the separate ranking for each of at least some of the terms.
-
3. Apparatus according to claim 2, further comprising a user interface for receiving at least one parameter, wherein the processor is arranged to set the given level based on the parameter and the separate ranking.
-
4. Apparatus according to claim 3, wherein the at least one parameter comprises a number k of documents to retrieve from the index, and a number r of the terms permitted in a query, and wherein the processor is arranged to set the level based on the score of one of the documents ranked k from the top in the ranking.
-
5. Apparatus according to claim 4, wherein the processor is arranged to set the given level by dividing the score of the one of the documents by r.
-
6. Apparatus according to claim 3, wherein the at least one parameter comprises a number δ
- , of the portion of the documents to retrieve from the ranking, and a number r, of the terms permitted in a query, and wherein the processor is arranged to set the given level based on the score of the first document out of the documents in the ranking, δ
, and r.
- , of the portion of the documents to retrieve from the ranking, and a number r, of the terms permitted in a query, and wherein the processor is arranged to set the given level based on the score of the first document out of the documents in the ranking, δ
-
7. Apparatus according to claim 6, wherein the processor is arranged to multiply the score of the first document by δ
- and divide by r.
-
8. Apparatus according to claim 1, wherein the processor is arranged to select the postings to prune based on information on a statistical distribution of queries in a search space with respect to the document postings.
-
9. Apparatus according to claim 8, further comprising a user interface for receiving at least one parameter, wherein the processor is arranged to set the given level based on the parameter and the rankings of the postings.
-
10. Apparatus according to claim 9, wherein the at least one parameter comprises a number M of the scores to remain in the compressed inverted index.
-
11. Apparatus according to claim 10, wherein the processor is arranged to determine a probability of at least some of the terms and to multiply the posting scores for each of the at least some of the terms by the probability of the term, and to rank all the postings by the multiplied posting scores, and wherein the given level comprises the score of document M from the top of the rankings of the postings.
-
12. Apparatus according to claim 1, wherein the index processor is a computer having large memory capacity, and comprises apparatus for transferring the compressed inverted index to a device with limited memory capacity.
-
13. Apparatus according to claim 12, wherein the device with limited memory capacity comprises a handheld computing device.
-
14. A computer-implemented method for indexing a corpus of text documents, comprising the steps of:
-
generating an inverted index of entries for terms appearing in a plurality of documents, each entry including a term present in the plurality of documents and a plurality of postings, each posting including an identifier that represents a document of the plurality of documents and a ranking that represents a rank of the term in the document, determining a score for each posting and updating the ranking of each posting based on the score; and compressing the inverted index by deleting a posting having a ranking set below a given level to generate a compressed inverted index.
-
-
15. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform method steps for indexing a corpus of text documents, the method steps comprising:
-
generating an inverted index of entries for terms appearing in a plurality of documents, each entry including a term present in the plurality of documents and a plurality of postings, each posting including an identifier that represents a document of the plurality of documents and a ranking that represents a rank of the term in the document, determining a score for each posting and updating the ranking of each posting based on the score; and compressing the inverted index by deleting a posting having a ranking set below a given level to generate a compressed inverted index.
-
Specification