Efficient Indexing of Documents with Similar Content
First Claim
Patent Images
1. A method of processing documents, comprising:
- at a computer system comprising one or more processors and memory storing one or more programs for execution by the one or more processors so as to perform the method;
grouping a set of documents into a plurality of clusters, wherein each cluster includes one or more documents of the set of documents and a respective cluster of the plurality of clusters includes a plurality of documents including a first document and a second document;
compressing the plurality of documents in the respective cluster to generate compressed cluster data, wherein compressing the plurality of documents includes;
determining that the second document includes duplicate data that is duplicative of corresponding data in the first document;
eliding the duplicate data from compressed cluster data; and
storing document data from which the first document and the second document can be reconstructed, the document data including document reconstruction data; and
generating an index of the compressed cluster data.
1 Assignment
0 Petitions
Accused Products
Abstract
A set of documents may be stored and indexed as a compressed sequence of tokens. A set of documents are grouped into clusters. Sequences of tokens representing the clusters of documents are encoded to elide some repeating instances of tokens. A compressed sequence of tokens is generated from the compressed cluster sequences of tokens. Queries on the compressed sequence are performed by identifying cluster sequences within the compressed sequence that are likely to have documents that satisfy the query and then identifying, within these identified clusters, the documents that actually satisfies the query.
-
Citations
24 Claims
-
1. A method of processing documents, comprising:
at a computer system comprising one or more processors and memory storing one or more programs for execution by the one or more processors so as to perform the method; grouping a set of documents into a plurality of clusters, wherein each cluster includes one or more documents of the set of documents and a respective cluster of the plurality of clusters includes a plurality of documents including a first document and a second document; compressing the plurality of documents in the respective cluster to generate compressed cluster data, wherein compressing the plurality of documents includes; determining that the second document includes duplicate data that is duplicative of corresponding data in the first document; eliding the duplicate data from compressed cluster data; and storing document data from which the first document and the second document can be reconstructed, the document data including document reconstruction data; and generating an index of the compressed cluster data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
9. A computer system, comprising:
-
one or more processors; memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for; grouping a set of documents into a plurality of clusters, wherein each cluster includes one or more documents of the set of documents and a respective cluster of the plurality of clusters includes a plurality of documents including a first document and a second document; compressing the plurality of documents in the respective cluster to generate compressed cluster data, wherein compressing the plurality of documents includes; determining that the second document includes duplicate data that is duplicative of corresponding data in the first document; eliding the duplicate data from compressed cluster data; and storing document data from which the first document and the second document can be reconstructed, the document data including document reconstruction data; and generating an index of the compressed cluster data. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computer system with one or more processors, cause the computer system to:
-
group a set of documents into a plurality of clusters, wherein each cluster includes one or more documents of the set of documents and a respective cluster of the plurality of clusters includes a plurality of documents including a first document and a second document; compress the plurality of documents in the respective cluster to generate compressed cluster data, wherein compressing the plurality of documents includes; determining that the second document includes duplicate data that is duplicative of corresponding data in the first document; eliding the duplicate data from compressed cluster data; and storing document data from which the first document and the second document can be reconstructed, the document data including document reconstruction data; and generate an index of the compressed cluster data. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
Specification