×

Efficient Indexing of Documents with Similar Content

  • US 20120023073A1
  • Filed: 09/29/2011
  • Published: 01/26/2012
  • Est. Priority Date: 05/19/2006
  • Status: Active Grant
First Claim
Patent Images

1. A method of processing documents, comprising:

  • at a computer system comprising one or more processors and memory storing one or more programs for execution by the one or more processors so as to perform the method;

    grouping a set of documents into a plurality of clusters, wherein each cluster includes one or more documents of the set of documents and a respective cluster of the plurality of clusters includes a plurality of documents including a first document and a second document;

    compressing the plurality of documents in the respective cluster to generate compressed cluster data, wherein compressing the plurality of documents includes;

    determining that the second document includes duplicate data that is duplicative of corresponding data in the first document;

    eliding the duplicate data from compressed cluster data; and

    storing document data from which the first document and the second document can be reconstructed, the document data including document reconstruction data; and

    generating an index of the compressed cluster data.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×