Efficient Indexing of Documents with Similar Content

US 20120023073A1
Filed: 09/29/2011
Published: 01/26/2012
Est. Priority Date: 05/19/2006
Status: Active Grant

First Claim

Patent Images

1. A method of processing documents, comprising:

at a computer system comprising one or more processors and memory storing one or more programs for execution by the one or more processors so as to perform the method;

grouping a set of documents into a plurality of clusters, wherein each cluster includes one or more documents of the set of documents and a respective cluster of the plurality of clusters includes a plurality of documents including a first document and a second document;

compressing the plurality of documents in the respective cluster to generate compressed cluster data, wherein compressing the plurality of documents includes;

determining that the second document includes duplicate data that is duplicative of corresponding data in the first document;

eliding the duplicate data from compressed cluster data; and

storing document data from which the first document and the second document can be reconstructed, the document data including document reconstruction data; and

generating an index of the compressed cluster data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A set of documents may be stored and indexed as a compressed sequence of tokens. A set of documents are grouped into clusters. Sequences of tokens representing the clusters of documents are encoded to elide some repeating instances of tokens. A compressed sequence of tokens is generated from the compressed cluster sequences of tokens. Queries on the compressed sequence are performed by identifying cluster sequences within the compressed sequence that are likely to have documents that satisfy the query and then identifying, within these identified clusters, the documents that actually satisfies the query.

Citations

24 Claims

1. A method of processing documents, comprising:
- at a computer system comprising one or more processors and memory storing one or more programs for execution by the one or more processors so as to perform the method;
  
  grouping a set of documents into a plurality of clusters, wherein each cluster includes one or more documents of the set of documents and a respective cluster of the plurality of clusters includes a plurality of documents including a first document and a second document;
  
  compressing the plurality of documents in the respective cluster to generate compressed cluster data, wherein compressing the plurality of documents includes;
  
  determining that the second document includes duplicate data that is duplicative of corresponding data in the first document;
  
  eliding the duplicate data from compressed cluster data; and
  
  storing document data from which the first document and the second document can be reconstructed, the document data including document reconstruction data; and
  
  generating an index of the compressed cluster data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein a respective cluster of the plurality of clusters includes a plurality of documents that are determined to be related to each other.
  - 3. The method of claim 2, wherein a respective document is determined to be related to one or more other documents in the respective cluster based on an analysis of content of the respective document and content of the one or more documents in the respective cluster.
  - 4. The method of claim 2, wherein a respective document is determined to be related to one or more other documents in the respective cluster based on a resource locator of the respective document and resource locators of the one or more other documents in the respective cluster.
  - 5. The method of claim 4, wherein:
    - a plurality of documents in the set of documents each have a resource locator;
      
      grouping the set of documents into a plurality of clusters includes;
      
      ordering the set of documents in accordance with the resource locators; and
      
      selecting a respective plurality of consecutive documents from the ordering for inclusion in the respective cluster.
  - 6. The method of claim 5, wherein:
    - the resource locator is a URL; and
      
      the respective cluster includes documents from a particular sub-domain within a same domain.
  - 7. The method of claim 5, wherein:
    - a plurality of documents in the set of documents each have a URL including a respective plurality of domains and a respective protocol indicator;
      
      prior to ordering the set of documents, a modified locator is generated for each respective document, wherein generating a respective modified locator for a particular document having a particular URL includes reversing the domains of the particular URL and moving the protocol indicator for the particular URL to the end of the respective modified locator; and
      
      the documents are ordered in accordance with the modified locators.
  - 8. The method of claim 1, wherein:
    - the set of documents comprises a historical archive of different versions of documents; and
      
      a respective cluster of the plurality of clusters includes a plurality of different versions of a same document from different times.

9. A computer system, comprising:
- one or more processors;
  
  memory; and
  
  one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for;
  
  grouping a set of documents into a plurality of clusters, wherein each cluster includes one or more documents of the set of documents and a respective cluster of the plurality of clusters includes a plurality of documents including a first document and a second document;
  
  compressing the plurality of documents in the respective cluster to generate compressed cluster data, wherein compressing the plurality of documents includes;
  
  determining that the second document includes duplicate data that is duplicative of corresponding data in the first document;
  
  eliding the duplicate data from compressed cluster data; and
  
  storing document data from which the first document and the second document can be reconstructed, the document data including document reconstruction data; and
  
  generating an index of the compressed cluster data.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The system of claim 9, wherein a respective cluster of the plurality of clusters includes a plurality of documents that are determined to be related to each other.
  - 11. The system of claim 10, wherein a respective document is determined to be related to one or more other documents in the respective cluster based on an analysis of content of the respective document and content of the one or more documents in the respective cluster.
  - 12. The system of claim 10, wherein a respective document is determined to be related to one or more other documents in the respective cluster based on a resource locator of the respective document and resource locators of the one or more other documents in the respective cluster.
  - 13. The system of claim 12, wherein:
    - a plurality of documents in the set of documents each have a resource locator;
      
      the instructions for grouping the set of documents into a plurality of clusters include instructions for;
      
      ordering the set of documents in accordance with the resource locators; and
      
      selecting a respective plurality of consecutive documents from the ordering for inclusion in the respective cluster.
  - 14. The system of claim 13, wherein:
    - the resource locator is a URL; and
      
      the respective cluster includes documents from a particular sub-domain within a same domain.
  - 15. The system of claim 13, wherein:
    - a plurality of documents in the set of documents each have a URL including a respective plurality of domains and a respective protocol indicator;
      
      prior to ordering the set of documents, a modified locator is generated for each respective document, wherein generating a respective modified locator for a particular document having a particular URL includes reversing the domains of the particular URL and moving the protocol indicator for the particular URL to the end of the respective modified locator; and
      
      the documents are ordered in accordance with the modified locators.
  - 16. The system of claim 9, wherein:
    - the set of documents comprises a historical archive of different versions of documents; and
      
      a respective cluster of the plurality of clusters includes a plurality of different versions of a same document from different times.

17. A non-transitory computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computer system with one or more processors, cause the computer system to:
- group a set of documents into a plurality of clusters, wherein each cluster includes one or more documents of the set of documents and a respective cluster of the plurality of clusters includes a plurality of documents including a first document and a second document;
  
  compress the plurality of documents in the respective cluster to generate compressed cluster data, wherein compressing the plurality of documents includes;
  
  determining that the second document includes duplicate data that is duplicative of corresponding data in the first document;
  
  eliding the duplicate data from compressed cluster data; and
  
  storing document data from which the first document and the second document can be reconstructed, the document data including document reconstruction data; and
  
  generate an index of the compressed cluster data.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
- - 18. The computer readable storage medium of claim 17, wherein a respective cluster of the plurality of clusters includes a plurality of documents that are determined to be related to each other.
  - 19. The computer readable storage medium of claim 18, wherein a respective document is determined to be related to one or more other documents in the respective cluster based on an analysis of content of the respective document and content of the one or more documents in the respective cluster.
  - 20. The computer readable storage medium of claim 18, wherein a respective document is determined to be related to one or more other documents in the respective cluster based on a resource locator of the respective document and resource locators of the one or more other documents in the respective cluster.
  - 21. The computer readable storage medium of claim 20, wherein:
    - a plurality of documents in the set of documents each have a resource locator;
      
      the instructions to group the set of documents into a plurality of clusters include instructions which, when executed by the one or more processors, cause the computer system to;
      
      order the set of documents in accordance with the resource locators; and
      
      select a respective plurality of consecutive documents from the ordering for inclusion in the respective cluster.
  - 22. The computer readable storage medium of claim 21, wherein:
    - the resource locator is a URL; and
      
      the respective cluster includes documents from a particular sub-domain within a same domain.
  - 23. The computer readable storage medium of claim 21, wherein:
    - a plurality of documents in the set of documents each have a URL including a respective plurality of domains and a respective protocol indicator;
      
      prior to ordering the set of documents, a modified locator is generated for each respective document, wherein generating a respective modified locator for a particular document having a particular URL includes reversing the domains of the particular URL and moving the protocol indicator for the particular URL to the end of the respective modified locator; and
      
      the documents are ordered in accordance with the modified locators.
  - 24. The computer readable storage medium of claim 17, wherein:
    - the set of documents comprises a historical archive of different versions of documents; and
      
      a respective cluster of the plurality of clusters includes a plurality of different versions of a same document from different times.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Gautham Thambidorai, Jeffrey A. Dean, Sanjay Ghemawat
Inventors
Dean, Jeffrey A., Ghemawat, Sanjay, Thambidorai, Gautham

Granted Patent

US 8,244,530 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/693
CPC Class Codes

G06F 16/355 Class or cluster creation o...

Efficient Indexing of Documents with Similar Content

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Efficient Indexing of Documents with Similar Content

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links