Representative Document Selection for a Set of Duplicate Documents
First Claim
1. A method, comprising:
- at a computing device having one or more processors and memory;
obtaining a plurality of documents, wherein a respective document in the plurality of documents is associated with a score and wherein each document in the plurality of documents is from a different data structure in a plurality of data structures, each data structure in the plurality of data structures representing a different portion of a document address space;
selecting a first document in the plurality of documents in accordance with the score associated with the first document, whereinthe first document has a fingerprint that indicates that the first document has substantially identical content to every other document in the plurality of documents;
indexing, in accordance with the score, the first document thereby producing an indexed first document; and
with respect to the plurality of documents, including the indexed first document in a document index as representative of each document in the plurality of documents.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and methods are provided for obtaining a plurality of documents. A respective document in the plurality of documents is associated with a score and each document in the plurality of documents is from a different data structure in a plurality of data structures. Each data structure in the plurality of data structures represents a different portion of a document address space. A first document in the plurality of documents is selected in accordance with the score associated with the first document. The first document has a fingerprint that indicates that the first document has substantially identical content to every other document in the plurality of documents. In accordance with the score, the first document is indexed thereby producing an indexed first document. With respect to the plurality of documents, the indexed first document is included in a document index as representative of each document in the plurality of documents.
-
Citations
20 Claims
-
1. A method, comprising:
-
at a computing device having one or more processors and memory; obtaining a plurality of documents, wherein a respective document in the plurality of documents is associated with a score and wherein each document in the plurality of documents is from a different data structure in a plurality of data structures, each data structure in the plurality of data structures representing a different portion of a document address space; selecting a first document in the plurality of documents in accordance with the score associated with the first document, wherein the first document has a fingerprint that indicates that the first document has substantially identical content to every other document in the plurality of documents; indexing, in accordance with the score, the first document thereby producing an indexed first document; and with respect to the plurality of documents, including the indexed first document in a document index as representative of each document in the plurality of documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computing system, comprising:
-
one or more processors; memory; and one or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by the one or more processors, the one or more programs comprising instructions for; selecting a first document in the plurality of documents in accordance with the score associated with the first document, wherein the first document has a fingerprint that indicates that the first document has substantially identical content to every other document in the plurality of documents; indexing, in accordance with the score, the first document thereby producing an indexed first document; and with respect to the plurality of documents, including the indexed first document in a document index as representative of each document in the plurality of documents. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A non-transitory computer readable storage medium storing one or more programs to be executed by one or more processing units of a computing device, the one or more programs comprising instructions that, when executed by the one or more processing units, cause the computing device to:
-
obtain a plurality of documents, wherein a respective document in the plurality of documents is associated with a score and wherein each document in the plurality of documents is from a different data structure in a plurality of data structures, each data structure in the plurality of data structures representing a different portion of a document address space; select a first document in the plurality of documents in accordance with the score associated with the first document, wherein the first document has a fingerprint that indicates that the first document has substantially identical content to every other document in the plurality of documents; index, in accordance with the score, the first document thereby producing an indexed first document; and with respect to the plurality of documents, include the indexed first document in a document index as representative of each document in the plurality of documents.
-
Specification