REPRESENTATIVE DOCUMENT SELECTION FOR A SET OF DUPLICATE DOCUMENTS
First Claim
1. A method, comprising:
- at a computing device having one or more processors and memory;
selecting a first document in a plurality of documents on the basis that the first document is associated with a query independent score, wherein the first document has a fingerprint that indicates that the first document has substantially identical content to every other document in the plurality of documents;
indexing, in accordance with the query independent score, the first document thereby producing an indexed first document; and
with respect to the plurality of documents, including only the indexed first document in a document index.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and methods for indexing a representative document from a set of duplicate documents are disclosed. Disclosed systems and methods comprise selecting a first document in a plurality of documents on the basis that the first document is associated with a query independent score. Each respective document in the plurality of documents has a fingerprint that indicates that the respective document has substantially identical content to every other document in the plurality of documents. Disclosed systems and methods further comprise indexing, in accordance with the query independent score, the first document thereby producing an indexed first document. With respect to the plurality of documents, only the indexed first document is included in a document index.
-
Citations
37 Claims
-
1. A method, comprising:
-
at a computing device having one or more processors and memory; selecting a first document in a plurality of documents on the basis that the first document is associated with a query independent score, wherein the first document has a fingerprint that indicates that the first document has substantially identical content to every other document in the plurality of documents; indexing, in accordance with the query independent score, the first document thereby producing an indexed first document; and with respect to the plurality of documents, including only the indexed first document in a document index. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A server system, comprising:
-
one or more processors; memory; and one or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by the one or more processors, the one or more programs comprising instructions for; selecting a first document in a plurality of documents on the basis that the first document is associated with a query independent score, wherein the first document has a fingerprint that indicates that the first document has substantially identical content to every other document in the plurality of documents; indexing, in accordance with the query independent score, the first document thereby producing an indexed first document; and with respect to the plurality of documents, including only the indexed first document in a document index. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer readable storage medium storing one or more programs to be executed by one or more processing units of a computing device, the one or more programs comprising instructions that, when executed by the one or more processing units, cause the computing device to:
-
select a first document in a plurality of documents on the basis that the first document is associated with a query independent score, wherein each respective document in the plurality of documents has a fingerprint that indicates that the respective document has substantially identical content to every other document in the plurality of documents; index, in accordance with the query independent score, the first document thereby producing an indexed first document; and with respect to the plurality of documents, include only the indexed first document in a document index. - View Dependent Claims (20)
-
-
21. A method, comprising:
at a computing device having one or more processors and memory; receiving a newly crawled document and a first query independent score associated with the newly crawled document; and determining a first equivalence class, from among a plurality of equivalence classes, for the newly crawled document based upon a fingerprint for the newly crawled document, wherein the first equivalence class comprises a plurality of documents, wherein each document in the plurality of documents is uniquely associated with a respective query independent score and wherein the fingerprint of the newly crawled document indicates that the newly crawled document is substantially identical to each document in the plurality of documents. - View Dependent Claims (22, 23, 24)
-
25. A server system, comprising:
-
one or more processors; memory; and one or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by the one or more processors, the one or more programs comprising instructions for; receiving a newly crawled document and a first query independent score associated with the newly crawled document; and determining a first equivalence class, from among a plurality of equivalence classes, for the newly crawled document based upon a fingerprint for the newly crawled document, wherein the first equivalence class comprises a plurality of documents, wherein each document in the plurality of documents is uniquely associated with a respective query independent score and wherein the fingerprint of the newly crawled document indicates that the newly crawled document is substantially identical to each document in the plurality of documents. - View Dependent Claims (26, 27, 28)
-
-
29. A method of associating anchor text to a web page, the method comprising:
at a computing device having one or more processors and memory; receiving a newly crawled document and a first query independent score associated with the newly crawled document; determining whether the newly crawled document is a canonical document for a first equivalence class, from among a plurality of equivalence classes, based upon a fingerprint for the newly crawled document, wherein the first equivalence class comprises a plurality of documents, wherein each document in the plurality of documents is uniquely associated with a respective query independent score and wherein the fingerprint of the newly crawled document indicates that the newly crawled document is substantially identical to each document in the plurality of documents; providing, when the newly crawled document is the canonical document, a list of documents in a first plurality of documents that are ranked by respective query independent scores; and indexing, when the newly crawled document is the canonical document, the newly crawled page, wherein the indexing comprises (i) retrieving anchor text of links to pages in the list of pages and (ii) associating this anchor text with the newly crawled page in a document index. - View Dependent Claims (30, 31, 32, 33)
-
34. A server system, comprising:
-
one or more processors; memory; and one or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by the one or more processors, the one or more programs comprising instructions for; receiving a newly crawled document and a first query independent score associated with the newly crawled document; determining whether the newly crawled document is a canonical document for a first equivalence class, from among a plurality of equivalence classes, based upon a fingerprint for the newly crawled document, wherein the first equivalence class comprises a plurality of documents, wherein each document in the plurality of documents is uniquely associated with a respective query independent score and wherein the fingerprint of the newly crawled document indicates that the newly crawled document is substantially identical to each document in the plurality of documents; providing, when the newly crawled document is the canonical document, a list of documents in a first plurality of documents that are ranked by respective query independent scores; and indexing, when the newly crawled document is the canonical document, the newly crawled page, wherein the indexing comprises (i) retrieving anchor text of links to pages in the list of pages and (ii) associating this anchor text with the newly crawled page in a document index. - View Dependent Claims (35, 36, 37)
-
Specification