Representative document selection for a set of duplicate documents
First Claim
1. A method, comprising:
- at a computing device having one or more processors and memory;
obtaining a plurality of documents, wherein a respective document in the plurality of document is associated with a query independent score;
selecting a first document in the plurality of documents in accordance with a query independent score associated with the first document, whereinthe first document has a fingerprint that indicates that the first document has substantially identical content to every other document in the plurality of documents;
indexing, in accordance with the query independent score, the first document thereby producing an indexed first document; and
with respect to the plurality of documents, including only the indexed first document in a document index.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and methods for indexing a representative document from a set of duplicate documents are disclosed. Disclosed systems and methods comprise selecting a first document in a plurality of documents on the basis that the first document is associated with a query independent score. Each respective document in the plurality of documents has a fingerprint that indicates that the respective document has substantially identical content to every other document in the plurality of documents. Disclosed systems and methods further comprise indexing, in accordance with the query independent score, the first document thereby producing an indexed first document. With respect to the plurality of documents, only the indexed first document is included in a document index.
-
Citations
37 Claims
-
1. A method, comprising:
-
at a computing device having one or more processors and memory; obtaining a plurality of documents, wherein a respective document in the plurality of document is associated with a query independent score; selecting a first document in the plurality of documents in accordance with a query independent score associated with the first document, wherein the first document has a fingerprint that indicates that the first document has substantially identical content to every other document in the plurality of documents; indexing, in accordance with the query independent score, the first document thereby producing an indexed first document; and with respect to the plurality of documents, including only the indexed first document in a document index. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computing system, comprising:
-
one or more processors; memory; and one or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by the one or more processors, the one or more programs comprising instructions for; obtaining, using the one or more processors, a plurality of documents, wherein a respective document in the plurality of document is associated with a query independent score; selecting a first document in the plurality of documents in accordance with a query independent score associated with the first document, wherein the first document has a fingerprint that indicates that the first document has substantially identical content to every other document in the plurality of documents; indexing, in accordance with the query independent score, the first document thereby producing an indexed first document; and with respect to the plurality of documents, including only the indexed first document in a document index. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A non-transitory computer readable storage medium storing one or more programs to be executed by one or more processing units of a computing device, the one or more programs comprising instructions that, when executed by the one or more processing units, cause the computing device to:
-
obtain a plurality of documents, wherein a respective document in the plurality of document is associated with a query independent score; select a first document in the plurality of documents in accordance with a query independent score associated with the first document, wherein the first document has a fingerprint that indicates that the first document has substantially identical content to every other document in the plurality of documents; index, in accordance with the query independent score, the first document thereby producing an indexed first document; and with respect to the plurality of documents, include only the indexed first document in a document index. - View Dependent Claims (20)
-
-
21. A method, comprising:
at a computing device having one or more processors and memory; identifying a plurality of newly crawled documents, wherein a respective document in the plurality of newly crawled document is associated with a query independent score; identifying (i) a newly crawled document in the plurality of newly crawled documents, and (ii) a first query independent score associated with the newly crawled document; and determining a first equivalence class, from among a plurality of equivalence classes, for the newly crawled document based upon a fingerprint for the newly crawled document, wherein the first equivalence class comprises a plurality of documents, each document in the plurality of documents is uniquely associated with a respective query independent score, and the fingerprint of the newly crawled document indicates that the newly crawled document is substantially identical to each document in the plurality of documents. - View Dependent Claims (22, 23, 24)
-
25. A computing system, comprising:
-
one or more processors; memory; and one or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by the one or more processors, the one or more programs comprising instructions for; identifying a plurality of newly crawled documents, wherein a respective document in the plurality of newly crawled document is associated with a query independent score; identifying (i) a newly crawled document in the plurality of newly crawled documents, and (ii) a first query independent score associated with the newly crawled document; and determining a first equivalence class, from among a plurality of equivalence classes, for the newly crawled document based upon a fingerprint for the newly crawled document, wherein the first equivalence class comprises a plurality of documents, each document in the plurality of documents is uniquely associated with a respective query independent score, and the fingerprint of the newly crawled document indicates that the newly crawled document is substantially identical to each document in the plurality of documents. - View Dependent Claims (26, 27, 28)
-
-
29. A method of associating anchor text to a web page, the method comprising:
at a computing device having one or more processors and memory; identifying a plurality of newly crawled documents, wherein a respective document in the plurality of newly crawled document is associated with a query independent score; identifying (i) a newly crawled document in the plurality of newly crawled documents, and (ii) a first query independent score associated with the newly crawled document; determining whether the newly crawled document is a canonical document for a first equivalence class, from among a plurality of equivalence classes, based upon a fingerprint for the newly crawled document, wherein the first equivalence class comprises a plurality of documents, wherein each document in the plurality of documents is uniquely associated with a respective query independent score and wherein the fingerprint of the newly crawled document indicates that the newly crawled document is substantially identical to each document in the plurality of documents; providing, when the newly crawled document is the canonical document, a list of documents in a first plurality of documents that are ranked by respective query independent scores; and indexing, when the newly crawled document is the canonical document, the newly crawled page, wherein the indexing comprises (i) retrieving anchor text of links to pages in the list of pages and (ii) associating this anchor text with the newly crawled page in a document index. - View Dependent Claims (30, 31, 32, 33)
-
34. A computing system, comprising:
-
one or more processors; memory; and one or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by the one or more processors, the one or more programs comprising instructions for; identifying a plurality of newly crawled documents, wherein a respective document in the plurality of newly crawled document is associated with a query independent score; identifying (i) a newly crawled document in the plurality of newly crawled documents, and (ii) a first query independent score associated with the newly crawled document; determining whether the newly crawled document is a canonical document for a first equivalence class, from among a plurality of equivalence classes, based upon a fingerprint for the newly crawled document, wherein the first equivalence class comprises a plurality of documents, wherein each document in the plurality of documents is uniquely associated with a respective query independent score and wherein the fingerprint of the newly crawled document indicates that the newly crawled document is substantially identical to each document in the plurality of documents; providing, when the newly crawled document is the canonical document, a list of documents in a first plurality of documents that are ranked by respective query independent scores; and indexing, when the newly crawled document is the canonical document, the newly crawled page, wherein the indexing comprises (i) retrieving anchor text of links to pages in the list of pages and (ii) associating this anchor text with the newly crawled page in a document index. - View Dependent Claims (35, 36, 37)
-
Specification