Representative document selection for sets of duplicate documents in a web crawler system
First Claim
Patent Images
1. A method of detecting duplicate documents, comprising:
- at a server having one or more processors and memory;
receiving documents from one or more databases of documents, wherein each received document is associated with a respective query independent score;
generating a document content identifier for each received document, each document content identifier comprising an identifier of a respective document'"'"'s content;
indexing at least a subset of the received documents to produce an document index that maps terms to documents in the one or more databases of documents; and
while performing the indexing,identifying respective sets of received documents having the same content identifier,selecting a single document in each respective set of documents, in accordance with the query independent scores associated with the documents in the respective set of documents, as a representative document for the respective set of received documents,indexing the representative document, andwith respect to each respective set of received documents, including only the representative document in the document index.
1 Assignment
0 Petitions
Accused Products
Abstract
Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.
-
Citations
15 Claims
-
1. A method of detecting duplicate documents, comprising:
-
at a server having one or more processors and memory; receiving documents from one or more databases of documents, wherein each received document is associated with a respective query independent score; generating a document content identifier for each received document, each document content identifier comprising an identifier of a respective document'"'"'s content; indexing at least a subset of the received documents to produce an document index that maps terms to documents in the one or more databases of documents; and while performing the indexing, identifying respective sets of received documents having the same content identifier, selecting a single document in each respective set of documents, in accordance with the query independent scores associated with the documents in the respective set of documents, as a representative document for the respective set of received documents, indexing the representative document, and with respect to each respective set of received documents, including only the representative document in the document index. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A system for detecting duplicated documents, comprising:
-
one or more processors; memory storing one or more programs executable by the one or more processors, the one or more programs comprising instructions to; receive documents from one or more databases of documents, wherein each received document is associated with a respective query independent score; obtain a document content identifier for each received document, each document content identifier comprising an identifier of a respective document'"'"'s content; index at least a subset of the received documents to produce an document index that maps terms to documents in the one or more databases of documents; and while performing the indexing, identify respective sets of received documents having the same content identifier, select a single document in each respective set of documents, in accordance with the query independent scores associated with the documents in the respective set of documents, as a representative document for the respective set of received documents, index the representative document, and with respect to each respective set of received documents, include only the representative document in the document index. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A non-transitory computer readable storage medium storing one or more programs that when executed by one or more processors of a computer system cause the computer system to:
-
receive documents from one or more databases of documents, wherein each received document is associated with a respective query independent score; obtain a document content identifier for each received document, each document content identifier comprising an identifier of a respective document'"'"'s content; index at least a subset of the received documents to produce an document index that maps terms to documents in the one or more databases of documents; and while performing the indexing, identify respective sets of received documents having the same content identifier; select a single document in each respective set of documents, in accordance with the query independent scores associated with the documents in the respective set of documents, as a representative document for the respective set of received documents; index the representative document; and with respect to each respective set of received documents, include only the representative document in the document index. - View Dependent Claims (12, 13, 14, 15)
-
Specification