Representative document selection for sets of duplicate documents in a web crawler system
First Claim
Patent Images
1. A method of detecting duplicate documents, comprising:
- at a server having one or more processors and memory;
receiving a first document, the received document characterized by a document content identifier;
selecting, from a plurality of previously received documents, a set of documents sharing the same document content identifier, the first document and the selected set having associated score information;
wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;
updating the selected set of documents with the first document, in accordance with the score information associated with the first document and the selected set of documents, to produce an updated set of documents;
determining a representative document for the updated set of documents in accordance with the score information;
indexing the received document when the received document is the representative document for the updated set of documents; and
repeating the receiving, selecting, updating, determining and indexing operations with respect to a plurality of received documents, each of which shares a respective document content identifier with a respective set of documents, such that at least some of the received documents are determined to be representative documents and are indexed.
1 Assignment
0 Petitions
Accused Products
Abstract
Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.
-
Citations
15 Claims
-
1. A method of detecting duplicate documents, comprising:
-
at a server having one or more processors and memory; receiving a first document, the received document characterized by a document content identifier; selecting, from a plurality of previously received documents, a set of documents sharing the same document content identifier, the first document and the selected set having associated score information;
wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;updating the selected set of documents with the first document, in accordance with the score information associated with the first document and the selected set of documents, to produce an updated set of documents; determining a representative document for the updated set of documents in accordance with the score information; indexing the received document when the received document is the representative document for the updated set of documents; and repeating the receiving, selecting, updating, determining and indexing operations with respect to a plurality of received documents, each of which shares a respective document content identifier with a respective set of documents, such that at least some of the received documents are determined to be representative documents and are indexed. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A system for detecting and managing duplicate documents, comprising:
-
one or more processing units for executing programs; a network interface for receiving documents; and memory storing one or more programs to be executed by the one or more processing units, the one or more programs comprising instructions that when executed by the one or more processing units cause the system to; receive a first document, the received document characterized by a document content identifier; select, from a plurality of previously received documents, a set of documents sharing the same document content identifier, the first document and the selected set having associated score information;
wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;update the selected set of documents with the first document, in accordance with the score information associated with the first document and the selected set of documents, to produce an updated set of documents; determine a representative document for the updated set of documents in accordance with the score information; index the received document when the received document is the representative document for the updated set of documents; and repeat the receiving, selecting, updating, determining and indexing operations with respect to a plurality of received documents, each of which shares a respective document content identifier with a respective set of documents, such that at least some of the received documents are determined to be representative documents and are indexed. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A computer readable storage medium storing one or more programs to be executed by one or more processing units of a computer system, the one or more programs comprising instructions that when executed by the one or more processing units cause the computer system to:
-
receive a first document, the received document characterized by a document content identifier; select, from a plurality of previously received documents, a set of documents sharing the same document content identifier, the first document and the selected set having associated score information;
wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;update the selected set of documents with the first document, in accordance with the score information associated with the first document and the selected set of documents, to produce an updated set of documents; determine a representative document for the updated set of documents in accordance with the score information; index the received document when the received document is the representative document for the updated set of documents; and repeat the receiving, selecting, updating, determining and indexing operations with respect to a plurality of received documents, each of which shares a respective document content identifier with a respective set of documents, such that at least some of the received documents are determined to be representative documents and are indexed. - View Dependent Claims (12, 13, 14, 15)
-
Specification