Representative Document Selection for Sets of Duplicate Dcouments in a Web Crawler System
First Claim
Patent Images
1. A method of detecting duplicate documents, comprising:
- at a server having one or more processors and memory;
receiving a first document, the received document characterized by a document content identifier;
selecting, from a plurality of previously received documents, a set of documents sharing the same document content identifier, the first document and the selected set having associated score information;
wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;
updating the selected set of documents with the first document, in accordance with the score information associated with the first document and the selected set of documents, to produce an updated set of documents; and
identifying a representative document for the updated set of documents in accordance with the score information.
1 Assignment
0 Petitions
Accused Products
Abstract
Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.
52 Citations
6 Claims
-
1. A method of detecting duplicate documents, comprising:
-
at a server having one or more processors and memory; receiving a first document, the received document characterized by a document content identifier; selecting, from a plurality of previously received documents, a set of documents sharing the same document content identifier, the first document and the selected set having associated score information;
wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;updating the selected set of documents with the first document, in accordance with the score information associated with the first document and the selected set of documents, to produce an updated set of documents; and identifying a representative document for the updated set of documents in accordance with the score information. - View Dependent Claims (2, 3, 4, 5, 6)
-
Specification