Representative Document Selection for Sets of Duplicate Dcouments in a Web Crawler System

US 20100076954A1
Filed: 12/01/2009
Published: 03/25/2010
Est. Priority Date: 07/03/2003
Status: Active Grant

First Claim

Patent Images

1. A method of detecting duplicate documents, comprising:

at a server having one or more processors and memory;

receiving a first document, the received document characterized by a document content identifier;

selecting, from a plurality of previously received documents, a set of documents sharing the same document content identifier, the first document and the selected set having associated score information;

wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;

updating the selected set of documents with the first document, in accordance with the score information associated with the first document and the selected set of documents, to produce an updated set of documents; and

identifying a representative document for the updated set of documents in accordance with the score information.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.

52 Citations

View as Search Results

6 Claims

1. A method of detecting duplicate documents, comprising:
- at a server having one or more processors and memory;
  
  receiving a first document, the received document characterized by a document content identifier;
  
  selecting, from a plurality of previously received documents, a set of documents sharing the same document content identifier, the first document and the selected set having associated score information;
  
  wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;
  
  updating the selected set of documents with the first document, in accordance with the score information associated with the first document and the selected set of documents, to produce an updated set of documents; and
  
  identifying a representative document for the updated set of documents in accordance with the score information.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein the document content identifier is a fixed length fingerprint of document content of a document characterized by the document content identifier.
  - 3. The method of claim 1, wherein the score information includes a document ranking value indicative of document importance.
  - 4. The method of claim 1, wherein the updating includes adding the first document to the selected set of documents only if the score information of the first document satisfies a predefined insertion condition.
  - 5. The method of claim 1, further including performing a data processing operation, with respect to the updated set of documents, on only the representative document.
  - 6. The method of claim 1, further including performing an indexing operation, with respect to the updated set of documents, on only the representative document so as to add to an index only the representative document and no other document of the updated set of documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Alexandre A. Verstak, Daniel Dulitz, Jeffrey A. Dean, Sanjay Ghemawat
Inventors
Ghemawat, Sanjay, Dean, Jeffrey A., Verstak, Alexandre A., Dulitz, Daniel

Granted Patent

US 7,984,054 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/709
CPC Class Codes

G06F 16/24578   using ranking

G06F 16/355   Class or cluster creation o...

G06F 16/951   Indexing; Web crawling tech...

G06F 16/9535   Search customisation based ...

G06F 16/9538   Presentation of query results

Y10S 707/99931   Database or file accessing

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99954   Version management

Representative Document Selection for Sets of Duplicate Dcouments in a Web Crawler System

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

52 Citations

6 Claims

Specification

Solutions

Use Cases

Quick Links

Representative Document Selection for Sets of Duplicate Dcouments in a Web Crawler System

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

52 Citations

6 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links