Detecting duplicate and near-duplicate files

US 7,366,718 B1
Filed: 06/27/2003
Issued: 04/29/2008
Est. Priority Date: 01/24/2001
Status: Expired due to Term

First Claim

Patent Images

1. A method for filtering a plurality of candidate search results to remove near-duplicates, the method comprising:

a) for one of the plurality of candidate search results, determining whether the one candidate search result is a near-duplicate of another of the plurality of candidate search results by1) comparing a cluster identifier of the one candidate search result with a cluster identifier of the other candidate search result, and2) if the cluster identifiers of the one and the other candidate search results match, then concluding that the one candidate search is a near-duplicate of the other candidate search result; and

b) in response to a determination that the one candidate search result is a near-duplicate of the other candidate search result, rejecting the one candidate search result thereby defining a filtered set of search results including only those of the plurality of candidate search results that have not been rejected.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.

108 Citations

View as Search Results

2 Claims

1. A method for filtering a plurality of candidate search results to remove near-duplicates, the method comprising:
- a) for one of the plurality of candidate search results, determining whether the one candidate search result is a near-duplicate of another of the plurality of candidate search results by1) comparing a cluster identifier of the one candidate search result with a cluster identifier of the other candidate search result, and2) if the cluster identifiers of the one and the other candidate search results match, then concluding that the one candidate search is a near-duplicate of the other candidate search result; and
  
  b) in response to a determination that the one candidate search result is a near-duplicate of the other candidate search result, rejecting the one candidate search result thereby defining a filtered set of search results including only those of the plurality of candidate search results that have not been rejected.

2. A search filter for processing a plurality of search results to remove near-duplicates, the search filter comprising:
- a) a near-duplicate determination facility for determining, for one of the plurality of candidate search results, whether the one candidate search result is a near-duplicate of another of the plurality of candidate search results, and wherein the near-duplicate determination facility includes a comparison facility for comparing a cluster identifier of the one candidate search result with a cluster identifier of the other candidate search result, and wherein if the cluster identifiers of the one candidate search result and the other candidate search result match, then it is concluded that the one candidate search result and the other candidate search result are near-duplicates; and
  
  b) a filter for rejecting the one candidate search result if it is determined that the one candidate search result is a near-duplicate of the other candidate search result and passing the one candidate search result if it is not determined that the one candidate search result is a near-duplicate of the other candidate search result, thereby defining a filtered set of search results including only those of the plurality of candidate results that have not been rejected by the filter.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Pugh, William, Henzinger, Monika H.
Primary Examiner(s)
Wong; Don
Assistant Examiner(s)
Nguyen; Merilyn P

Application Number

US10/608,468
Time in Patent Office

1,768 Days
Field of Search

707/1, 707/2, 707 3- 5, 707/6, 707/7, 707/9, 707/10, 707/100, 707/101, 707/102, 707/104.1, 715/501.1, 715/512, 715/500, 715/511, 704/9, 382/225, 382/229
US Class Current

1/1
CPC Class Codes

G06F 16/355   Class or cluster creation o...

G06F 16/951   Indexing; Web crawling tech...

G06F 16/9532   Query formulation

G06F 16/9538   Presentation of query results

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Y10S 707/99943   Generating database or data...

Detecting duplicate and near-duplicate files

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

108 Citations

2 Claims

Specification

Solutions

Use Cases

Quick Links

Detecting duplicate and near-duplicate files

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

108 Citations

2 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links