DETECTING DUPLICATE AND NEAR-DUPLICATE FILES

US 20120078871A1
Filed: 12/07/2011
Published: 03/29/2012
Est. Priority Date: 01/24/2001
Status: Abandoned Application

First Claim

Patent Images

1. A method for filtering a plurality of candidate search results to remove near-duplicates, the method comprising:

a) for one or more of the plurality of candidate search results, determining that one candidate search result of the one or more candidate search results is a near-duplicate of another of the plurality of candidate search results by1) determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result; and

2) in response to determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result, concluding that the one candidate search is a near-duplicate of the other candidate search result; and

b) in response to the determination that the one candidate search result is a near-duplicate of the other candidate search result, rejecting the one candidate search result thereby defining a filtered set of search results including only those of the plurality of candidate search results that have not been rejected.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.

54 Citations

View as Search Results

21 Claims

1. A method for filtering a plurality of candidate search results to remove near-duplicates, the method comprising:
- a) for one or more of the plurality of candidate search results, determining that one candidate search result of the one or more candidate search results is a near-duplicate of another of the plurality of candidate search results by1) determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result; and
  
  2) in response to determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result, concluding that the one candidate search is a near-duplicate of the other candidate search result; and
  
  b) in response to the determination that the one candidate search result is a near-duplicate of the other candidate search result, rejecting the one candidate search result thereby defining a filtered set of search results including only those of the plurality of candidate search results that have not been rejected.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein determining that the one candidate search result is a near-duplicate of another of the plurality of candidate search results is preceded by:
    - determining that the one candidate search result is a near-duplicate of a previously processed search result; and
      
      associating the one candidate search result with a cluster identifier of the previously processed search result.
  - 3. The method of claim 2, wherein associating the one candidate search result with the cluster identifier of the previously processed search result comprises:
    - determining that the one candidate search result is associated with a different cluster identifier, wherein the different cluster identifier identifies a cluster of related search results; and
      
      associating each of the cluster of related search results with the cluster identifier of the previously processed search result.
  - 4. The method of claim 1, comprising:
    - receiving the plurality of candidate search results from a search engine.
  - 5. The method of claim 1, wherein rejecting the one candidate search result comprises:
    - determining, for both of the one candidate search result and the other candidate search result, a quality measure associated with the search result;
      
      determining that the one candidate search result has a lower quality measure than the other candidate search result; and
      
      adding the other candidate search result to the filtered set of search results.
  - 6. The method of claim 1, wherein rejecting the one candidate search result comprises:
    - determining that the document referenced by the other candidate search result is more recent than the document associated with the one candidate search result; and
      
      adding the other candidate search result to the filtered set of search results.
  - 7. The method of claim 1, comprising:
    - providing the filtered set of search results for display.

8. An apparatus, comprising:
- at least one processor; and
  
  at least one storage device storing a processor executable program which, when executed by the at least one processor, causes the at least one processor to perform operations comprising;
  
  a) for one or more of the plurality of candidate search results, determining that one candidate search result of the one or more candidate search results is a near-duplicate of another of the plurality of candidate search results by1) determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result; and
  
  2) in response to determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result, concluding that the one candidate search is a near-duplicate of the other candidate search result; and
  
  b) in response to the determination that the one candidate search result is a near-duplicate of the other candidate search result, rejecting the one candidate search result thereby defining a filtered set of search results including only those of the plurality of candidate search results that have not been rejected.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The apparatus of claim 8, wherein performing the operation of determining that the one candidate search result is a near-duplicate of another of the plurality of candidate search results is preceded by performing operations comprising:
    - determining that the one candidate search result is a near-duplicate of a previously processed search result; and
      
      associating the one candidate search result with a cluster identifier of the previously processed search result.
  - 10. The apparatus of claim 9, wherein performing the operation of associating the one candidate search result with the cluster identifier of the previously processed search result comprises:
    - determining that the one candidate search result is associated with a different cluster identifier, wherein the different cluster identifier identifies a cluster of related search results; and
      
      associating each of the cluster of related search results with the cluster identifier of the previously processed search result.
  - 11. The apparatus of claim 8, wherein the operations further comprise:
    - receiving the plurality of candidate search results from a search engine.
  - 12. The apparatus of claim 8, wherein performing the operations rejecting the one candidate search result comprises:
    - determining, for both of the one candidate search result and the other candidate search result, a quality measure associated with the search result;
      
      determining that the one candidate search result has a lower quality measure than the other candidate search result; and
      
      adding the other candidate search result to the filtered set of search results.
  - 13. The apparatus of claim 8, wherein performing the operations rejecting the one candidate search result comprises:
    - determining that the document referenced by the other candidate search result is more recent than the document associated with the one candidate search result; and
      
      adding the other candidate search result to the filtered set of search results.
  - 14. The apparatus of claim 8, wherein the operations further comprise:
    - providing the filtered set of search results for display.

15. A computer storage device encoded with a computer program, the computer program comprising instructions that, when executed by data processing apparatus, cause the data processing apparatus to perform operations comprising:
- a) for one or more of the plurality of candidate search results, determining that one candidate search result of the one or more candidate search results is a near-duplicate of another of the plurality of candidate search results by1) determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result; and
  
  2) in response to determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result, concluding that the one candidate search is a near-duplicate of the other candidate search result; and
  
  b) in response to the determination that the one candidate search result is a near-duplicate of the other candidate search result, rejecting the one candidate search result thereby defining a filtered set of search results including only those of the plurality of candidate search results that have not been rejected.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. The computer storage device of claim 15, wherein performing the operation of determining that the one candidate search result is a near-duplicate of another of the plurality of candidate search results is preceded by performing operations comprising:
    - determining that the one candidate search result is a near-duplicate of a previously processed search result; and
      
      associating the one candidate search result with a cluster identifier of the previously processed search result.
  - 17. The computer storage device of claim 16, wherein performing the operation of associating the one candidate search result with the cluster identifier of the previously processed search result comprises:
    - determining that the one candidate search result is associated with a different cluster identifier, wherein the different cluster identifier identifies a cluster of related search results; and
      
      associating each of the cluster of related search results with the cluster identifier of the previously processed search result.
  - 18. The computer storage device of claim 15, wherein the operations further comprise:
    - receiving the plurality of candidate search results from a search engine.
  - 19. The computer storage device of claim 15, wherein performing the operations rejecting the one candidate search result comprises:
    - determining, for both of the one candidate search result and the other candidate search result, a quality measure associated with the search result;
      
      determining that the one candidate search result has a lower quality measure than the other candidate search result; and
      
      adding the other candidate search result to the filtered set of search results.
  - 20. The computer storage device of claim 15, wherein performing the operations rejecting the one candidate search result comprises:
    - determining that the document referenced by the other candidate search result is more recent than the document associated with the one candidate search result; and
      
      adding the other candidate search result to the filtered set of search results.
  - 21. The computer storage device of claim 15, wherein the operations further comprise:
    - providing the filtered set of search results for display.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google Inc. (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Pugh, William, Henzinger, Monika H.

Application Number

US13/313,913
Publication Number

US 20120078871A1
Time in Patent Office

Days
Field of Search
US Class Current

707/706
CPC Class Codes

G06F 16/355   Class or cluster creation o...

G06F 16/951   Indexing; Web crawling tech...

G06F 16/9532   Query formulation

G06F 16/9538   Presentation of query results

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Y10S 707/99943   Generating database or data...

DETECTING DUPLICATE AND NEAR-DUPLICATE FILES

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

54 Citations

21 Claims

Specification

Use Cases

Quick Links

Others

DETECTING DUPLICATE AND NEAR-DUPLICATE FILES

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

54 Citations

21 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others