Document similarity detection

US 7,734,627 B1
Filed: 06/17/2003
Issued: 06/08/2010
Est. Priority Date: 06/17/2003
Status: Active Grant

First Claim

Patent Images

1. A method performed by a computer system, the method comprising:

randomly sampling pairs of ordered terms from a particular document of a set of documents, to generate a cluster of pairs of ordered terms for the particular document,where the pairs of ordered terms include a first term and a second term and where, in at least some of the pairs of ordered terms, the second term occurs after one or more intervening terms occurring after the first term in the particular document, andwhere the random sampling is biased to have a higher chance of including a first ordered pair in the cluster than a second ordered pair, if the first ordered pair has fewer intervening terms that the second ordered pair, the randomly sampling being performed using one or more processors associated with the computer system;

building, using one or more processors associated with the computer system, a similarity model that includes the cluster of pairs;

comparing, using one or more processors associated with the computer system, a cluster of pairs from a target document to clusters of pairs from the similarity model;

generating, using one or more processors associated with the computer system, similarity metrics that measure similarity between the target document and particular documents in the set of documents, the generating being based on the comparing; and

outputting the generated similarity metrics.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A similarity detector detects similar or near duplicate occurrences of a document. The similarity detector determines similarity of documents by characterizing the documents as clusters each made up of a set of term entries, such as pairs of terms. A pair of terms, for example, indicates that the first term of the pair occurs before the second term of the pair in the underlying document. Another document that has a threshold level of term entries in common with a cluster is considered similar to the document characterized by the cluster.

Citations

20 Claims

1. A method performed by a computer system, the method comprising:
- randomly sampling pairs of ordered terms from a particular document of a set of documents, to generate a cluster of pairs of ordered terms for the particular document,where the pairs of ordered terms include a first term and a second term and where, in at least some of the pairs of ordered terms, the second term occurs after one or more intervening terms occurring after the first term in the particular document, andwhere the random sampling is biased to have a higher chance of including a first ordered pair in the cluster than a second ordered pair, if the first ordered pair has fewer intervening terms that the second ordered pair, the randomly sampling being performed using one or more processors associated with the computer system;
  
  building, using one or more processors associated with the computer system, a similarity model that includes the cluster of pairs;
  
  comparing, using one or more processors associated with the computer system, a cluster of pairs from a target document to clusters of pairs from the similarity model;
  
  generating, using one or more processors associated with the computer system, similarity metrics that measure similarity between the target document and particular documents in the set of documents, the generating being based on the comparing; and
  
  outputting the generated similarity metrics.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, where the clusters and the pairs of ordered terms included in the clusters are stored in an inverted index that lists, for each of the pairs of ordered terms, the clusters in which the pair is found.
  - 3. The method of claim 1, where building the similarity model further includes:
    - updating a table that defines a number of pairs of ordered terms associated with each of the clusters.
  - 4. The method of claim 1, where a number of pairs of ordered terms sampled for the particular document is based on a length of the particular document.
  - 5. The method of claim 1, where the sampling includes biasing the sampling to give preference to rare terms.
  - 6. The method of claim 5, where the sampling includes excluding very rare terms from the sampling.
  - 7. The method of claim 1, where the sampling includes biasing the sampling to give preference to terms that occur in pre-selected sections of the documents.
  - 8. The method of claim 1, where the comparing includes:
    - comparing pairs of ordered terms from the target document to the similarity model by enumerating all possible pairs of ordered terms from the target document.
  - 9. The method of claim 1, where the comparing includes:
    - comparing pairs of ordered terms from the target document to the similarity model by enumerating pairs from the target document such that a first term and a second term in one of the enumerated pairs appears within a fixed distance of each other in the target document.

10. A similarity detection device comprising:
- a cluster creation hardware-implemented component that generates clusters of pairs of ordered terms by randomly sampling documents,where a particular pair of ordered terms includes a first term that occurs before a second term in a particular document, and where at least some of the pairs of ordered terms include terms that occur non-consecutively in the particular document, andwhere the random sampling is biased such that terms closer to one another, as measured based on a number of terms separating two terms, have a greater chance of being included, in a particular cluster, as one of the pairs of ordered terms;
  
  an inverted index hardware-implemented component that relates an order of occurrence of pairs of ordered terms to clusters that contain the pairs of ordered terms;
  
  an enumeration hardware-implemented component that generates pairs of ordered terms for a first document that is to be compared to the inverted index;
  
  a pair lookup hardware-implemented component that looks up the generated pairs of ordered terms in the inverted index to obtain clusters that contain the generated pairs of ordered terms; and
  
  a cluster selection hardware-implemented component that selects clusters obtained by the pair lookup component that are similar to the first document.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The similarity detection device of claim 10, where the cluster selection component selects the clusters that are similar to the first document based on a similarity metric that quantifies a presence of the generated pairs in the clusters.
  - 12. The similarity detection device of claim 10, where a number of pairs of terms sampled from each document is based on a length of the document.
  - 13. The similarity detection device of claim 10, where the sampling pairs of terms includes biasing the sampling to give preference to rare terms.
  - 14. The similarity detection device of claim 13, where the sampling pairs of terms excludes very rare terms from the sampling.
  - 15. The similarity detection device of claim 10, where the sampling pairs of terms includes biasing the sampling to give preference to terms that occur in a pre-determined section of the documents.

16. A device for determining similarity of a target document to a first set of documents, the device comprising:
- a memory to store instructions; and
  
  a processor to execute the instructions to implement;
  
  means for randomly sampling pairs of ordered terms from a particular document of a set of documents, to generate a cluster of pairs of ordered terms for the particular document,where the pairs of ordered terms include a first term and a second term and where, in at least some of the pairs of ordered terms, the second term occurs after one or more intervening terms occurring after the first term in the particular document, andwhere the random sampling is biased to have a higher chance of including a first ordered pair in the cluster than a second ordered pair, if the first ordered pair has fewer intervening terms that the second ordered pair;
  
  means for building a similarity model that includes the cluster of pairs;
  
  means for comparing cluster of pairs from a target document to clusters of pairs from the similarity model;
  
  means for generating similarity metrics that measure similarity between the target document and particular documents in the set of documents based on the comparing; and
  
  means for outputting the generated similarity metrics.
- View Dependent Claims (18)
- - 18. The device of claim 16, where the processor further executes the instructions to implement:
    - means for storing clusters of pairs of ordered terms in an inverted index that associates a particular pair of ordered terms with clusters in which the particular pair is found.

17. One or more memory devices comprising program instructions executable by least one processor, the one or more memory devices comprising:
- one or more instructions to randomly sample pairs of ordered terms from a particular document of a set of documents, to generate a cluster of pairs of ordered terms for the particular document,where the pairs of ordered terms include a first term and a second term and where, in at least some of the pairs of ordered terms, the second term occurs after one or more intervening terms occurring after the first term in the particular document, andwhere the random sampling is biased to have a higher chance of including a first ordered pair in the cluster than a second ordered pair, if the first ordered pair has fewer intervening terms that the second ordered pair;
  
  one or more instructions to build a similarity model that includes the cluster of pairs;
  
  one or more instructions to compare pairs of ordered terms from a target document to clusters of pairs of ordered terms from the similarity model;
  
  one or more instructions to generate similarity metrics that measure similarity between the target document and particular documents in the set of documents based on the comparing; and
  
  one or more instructions to output the generated similarity metrics.
- View Dependent Claims (19, 20)
- - 19. The one or more memory devices of claim 17, further comprising:
    - one or more instructions to determine whether to add the target document to the set of documents, based on the generated similarity metrics.
  - 20. The one or more memory devices of claim 17, further comprising:
    - one or more instructions to determine whether to categorize the target document as spam, based on the generated similarity metrics.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Tong, Simon
Primary Examiner(s)
Jalil; Neveen Abel
Assistant Examiner(s)
RADTKE, MARK A

Application Number

US10/462,690
Time in Patent Office

2,548 Days
Field of Search

None
US Class Current

707/737
CPC Class Codes

G06F 16/319 Inverted lists

Document similarity detection

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Document similarity detection

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links