Document similarity detection

US 8,650,199 B1
Filed: 06/25/2012
Issued: 02/11/2014
Est. Priority Date: 06/17/2003
Status: Active Grant

First Claim

Patent Images

1. A method performed by one or more server devices, the method comprising:

receiving, using one or more processors associated with the one or more server devices, a document;

selecting, using one or more processors associated with the one or more server devices, terms from the received document to form a plurality of term groups for the received document,each term group, of the plurality of term groups, being associated with an indication that a first term, of the term group, occurs before a second term, of the term group, within the received document;

identifying, using one or more processors associated with the one or more server devices and from an inverted index of term groups, one or more clusters of a plurality of clusters,each cluster, of the one or more identified clusters, comprising a set of term groups for a respective other document,each respective term group, of the set of term groups, being associated with an indication that a first term, of the respective term group, occurs before a second term, of the respective term group, within the respective other document;

determining, using one or more processors associated with the one or more server devices, measures of similarity between the plurality of term groups for the received document and the set of term groups for each of the one or more identified clusters; and

determining, using one or more processors associated with the one or more server devices and based on the determined measures of similarity, that the received document is similar to the respective other document.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A similarity detector detects similar or near duplicate occurrences of a document. The similarity detector determines similarity of documents by characterizing the documents as clusters each made up of a set of term entries, such as pairs of terms. A pair of terms, for example, indicates that the first term of the pair occurs before the second term of the pair in the underlying document. Another document that has a threshold level of term entries in common with a cluster is considered similar to the document characterized by the cluster.

Citations

20 Claims

1. A method performed by one or more server devices, the method comprising:
- receiving, using one or more processors associated with the one or more server devices, a document;
  
  selecting, using one or more processors associated with the one or more server devices, terms from the received document to form a plurality of term groups for the received document,each term group, of the plurality of term groups, being associated with an indication that a first term, of the term group, occurs before a second term, of the term group, within the received document;
  
  identifying, using one or more processors associated with the one or more server devices and from an inverted index of term groups, one or more clusters of a plurality of clusters,each cluster, of the one or more identified clusters, comprising a set of term groups for a respective other document,each respective term group, of the set of term groups, being associated with an indication that a first term, of the respective term group, occurs before a second term, of the respective term group, within the respective other document;
  
  determining, using one or more processors associated with the one or more server devices, measures of similarity between the plurality of term groups for the received document and the set of term groups for each of the one or more identified clusters; and
  
  determining, using one or more processors associated with the one or more server devices and based on the determined measures of similarity, that the received document is similar to the respective other document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising:
    - storing a respective value, associated with each of the one or more identified clusters, that indicates a quantity of term groups included in a respective one of the one or more identified clusters,the quantity of term groups included in the respective one of the one or more identified clusters being based on a length of the respective other document.
  - 3. The method of claim 1, where identifying the one or more clusters of the plurality of clusters includes:
    - sampling the respective other document using a bias such that terms that occur in a particular section of the respective other document are preferred over terms that do not occur in the particular section of the respective other document.
  - 4. The method of claim 3, where sampling the respective other document includes:
    - ignoring terms that appear in a top portion of the respective other document; and
      
      ignoring terms that appear in a bottom portion of the respective other document.
  - 5. The method of claim 1, further comprising:
    - storing the inverted index, the inverted index including;
      
      information regarding the cluster associated with the respective other document, anda list of the plurality of clusters that include a particular term group of the plurality of term groups.
  - 6. The method of claim 1, further comprising:
    - determining whether one or more of the plurality of clusters include more than one of the plurality of term groups;
      
      selecting, based on determining that one or more of the plurality of clusters include more than one of the plurality of term groups, at least one cluster, of the plurality of clusters, occurring greater than a threshold amount; and
      
      determining a similarity metric of the received document and another document that corresponds to the at least one cluster occurring greater than the threshold amount.
  - 7. The method of claim 1, where the first term of the term group and the second term of the term group are non-consecutive in at least some term groups of the plurality of term groups.
  - 8. The method of claim 1, where, when determining that the received document is similar to the respective other document, the method further includes:
    - creating the plurality of term groups from the received document based on enumerating possible term groups within the received document; and
      
      determining a quantity of matches between the enumerated term groups and the set of term groups of the cluster for the respective other document.

9. A system comprising:
- one or more devices, including at least one memory and at least one processor, to;
  
  receive a document;
  
  select terms from the received document to form a plurality of term groups for the received document,each term group, of the plurality of term groups, being associated with an indication that a first term, of the term group, occurs before a second term, of the term group, within the received document;
  
  identify, from an inverted index of term groups, one or more clusters of a plurality of clusters,each cluster, of the one or more identified clusters, comprising a set of term groups for a respective other document,each respective term group, of the set of term groups, being associated with an indication that a first term, of the respective term group, occurs before a second term, of the respective term group, within the respective other document;
  
  determine measures of similarity between the plurality of term groups for the received document and the set of term groups for each of the one or more identified clusters; and
  
  determine, based on the determined measures of similarity, that the received document is similar to the respective other document.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The system of claim 9, where the one or more devices are further to:
    - store a respective value, associated with each of the one or more identified clusters, that indicates a quantity of term groups included in a respective one of the one or more identified clusters,the quantity of term groups included in the respective one of the one or more identified clusters being based on a length of the respective other document.
  - 11. The system of claim 9, where, when identifying the one or more clusters of the plurality of clusters, the one or more devices are further to:
    - sample the respective other document using a bias such that terms that occur in a particular section of the respective other document are preferred over terms that do not occur in the particular section of the respective other document.
  - 12. The system of claim 11, where, when sampling the respective other document, the one or more devices are further to:
    - ignore terms that appear in a top portion of the respective other document; and
      
      ignore terms that appear in a bottom portion of the respective other document.
  - 13. The system of claim 9, where the one or more devices are further to:
    - store the inverted index, the inverted index including;
      
      information regarding the cluster associated with the respective other document, anda list of the plurality of clusters that include a particular term group of the plurality of term groups.
  - 14. The system of claim 9, where the one or more devices are further to:
    - determine whether one or more of the plurality of clusters include more than one of the plurality of term groups;
      
      select, based on determining that one or more of the plurality of clusters include more than one of the plurality of term groups, at least one cluster, of the plurality of clusters, occurring greater than a threshold amount; and
      
      determine a similarity metric of the received document and another document that corresponds to the at least one cluster occurring greater than the threshold amount.
  - 15. The system of claim 9, where the first term of the term group and the second term of the term group are non-consecutive in at least some term groups of the plurality of term groups.
  - 16. The system of claim 9, where, when determining that the received document is similar to the respective other document, the one or more devices are further to:
    - create the plurality of term groups from the received document based on enumerating possible term groups within the received document; and
      
      determine a quantity of matches between the enumerated term groups and the set of term groups of the cluster for the respective other document.

17. A method performed by one or more server devices, the method comprising:
- selecting, using one or more processors associated with the one or more server devices, terms from a received document to form a plurality of term groups for the received document,each term group, of the plurality of term groups, being associated with an indication that a first term, of the term group, occurs before a second term, of the term group, within the received document,the first term and the second term being non-consecutive in at least some term groups of the plurality of term groups;
  
  accessing, using one or more processors associated with the one or more server devices, a stored inverted index of term groups for clusters representing term groups for multiple other documents,each cluster, of the clusters, comprising a set of term groups for a respective other document,each respective term group, of the set of term groups, being associated with an indication that a first term, of the respective term group, occurs before a second term, of the respective term group, within the respective other document,the first term of the respective term group and the second term of the respective term group being non-consecutive in at least some of the set of term groups;
  
  identifying, using one or more processors associated with the one or more server devices and from the inverted index, one or more clusters that have term groups in common with the received document; and
  
  evaluating, using one or more processors associated with the one or more server devices, a similarity between the received document and the respective other document based on one or more matches between the plurality of term groups for the received document and the set of term groups for each of the identified one or more clusters.
- View Dependent Claims (18, 19, 20)
- - 18. The method of claim 17, further comprising:
    - storing a respective value, associated with each of the clusters, that indicates a quantity of term groups included in a respective one of the clusters,the quantity of term groups included in the respective one of the clusters being based on a length of the respective other document.
  - 19. The method of claim 17, where, when identifying the one or more clusters, the method includes:
    - sampling the respective other document using a bias such that terms that occur in a particular section of the respective other document are preferred over terms that do not occur in the particular section of the respective other document.
  - 20. The method of claim 17, further comprising:
    - determining whether one or more of the clusters include more than one of the plurality of term groups;
      
      selecting, based on determining that one or more of the plurality of clusters include more than one of the plurality of term groups, at least one frequently occurring cluster, of the plurality of clusters, occurring greater than a threshold amount; and
      
      determining a similarity metric of the received document and another document that corresponds to the at least one frequently occurring cluster occurring greater than the threshold amount.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Tong, Simon
Primary Examiner(s)
Chempakaseril, Ann

Application Number

US13/531,670
Time in Patent Office

596 Days
Field of Search

707/663, 707/776, 707/749, 707/737
US Class Current

707/749
CPC Class Codes

G06F 16/319 Inverted lists

Document similarity detection

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Document similarity detection

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links