Document similarity detection

US 8,209,339 B1
Filed: 04/21/2010
Issued: 06/26/2012
Est. Priority Date: 06/17/2003
Status: Expired due to Term

First Claim

Patent Images

1. A method performed by one or more server devices, the method comprising:

receiving, using one or more processors associated with the one or more server devices, a document;

selecting, using one or more processors associated with the one or more server devices, terms from the document to form a plurality of term pairs, where the selection is biased such that terms that appear closer to each other in the document have a greater probability of being included in the plurality of term pairs than terms that appear further from each other in the document;

creating, using one or more processors associated with the one or more server devices, a cluster that includes the plurality of term pairs, where creating the cluster includes;

sampling a quantity of the plurality of term pairs, where the quantity is determined based on a length of the document; and

determining, using one or more processors associated with the one or more server devices, whether another document is similar to the document by comparing pairs of terms from the other document with the plurality of term pairs of the cluster.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A similarity detector detects similar or near duplicate occurrences of a document. The similarity detector determines similarity of documents by characterizing the documents as clusters each made up of a set of term entries, such as pairs of terms. A pair of terms, for example, indicates that the first term of the pair occurs before the second term of the pair in the underlying document. Another document that has a threshold level of term entries in common with a cluster is considered similar to the document characterized by the cluster.

69 Citations

View as Search Results

21 Claims

1. A method performed by one or more server devices, the method comprising:
- receiving, using one or more processors associated with the one or more server devices, a document;
  
  selecting, using one or more processors associated with the one or more server devices, terms from the document to form a plurality of term pairs, where the selection is biased such that terms that appear closer to each other in the document have a greater probability of being included in the plurality of term pairs than terms that appear further from each other in the document;
  
  creating, using one or more processors associated with the one or more server devices, a cluster that includes the plurality of term pairs, where creating the cluster includes;
  
  sampling a quantity of the plurality of term pairs, where the quantity is determined based on a length of the document; and
  
  determining, using one or more processors associated with the one or more server devices, whether another document is similar to the document by comparing pairs of terms from the other document with the plurality of term pairs of the cluster.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, where the terms in each of the plurality of term pairs are ordered such that a first term, of each of the plurality of term pairs, comes before a second term, of each of the plurality of term pairs, in the document.
  - 3. The method of claim 1, further comprising:
    - storing an inverted index including a plurality of clusters associated with a plurality of documents, where the inverted index comprises a list of the plurality of clusters that include a particular term pair of the plurality of term pairs.
  - 4. The method of claim 1, further comprising:
    - storing a value, associated with the cluster, that indicates a quantity of term pairs included in the cluster.
  - 5. The method of claim 4, where the quantity of term pairs included in the cluster is based on a length of the document.
  - 6. The method of claim 1, where the selection is further biased such that terms with a lower frequency of occurrence within the document have a greater probability of being included in the plurality of term pairs than terms with a higher frequency of occurrence within the document.
  - 7. The method of claim 1, where the selection is further biased such that terms that appear within hypertext markup language (HTML) tags are excluded from inclusion in one of the plurality of term pairs.
  - 8. The method of claim 1, where the selection is further biased such that terms that appear in a pre-selected section of the document have a greater probability of being included in the plurality of term pairs than terms that appear in other sections of the document, where the pre-selected section includes the upper middle section of the document.
  - 9. The method of claim 1, further comprising:
    - selecting terms from the document to form n-ary term sets.
  - 10. The method of claim 9, where a first term in a particular n-ary term set occurs in the document before a second term in the particular n-ary term set, and where a third term in the particular n-ary term set occurs in the document after both the first and second terms in the particular n-ary term set.

11. A server comprising:
- a memory to store instructions; and
  
  a processor to execute the instructions to;
  
  receive a document;
  
  select terms from the document to form a plurality of term pairs, where the selection of terms is weighted such that terms that appear closer to each other in the document have a higher probability of being included in the plurality of term pairs than terms that appear farther from each other in the document;
  
  create a cluster that includes the plurality of term pairs, where the cluster is created by sampling at least one of the plurality of term pairs, and a quantity of the plurality of term pairs that is sampled is determined based on a length of the document; and
  
  determine whether an input document is similar to the document by comparing pairs of terms from the input document with the plurality of term pairs in the cluster for the document.
- View Dependent Claims (12, 13, 14, 15, 16, 17)
- - 12. The server of claim 11, where, when determining whether an input document is similar to the document, the processor executes instructions to:
    - create the pairs of terms from the input document based on enumerating possible pairs of terms within the input document; and
      
      determine a quantity of matches between the enumerated pairs of terms and the plurality of term pairs of the cluster.
  - 13. The server of claim 11, where, when determining whether an input document is similar to the document, the processor executes instructions to:
    - enumerate, for a particular term within the input document, a plurality of pairs of terms, where each of the plurality of pairs of terms includes the particular term and another term that is within a particular distance of the particular term within the input document; and
      
      determine a quantity of matches between the enumerated pairs of terms and the plurality of term pairs of the cluster.
  - 14. The server of claim 11, where the processor executes instructions to:
    - store a plurality of clusters, where each cluster includes a plurality of term pairs and each cluster is associated with a different document.
  - 15. The server of claim 14, where the processor executes instructions to:
    - identify, for each of the pairs of terms from the input document, clusters that include a term pair that matches one of the pairs of terms from the input document.
  - 16. The server of claim 15, where the processor executes instructions to:
    - select the clusters that include more than a threshold number of term pairs that match with the pairs of terms from the input document; and
      
      provide an indication that the input document is similar to the documents associated with the selected clusters.
  - 17. The server of claim 11, where, when determining whether an input document is similar to the document, the processor executes instructions to:
    - determine a quantity of the pairs of terms from the input document that match with the plurality of term pairs of the cluster;
      
      calculate a similarity metric based on a percentage of the pairs of terms from the input document that match the plurality of term pairs of the cluster; and
      
      indicate that the input document is similar to the document associated with the cluster when the similarity metric is above a threshold.

18. A computer-readable memory device including instructions executable by at least one processor, the computer-readable memory device comprising:
- one or more instructions to receive a document;
  
  one or more instructions to select terms from the document to form a plurality of term pairs, where the selection is weighted such that terms that appear closer to each other in the document have a higher probability of being included in the plurality of term pairs than terms that appear farther from each other in the document;
  
  one or more instructions to create a cluster that includes the plurality of term pairs, where the one or more instructions to create the cluster include;
  
  one or more instructions to sample at least one of the plurality of term pairs, where a quantity of the plurality of term pairs that is sampled is determined based on a length of the document; and
  
  one or more instructions to determine that another document is similar to the document by comparing pairs of terms from the other document with the pairs of terms of the cluster.
- View Dependent Claims (19, 20, 21)
- - 19. The computer-readable memory device of claim 18, where the other document is an email message.
  - 20. The computer-readable memory device of claim 19, further comprising:
    - one or more instructions to block the email message from reaching an intended recipient when the other document is determined to be similar to the document.
  - 21. The computer-readable memory device of claim 18, where the document and the other document are included in search results based on a search query, the computer-readable memory device further comprising:
    - one or more instructions to remove the other document from the search results when the other document is determined to be similar to the document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Tong, Simon
Primary Examiner(s)
Radtke, Mark A

Application Number

US12/764,293
Time in Patent Office

797 Days
Field of Search

707/749
US Class Current

707/749
CPC Class Codes

G06F 16/319 Inverted lists

Document similarity detection

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

69 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Document similarity detection

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

69 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links