×

Document similarity detection

  • US 8,209,339 B1
  • Filed: 04/21/2010
  • Issued: 06/26/2012
  • Est. Priority Date: 06/17/2003
  • Status: Expired due to Term
First Claim
Patent Images

1. A method performed by one or more server devices, the method comprising:

  • receiving, using one or more processors associated with the one or more server devices, a document;

    selecting, using one or more processors associated with the one or more server devices, terms from the document to form a plurality of term pairs, where the selection is biased such that terms that appear closer to each other in the document have a greater probability of being included in the plurality of term pairs than terms that appear further from each other in the document;

    creating, using one or more processors associated with the one or more server devices, a cluster that includes the plurality of term pairs, where creating the cluster includes;

    sampling a quantity of the plurality of term pairs, where the quantity is determined based on a length of the document; and

    determining, using one or more processors associated with the one or more server devices, whether another document is similar to the document by comparing pairs of terms from the other document with the plurality of term pairs of the cluster.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×