×

Document similarity detection

  • US 7,734,627 B1
  • Filed: 06/17/2003
  • Issued: 06/08/2010
  • Est. Priority Date: 06/17/2003
  • Status: Active Grant
First Claim
Patent Images

1. A method performed by a computer system, the method comprising:

  • randomly sampling pairs of ordered terms from a particular document of a set of documents, to generate a cluster of pairs of ordered terms for the particular document,where the pairs of ordered terms include a first term and a second term and where, in at least some of the pairs of ordered terms, the second term occurs after one or more intervening terms occurring after the first term in the particular document, andwhere the random sampling is biased to have a higher chance of including a first ordered pair in the cluster than a second ordered pair, if the first ordered pair has fewer intervening terms that the second ordered pair, the randomly sampling being performed using one or more processors associated with the computer system;

    building, using one or more processors associated with the computer system, a similarity model that includes the cluster of pairs;

    comparing, using one or more processors associated with the computer system, a cluster of pairs from a target document to clusters of pairs from the similarity model;

    generating, using one or more processors associated with the computer system, similarity metrics that measure similarity between the target document and particular documents in the set of documents, the generating being based on the comparing; and

    outputting the generated similarity metrics.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×