Document similarity detection
First Claim
Patent Images
1. A method performed by a computer system, the method comprising:
- randomly sampling pairs of ordered terms from a particular document of a set of documents, to generate a cluster of pairs of ordered terms for the particular document,where the pairs of ordered terms include a first term and a second term and where, in at least some of the pairs of ordered terms, the second term occurs after one or more intervening terms occurring after the first term in the particular document, andwhere the random sampling is biased to have a higher chance of including a first ordered pair in the cluster than a second ordered pair, if the first ordered pair has fewer intervening terms that the second ordered pair, the randomly sampling being performed using one or more processors associated with the computer system;
building, using one or more processors associated with the computer system, a similarity model that includes the cluster of pairs;
comparing, using one or more processors associated with the computer system, a cluster of pairs from a target document to clusters of pairs from the similarity model;
generating, using one or more processors associated with the computer system, similarity metrics that measure similarity between the target document and particular documents in the set of documents, the generating being based on the comparing; and
outputting the generated similarity metrics.
3 Assignments
0 Petitions
Accused Products
Abstract
A similarity detector detects similar or near duplicate occurrences of a document. The similarity detector determines similarity of documents by characterizing the documents as clusters each made up of a set of term entries, such as pairs of terms. A pair of terms, for example, indicates that the first term of the pair occurs before the second term of the pair in the underlying document. Another document that has a threshold level of term entries in common with a cluster is considered similar to the document characterized by the cluster.
-
Citations
20 Claims
-
1. A method performed by a computer system, the method comprising:
-
randomly sampling pairs of ordered terms from a particular document of a set of documents, to generate a cluster of pairs of ordered terms for the particular document, where the pairs of ordered terms include a first term and a second term and where, in at least some of the pairs of ordered terms, the second term occurs after one or more intervening terms occurring after the first term in the particular document, and where the random sampling is biased to have a higher chance of including a first ordered pair in the cluster than a second ordered pair, if the first ordered pair has fewer intervening terms that the second ordered pair, the randomly sampling being performed using one or more processors associated with the computer system; building, using one or more processors associated with the computer system, a similarity model that includes the cluster of pairs; comparing, using one or more processors associated with the computer system, a cluster of pairs from a target document to clusters of pairs from the similarity model; generating, using one or more processors associated with the computer system, similarity metrics that measure similarity between the target document and particular documents in the set of documents, the generating being based on the comparing; and outputting the generated similarity metrics. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A similarity detection device comprising:
-
a cluster creation hardware-implemented component that generates clusters of pairs of ordered terms by randomly sampling documents, where a particular pair of ordered terms includes a first term that occurs before a second term in a particular document, and where at least some of the pairs of ordered terms include terms that occur non-consecutively in the particular document, and where the random sampling is biased such that terms closer to one another, as measured based on a number of terms separating two terms, have a greater chance of being included, in a particular cluster, as one of the pairs of ordered terms; an inverted index hardware-implemented component that relates an order of occurrence of pairs of ordered terms to clusters that contain the pairs of ordered terms; an enumeration hardware-implemented component that generates pairs of ordered terms for a first document that is to be compared to the inverted index; a pair lookup hardware-implemented component that looks up the generated pairs of ordered terms in the inverted index to obtain clusters that contain the generated pairs of ordered terms; and a cluster selection hardware-implemented component that selects clusters obtained by the pair lookup component that are similar to the first document. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A device for determining similarity of a target document to a first set of documents, the device comprising:
-
a memory to store instructions; and a processor to execute the instructions to implement; means for randomly sampling pairs of ordered terms from a particular document of a set of documents, to generate a cluster of pairs of ordered terms for the particular document, where the pairs of ordered terms include a first term and a second term and where, in at least some of the pairs of ordered terms, the second term occurs after one or more intervening terms occurring after the first term in the particular document, and where the random sampling is biased to have a higher chance of including a first ordered pair in the cluster than a second ordered pair, if the first ordered pair has fewer intervening terms that the second ordered pair; means for building a similarity model that includes the cluster of pairs; means for comparing cluster of pairs from a target document to clusters of pairs from the similarity model; means for generating similarity metrics that measure similarity between the target document and particular documents in the set of documents based on the comparing; and means for outputting the generated similarity metrics. - View Dependent Claims (18)
-
-
17. One or more memory devices comprising program instructions executable by least one processor, the one or more memory devices comprising:
-
one or more instructions to randomly sample pairs of ordered terms from a particular document of a set of documents, to generate a cluster of pairs of ordered terms for the particular document, where the pairs of ordered terms include a first term and a second term and where, in at least some of the pairs of ordered terms, the second term occurs after one or more intervening terms occurring after the first term in the particular document, and where the random sampling is biased to have a higher chance of including a first ordered pair in the cluster than a second ordered pair, if the first ordered pair has fewer intervening terms that the second ordered pair; one or more instructions to build a similarity model that includes the cluster of pairs; one or more instructions to compare pairs of ordered terms from a target document to clusters of pairs of ordered terms from the similarity model; one or more instructions to generate similarity metrics that measure similarity between the target document and particular documents in the set of documents based on the comparing; and one or more instructions to output the generated similarity metrics. - View Dependent Claims (19, 20)
-
Specification