Document similarity detection
First Claim
Patent Images
1. A method performed by one or more server devices, the method comprising:
- receiving, using one or more processors associated with the one or more server devices, a document;
selecting, using one or more processors associated with the one or more server devices, terms from the received document to form a plurality of term groups for the received document,each term group, of the plurality of term groups, being associated with an indication that a first term, of the term group, occurs before a second term, of the term group, within the received document;
identifying, using one or more processors associated with the one or more server devices and from an inverted index of term groups, one or more clusters of a plurality of clusters,each cluster, of the one or more identified clusters, comprising a set of term groups for a respective other document,each respective term group, of the set of term groups, being associated with an indication that a first term, of the respective term group, occurs before a second term, of the respective term group, within the respective other document;
determining, using one or more processors associated with the one or more server devices, measures of similarity between the plurality of term groups for the received document and the set of term groups for each of the one or more identified clusters; and
determining, using one or more processors associated with the one or more server devices and based on the determined measures of similarity, that the received document is similar to the respective other document.
1 Assignment
0 Petitions
Accused Products
Abstract
A similarity detector detects similar or near duplicate occurrences of a document. The similarity detector determines similarity of documents by characterizing the documents as clusters each made up of a set of term entries, such as pairs of terms. A pair of terms, for example, indicates that the first term of the pair occurs before the second term of the pair in the underlying document. Another document that has a threshold level of term entries in common with a cluster is considered similar to the document characterized by the cluster.
-
Citations
20 Claims
-
1. A method performed by one or more server devices, the method comprising:
-
receiving, using one or more processors associated with the one or more server devices, a document; selecting, using one or more processors associated with the one or more server devices, terms from the received document to form a plurality of term groups for the received document, each term group, of the plurality of term groups, being associated with an indication that a first term, of the term group, occurs before a second term, of the term group, within the received document; identifying, using one or more processors associated with the one or more server devices and from an inverted index of term groups, one or more clusters of a plurality of clusters, each cluster, of the one or more identified clusters, comprising a set of term groups for a respective other document, each respective term group, of the set of term groups, being associated with an indication that a first term, of the respective term group, occurs before a second term, of the respective term group, within the respective other document; determining, using one or more processors associated with the one or more server devices, measures of similarity between the plurality of term groups for the received document and the set of term groups for each of the one or more identified clusters; and determining, using one or more processors associated with the one or more server devices and based on the determined measures of similarity, that the received document is similar to the respective other document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system comprising:
one or more devices, including at least one memory and at least one processor, to; receive a document; select terms from the received document to form a plurality of term groups for the received document, each term group, of the plurality of term groups, being associated with an indication that a first term, of the term group, occurs before a second term, of the term group, within the received document; identify, from an inverted index of term groups, one or more clusters of a plurality of clusters, each cluster, of the one or more identified clusters, comprising a set of term groups for a respective other document, each respective term group, of the set of term groups, being associated with an indication that a first term, of the respective term group, occurs before a second term, of the respective term group, within the respective other document; determine measures of similarity between the plurality of term groups for the received document and the set of term groups for each of the one or more identified clusters; and determine, based on the determined measures of similarity, that the received document is similar to the respective other document. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
17. A method performed by one or more server devices, the method comprising:
-
selecting, using one or more processors associated with the one or more server devices, terms from a received document to form a plurality of term groups for the received document, each term group, of the plurality of term groups, being associated with an indication that a first term, of the term group, occurs before a second term, of the term group, within the received document, the first term and the second term being non-consecutive in at least some term groups of the plurality of term groups; accessing, using one or more processors associated with the one or more server devices, a stored inverted index of term groups for clusters representing term groups for multiple other documents, each cluster, of the clusters, comprising a set of term groups for a respective other document, each respective term group, of the set of term groups, being associated with an indication that a first term, of the respective term group, occurs before a second term, of the respective term group, within the respective other document, the first term of the respective term group and the second term of the respective term group being non-consecutive in at least some of the set of term groups; identifying, using one or more processors associated with the one or more server devices and from the inverted index, one or more clusters that have term groups in common with the received document; and evaluating, using one or more processors associated with the one or more server devices, a similarity between the received document and the respective other document based on one or more matches between the plurality of term groups for the received document and the set of term groups for each of the identified one or more clusters. - View Dependent Claims (18, 19, 20)
-
Specification