Document similarity detection
First Claim
Patent Images
1. A method performed by one or more server devices, the method comprising:
- receiving, using one or more processors associated with the one or more server devices, a document;
selecting, using one or more processors associated with the one or more server devices, terms from the document to form a plurality of term pairs, where the selection is biased such that terms that appear closer to each other in the document have a greater probability of being included in the plurality of term pairs than terms that appear further from each other in the document;
creating, using one or more processors associated with the one or more server devices, a cluster that includes the plurality of term pairs, where creating the cluster includes;
sampling a quantity of the plurality of term pairs, where the quantity is determined based on a length of the document; and
determining, using one or more processors associated with the one or more server devices, whether another document is similar to the document by comparing pairs of terms from the other document with the plurality of term pairs of the cluster.
1 Assignment
0 Petitions
Accused Products
Abstract
A similarity detector detects similar or near duplicate occurrences of a document. The similarity detector determines similarity of documents by characterizing the documents as clusters each made up of a set of term entries, such as pairs of terms. A pair of terms, for example, indicates that the first term of the pair occurs before the second term of the pair in the underlying document. Another document that has a threshold level of term entries in common with a cluster is considered similar to the document characterized by the cluster.
69 Citations
21 Claims
-
1. A method performed by one or more server devices, the method comprising:
-
receiving, using one or more processors associated with the one or more server devices, a document; selecting, using one or more processors associated with the one or more server devices, terms from the document to form a plurality of term pairs, where the selection is biased such that terms that appear closer to each other in the document have a greater probability of being included in the plurality of term pairs than terms that appear further from each other in the document; creating, using one or more processors associated with the one or more server devices, a cluster that includes the plurality of term pairs, where creating the cluster includes; sampling a quantity of the plurality of term pairs, where the quantity is determined based on a length of the document; and determining, using one or more processors associated with the one or more server devices, whether another document is similar to the document by comparing pairs of terms from the other document with the plurality of term pairs of the cluster. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A server comprising:
-
a memory to store instructions; and a processor to execute the instructions to; receive a document; select terms from the document to form a plurality of term pairs, where the selection of terms is weighted such that terms that appear closer to each other in the document have a higher probability of being included in the plurality of term pairs than terms that appear farther from each other in the document; create a cluster that includes the plurality of term pairs, where the cluster is created by sampling at least one of the plurality of term pairs, and a quantity of the plurality of term pairs that is sampled is determined based on a length of the document; and determine whether an input document is similar to the document by comparing pairs of terms from the input document with the plurality of term pairs in the cluster for the document. - View Dependent Claims (12, 13, 14, 15, 16, 17)
-
-
18. A computer-readable memory device including instructions executable by at least one processor, the computer-readable memory device comprising:
-
one or more instructions to receive a document; one or more instructions to select terms from the document to form a plurality of term pairs, where the selection is weighted such that terms that appear closer to each other in the document have a higher probability of being included in the plurality of term pairs than terms that appear farther from each other in the document; one or more instructions to create a cluster that includes the plurality of term pairs, where the one or more instructions to create the cluster include; one or more instructions to sample at least one of the plurality of term pairs, where a quantity of the plurality of term pairs that is sampled is determined based on a length of the document; and one or more instructions to determine that another document is similar to the document by comparing pairs of terms from the other document with the pairs of terms of the cluster. - View Dependent Claims (19, 20, 21)
-
Specification