System and method for detecting duplicate and similar documents
First Claim
1. A method for processing data representing documents, comprising:
- for individual documents of a set of documents, executing a software program to obtain a list of terms found in each document;
comparing the list of terms for a first document to the list of terms for a second document;
declaring the first document to be substantially identical to, or substantially similar to, the second document if some predetermined number of terms are found in each of the lists of the first document and the second document; and
wherein the step of comparing includes a preliminary step of sorting the documents into a document list in order of increasing size, and where the step of comparing compares the list of terms for a given document with the list of terms for the for the next larger-documents in the document list.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and a method are described for rapidly determining document similarity among a set of documents, such as a set of documents obtained from an information retrieval (IR) system. A ranked list of the most important terms in each document is obtained using a phrase recognizer system. The list is stored in a database and is used to compute document similarity with a simple database query. If the number of terms found to not be contained in both documents is less than some predetermined threshold compared to the total number of terms in the document, these documents are determined to be very similar. It is shown that these techniques may be employed to accurately recognize that documents, that have been revised to contain parts of other documents, are still closely related to the original document. These teachings further provide for the computation of a document signature that can then be used to make a rapid comparison between documents that are likely to be identical.
119 Citations
33 Claims
-
1. A method for processing data representing documents, comprising:
-
for individual documents of a set of documents, executing a software program to obtain a list of terms found in each document; comparing the list of terms for a first document to the list of terms for a second document; declaring the first document to be substantially identical to, or substantially similar to, the second document if some predetermined number of terms are found in each of the lists of the first document and the second document; and wherein the step of comparing includes a preliminary step of sorting the documents into a document list in order of increasing size, and where the step of comparing compares the list of terms for a given document with the list of terms for the for the next larger-documents in the document list.
-
-
2. A method for processing data representing documents, comprising:
-
for individual documents of a set of documents, executing a software program to obtain a list of salient terms found in each document; comparing the list of salient terms for a first document to the list of salient terms for a second document; and declaring the first document to be substantially identical to, or substantially similar to, the second document if some predetermined number of salient terms are found in each of the lists of the first document and the second document, wherein the comparing includes sorting the documents into a document list in order of increasing size, and where the step of comparing compares the list of salient terms for a given document with the list of salient terms for the next larger-documents in the document list. - View Dependent Claims (3, 4, 5, 6, 7, 8)
-
-
9. A method for processing data representing documents, comprising:
-
for individual documents of a set of documents, executing a software program to obtain a list of terms found in each document; comparing the list of terms for a first document to the list of terms for a second document; declaring the first document to be substantially identical to, or substantially similar to, the second document if some predetermined number of terms are found in each of the lists of the first document and the second document; and wherein the step of comparing includes a preliminary step of sorting the documents into a document list in order of increasing size, and where the step of comparing compares the list of terms of a given document only with the list of terms of another document in the list that is no more than a predetermined amount larger than the given document.
-
-
10. A method for processing data representing documents, comprising:
-
for individual ones of documents, executing a software program to obtain a list of salient terms found in each document; computing a document signature for each document from the list of salient terms obtained for the document; comparing the document signature for a first document to the document signature for a second document; and declaring the first document to be substantially identical to the second document if the document signatures are substantially equal, wherein the step of comparing includes a preliminary step of sorting the documents into a document list in order of increasing size, and where the step of comparing compares the list of salient terms of a given document with the list of salient terms of the next larger-documents in the document list. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A system for processing data representing documents comprising,
for individual documents of a set of documents, a processor that executes a software program to obtain a list of terms found in each document and compares the list of terms for a first document to the list of terms for a second document, said processor operating to declare the first document to be substantially identical to, or substantially similar to, the second document if some predetermined number of terms are found in each of the lists of the first document and the second document; - and wherein said processor is further operable, before comparing the lists of terms, to sort the documents into a document list in order of increasing size, and to then compare the list of terms of a given document with the list of terms of a next larger documents in the document list.
- 17. A system for processing data representing documents comprising, for individual documents of a set of documents, a processor that executes a software program to obtain a list of salient terms found in each document and that compares the list of salient terms for a first document to the list of salient terms for a second document, said processor, in response to executing the software program, declaring the first document to be substantially identical to, or substantially similar to, the second document if some predetermined number of salient terms are found in each of the lists of the first document and the second document, wherein said comparing includes sorting the documents into a document list in order of increasing size, and where said comparing compares the list of salient terms for a given document with the the list of salient terms for the next larger-documents in the document list.
-
24. A system for processing data representing documents comprising,
for individual documents of a set of documents, a processor that executes a software program to obtain a list of terms found in each document and compares the list of terms for a first document to the list of terms for a second document, said processor, in response to executing the software program, declaring the first document to be substantially identical to, or substantially similar to, the second document if some predetermined number of terms are found in each of the lists of the first document and the second document; - and wherein said processor, before comparing the lists of terms, sorts the documents into a document list in order of increasing size, and compares the list of terms for a given document only with the list of terms for another document in the list that is no more than a predetermined amount larger than the given document.
-
25. A system for processing data representing documents, comprising, for individual documents of a set of documents, a processor that executes a software program to obtain a list of salient terms found in each document, computes a document signature for each document from the list of salient terms obtained for the document;
- compares the document signature for a first document to the document signature for a second document; and
declares the first document to be substantially identical to the second document if the document signatures are equal, wherein said comparing includes sorting the documents into a document list in order of increasing size, and where said comparing compares the list of salient terms for a given document with the list of salient terms for the next larger-documents in the document list. - View Dependent Claims (26, 27, 28, 29, 30)
- compares the document signature for a first document to the document signature for a second document; and
-
31. A computer program recorded on a computer-readable media, said computer program comprising instructions for directing a data processor to process data representing documents by, for individual documents of a set of documents, obtaining a list of salient terms found in each document;
- comparing the list of salient terms for a first document to the list of salient terms for a second document; and
declaring the first document to be substantially identical to, or substantially similar to, the second document if some predetermined number of salient terms are found in each of the lists of the first document and the second document, wherein said comparing includes sorting the documents into a document list in order of increasing size, and where said comparing compares the list of salient terms for a given document with the list of salient terms for the next larger-documents in the document list.
- comparing the list of salient terms for a first document to the list of salient terms for a second document; and
-
32. A computer program recorded on a computer-readable media, said computer program comprising instructions for directing a data processor to process data representing documents by, for individual ones of documents, obtaining a list of salient terms found in each document;
- computing a document signature for each document from the list of salient terms obtained for the document;
comparing the document signature for a first document to the document signature for a second document; and
declaring the first document to be substantially identical to the second document if the document signatures are equal, wherein said comparing includes sorting the documents into a document list in order of increasing size, and where said comparing compares the list of salient terms for a given document with the list of salient terms for the next larger-documents in the document list. - View Dependent Claims (33)
- computing a document signature for each document from the list of salient terms obtained for the document;
Specification