SYSTEM AND METHOD FOR NEAR AND EXACT DE-DUPLICATION OF DOCUMENTS
First Claim
1. A method for identifying near and exact-duplicate documents in a document collection, the method comprising:
- for each document in the collection;
reading textual content from the document;
filtering the textual content based on user settings;
determining N most frequent words from the filtered textual content of the document;
performing a quorum search of the N most frequent words in the document with a threshold M; and
sorting results from the quorum search based on relevancy,whereby based on the values of N and M near and exact-duplicate documents are identified in the document collection.
3 Assignments
0 Petitions
Accused Products
Abstract
A system, method and computer program product for identifying near and exact-duplicate documents in a document collection, including for each document in the collection, reading textual content from the document; filtering the textual content based on user settings; determining N most frequent words from the filtered textual content of the document; performing a quorum search of the N most frequent words in the document with a threshold M; and sorting results from the quorum search based on relevancy. Based on the values of N and M near and exact-duplicate documents are identified in the document collection.
-
Citations
33 Claims
-
1. A method for identifying near and exact-duplicate documents in a document collection, the method comprising:
-
for each document in the collection; reading textual content from the document; filtering the textual content based on user settings; determining N most frequent words from the filtered textual content of the document; performing a quorum search of the N most frequent words in the document with a threshold M; and sorting results from the quorum search based on relevancy, whereby based on the values of N and M near and exact-duplicate documents are identified in the document collection. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer program product for identifying near and exact-duplicate documents in a document collection and including one or more computer readable instructions embedded on a computer readable medium and configured to cause one or more computer processors to perform the steps of:
-
for each document in the collection; reading textual content from the document; filtering the textual content based on user settings; determining N most frequent words from the filtered textual content of the document; performing a quorum search of the N most frequent words in the document with a threshold M; and sorting results from the quorum search based on relevancy, whereby based on the values of N and M near and exact-duplicate documents are identified in the document collection. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A system for identifying near and exact-duplicate documents in a document collection, the system comprising:
-
for each document in the collection; means for reading textual content from the document; means for filtering the textual content based on user settings; means for determining N most frequent words from the filtered textual content of the document; means for performing a quorum search of the N most frequent words in the document with a threshold M; and means for sorting results from the quorum search based on relevancy, whereby based on the values of N and M near and exact-duplicate documents are identified in the document collection. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
-
Specification