Please download the dossier by clicking on the dossier button x
×

METHOD FOR FILTERING OUT IDENTICAL OR SIMILAR DOCUMENTS

  • US 20100082626A1
  • Filed: 09/17/2009
  • Published: 04/01/2010
  • Est. Priority Date: 09/19/2008
  • Status: Active Grant
First Claim
Patent Images

1. A method for filtering out identical or similar documents, adapted to find out documents with identical or highly similar contents from a plurality of documents and cluster the documents by using an electronic device, the method comprising:

  • (a) reading a plurality of documents to be filtered;

    (b) converting data structures of the documents to be filtered, and storing the converted data structures together as a preset data structure profile;

    (c) setting a lower threshold, representing a minimum consecutive character length;

    (d) setting a higher threshold, representing a consecutive character length;

    (e) searching for all string nodes (node I) with a consecutive character length reaching the lower threshold in the data structure profile, wherein each of the string node stores a document identity (DID) of a document therein;

    (f) recording the DID stored in each of the found string nodes (node I) as a string group (G); and

    (g) setting documents pointed to by all FIDs in the string group (G) to first-type documents, using a string content stored in the string node (node I) as a prefix to find a string node (node I1) with a consecutive character length equal to or higher than the higher threshold, and if the string node exists, marking a string group (G1) stored in the string node with a consecutive character length equal to or higher than the higher threshold as identical or highly similar documents.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×