×

Method for filtering out identical or similar documents

  • US 8,185,532 B2
  • Filed: 09/17/2009
  • Issued: 05/22/2012
  • Est. Priority Date: 09/19/2008
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method for filtering out identical or similar documents, adapted to find out documents with identical or highly similar contents from a plurality of documents and cluster the documents by using an electronic device, the method comprising:

  • (a) reading a plurality of documents to be filtered;

    (b) converting data structures of the documents to be filtered, and storing the converted data structures together as a preset data structure profile;

    (c) setting a lower threshold, representing a minimum consecutive character length;

    (d) setting a higher threshold, representing a consecutive character length;

    (e) searching for all string nodes (node I) with a consecutive character length reaching the lower threshold in the data structure profile, wherein each of the string node stores a document identity (DID) of a document therein;

    (f) recording the DID stored in each of the found string nodes (node I) as a string group (G);

    (g) setting documents pointed to by all DIDs in the string group (G) to first-type documents, using a string content stored in the string node (node I) as a prefix to find a string node (node I1) with a consecutive character length equal to or higher than the higher threshold, and if the string node exists, marking a string group (G1) stored in the string node with a consecutive character length equal to or higher than the higher threshold as identical or highly similar documents;

    (h) finding second-type documents from a cluster formed by the first-type documents, wherein the second-type documents are a cluster of documents with a consecutive character length lower than the higher threshold in the first-type documents;

    (i) setting a ratio threshold; and

    (j) finding documents having identical consecutive characters with such a length that a ratio of the length of the identical consecutive characters to a total character length of each document reaches the ratio threshold from the cluster of the second-type documents, and setting the found documents to documents with identical or highly similar contents.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×