Method for filtering out identical or similar documents
First Claim
1. A method for filtering out identical or similar documents, adapted to find out documents with identical or highly similar contents from a plurality of documents and cluster the documents by using an electronic device, the method comprising:
- (a) reading a plurality of documents to be filtered;
(b) converting data structures of the documents to be filtered, and storing the converted data structures together as a preset data structure profile;
(c) setting a lower threshold, representing a minimum consecutive character length;
(d) setting a higher threshold, representing a consecutive character length;
(e) searching for all string nodes (node I) with a consecutive character length reaching the lower threshold in the data structure profile, wherein each of the string node stores a document identity (DID) of a document therein;
(f) recording the DID stored in each of the found string nodes (node I) as a string group (G);
(g) setting documents pointed to by all DIDs in the string group (G) to first-type documents, using a string content stored in the string node (node I) as a prefix to find a string node (node I1) with a consecutive character length equal to or higher than the higher threshold, and if the string node exists, marking a string group (G1) stored in the string node with a consecutive character length equal to or higher than the higher threshold as identical or highly similar documents;
(h) finding second-type documents from a cluster formed by the first-type documents, wherein the second-type documents are a cluster of documents with a consecutive character length lower than the higher threshold in the first-type documents;
(i) setting a ratio threshold; and
(j) finding documents having identical consecutive characters with such a length that a ratio of the length of the identical consecutive characters to a total character length of each document reaches the ratio threshold from the cluster of the second-type documents, and setting the found documents to documents with identical or highly similar contents.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for filtering out identical or similar documents includes storing multiple documents to be filtered as a pat tree (PT) data structure profile based on a pat tree data structure, searching for all string nodes with a consecutive character length reaching a lower threshold in the PT profile and all documents to which the string nodes belong, and finding documents having identical consecutive characters with a length reaching a higher threshold from the documents. Another technical solution includes searching for all string nodes with a consecutive character length reaching a lower threshold in the PT profile and all documents to which the string nodes belong, and finding documents having identical consecutive characters with such a length that a ratio of the length of the identical consecutive characters to a total character length of the original document reaches a ratio threshold from the documents, these documents are similarity.
-
Citations
8 Claims
-
1. A method for filtering out identical or similar documents, adapted to find out documents with identical or highly similar contents from a plurality of documents and cluster the documents by using an electronic device, the method comprising:
-
(a) reading a plurality of documents to be filtered; (b) converting data structures of the documents to be filtered, and storing the converted data structures together as a preset data structure profile; (c) setting a lower threshold, representing a minimum consecutive character length; (d) setting a higher threshold, representing a consecutive character length; (e) searching for all string nodes (node I) with a consecutive character length reaching the lower threshold in the data structure profile, wherein each of the string node stores a document identity (DID) of a document therein; (f) recording the DID stored in each of the found string nodes (node I) as a string group (G); (g) setting documents pointed to by all DIDs in the string group (G) to first-type documents, using a string content stored in the string node (node I) as a prefix to find a string node (node I1) with a consecutive character length equal to or higher than the higher threshold, and if the string node exists, marking a string group (G1) stored in the string node with a consecutive character length equal to or higher than the higher threshold as identical or highly similar documents; (h) finding second-type documents from a cluster formed by the first-type documents, wherein the second-type documents are a cluster of documents with a consecutive character length lower than the higher threshold in the first-type documents; (i) setting a ratio threshold; and (j) finding documents having identical consecutive characters with such a length that a ratio of the length of the identical consecutive characters to a total character length of each document reaches the ratio threshold from the cluster of the second-type documents, and setting the found documents to documents with identical or highly similar contents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
Specification