METHOD FOR FILTERING OUT IDENTICAL OR SIMILAR DOCUMENTS
First Claim
1. A method for filtering out identical or similar documents, adapted to find out documents with identical or highly similar contents from a plurality of documents and cluster the documents by using an electronic device, the method comprising:
- (a) reading a plurality of documents to be filtered;
(b) converting data structures of the documents to be filtered, and storing the converted data structures together as a preset data structure profile;
(c) setting a lower threshold, representing a minimum consecutive character length;
(d) setting a higher threshold, representing a consecutive character length;
(e) searching for all string nodes (node I) with a consecutive character length reaching the lower threshold in the data structure profile, wherein each of the string node stores a document identity (DID) of a document therein;
(f) recording the DID stored in each of the found string nodes (node I) as a string group (G); and
(g) setting documents pointed to by all FIDs in the string group (G) to first-type documents, using a string content stored in the string node (node I) as a prefix to find a string node (node I1) with a consecutive character length equal to or higher than the higher threshold, and if the string node exists, marking a string group (G1) stored in the string node with a consecutive character length equal to or higher than the higher threshold as identical or highly similar documents.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for filtering out identical or similar documents includes storing a plurality of documents to be filtered as a pat tree (PT) data structure profile based on a pat tree data structure, searching for all string nodes with a consecutive character length reaching a lower threshold in the PT profile and all documents to which the string nodes belong, and finding documents having identical consecutive characters with a length reaching a higher threshold from the documents. Another technical solution includes searching for all string nodes with a consecutive character length reaching a lower threshold in the PT profile and all documents to which the string nodes belong, and finding documents having identical consecutive characters with such a length that a ratio of the length of the identical consecutive characters to a total character length of the original document reaches a ratio threshold from the documents, these documents are similarity.
-
Citations
16 Claims
-
1. A method for filtering out identical or similar documents, adapted to find out documents with identical or highly similar contents from a plurality of documents and cluster the documents by using an electronic device, the method comprising:
-
(a) reading a plurality of documents to be filtered; (b) converting data structures of the documents to be filtered, and storing the converted data structures together as a preset data structure profile; (c) setting a lower threshold, representing a minimum consecutive character length; (d) setting a higher threshold, representing a consecutive character length; (e) searching for all string nodes (node I) with a consecutive character length reaching the lower threshold in the data structure profile, wherein each of the string node stores a document identity (DID) of a document therein; (f) recording the DID stored in each of the found string nodes (node I) as a string group (G); and (g) setting documents pointed to by all FIDs in the string group (G) to first-type documents, using a string content stored in the string node (node I) as a prefix to find a string node (node I1) with a consecutive character length equal to or higher than the higher threshold, and if the string node exists, marking a string group (G1) stored in the string node with a consecutive character length equal to or higher than the higher threshold as identical or highly similar documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method for filtering out identical or similar documents, adapted to find out documents with identical or highly similar contents from a plurality of documents and cluster the documents by using an electronic device, the method comprising:
-
(a1) automatically abstracting contents of the plurality of documents to be filtered to generate abstract documents; (a) reading the abstract documents; (b) storing the abstract documents as a pat tree (PT) profile based on a PT data structure; (c) setting a lower threshold, representing a minimum consecutive character length; (d) setting a higher threshold, representing a consecutive character length; (e) searching for all string nodes (node I) with a consecutive character length reaching the lower threshold in the PT profile; (f) recording a document identity (DID) stored in each of the found string nodes (node I) as a string group (G); and (g) setting documents pointed to by all FIDs in the string group (G) to first-type documents, comparing documents in a cluster of the first-type documents in pairs to find documents having identical consecutive characters with a length reaching the higher threshold from the cluster of the first-type documents, and marking the found documents as documents with identical or highly similar contents. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
Specification