METHOD FOR FILTERING OUT IDENTICAL OR SIMILAR DOCUMENTS

US 20100082626A1
Filed: 09/17/2009
Published: 04/01/2010
Est. Priority Date: 09/19/2008
Status: Active Grant

First Claim

Patent Images

1. A method for filtering out identical or similar documents, adapted to find out documents with identical or highly similar contents from a plurality of documents and cluster the documents by using an electronic device, the method comprising:

(a) reading a plurality of documents to be filtered;

(b) converting data structures of the documents to be filtered, and storing the converted data structures together as a preset data structure profile;

(c) setting a lower threshold, representing a minimum consecutive character length;

(d) setting a higher threshold, representing a consecutive character length;

(e) searching for all string nodes (node I) with a consecutive character length reaching the lower threshold in the data structure profile, wherein each of the string node stores a document identity (DID) of a document therein;

(f) recording the DID stored in each of the found string nodes (node I) as a string group (G); and

(g) setting documents pointed to by all FIDs in the string group (G) to first-type documents, using a string content stored in the string node (node I) as a prefix to find a string node (node I1) with a consecutive character length equal to or higher than the higher threshold, and if the string node exists, marking a string group (G1) stored in the string node with a consecutive character length equal to or higher than the higher threshold as identical or highly similar documents.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for filtering out identical or similar documents includes storing a plurality of documents to be filtered as a pat tree (PT) data structure profile based on a pat tree data structure, searching for all string nodes with a consecutive character length reaching a lower threshold in the PT profile and all documents to which the string nodes belong, and finding documents having identical consecutive characters with a length reaching a higher threshold from the documents. Another technical solution includes searching for all string nodes with a consecutive character length reaching a lower threshold in the PT profile and all documents to which the string nodes belong, and finding documents having identical consecutive characters with such a length that a ratio of the length of the identical consecutive characters to a total character length of the original document reaches a ratio threshold from the documents, these documents are similarity.

Citations

16 Claims

1. A method for filtering out identical or similar documents, adapted to find out documents with identical or highly similar contents from a plurality of documents and cluster the documents by using an electronic device, the method comprising:
- (a) reading a plurality of documents to be filtered;
  
  (b) converting data structures of the documents to be filtered, and storing the converted data structures together as a preset data structure profile;
  
  (c) setting a lower threshold, representing a minimum consecutive character length;
  
  (d) setting a higher threshold, representing a consecutive character length;
  
  (e) searching for all string nodes (node I) with a consecutive character length reaching the lower threshold in the data structure profile, wherein each of the string node stores a document identity (DID) of a document therein;
  
  (f) recording the DID stored in each of the found string nodes (node I) as a string group (G); and
  
  (g) setting documents pointed to by all FIDs in the string group (G) to first-type documents, using a string content stored in the string node (node I) as a prefix to find a string node (node I1) with a consecutive character length equal to or higher than the higher threshold, and if the string node exists, marking a string group (G1) stored in the string node with a consecutive character length equal to or higher than the higher threshold as identical or highly similar documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method for filtering out identical or similar documents according to claim 1, further comprising:
    - (h) finding second-type documents from a cluster formed by the first-type documents, wherein the second-type documents are a cluster of documents with a consecutive character length lower than the higher threshold in the first-type documents;
      
      (i) setting a ratio threshold; and
      
      (j) finding documents having identical consecutive characters with such a length that a ratio of the length of the identical consecutive characters to a total character length of each document reaches the ratio threshold from the cluster of the second-type documents, and setting the found documents to documents with identical or highly similar contents.
  - 3. The method for filtering out identical or similar documents according to claim 1, wherein before the step (a), the method further comprises automatically abstracting contents of the documents to be filtered to generate abstract documents.
  - 4. The method for filtering out identical or similar documents according to claim 1, further comprising:
    - processing synonyms in contents of the documents to be filtered.
  - 5. The method for filtering out identical or similar documents according to claim 3, further comprising:
    - processing synonyms in contents of the abstract documents.
  - 6. The method for filtering out identical or similar documents according to claim 1, 3, 4, or 5, further comprising:
    - removing punctuation marks from contents of the documents to be filtered.
  - 7. The method for filtering out identical or similar documents according to claim 1, wherein after finding out identical or highly similar documents, the method further comprises displaying any one of the identical documents as a search result but not displaying other documents marked as identical or similar documents.
  - 8. The method for filtering out identical or similar documents according to claim 1, wherein the document is selected from a group consisting of a web page, a text, a database, and data stored in other forms.
  - 9. The method for filtering out identical or similar documents according to claim 1, wherein the data structure profile is a pat tree (PT) data structure or a character tree data structure.

10. A method for filtering out identical or similar documents, adapted to find out documents with identical or highly similar contents from a plurality of documents and cluster the documents by using an electronic device, the method comprising:
- (a1) automatically abstracting contents of the plurality of documents to be filtered to generate abstract documents;
  
  (a) reading the abstract documents;
  
  (b) storing the abstract documents as a pat tree (PT) profile based on a PT data structure;
  
  (c) setting a lower threshold, representing a minimum consecutive character length;
  
  (d) setting a higher threshold, representing a consecutive character length;
  
  (e) searching for all string nodes (node I) with a consecutive character length reaching the lower threshold in the PT profile;
  
  (f) recording a document identity (DID) stored in each of the found string nodes (node I) as a string group (G); and
  
  (g) setting documents pointed to by all FIDs in the string group (G) to first-type documents, comparing documents in a cluster of the first-type documents in pairs to find documents having identical consecutive characters with a length reaching the higher threshold from the cluster of the first-type documents, and marking the found documents as documents with identical or highly similar contents.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. The method for filtering out identical or similar documents according to claim 10, further comprising:
    - (h) finding second-type documents from the cluster of the first-type documents, wherein the second-type documents are a cluster of documents having identical consecutive characters with a length lower than the higher threshold;
      
      (i) setting a ratio threshold; and
      
      (j) finding documents having identical consecutive characters with such a length that a ratio of the length of the identical consecutive characters to a total character length of each document reaches the ratio threshold from the cluster of the second-type documents, and setting the found documents to documents with identical or highly similar contents.
  - 12. The method for filtering out identical or similar documents according to claim 10, further comprising:
    - processing synonyms in contents of the documents to be filtered.
  - 13. The method for filtering out identical or similar documents according to claim 10 or 12, further comprising:
    - removing punctuation marks from contents of the abstract documents.
  - 14. The method for filtering out identical or similar documents according to claim 10, wherein after finding out identical or highly similar documents (or web pages), the method further comprises displaying any one of the identical documents as a search result but not displaying other documents marked as identical or similar documents.
  - 15. The method for filtering out identical or similar documents according to claim 10, wherein the document is selected from a group consisting of a web page, a text, a database, and data stored in other forms.
  - 16. The method for filtering out identical or similar documents according to claim 10, wherein the data structure profile is a character tree data structure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
esobi, Inc.
Original Assignee
esobi, Inc.
Inventors
Tsai, Hong Yang, Cho, Hsun Hsueh

Granted Patent

US 8,185,532 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/737
CPC Class Codes

G06F 16/335 Filtering based on addition...

METHOD FOR FILTERING OUT IDENTICAL OR SIMILAR DOCUMENTS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD FOR FILTERING OUT IDENTICAL OR SIMILAR DOCUMENTS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links