Method for filtering out identical or similar documents

US 8,185,532 B2
Filed: 09/17/2009
Issued: 05/22/2012
Est. Priority Date: 09/19/2008
Status: Expired due to Fees

First Claim

Patent Images

1. A method for filtering out identical or similar documents, adapted to find out documents with identical or highly similar contents from a plurality of documents and cluster the documents by using an electronic device, the method comprising:

(a) reading a plurality of documents to be filtered;

(b) converting data structures of the documents to be filtered, and storing the converted data structures together as a preset data structure profile;

(c) setting a lower threshold, representing a minimum consecutive character length;

(d) setting a higher threshold, representing a consecutive character length;

(e) searching for all string nodes (node I) with a consecutive character length reaching the lower threshold in the data structure profile, wherein each of the string node stores a document identity (DID) of a document therein;

(f) recording the DID stored in each of the found string nodes (node I) as a string group (G);

(g) setting documents pointed to by all DIDs in the string group (G) to first-type documents, using a string content stored in the string node (node I) as a prefix to find a string node (node I1) with a consecutive character length equal to or higher than the higher threshold, and if the string node exists, marking a string group (G1) stored in the string node with a consecutive character length equal to or higher than the higher threshold as identical or highly similar documents;

(h) finding second-type documents from a cluster formed by the first-type documents, wherein the second-type documents are a cluster of documents with a consecutive character length lower than the higher threshold in the first-type documents;

(i) setting a ratio threshold; and

(j) finding documents having identical consecutive characters with such a length that a ratio of the length of the identical consecutive characters to a total character length of each document reaches the ratio threshold from the cluster of the second-type documents, and setting the found documents to documents with identical or highly similar contents.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for filtering out identical or similar documents includes storing multiple documents to be filtered as a pat tree (PT) data structure profile based on a pat tree data structure, searching for all string nodes with a consecutive character length reaching a lower threshold in the PT profile and all documents to which the string nodes belong, and finding documents having identical consecutive characters with a length reaching a higher threshold from the documents. Another technical solution includes searching for all string nodes with a consecutive character length reaching a lower threshold in the PT profile and all documents to which the string nodes belong, and finding documents having identical consecutive characters with such a length that a ratio of the length of the identical consecutive characters to a total character length of the original document reaches a ratio threshold from the documents, these documents are similarity.

Citations

8 Claims

1. A method for filtering out identical or similar documents, adapted to find out documents with identical or highly similar contents from a plurality of documents and cluster the documents by using an electronic device, the method comprising:
- (a) reading a plurality of documents to be filtered;
  
  (b) converting data structures of the documents to be filtered, and storing the converted data structures together as a preset data structure profile;
  
  (c) setting a lower threshold, representing a minimum consecutive character length;
  
  (d) setting a higher threshold, representing a consecutive character length;
  
  (e) searching for all string nodes (node I) with a consecutive character length reaching the lower threshold in the data structure profile, wherein each of the string node stores a document identity (DID) of a document therein;
  
  (f) recording the DID stored in each of the found string nodes (node I) as a string group (G);
  
  (g) setting documents pointed to by all DIDs in the string group (G) to first-type documents, using a string content stored in the string node (node I) as a prefix to find a string node (node I1) with a consecutive character length equal to or higher than the higher threshold, and if the string node exists, marking a string group (G1) stored in the string node with a consecutive character length equal to or higher than the higher threshold as identical or highly similar documents;
  
  (h) finding second-type documents from a cluster formed by the first-type documents, wherein the second-type documents are a cluster of documents with a consecutive character length lower than the higher threshold in the first-type documents;
  
  (i) setting a ratio threshold; and
  
  (j) finding documents having identical consecutive characters with such a length that a ratio of the length of the identical consecutive characters to a total character length of each document reaches the ratio threshold from the cluster of the second-type documents, and setting the found documents to documents with identical or highly similar contents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method for filtering out identical or similar documents according to claim 1, wherein before the step (a), the method further comprises automatically abstracting contents of the documents to be filtered to generate abstract documents.
  - 3. The method for filtering out identical or similar documents according to claim 1, further comprising:
    - processing synonyms in contents of the documents to be filtered.
  - 4. The method for filtering out identical or similar documents according to claim 2, further comprising:
    - processing synonyms in contents of the abstract documents.
  - 5. The method for filtering out identical or similar documents according to claim 1, 2, 3, or 4, further comprising:
    - removing punctuation marks from contents of the documents to be filtered.
  - 6. The method for filtering out identical or similar documents according to claim 1, wherein after finding out identical or highly similar documents, the method further comprises displaying any one of the identical documents as a search result but not displaying other documents marked as identical or similar documents.
  - 7. The method for filtering out identical or similar documents according to claim 1, wherein the document is selected from a group consisting of a web page, a text, a database, and data stored in other forms.
  - 8. The method for filtering out identical or similar documents according to claim 1, wherein the data structure profile is a pat tree (PT) data structure or a character tree data structure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
esobi, Inc.
Original Assignee
esobi, Inc.
Inventors
Tsai, Hong Yang, Cho, Hsun Hsueh
Primary Examiner(s)
NGUYEN, PHONG H

Application Number

US12/561,843
Publication Number

US 20100082626A1
Time in Patent Office

978 Days
Field of Search

707/602, 707/728, 707/749, 707/758, 707/737, 707/754, 715/234
US Class Current

707/737
CPC Class Codes

G06F 16/335 Filtering based on addition...

Method for filtering out identical or similar documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

Method for filtering out identical or similar documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links