System for similar document detection

US 7,660,819 B1
Filed: 07/31/2000
Issued: 02/09/2010
Est. Priority Date: 07/31/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A method for detecting similar documents using a computer, comprising:

obtaining, using the computer, a document;

parsing, using the computer, the document to remove formatting and to obtain a token stream, the token stream comprising a plurality of tokens;

retaining, using the computer, only retained tokens in the token stream by using at least one token threshold;

reordering, using the computer, the retained tokens to obtain an arranged token stream;

processing, using the computer, in turn each retained token in the arranged token stream using a hash algorithm to obtain a single hash value for the document;

generating, using the computer, a document identifier for the document;

forming, using the computer, a single tuple for the document, the tuple comprising the document identifier for the document and the hash value for the document;

inserting, using the computer, the tuple for the document into a document storage tree, the document storage tree comprising a plurality of tuples, each tuple located at a bucket of the document storage tree, each tuple in the plurality of tuples representing one of a plurality of documents, each tuple in the plurality of tuples comprising a document identifier and a hash value; and

determining, using the computer, if the tuple for the document is co-located with another tuple at a same bucket in the document storage tree, and detecting if the document is similar to another document represented by the another tuple in a computer readable recording medium storing the document storage tree.

View all claims

21 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A document is compared to the documents in a document collection using a hash algorithm and collection statistics to detect if the document is similar to any of the documents in the document collection.

58 Citations

View as Search Results

9 Claims

1. A method for detecting similar documents using a computer, comprising:
- obtaining, using the computer, a document;
  
  parsing, using the computer, the document to remove formatting and to obtain a token stream, the token stream comprising a plurality of tokens;
  
  retaining, using the computer, only retained tokens in the token stream by using at least one token threshold;
  
  reordering, using the computer, the retained tokens to obtain an arranged token stream;
  
  processing, using the computer, in turn each retained token in the arranged token stream using a hash algorithm to obtain a single hash value for the document;
  
  generating, using the computer, a document identifier for the document;
  
  forming, using the computer, a single tuple for the document, the tuple comprising the document identifier for the document and the hash value for the document;
  
  inserting, using the computer, the tuple for the document into a document storage tree, the document storage tree comprising a plurality of tuples, each tuple located at a bucket of the document storage tree, each tuple in the plurality of tuples representing one of a plurality of documents, each tuple in the plurality of tuples comprising a document identifier and a hash value; and
  
  determining, using the computer, if the tuple for the document is co-located with another tuple at a same bucket in the document storage tree, and detecting if the document is similar to another document represented by the another tuple in a computer readable recording medium storing the document storage tree.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. A computer-readable storage medium having software stored therein for causing a computer to perform operations in accordance with claim 1.
  - 3. A method as claimed in claim 1, wherein reordering is based on Unicode ordering.
  - 4. A method as claimed in claim 1, wherein reordering is based on EBCDIC ordering.
  - 5. A method as claimed in claim 1, wherein reordering is based on ASCII ordering.
  - 6. A method as claimed in claim 1, wherein reordering is based on collection statistic measurements.
  - 7. A method as claimed in claim 6, wherein collection statistic measurements are determined based on an inverse document frequency.

8. A method for detecting similar documents using a computer, comprising:
- obtaining, using the computer, a document;
  
  filtering, using the computer, the document to eliminate tokens based on parts of speech and obtain a filtered document;
  
  generating, using the computer, a single tuple for the filtered document;
  
  comparing, using the computer, the tuple for the filtered document with a document storage structure comprising a plurality of tuples, each tuple in the plurality of tuples representing one of a plurality of documents; and
  
  determining, using the computer, if the tuple for the filtered document is clustered with another tuple in the document storage structure, and detecting if the document is similar to another document represented by the another tuple in a computer readable recording medium storing the document storage structure.

9. A computer-readable storage medium having a program stored therein for causing a computer to execute operations including detecting similar documents comprising:
- obtaining a document;
  
  filtering the document to eliminate tokens based on parts of speech and obtain a filtered document;
  
  generating a single tuple for the filtered document;
  
  comparing the tuple for the filtered document with a document storage structure comprising a plurality of tuples, each tuple in the plurality of tuples representing one of a plurality of documents; and
  
  determining if the tuple for the filtered document is clustered with another tuple in the document storage structure, based on the comparison, and detecting if the document is similar to another document represented by the another tuple in a computer readable recording medium storing the document storage structure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Alion Science and Technology Corporation
Original Assignee
Alion Science and Technology Corporation
Inventors
Frieder, Ophir, Chowdhury, Abdur R.
Primary Examiner(s)
Le; Uyen T.

Application Number

US09/629,175
Time in Patent Office

3,480 Days
Field of Search

707 1- 10
US Class Current

1/1
CPC Class Codes

G06F 16/3346 using probabilistic model

Y10S 707/99948 Application of database or ...

System for similar document detection

First Claim

21 Assignments

0 Petitions

Accused Products

Abstract

58 Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

System for similar document detection

First Claim

21 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

58 Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links