Method and apparatus for indexing documents for message filtering

US 6,314,421 B1
Filed: 05/12/1998
Issued: 11/06/2001
Est. Priority Date: 05/12/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A method comprising the steps of:

parsing at least a part of a first document into a plurality of elements, wherein the set of possible elements is large enough that multiple randomly-selected elements infrequently occur in unrelated documents;

calculating a deterministic pseudo-random score for each element;

selecting only a small subset of the scores based on values of the scores; and

using the small subset of the scores as multiple indexes into a database of documents to be used to determine a match between the first document and another document, and discarding the remaining scores.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for a document filtering mechanism is provided. A document indexing mechanism is described. A first document is parsed into elements, and a deterministic pseudo-random score is calculated for at least some of the elements. A subset of the scores is selected based on the values of the scores. The subset of the scores is used as multiple indexes into a database of documents.

126 Citations

25 Claims

1. A method comprising the steps of:
- parsing at least a part of a first document into a plurality of elements, wherein the set of possible elements is large enough that multiple randomly-selected elements infrequently occur in unrelated documents;
  
  calculating a deterministic pseudo-random score for each element;
  
  selecting only a small subset of the scores based on values of the scores; and
  
  using the small subset of the scores as multiple indexes into a database of documents to be used to determine a match between the first document and another document, and discarding the remaining scores.
- View Dependent Claims (2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1 further comprising the steps of:
3. The method of claim 2, wherein said step of performing the comparison comprises the steps of:
- parsing at least part of the document retrieved from the database of documents into elements;
  
  tallying how many elements exist in both the first document and the document retrieved, and how many exist in only one document;
  
  obtaining a tally; and
  
  determining if the documents are substantially identical based on the tally.
4. The method of claim 2 further comprising the steps of:
- selecting a second subset of scores based on the values;
  
  adding the second subset of the scores to the index of the database of documents; and
  
  adding the first document to the database of documents.
5. The method of claim 4 wherein the first subset and the second subset are identical.
7. The method of claim 1, wherein said step of parsing comprises the step of determining N-word sequences contained within the first document.
8. The method of claim 7, wherein N is three.
9. The method of claim 1, wherein said step of calculating a pseudo-random score uses a CRC function having an even distribution of values over all inputs.
10. The method of claim 1, wherein a constant is added to or combined with each element before the score is calculated.
11. The method of claim 10, wherein the constant is an Internet protocol (IP) address or name of a server system that holds the document database.
12. The method of claim 1, wherein the part of the first document that is parsed is the first Y bytes of the document.
13. The method of claim 12, wherein Y is between one kilobyte and one megabyte.
14. The method of claim 1, wherein the subset chosen is the M elements having a highest value.
15. The method of claim 14, wherein M is ten.
16. The method of claim 1, wherein the subset chosen is the M elements having a lowest value.
17. The method of claim 4, wherein the second subset has a different number of elements than the first subset.

6. The method of 2 further comprising the step of:
- identifying the first document as present in the database if there exists in the database a document that compares as substantially identical to the first document and is indexed by at least one score that is also present in the subset of scores calculated for the first document.

18. A method comprising the steps of:
- parsing at least part of a first document having a plurality of words into a first plurality of elements, each element including a plurality of words, adjacent elements sharing words;
  
  parsing at least part of a second document into a second plurality of elements, each element having a plurality of words, adjacent elements sharing words;
  
  retaining a subset of the first plurality of elements and a subset of the second plurality of elements;
  
  tallying how many of the subset of elements exist in both documents and how many exist in only one document;
  
  obtaining a tally; and
  
  determining if the documents are substantially identical based on the tally.
- View Dependent Claims (23, 24)
- - 23. The method of claim 18, wherein each of the words in the document starts an element.
  - 24. The method of claim 18, wherein each two subsequent elements have at least one overlapping word that appears in both elements.

19. An apparatus comprising:
- a parsing unit for parsing at least part of a first document into a plurality of elements, wherein the set of possible elements is large enough that multiple randomly-selected elements occur infrequently in unrelated documents;
  
  a score calculation unit for calculating a deterministic pseudo-random score for each element and selecting a small subset of the scores based on values of the scores, the score calculating unit further to discard the remaining scores; and
  
  a document database including a plurality of documents and using only the small subset of the scores as indexes into the document database.
- View Dependent Claims (20, 21)
- - 20. The apparatus of claim 19, further comprising:
21. The apparatus of claim 20 further comprising:
- a learning unit for selecting a second subset of scores based on the values, adding the second subset of the scores to the index of the document database, and adding the first document to the document database.

22. The apparatus of 20 further comprising:
- a tallying unit for identifying the first document as present in the database if there exists in the document database a document that compares as substantially identical to the first document and is indexed by at least one score that is also present in the subset of scores calculated for the first document.

25. A method comprising the steps of:
- parsing at least a part of a first document into a plurality of multi-word elements, adjacent elements sharing words;
  
  calculating a deterministic pseudo-random score for each element;
  
  selecting only a small subset of the scores based on values of the scores and discarding the remaining scores; and
  
  using the small subset of the scores as multiple indexes into a database of documents, to declare a match between the first document and another document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
David M. Sharnoff
Original Assignee
David M. Sharnoff
Inventors
Sharnoff, David Muir, Dillon, Matthew D.
Primary Examiner(s)
Alam, Hosain T.

Application Number

US09/076,757
Time in Patent Office

1,274 Days
Field of Search

707/5, 707/6, 707/3-4, 707/100-102, 704/2, 704/7, 704/9, 704/10, 380/29, 380/37, 380/42, 380/28, 709/201
US Class Current

1/1
CPC Class Codes

G06F 16/3346 using probabilistic model

Y10S 707/99935 Query augmenting and refini...

Method and apparatus for indexing documents for message filtering

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

126 Citations

25 Claims

Specification

Use Cases

Quick Links

Others

Method and apparatus for indexing documents for message filtering

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

126 Citations

25 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others