Method and apparatus for indexing documents for message filtering
First Claim
Patent Images
1. A method comprising the steps of:
- parsing at least a part of a first document into a plurality of elements, wherein the set of possible elements is large enough that multiple randomly-selected elements infrequently occur in unrelated documents;
calculating a deterministic pseudo-random score for each element;
selecting only a small subset of the scores based on values of the scores; and
using the small subset of the scores as multiple indexes into a database of documents to be used to determine a match between the first document and another document, and discarding the remaining scores.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus for a document filtering mechanism is provided. A document indexing mechanism is described. A first document is parsed into elements, and a deterministic pseudo-random score is calculated for at least some of the elements. A subset of the scores is selected based on the values of the scores. The subset of the scores is used as multiple indexes into a database of documents.
126 Citations
25 Claims
-
1. A method comprising the steps of:
-
parsing at least a part of a first document into a plurality of elements, wherein the set of possible elements is large enough that multiple randomly-selected elements infrequently occur in unrelated documents;
calculating a deterministic pseudo-random score for each element;
selecting only a small subset of the scores based on values of the scores; and
using the small subset of the scores as multiple indexes into a database of documents to be used to determine a match between the first document and another document, and discarding the remaining scores. - View Dependent Claims (2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
determining a match between the first document and a document in the database of documents using some or all of the subset of scores of the first document; and
performing a comparison between each of the documents retrieved and the first document to determine if the first document is present in the database of documents.
-
-
3. The method of claim 2, wherein said step of performing the comparison comprises the steps of:
-
parsing at least part of the document retrieved from the database of documents into elements;
tallying how many elements exist in both the first document and the document retrieved, and how many exist in only one document;
obtaining a tally; and
determining if the documents are substantially identical based on the tally.
-
-
4. The method of claim 2 further comprising the steps of:
-
selecting a second subset of scores based on the values;
adding the second subset of the scores to the index of the database of documents; and
adding the first document to the database of documents.
-
-
5. The method of claim 4 wherein the first subset and the second subset are identical.
-
7. The method of claim 1, wherein said step of parsing comprises the step of determining N-word sequences contained within the first document.
-
8. The method of claim 7, wherein N is three.
-
9. The method of claim 1, wherein said step of calculating a pseudo-random score uses a CRC function having an even distribution of values over all inputs.
-
10. The method of claim 1, wherein a constant is added to or combined with each element before the score is calculated.
-
11. The method of claim 10, wherein the constant is an Internet protocol (IP) address or name of a server system that holds the document database.
-
12. The method of claim 1, wherein the part of the first document that is parsed is the first Y bytes of the document.
-
13. The method of claim 12, wherein Y is between one kilobyte and one megabyte.
-
14. The method of claim 1, wherein the subset chosen is the M elements having a highest value.
-
15. The method of claim 14, wherein M is ten.
-
16. The method of claim 1, wherein the subset chosen is the M elements having a lowest value.
-
17. The method of claim 4, wherein the second subset has a different number of elements than the first subset.
-
6. The method of 2 further comprising the step of:
identifying the first document as present in the database if there exists in the database a document that compares as substantially identical to the first document and is indexed by at least one score that is also present in the subset of scores calculated for the first document.
-
18. A method comprising the steps of:
-
parsing at least part of a first document having a plurality of words into a first plurality of elements, each element including a plurality of words, adjacent elements sharing words;
parsing at least part of a second document into a second plurality of elements, each element having a plurality of words, adjacent elements sharing words;
retaining a subset of the first plurality of elements and a subset of the second plurality of elements;
tallying how many of the subset of elements exist in both documents and how many exist in only one document;
obtaining a tally; and
determining if the documents are substantially identical based on the tally. - View Dependent Claims (23, 24)
-
-
19. An apparatus comprising:
-
a parsing unit for parsing at least part of a first document into a plurality of elements, wherein the set of possible elements is large enough that multiple randomly-selected elements occur infrequently in unrelated documents;
a score calculation unit for calculating a deterministic pseudo-random score for each element and selecting a small subset of the scores based on values of the scores, the score calculating unit further to discard the remaining scores; and
a document database including a plurality of documents and using only the small subset of the scores as indexes into the document database. - View Dependent Claims (20, 21)
retrieving documents from the document database using some or all of the subset of scores of the first document; and
performing a comparison between each of the documents retrieved and the first document to determine if the first document is present in the document database.
-
-
21. The apparatus of claim 20 further comprising:
a learning unit for selecting a second subset of scores based on the values, adding the second subset of the scores to the index of the document database, and adding the first document to the document database.
-
22. The apparatus of 20 further comprising:
a tallying unit for identifying the first document as present in the database if there exists in the document database a document that compares as substantially identical to the first document and is indexed by at least one score that is also present in the subset of scores calculated for the first document.
-
25. A method comprising the steps of:
-
parsing at least a part of a first document into a plurality of multi-word elements, adjacent elements sharing words;
calculating a deterministic pseudo-random score for each element;
selecting only a small subset of the scores based on values of the scores and discarding the remaining scores; and
using the small subset of the scores as multiple indexes into a database of documents, to declare a match between the first document and another document.
-
Specification