Document Comparison Method And Apparatus
First Claim
1. A document comparison and identification method, the method comprising the steps of:
- identifying, in a source document, words of a predetermined number of characters or greater;
generating a list containing the identified words, and excluding identified words from said list that occur with a predetermined frequency or greater in a set of documents to be searched;
searching each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list;
for each of the plurality of documents, determining how many identified words from the list occur in the document; and
calculating a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches.
1 Assignment
0 Petitions
Accused Products
Abstract
A document comparison and identification method comprises the steps of: identifying (S210), in a source document, words of a predetermined number of characters or greater; generating a list containing the identified words (S220), and excluding (S220) identified words occurring with a predetermined frequency or greater throughout a set of documents to be searched; searching (S230) each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list; for each of the plurality of documents, determining (S230) how many identified words from the list occur in the document; and calculating (S240) a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches.
-
Citations
21 Claims
-
1. A document comparison and identification method, the method comprising the steps of:
-
identifying, in a source document, words of a predetermined number of characters or greater; generating a list containing the identified words, and excluding identified words from said list that occur with a predetermined frequency or greater in a set of documents to be searched; searching each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list; for each of the plurality of documents, determining how many identified words from the list occur in the document; and calculating a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A document comparison and identification method, comprising the steps of:
-
performing a first search to identify documents identical to a source document; performing a second search to identify documents having an identical or a similar document name to the source document; performing a third search to identify documents of similar content to the source document; determining a ranking for the results of each of the first, second, and third searches; and presenting results of the first, second, and third searches in accordance with the determined ranking. - View Dependent Claims (10, 11, 12)
-
-
13. A document comparison and identification apparatus comprising:
-
a memory unit for storing data and program instructions; and a processing unit coupled to said memory unit; wherein said processing unit is programmed to; identify, in a source document, words of a predetermined number of characters or greater; generate a list containing the identified words, and exclude identified words from the list that occur with a predetermined frequency or greater in a set of documents to be searched; search each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list; determine, for each of the plurality of documents, how many identified words from the list occur in the document; and calculate a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches. - View Dependent Claims (14, 15)
-
-
16. A document comparison and identification apparatus, comprising:
-
a memory unit for storing data and program instructions; and a processing unit coupled to said memory unit; wherein said processing unit is programmed to; perform a first search to identify documents identical to a source document; perform a second search to identify documents having an identical or a similar document name to the source document; perform a third search to identify documents of similar content to the source document; determine a ranking for the results of each of the first, second, and third searches; and present results of the first, second, and third searches in accordance with the determined ranking. - View Dependent Claims (17, 18)
-
-
19. A computer program product comprising a computer readable medium comprising a computer program recorded therein for document comparison and identification, said computer program product comprising:
-
computer program code means for identifying, in a source document, words of a predetermined number of characters or greater; computer program code means for generating a list containing the identified words, and excluding identified words from said list that occur with a predetermined frequency or greater in a set of documents to be searched; computer program code means for searching each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list; computer program code means for, for each of the plurality of documents, determining how many identified words from the list occur in the document; and computer program code means for calculating a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches.
-
-
20. A computer program product comprising a computer readable medium comprising a computer program recorded therein for document comparison and identification, said computer program product comprising:
-
computer program code means for performing a first search to identify documents identical to a source document; computer program code means for performing a second search to identify documents having an identical or a similar document name to the source document; computer program code means for performing a third search to identify documents of similar content to the source document; computer program code means for determining a ranking for the results of each of the first, second, and third searches; and presenting results of the first, second, and third searches in accordance with the determined ranking. - View Dependent Claims (21)
-
Specification