Systems and methods for identifying similar documents
First Claim
Patent Images
1. A computer-implemented method for identifying similar documents, comprising:
- (a) receiving document text for a current document that includes at least two words;
(b) calculating;
a prominence score for each word, wherein the prominence score for each word is based on a term weight and a non-compound likelihood for each word,a prominence score for each pair of consecutive words, wherein the prominence score for each pair of consecutive words is based on a term weight and a compound probability for each pair of consecutive words,a descriptiveness score for each word wherein the descriptiveness score for each word is based on a corpus, anda descriptiveness score for each pair of consecutive words, wherein the descriptiveness score for each pair of consecutive words is based on the corpus;
(c) calculating a comparison metric for the current document, wherein the comparison metric is based on the combination of each prominence score and each descriptiveness score;
(d) finding, using a query processor, at least one potential document, wherein document text for each potential document includes at least one word from the current document; and
(e) analyzing each found potential document to identify at least one similar document as a function of a comparison metric for the respective potential document and the comparison metric for the current document.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides systems and methods for identifying similar documents. In an embodiment, the present invention identifies similar documents by (1) receiving document text for a current document that includes at least one word; (2) calculating a prominence score and a descriptiveness score for each word and each pair of consecutive words; (3) calculating a comparison metric for the current document; (4) finding at least one potential document, where document text for each potential document includes at least one of the words; and (5) analyzing each potential document to identify at least one similar document.
263 Citations
25 Claims
-
1. A computer-implemented method for identifying similar documents, comprising:
-
(a) receiving document text for a current document that includes at least two words; (b) calculating; a prominence score for each word, wherein the prominence score for each word is based on a term weight and a non-compound likelihood for each word, a prominence score for each pair of consecutive words, wherein the prominence score for each pair of consecutive words is based on a term weight and a compound probability for each pair of consecutive words, a descriptiveness score for each word wherein the descriptiveness score for each word is based on a corpus, and a descriptiveness score for each pair of consecutive words, wherein the descriptiveness score for each pair of consecutive words is based on the corpus; (c) calculating a comparison metric for the current document, wherein the comparison metric is based on the combination of each prominence score and each descriptiveness score; (d) finding, using a query processor, at least one potential document, wherein document text for each potential document includes at least one word from the current document; and (e) analyzing each found potential document to identify at least one similar document as a function of a comparison metric for the respective potential document and the comparison metric for the current document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for identifying similar documents, comprising:
-
a processor and a memory; a query processor using the processor and the memory; a compound likelihood database, that stores bigram probabilities for a first set of words; a descriptive database, that stores descriptiveness scores for a second set of words, wherein the first set of words and the second set of words includes common words; and a similar document identifier configured to receive document text for a current document, wherein the document text includes at least two document words, wherein the similar document identifier uses the query processor, the query processor configured to identify at least one similar document based on prominence scores and descriptiveness scores for each document word and pair of consecutive document words, wherein the prominence scores are based on results from a query to the compound likelihood database and the descriptiveness scores are received based on a query to the descriptive database, and wherein the prominence score for each word is further based on a term weight and a non-compound likelihood for each word; the prominence score for each pair of consecutive words is based on a term weight and a compound probability for each pair of consecutive words; the descriptiveness score for each word is based on a corpus, and the descriptiveness score for each pair of consecutive words is based on the corpus. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
the similar document identifier further comprises; a feature extractor, configured to determine the comparison metric for the current document and the potential document, wherein the comparison metric is based on a combination of the prominence scores and the descriptiveness scores for each document word and pair of consecutive document words; and a prominence calculator, configured to calculate the prominence scores based on the results from the compound likelihood database.
-
-
13. The system of claim 12, wherein the query processor identifying each similar document comprises:
-
calculating, for each potential document, a dot product of a scoring vector for the respective potential document and a scoring vector for the current document, wherein the scoring vectors are based on each document'"'"'s respective comparison metric; and using the calculated scoring vectors to identify a similar document.
-
-
14. The system of claim 12, wherein the query processor identifying each similar document comprises:
identifying the at least one similar document based on the comparison metric of each potential document, wherein potential documents with a comparison metric above a threshold are selected for identification.
-
15. The system of claim 13, wherein calculating the dot product further comprises:
-
(i) calculating a normalization factor for each scoring vector, wherein the normalization factor converts each scoring vector into unit length; (ii) calculating a quality score for each scoring vector based on term weights of each respective word; and (iii) multiplying the dot product by the quality score and the inverse of the normalization factor for the respective potential document and by the quality score and the inverse of the normalization factor for the current document.
-
-
16. The system of claim 11, wherein the non-compound likelihood for each word is based on a probability that each respective word and a word to the left of the respective word do not combine to form a word compound and a probability that each respective word and a word to the right of the respective word do not combine to form a word compound.
-
17. The system of claim 11, wherein the compound probability for each pair of consecutive words comprises a number of times each respective word pair appears in a training corpus and a number of times a first word and a second word of each respective word pair appear within a certain distance of each other in the training corpus.
-
18. The system of claim 17, wherein the term weight is based on a location of the respective word in the document text.
-
19. A computer-implemented method for scoring documents, comprising:
-
(a) receiving document text for a current document that includes at least two words; (b) calculating a prominence score and a descriptiveness score for each word from the current document and each pair of consecutive words from the current document, wherein; the prominence score for each word is based on a term weight and a non-compound likelihood for each word, the prominence score for each pair of consecutive words is based on a term weight and a bigram probability for each pair of consecutive words, a descriptiveness score for each word is based on a corpus, a descriptiveness score for each pair of consecutive words is based on the corpus; and (c) creating a scoring vector for the current document, wherein the scoring vector is based on the combination of each prominence score and each descriptiveness score. - View Dependent Claims (20, 21, 22, 23, 24, 25)
-
Specification