Systems and methods for identifying similar documents

US 7,958,136 B1
Filed: 03/18/2008
Issued: 06/07/2011
Est. Priority Date: 03/18/2008
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for identifying similar documents, comprising:

(a) receiving document text for a current document that includes at least two words;

(b) calculating;

a prominence score for each word, wherein the prominence score for each word is based on a term weight and a non-compound likelihood for each word,a prominence score for each pair of consecutive words, wherein the prominence score for each pair of consecutive words is based on a term weight and a compound probability for each pair of consecutive words,a descriptiveness score for each word wherein the descriptiveness score for each word is based on a corpus, anda descriptiveness score for each pair of consecutive words, wherein the descriptiveness score for each pair of consecutive words is based on the corpus;

(c) calculating a comparison metric for the current document, wherein the comparison metric is based on the combination of each prominence score and each descriptiveness score;

(d) finding, using a query processor, at least one potential document, wherein document text for each potential document includes at least one word from the current document; and

(e) analyzing each found potential document to identify at least one similar document as a function of a comparison metric for the respective potential document and the comparison metric for the current document.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides systems and methods for identifying similar documents. In an embodiment, the present invention identifies similar documents by (1) receiving document text for a current document that includes at least one word; (2) calculating a prominence score and a descriptiveness score for each word and each pair of consecutive words; (3) calculating a comparison metric for the current document; (4) finding at least one potential document, where document text for each potential document includes at least one of the words; and (5) analyzing each potential document to identify at least one similar document.

263 Citations

25 Claims

1. A computer-implemented method for identifying similar documents, comprising:
- (a) receiving document text for a current document that includes at least two words;
  
  (b) calculating;
  
  a prominence score for each word, wherein the prominence score for each word is based on a term weight and a non-compound likelihood for each word,a prominence score for each pair of consecutive words, wherein the prominence score for each pair of consecutive words is based on a term weight and a compound probability for each pair of consecutive words,a descriptiveness score for each word wherein the descriptiveness score for each word is based on a corpus, anda descriptiveness score for each pair of consecutive words, wherein the descriptiveness score for each pair of consecutive words is based on the corpus;
  
  (c) calculating a comparison metric for the current document, wherein the comparison metric is based on the combination of each prominence score and each descriptiveness score;
  
  (d) finding, using a query processor, at least one potential document, wherein document text for each potential document includes at least one word from the current document; and
  
  (e) analyzing each found potential document to identify at least one similar document as a function of a comparison metric for the respective potential document and the comparison metric for the current document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein the analyzing step (e) comprises:
    - calculating, for each potential document, a dot product of a scoring vector for the respective potential document and a scoring vector for the current document, wherein the scoring vectors are based on each document'"'"'s respective comparison metric.
  - 3. The method of claim 1, wherein the analyzing step (e) comprises:
    - identifying the at least one similar document based on the comparison metric of each potential document and the current document, wherein potential documents with a comparison metric above a threshold are selected for identification.
  - 4. The method of claim 2, wherein calculating the dot product further comprises:
    - (i) calculating a normalization factor for each scoring vector, wherein the normalization factor converts each scoring vector into unit length;
      
      (ii) calculating a quality score for each scoring vector based on term weights of each respective word; and
      
      (iii) multiplying each dot product by the quality score and the inverse of the normalization factor for the respective potential document and by the quality score and the inverse of the normalization factor for the current document.
  - 5. The method of claim 1, wherein the non-compound likelihood for each word is based on a probability that each respective word and a word to the left of the respective word do not combine to form a word compound and a probability that each respective word and a word to the right of the respective word do not combine to form a word compound.
  - 6. The method of claim 1, wherein the compound probability for each pair of consecutive words is based on a number of times each respective word pair appears in a training corpus and a number of times a first word and a second word of each respective word pair appear within a certain distance of each other in the training corpus.
  - 7. The method of claim 1, wherein the term weight is based on a location of the respective word in the document text.
  - 8. The method of claim 1, wherein calculating the descriptiveness score comprises:
    - calculating a frequency of each word or pair of consecutive words in a purpose-relevant corpus and a frequency of each word or pair of consecutive words in a background corpus.
  - 9. The method of claim 8, wherein the purpose-relevant corpus includes a corpus of documents that are relevant in performing a desired task on the current document.
  - 10. The method of claim 1, further comprising:
    - (f) outputting a signal representing each similar document to a user via a communication network.

11. A system for identifying similar documents, comprising:
- a processor and a memory;
  
  a query processor using the processor and the memory;
  
  a compound likelihood database, that stores bigram probabilities for a first set of words;
  
  a descriptive database, that stores descriptiveness scores for a second set of words, wherein the first set of words and the second set of words includes common words; and
  
  a similar document identifier configured to receive document text for a current document, wherein the document text includes at least two document words,wherein the similar document identifier uses the query processor, the query processor configured to identify at least one similar document based on prominence scores and descriptiveness scores for each document word and pair of consecutive document words, wherein the prominence scores are based on results from a query to the compound likelihood database and the descriptiveness scores are received based on a query to the descriptive database, and whereinthe prominence score for each word is further based on a term weight and a non-compound likelihood for each word;
  
  the prominence score for each pair of consecutive words is based on a term weight and a compound probability for each pair of consecutive words;
  
  the descriptiveness score for each word is based on a corpus, andthe descriptiveness score for each pair of consecutive words is based on the corpus.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
- - 12. The system of claim 11, whereinthe query processor is further configured to:
    - find at least one potential document, wherein document text for each potential document includes at least one of the document words from the current document,analyze each potential document to identify each similar document as a function of a comparison metric for the respective potential document and a comparison metric for the current document, andoutput a signal representing each similar document to a user via a communication network; and
13. The system of claim 12, wherein the query processor identifying each similar document comprises:
- calculating, for each potential document, a dot product of a scoring vector for the respective potential document and a scoring vector for the current document, wherein the scoring vectors are based on each document'"'"'s respective comparison metric; and
  
  using the calculated scoring vectors to identify a similar document.
14. The system of claim 12, wherein the query processor identifying each similar document comprises:
- identifying the at least one similar document based on the comparison metric of each potential document, wherein potential documents with a comparison metric above a threshold are selected for identification.
15. The system of claim 13, wherein calculating the dot product further comprises:
- (i) calculating a normalization factor for each scoring vector, wherein the normalization factor converts each scoring vector into unit length;
  
  (ii) calculating a quality score for each scoring vector based on term weights of each respective word; and
  
  (iii) multiplying the dot product by the quality score and the inverse of the normalization factor for the respective potential document and by the quality score and the inverse of the normalization factor for the current document.
16. The system of claim 11, wherein the non-compound likelihood for each word is based on a probability that each respective word and a word to the left of the respective word do not combine to form a word compound and a probability that each respective word and a word to the right of the respective word do not combine to form a word compound.
17. The system of claim 11, wherein the compound probability for each pair of consecutive words comprises a number of times each respective word pair appears in a training corpus and a number of times a first word and a second word of each respective word pair appear within a certain distance of each other in the training corpus.
18. The system of claim 17, wherein the term weight is based on a location of the respective word in the document text.

19. A computer-implemented method for scoring documents, comprising:
- (a) receiving document text for a current document that includes at least two words;
  
  (b) calculating a prominence score and a descriptiveness score for each word from the current document and each pair of consecutive words from the current document, wherein;
  
  the prominence score for each word is based on a term weight and a non-compound likelihood for each word,the prominence score for each pair of consecutive words is based on a term weight and a bigram probability for each pair of consecutive words,a descriptiveness score for each word is based on a corpus,a descriptiveness score for each pair of consecutive words is based on the corpus; and
  
  (c) creating a scoring vector for the current document, wherein the scoring vector is based on the combination of each prominence score and each descriptiveness score.
- View Dependent Claims (20, 21, 22, 23, 24, 25)
- - 20. The method of claim 19, wherein the non-compound likelihood is based on a probability that each respective word and a word to the left of the respective word do not combine to form a word compound and a probability that each respective word and a word to the right of the respective word do not combine to form a word compound.
  - 21. The method of claim 19, wherein the bigram probability comprises a number of times each respective word pair appears in a training corpus and a number of times a first word and a second word of each respective word pair appear within a certain distance of each other in the training corpus.
  - 22. The method of claim 21, wherein the term weight of each respective word is based on a location of the respective word in the document text.
  - 23. The method of claim 19, wherein calculating the descriptiveness score comprises:
    - calculating a frequency of each word or pair of consecutive words in a purpose-relevant corpus and a frequency of each word or pair of consecutive words in a background corpus.
  - 24. The method of claim 23, wherein the purpose-relevant corpus includes a corpus of documents that are relevant in performing a desired task on the current document.
  - 25. The method of claim 19, wherein the scoring vector creating step (c) further comprises:
    - (i) normalizing the scoring vector, wherein the normalization converts the scoring vector into unit length; and
      
      (ii) multiplying the normalized scoring vector by a quality score, wherein the quality score is based on the term weight of each word.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Heafield, Kenneth, Curtis, Taylor
Primary Examiner(s)
Vy; Hung T
Assistant Examiner(s)
CAO, PHUONG THAO

Application Number

US12/050,626
Time in Patent Office

1,176 Days
Field of Search

707/999.003, 707/705, 707/758, 707/802
US Class Current

707/758
CPC Class Codes

G06F 16/313 Selection or weighting of t...

Systems and methods for identifying similar documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

263 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for identifying similar documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

263 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links