Systems and methods for identifying similar documents

US 8,713,034 B1
Filed: 06/03/2011
Issued: 04/29/2014
Est. Priority Date: 03/18/2008
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for identifying similar documents, comprising:

receiving document text for a current document, the current document including a plurality of metadata fields and corresponding portions of the document text;

determining a first descriptiveness score for each word of the document text based on a first corpus;

determining a prominence score for each word of the document text, the prominence score for a particular word being based on a non-compound likelihood for the particular word, the non-compound likelihood for the particular word being based on a probability that the particular word and a word that is adjacent to the particular word in the current document do not combine to form a word compound;

modifying the first descriptiveness score for each word based on a weight assigned to the metadata field to which the word corresponds;

determining a comparison metric for the current document based on the modified first descriptiveness score for each word and the prominence score for each word;

finding, using a query processor, at least one potential document, wherein document text for each potential document includes at least one word from the current document; and

analyzing each found potential document to identify at least one similar document as a function of a comparison metric for the respective potential document and the comparison metric for the current document.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides systems and methods for identifying similar documents. In an embodiment, the present invention identifies similar documents by (1) receiving document text for a current document that includes at least one word; (2) calculating a prominence score and a descriptiveness score for each word and each pair of consecutive words; (3) calculating a comparison metric for the current document; (4) finding at least one potential document, where document text for each potential document includes at least one of the words; and (5) analyzing each potential document to identify at least one similar document.

Citations

20 Claims

1. A computer-implemented method for identifying similar documents, comprising:
- receiving document text for a current document, the current document including a plurality of metadata fields and corresponding portions of the document text;
  
  determining a first descriptiveness score for each word of the document text based on a first corpus;
  
  determining a prominence score for each word of the document text, the prominence score for a particular word being based on a non-compound likelihood for the particular word, the non-compound likelihood for the particular word being based on a probability that the particular word and a word that is adjacent to the particular word in the current document do not combine to form a word compound;
  
  modifying the first descriptiveness score for each word based on a weight assigned to the metadata field to which the word corresponds;
  
  determining a comparison metric for the current document based on the modified first descriptiveness score for each word and the prominence score for each word;
  
  finding, using a query processor, at least one potential document, wherein document text for each potential document includes at least one word from the current document; and
  
  analyzing each found potential document to identify at least one similar document as a function of a comparison metric for the respective potential document and the comparison metric for the current document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, further comprising determining a second descriptiveness score for each word based on a second corpus, wherein the modifying the first descriptiveness score for a particular word comprises modifying the first descriptiveness score for the particular word based on the second descriptiveness score for the particular word.
  - 3. The method of claim 2, wherein the determining a second descriptiveness score for each word based on a second corpus comprises determining a second descriptiveness score for each word based on a corpus specific to a document type of the current document.
  - 4. The method of claim 3, wherein:
    - the receiving document text for a current document comprises receiving document text for an advertisement document, andthe determining a second descriptiveness score for each word based on a corpus specific to a document type of the current document comprises determining a second descriptiveness score for each word based on a corpus of advertisements.
  - 5. The method of claim 1, wherein:
    - the determining a first descriptiveness score for a particular word comprises determining a first descriptiveness score for a pair of consecutive words that includes the particular word, andthe modifying the first descriptiveness score for each word comprises modifying the first descriptiveness score for the pair of consecutive words that includes the particular word based on a factor.
  - 6. The method of claim 1, wherein the modifying the first descriptiveness score for each word comprises modifying the first descriptiveness score of a particular word based on a comparison of the particular word to a corpus of first and last names of persons.
  - 7. The method of claim 6, wherein the modifying the first descriptiveness score of the particular word based on a comparison of the particular word to a corpus of first and last names of persons comprises increasing the first descriptiveness score of the particular word based on a celebrity measure of a name matched to the particular word.
  - 8. The method of claim 6, wherein the modifying the first descriptiveness score of the particular word based on a comparison of the particular word to a corpus of first and last names of persons comprises increasing the first descriptiveness score of the particular word based on a comparison to a corpus of celebrity names matched to the particular word.
  - 9. The method of claim 1, further comprising:
    - selecting a second corpus based on the current document;
      
      determining a second descriptiveness score for each word based on the second corpus, and, wherein modifying the first descriptiveness score for a particular word comprises reducing the first descriptiveness score for the particular word based on the second descriptiveness score for the particular word.
  - 10. The method of claim 1, wherein modifying the first descriptiveness score for each word comprises reducing the first descriptiveness score for a particular word based on an extent to which the particular word describes a medium the current document describes.
  - 11. The method of claim 10, wherein the receiving document text for a current document comprises receiving document text for a document that describes a photograph, and reducing the first descriptiveness score for the particular word based on the extent to which the word describes the medium the current document describes comprises reducing the descriptiveness score for the particular word based on the extent to which the particular word describes a photographic medium.

12. A system for identifying similar documents, comprising:
- a processor and a memory, the processor being configured to receive document text for a current document, the current document including a plurality of metadata fields and corresponding portions of the document text;
  
  a first descriptive database configured to store descriptiveness scores for a first corpus of words;
  
  a descriptiveness score retriever configured to retrieve a first descriptiveness score for each word in the current document based on the first descriptive database;
  
  a prominence score retriever configured to retrieve a prominence score for each word in the current document, the prominence score for a particular word being based on a non-compound likelihood for the particular word, the non-compound likelihood for the particular word being based on a probability that the particular word and a word that is adjacent to the particular word in the document text do not combine to form a word compound;
  
  a descriptiveness score modifier configured use the processor and the memory to modify the first descriptiveness score for each word based on a weight assigned to the metadata field to which the word corresponds;
  
  a comparison metric generator configured to generate a comparison metric for the current document based on the modified first descriptiveness score for each word and the prominence score for each word;
  
  a query processor configured to find at least one potential document, wherein document text for each potential document includes at least one word from the current document; and
  
  a similar document identifier configured to identify at least one similar document from the found potential documents based on the generated comparison metric for the current document and a respective comparison metric for each found potential document.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
- - 13. The system of claim 12, further comprising:
    - a second descriptive database configured to store descriptiveness scores for a second corpus of words, wherein;
      
      the second corpus of words is based on the current document, the descriptiveness score retriever is further configured to retrieve a second descriptiveness score for each word in a current document based on the second descriptive database, anda factor upon which the modification of the first descriptiveness score is based is the second descriptiveness score.
  - 14. The system of claim 13, wherein the second corpus of words is specific to a document type of the current document.
  - 15. The system of claim 14, wherein:
    - the current document is an advertisement; and
      
      the second corpus of words is a corpus of advertisements.
  - 16. The system of claim 12, wherein the descriptiveness score retriever is configured to retrieve a first descriptiveness score for each pair of consecutive words, and the descriptiveness score modifier is configured to modify the first descriptiveness score for each pair of consecutive words.
  - 17. The system of claim 16, further comprising:
    - a second descriptive database configured to store descriptiveness scores for a second corpus, the second corpus being of first and last names of persons, wherein;
      
      the descriptiveness score retriever is configured to retrieve a second descriptiveness score for each pair of consecutive words in a current document from the second descriptive database, anda factor upon which the modification of the first descriptiveness score for each pair of consecutive words is based is the second descriptiveness score for the pair of consecutive words.
  - 18. The system of claim 17, wherein the factor further comprises a celebrity measure of a name matched to the pair of consecutive words.
  - 19. The system of claim 17, wherein a factor upon which the modification of the first descriptiveness score for a particular pair of consecutive words is based further comprises an extent to which the particular pair of words describes a medium the current document describes.

20. A non-transitory computer-readable medium having computer-executable instructions stored thereon that, when executed by a computing device, cause the computing device to perform a method for identifying similar documents, comprising:
- receiving document text for a current document, the current document including a plurality of metadata fields and corresponding portions of the document text;
  
  determining a first descriptiveness score for each word of the document text based on a first corpus;
  
  determining a prominence score for each word of the document text, the prominence score for a particular word being based on a non-compound likelihood for the particular word, the non-compound likelihood for the particular word being based on a probability that the particular word and a word that is adjacent to the particular word in the document text do not combine to form a word compound;
  
  modifying the first descriptiveness score for each word based on a weight assigned to the metadata field to which the word corresponds;
  
  determining a comparison metric for the current document based on the modified first descriptiveness score for each word and the prominence score for each word;
  
  finding, using a query processor, at least one potential document, wherein document text for each potential document includes at least one word from the current document; and
  
  analyzing each found potential document to identify at least one similar document as a function of a comparison metric for the respective potential document and the comparison metric for the current document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Curtis, Taylor, Heafield, Kenneth
Primary Examiner(s)
Cao, Phuong Thao

Application Number

US13/153,319
Time in Patent Office

1,061 Days
Field of Search

707/802, 707/999.003, 707/758, 707/748
US Class Current

707/758
CPC Class Codes

G06F 16/313 Selection or weighting of t...

Systems and methods for identifying similar documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for identifying similar documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links