×

Systems and methods for identifying similar documents

  • US 7,958,136 B1
  • Filed: 03/18/2008
  • Issued: 06/07/2011
  • Est. Priority Date: 03/18/2008
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for identifying similar documents, comprising:

  • (a) receiving document text for a current document that includes at least two words;

    (b) calculating;

    a prominence score for each word, wherein the prominence score for each word is based on a term weight and a non-compound likelihood for each word,a prominence score for each pair of consecutive words, wherein the prominence score for each pair of consecutive words is based on a term weight and a compound probability for each pair of consecutive words,a descriptiveness score for each word wherein the descriptiveness score for each word is based on a corpus, anda descriptiveness score for each pair of consecutive words, wherein the descriptiveness score for each pair of consecutive words is based on the corpus;

    (c) calculating a comparison metric for the current document, wherein the comparison metric is based on the combination of each prominence score and each descriptiveness score;

    (d) finding, using a query processor, at least one potential document, wherein document text for each potential document includes at least one word from the current document; and

    (e) analyzing each found potential document to identify at least one similar document as a function of a comparison metric for the respective potential document and the comparison metric for the current document.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×