×

Systems and methods for identifying similar documents

  • US 8,713,034 B1
  • Filed: 06/03/2011
  • Issued: 04/29/2014
  • Est. Priority Date: 03/18/2008
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for identifying similar documents, comprising:

  • receiving document text for a current document, the current document including a plurality of metadata fields and corresponding portions of the document text;

    determining a first descriptiveness score for each word of the document text based on a first corpus;

    determining a prominence score for each word of the document text, the prominence score for a particular word being based on a non-compound likelihood for the particular word, the non-compound likelihood for the particular word being based on a probability that the particular word and a word that is adjacent to the particular word in the current document do not combine to form a word compound;

    modifying the first descriptiveness score for each word based on a weight assigned to the metadata field to which the word corresponds;

    determining a comparison metric for the current document based on the modified first descriptiveness score for each word and the prominence score for each word;

    finding, using a query processor, at least one potential document, wherein document text for each potential document includes at least one word from the current document; and

    analyzing each found potential document to identify at least one similar document as a function of a comparison metric for the respective potential document and the comparison metric for the current document.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×