Systems and methods for identifying similar documents
First Claim
Patent Images
1. A computer-implemented method for identifying similar documents, comprising:
- receiving document text for a current document, the current document including a plurality of metadata fields and corresponding portions of the document text;
determining a first descriptiveness score for each word of the document text based on a first corpus;
determining a prominence score for each word of the document text, the prominence score for a particular word being based on a non-compound likelihood for the particular word, the non-compound likelihood for the particular word being based on a probability that the particular word and a word that is adjacent to the particular word in the current document do not combine to form a word compound;
modifying the first descriptiveness score for each word based on a weight assigned to the metadata field to which the word corresponds;
determining a comparison metric for the current document based on the modified first descriptiveness score for each word and the prominence score for each word;
finding, using a query processor, at least one potential document, wherein document text for each potential document includes at least one word from the current document; and
analyzing each found potential document to identify at least one similar document as a function of a comparison metric for the respective potential document and the comparison metric for the current document.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides systems and methods for identifying similar documents. In an embodiment, the present invention identifies similar documents by (1) receiving document text for a current document that includes at least one word; (2) calculating a prominence score and a descriptiveness score for each word and each pair of consecutive words; (3) calculating a comparison metric for the current document; (4) finding at least one potential document, where document text for each potential document includes at least one of the words; and (5) analyzing each potential document to identify at least one similar document.
-
Citations
20 Claims
-
1. A computer-implemented method for identifying similar documents, comprising:
receiving document text for a current document, the current document including a plurality of metadata fields and corresponding portions of the document text; determining a first descriptiveness score for each word of the document text based on a first corpus; determining a prominence score for each word of the document text, the prominence score for a particular word being based on a non-compound likelihood for the particular word, the non-compound likelihood for the particular word being based on a probability that the particular word and a word that is adjacent to the particular word in the current document do not combine to form a word compound; modifying the first descriptiveness score for each word based on a weight assigned to the metadata field to which the word corresponds; determining a comparison metric for the current document based on the modified first descriptiveness score for each word and the prominence score for each word; finding, using a query processor, at least one potential document, wherein document text for each potential document includes at least one word from the current document; and analyzing each found potential document to identify at least one similar document as a function of a comparison metric for the respective potential document and the comparison metric for the current document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
12. A system for identifying similar documents, comprising:
-
a processor and a memory, the processor being configured to receive document text for a current document, the current document including a plurality of metadata fields and corresponding portions of the document text; a first descriptive database configured to store descriptiveness scores for a first corpus of words; a descriptiveness score retriever configured to retrieve a first descriptiveness score for each word in the current document based on the first descriptive database; a prominence score retriever configured to retrieve a prominence score for each word in the current document, the prominence score for a particular word being based on a non-compound likelihood for the particular word, the non-compound likelihood for the particular word being based on a probability that the particular word and a word that is adjacent to the particular word in the document text do not combine to form a word compound; a descriptiveness score modifier configured use the processor and the memory to modify the first descriptiveness score for each word based on a weight assigned to the metadata field to which the word corresponds; a comparison metric generator configured to generate a comparison metric for the current document based on the modified first descriptiveness score for each word and the prominence score for each word; a query processor configured to find at least one potential document, wherein document text for each potential document includes at least one word from the current document; and a similar document identifier configured to identify at least one similar document from the found potential documents based on the generated comparison metric for the current document and a respective comparison metric for each found potential document. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
-
-
20. A non-transitory computer-readable medium having computer-executable instructions stored thereon that, when executed by a computing device, cause the computing device to perform a method for identifying similar documents, comprising:
-
receiving document text for a current document, the current document including a plurality of metadata fields and corresponding portions of the document text; determining a first descriptiveness score for each word of the document text based on a first corpus; determining a prominence score for each word of the document text, the prominence score for a particular word being based on a non-compound likelihood for the particular word, the non-compound likelihood for the particular word being based on a probability that the particular word and a word that is adjacent to the particular word in the document text do not combine to form a word compound; modifying the first descriptiveness score for each word based on a weight assigned to the metadata field to which the word corresponds; determining a comparison metric for the current document based on the modified first descriptiveness score for each word and the prominence score for each word; finding, using a query processor, at least one potential document, wherein document text for each potential document includes at least one word from the current document; and analyzing each found potential document to identify at least one similar document as a function of a comparison metric for the respective potential document and the comparison metric for the current document.
-
Specification