Deriving document similarity indices
First Claim
Patent Images
1. At a computer system including one or more processors and system memory, a method for deriving a document similarity index for a plurality of documents, the method comprising:
- an act of accessing a document;
an act computing a tag index for the document, the tag index including one more keyword/weight pairs, each keyword/weight pair mapping a keyword to a corresponding weight for the keyword to indicate a significance of the keyword within the document;
an act of identifying a specified number of most significant keywords in the document based on weights in the tag index;
for each keyword in the specified number of the most significant keywords, an act of determining the corresponding weight of the keyword in each document in the plurality of documents;
an act of identifying a plurality of candidate documents, from the among the plurality of documents, based on the corresponding weights of the specified number of the most significant keywords in the plurality of documents, at least some of the specified number of the most significant keywords in the document also being significant keywords in each of the plurality of candidate documents;
for each candidate document in the plurality of candidate documents, an act of calculating a full similarity between the document and candidate document by determining the weight of additional keywords from the document within the candidate document;
an act of selecting full similarities for a prescribed number of a candidate documents for inclusion in the document similarity index to indicate documents that are similar to the document, selection of the full similarities for the prescribed number of candidate documents based on the full similarity calculations and in accordance with one of a hard limit or an express threshold, the hard limit or the express threshold limiting the number of candidate documents that can be selected for inclusion in the document similarity index; and
for each candidate document included in the prescribed number of candidate documents, an act of storing information from the full similarity between the document and the candidate document in the document similarly index.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention extends to methods, systems, and computer program products for deriving document similarity indices. Embodiments of the invention include scalable and efficient mechanisms for deriving and updating a document similarity index for a plurality of documents. The number of maintained similarities can be controlled to conserve CPU and storage resources.
25 Citations
20 Claims
-
1. At a computer system including one or more processors and system memory, a method for deriving a document similarity index for a plurality of documents, the method comprising:
-
an act of accessing a document; an act computing a tag index for the document, the tag index including one more keyword/weight pairs, each keyword/weight pair mapping a keyword to a corresponding weight for the keyword to indicate a significance of the keyword within the document; an act of identifying a specified number of most significant keywords in the document based on weights in the tag index; for each keyword in the specified number of the most significant keywords, an act of determining the corresponding weight of the keyword in each document in the plurality of documents; an act of identifying a plurality of candidate documents, from the among the plurality of documents, based on the corresponding weights of the specified number of the most significant keywords in the plurality of documents, at least some of the specified number of the most significant keywords in the document also being significant keywords in each of the plurality of candidate documents; for each candidate document in the plurality of candidate documents, an act of calculating a full similarity between the document and candidate document by determining the weight of additional keywords from the document within the candidate document; an act of selecting full similarities for a prescribed number of a candidate documents for inclusion in the document similarity index to indicate documents that are similar to the document, selection of the full similarities for the prescribed number of candidate documents based on the full similarity calculations and in accordance with one of a hard limit or an express threshold, the hard limit or the express threshold limiting the number of candidate documents that can be selected for inclusion in the document similarity index; and for each candidate document included in the prescribed number of candidate documents, an act of storing information from the full similarity between the document and the candidate document in the document similarly index. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. At a computer system including one or more processors and system memory, the computer system also including a plurality of documents and a document similarity index, the document similarity index indicating similarities between different documents in the plurality of documents, a method for updating the document similarity index, the method comprising:
-
an act of accessing a batch of documents; for each document in the batch of documents, an act of computing a tag index for the document, the tag index including one more keyword/weight pairs, each keyword/weight pair mapping a keyword to a corresponding weight for keyword to indicate a significance of the keyword within the document; for each document in the batch of documents subsequent to computing the tag indices; an act of identifying a specified number of the most significant keywords in the document based on weights in the tag index; for each keyword in the specified number of most significant keywords, an act of determining the corresponding weight of the keyword in each document in the plurality of documents and in the batch of documents; an act of identifying a plurality of candidate documents, from the among the plurality of documents and the batch of documents, based on the corresponding weights of the specified number of the most significant keywords in the plurality of documents and in the batch of documents, at least some of the specified number of the most significant keywords in the document also being significant keywords in each of the plurality of candidate documents; for at least one candidate document identified from within the plurality of documents; an act of calculating a full similarity between the document and candidate document by determining the weight of additional keywords from the document within the candidate document; an act of identifying the weakest similarity, from among a specified number of top similarities, for the candidate document from within the document similarity index, the weakest similarity indicating the similarity between the candidate document and another document in the plurality of documents; an act of determining that the candidate document and the document are more similar than the candidate document and the other document by comparing the calculated full similarity to the identified weakest similarity; and an act of replacing the weakest similarly with information from the calculated full similarity within the document similarity index to incrementally update the document similarity index, the replacement based on the determination; and for any candidate documents identified from within the batch of documents; an act of calculating a full similarity between the document and candidate document by determining the weight of additional keywords from the document within the candidate document; an act of selecting a prescribed number of candidate documents for inclusion in the document similarity index as documents that are similar to the document, selection of the prescribed number of candidate documents based on the full similarity calculations and in accordance with one of a hard limit or an express threshold, the hard limit or the express threshold limiting the number of candidate documents that can be selected for inclusion in the document similarity index; and an act of storing information from the calculated full similarity between the document and the candidate document in the document similarly index. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A computer program product for use at a computer system, the computer program product a method for deriving a document similarity index for a plurality of documents, the computer program product comprising one or more computer storage devices having stored thereon computer executable instructions that when executed at a processor cause, the computer system to perform the method including the following:
-
access a document containing words in a written language; compute a tag index for the document, the tag index including one more keyword/weight pairs, each keyword/weight pair mapping a keyword to a corresponding weight for the keyword to indicate a significance of the keyword within the document; identify a specified number of most significant keywords in the document based on weights in the tag index; for each keyword in the specified number of the most significant keywords, determine the corresponding weight of the keyword in each document in the plurality of documents; identify a plurality of candidate documents, from the among the plurality of documents, based on the corresponding weights of the specified number of the most significant keywords in the plurality of documents, at least some of the specified number of the most significant keywords accessed from a least recently used cache; for each candidate document in the plurality of candidate documents, an act of use a cosine-similarity function to calculating a full similarity between the document and candidate document by determining the weight of additional keywords from the document within the candidate document; for a first one or more candidate documents; select full similarities for a prescribed number of a candidate documents for inclusion in the document similarity index to indicate documents that are similar to the document, selection of the full similarities for the prescribed number of candidate documents based on the full similarity calculations and in accordance with one of a hard limit or an express threshold, the hard limit or the express threshold limiting the number of candidate documents that can be selected for inclusion in the document similarity index; and for each candidate document included in the prescribed number of candidate documents, store information from the full similarity between the document and the candidate document in the document similarly index, for a second one or more candidate documents; identifying the weakest similarity, from among a specified number of top similarities, for the candidate document from within the document similarity index, the weakest similarity indicating the similarity between the candidate document and another document in the plurality of documents; determine if the candidate document and the document are more similar than the candidate document and the other document by comparing the calculated full similarity to the identified weakest similarity; replace the weakest similarly with information from the calculated full similarity within the document similarity index, when candidate document and the document are more similar than the candidate document and the other document; and retain the weakest similarly, when candidate document and the document are not more similar than the candidate document and the other document. - View Dependent Claims (19, 20)
-
Specification