×

Deriving document similarity indices

  • US 8,478,740 B2
  • Filed: 12/16/2010
  • Issued: 07/02/2013
  • Est. Priority Date: 12/16/2010
  • Status: Active Grant
First Claim
Patent Images

1. At a computer system including one or more processors and system memory, a method for deriving a document similarity index for a plurality of documents, the method comprising:

  • an act of accessing a document;

    an act computing a tag index for the document, the tag index including one more keyword/weight pairs, each keyword/weight pair mapping a keyword to a corresponding weight for the keyword to indicate a significance of the keyword within the document;

    an act of identifying a specified number of most significant keywords in the document based on weights in the tag index;

    for each keyword in the specified number of the most significant keywords, an act of determining the corresponding weight of the keyword in each document in the plurality of documents;

    an act of identifying a plurality of candidate documents, from the among the plurality of documents, based on the corresponding weights of the specified number of the most significant keywords in the plurality of documents, at least some of the specified number of the most significant keywords in the document also being significant keywords in each of the plurality of candidate documents;

    for each candidate document in the plurality of candidate documents, an act of calculating a full similarity between the document and candidate document by determining the weight of additional keywords from the document within the candidate document;

    an act of selecting full similarities for a prescribed number of a candidate documents for inclusion in the document similarity index to indicate documents that are similar to the document, selection of the full similarities for the prescribed number of candidate documents based on the full similarity calculations and in accordance with one of a hard limit or an express threshold, the hard limit or the express threshold limiting the number of candidate documents that can be selected for inclusion in the document similarity index; and

    for each candidate document included in the prescribed number of candidate documents, an act of storing information from the full similarity between the document and the candidate document in the document similarly index.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×