×

Document similarity scoring and ranking method, device and computer program product

  • US 7,689,559 B2
  • Filed: 02/08/2006
  • Issued: 03/30/2010
  • Est. Priority Date: 02/08/2006
  • Status: Active Grant
First Claim
Patent Images

1. A computer-based method of searching, navigating or retrieving, from a set of electronic documents, comprising:

  • electronically constructing and storing a word corpus from said set of electronic documents, each document in said set of electronic documents having a corresponding document ID;

    electronically constructing an inverted index, based on the electronic corpus and set of electronic documents;

    for each word in the inverted index, obtaining a plurality of document similarity scores by;

    sorting the document IDs of said documents according to a word similarity score to form a sorted set of document IDs, wherein said word appears in each document represented in said sorted set, and wherein said sorted set is an index-word document list, said documents being sorted into decreasing order of similarity,calculating a document similarity score between pairs of documents identified in said index-word document list,entering the calculated document similarity scores into a matrix of similarity scores wherein each similarity score represents a degree of similarity between a pair of documents, said matrix being a similarity graph (S),in said matrix, treating a degree of similarity between each pair of documents for which a similarity score has not been calculated as being a zero value;

    using said similarity graph (S) when performing a similarity analysis of said documents for at least one of;

    searching said set of electronic documents based on said similarity analysis,navigating said set of electronic documents based on said similarity analysis, andretrieving from set of documents based on said similarity analysis,wherein the obtaining a plurality of document similarity scores further includes at least one of;

    the step of sorting the document IDs of said documents further includes truncating said sorted list by removing documents whose similarity is less than a threshold τ

    word, andthe step of calculating a document similarity score further includes calculating the document similarity score between pairs of documents identified in said index-word document list until a first occurrence of a similarity score lower than a threshold τ

    set.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×