Document similarity scoring and ranking method, device and computer program product
First Claim
1. A computer-based method of searching, navigating or retrieving, from a set of electronic documents, comprising:
- electronically constructing and storing a word corpus from said set of electronic documents, each document in said set of electronic documents having a corresponding document ID;
electronically constructing an inverted index, based on the electronic corpus and set of electronic documents;
for each word in the inverted index, obtaining a plurality of document similarity scores by;
sorting the document IDs of said documents according to a word similarity score to form a sorted set of document IDs, wherein said word appears in each document represented in said sorted set, and wherein said sorted set is an index-word document list, said documents being sorted into decreasing order of similarity,calculating a document similarity score between pairs of documents identified in said index-word document list,entering the calculated document similarity scores into a matrix of similarity scores wherein each similarity score represents a degree of similarity between a pair of documents, said matrix being a similarity graph (S),in said matrix, treating a degree of similarity between each pair of documents for which a similarity score has not been calculated as being a zero value;
using said similarity graph (S) when performing a similarity analysis of said documents for at least one of;
searching said set of electronic documents based on said similarity analysis,navigating said set of electronic documents based on said similarity analysis, andretrieving from set of documents based on said similarity analysis,wherein the obtaining a plurality of document similarity scores further includes at least one of;
the step of sorting the document IDs of said documents further includes truncating said sorted list by removing documents whose similarity is less than a threshold τ
word, andthe step of calculating a document similarity score further includes calculating the document similarity score between pairs of documents identified in said index-word document list until a first occurrence of a similarity score lower than a threshold τ
set.
1 Assignment
0 Petitions
Accused Products
Abstract
A device, computer program product and a method for searching, navigating or retrieving documents in a set of electronic documents, including performing a link analysis of the set of electronic documents. The link analysis includes one of analyzing at least two of the set of documents with at least a portion of a similarity graph constructed among the set of documents and analyzing the at least two of the set of documents with the at least a portion of the similarity graph and at least a portion of a hyperlink graph constructed from hyperlinks between the set of documents. Also described is a method for building a similarity matrix.
-
Citations
12 Claims
-
1. A computer-based method of searching, navigating or retrieving, from a set of electronic documents, comprising:
-
electronically constructing and storing a word corpus from said set of electronic documents, each document in said set of electronic documents having a corresponding document ID; electronically constructing an inverted index, based on the electronic corpus and set of electronic documents; for each word in the inverted index, obtaining a plurality of document similarity scores by; sorting the document IDs of said documents according to a word similarity score to form a sorted set of document IDs, wherein said word appears in each document represented in said sorted set, and wherein said sorted set is an index-word document list, said documents being sorted into decreasing order of similarity, calculating a document similarity score between pairs of documents identified in said index-word document list, entering the calculated document similarity scores into a matrix of similarity scores wherein each similarity score represents a degree of similarity between a pair of documents, said matrix being a similarity graph (S), in said matrix, treating a degree of similarity between each pair of documents for which a similarity score has not been calculated as being a zero value; using said similarity graph (S) when performing a similarity analysis of said documents for at least one of; searching said set of electronic documents based on said similarity analysis, navigating said set of electronic documents based on said similarity analysis, and retrieving from set of documents based on said similarity analysis, wherein the obtaining a plurality of document similarity scores further includes at least one of; the step of sorting the document IDs of said documents further includes truncating said sorted list by removing documents whose similarity is less than a threshold τ
word, andthe step of calculating a document similarity score further includes calculating the document similarity score between pairs of documents identified in said index-word document list until a first occurrence of a similarity score lower than a threshold τ
set. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
Specification