Document similarity scoring and ranking method, device and computer program product
1 Assignment
0 Petitions
Accused Products
Abstract
A device, computer program product and a method for searching, navigating or retrieving documents in a set of electronic documents, including performing a link analysis of the set of electronic documents. The link analysis includes one of analyzing at least two of the set of documents with at least a portion of a similarity graph constructed among the set of documents and analyzing the at least two of the set of documents with the at least a portion of the similarity graph and at least a portion of a hyperlink graph constructed from hyperlinks between the set of documents. Also described is a method for building a similarity matrix.
80 Citations
67 Claims
-
1-38. -38. (canceled)
-
39. A computer-based method of electronic document searching, navigating or retrieving, including building a similarity graph of a set of electronic documents, comprising:
-
representing the similarity graph as a similarity matrix for said set of electronic documents, said step of representing comprising;
electronically constructing and storing a word corpus from said set of electronic documents, each document in said set of electronic documents having a corresponding document ID;
electronically constructing an inverted index, based on the electronic corpus and set of electronic documents;
for each word in the inverted index, obtaining a plurality of document similarity scores by;
electronically sorting document IDs of said set of electronic documents according to a word similarity score to form a sorted set of document IDs, each document ID in said sorted set corresponding uniquely to a document in said set of electronic documents, where said word appears in each document represented in said sorted set, said sorted set being an index-word document list, and electronically calculating a document similarity score between each pair of documents identified in said index-word document list, for which said pair of documents meets a set of threshold criteria;
collecting, into said similarity matrix, the document similarity scores calculated;
assigning a value from said similarity matrix as a link weight between a corresponding two documents identified in said index-word document list; and
treating all remaining matrix elements of said similarity matrix as zero; and
at least one of electronically searching said set of electronic documents based on said link weight, navigating said set of electronic documents based on said link weight, and retrieving from said set of electronic documents based on said link weight. - View Dependent Claims (40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 66)
-
-
52. A computer-based method of electronic document searching, navigating or retrieving documents in a set of electronic documents, comprising:
-
performing a link analysis of the set of electronic documents, said step of performing link analysis including one of analyzing at least two of the set of documents with at least a portion of a similarity graph constructed among the set of documents and analyzing said at least two of the set of documents with said at least a portion of the similarity graph and at least a portion of a hyperlink graph constructed from hyperlinks between the set of documents, wherein;
said step of analyzing said at least two of the set of documents with said at least a portion of the similarity graph and at least a portion of a hyperlink graph includes one of;
combining the at least a portion of the similarity graph and the at least a portion of the hyperlink graph into a single, hybrid graph, and obtaining scores from an eigenvector of a matrix of the hybrid graph, the hybrid graph comprising one of;
the whole similarity graph and a subgraph of the hyperlink graph, and a subgraph of the similarity graph and the whole hyperlink graph; and
obtaining two eigenvectors of scores, one from each of the at least a portion of the similarity graph and the at least a portion of the hyperlink graph, and determining a net score for each document in the set of documents from a weighted combination of said two eigenvectors of scores, said method of searching, navigating or retrieving further comprising;
ranking at least one of the set of documents against another of the set documents with corresponding document scores; and
at least one of electronically searching said set of electronic documents, navigating said set of electronic documents, and retrieving from said set of electronic documents based on said ranking. - View Dependent Claims (53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 67)
-
Specification