Document similarity scoring and ranking method, device and computer program product
First Claim
Patent Images
1. A computer-based method of electronic document searching, navigating or retrieving documents in a set of electronic documents, comprising:
- analyzing the set of documents based on at least a portion of a similarity graph and at least a portion of a hyperlink graph constructed from hyperlinks between the set of documents, wherein;
said step of analyzing the set of documents using said at least a portion of the similarity graph and at least a portion of a hyperlink graph includes one of;
combining the at least a portion of the similarity graph and the at least a portion of the hyperlink graph into a single, hybrid graph by adding at least a portion of a hyperlink matrix to the at least a portion of a similarity matrix, and determining a score for the documents in the set of documents from an eigenvector of a matrix of the hybrid graph, the hybrid graph comprising one of;
the whole similarity graph and a subgraph of the hyperlink graph, anda subgraph of the similarity graph and the whole hyperlink graph; and
obtaining two eigenvectors of scores, one from each of the at least a portion of the similarity graph and the at least a portion of the hyperlink graph, and determining a net score for each document in the set of documents from a weighted combination of said two eigenvectors of scores,said method of searching, navigating or retrieving further comprising;
ranking at least one of the set of documents against another of the set documents with corresponding document scores; and
at least one of electronically searching said set of electronic documents, navigating said set of electronic documents, and retrieving from said set of electronic documents based on said ranking.
0 Assignments
0 Petitions
Accused Products
Abstract
A device, computer program product and a method for computing the similarity of a set of documents that avoids the large, wasted computational effort involved in calculating very small similarity scores by using thresholds to stop a similarity calculation between documents, thus ensuring that, with high probability, all document pairs with higher similarity than the thresholds have been found.
-
Citations
16 Claims
-
1. A computer-based method of electronic document searching, navigating or retrieving documents in a set of electronic documents, comprising:
-
analyzing the set of documents based on at least a portion of a similarity graph and at least a portion of a hyperlink graph constructed from hyperlinks between the set of documents, wherein; said step of analyzing the set of documents using said at least a portion of the similarity graph and at least a portion of a hyperlink graph includes one of; combining the at least a portion of the similarity graph and the at least a portion of the hyperlink graph into a single, hybrid graph by adding at least a portion of a hyperlink matrix to the at least a portion of a similarity matrix, and determining a score for the documents in the set of documents from an eigenvector of a matrix of the hybrid graph, the hybrid graph comprising one of; the whole similarity graph and a subgraph of the hyperlink graph, and a subgraph of the similarity graph and the whole hyperlink graph; and obtaining two eigenvectors of scores, one from each of the at least a portion of the similarity graph and the at least a portion of the hyperlink graph, and determining a net score for each document in the set of documents from a weighted combination of said two eigenvectors of scores, said method of searching, navigating or retrieving further comprising; ranking at least one of the set of documents against another of the set documents with corresponding document scores; and at least one of electronically searching said set of electronic documents, navigating said set of electronic documents, and retrieving from said set of electronic documents based on said ranking. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
Specification