Clustering hypertext with applications to WEB searching
First Claim
1. A method of searching a database containing hypertext documents, said method comprising:
- searching said database using a query to produce a set of hypertext documents, and clustering said set of hypertext documents into various clusters such that documents within each cluster are similar to each other.
0 Assignments
0 Petitions
Accused Products
Abstract
A method and structure for providing a database of documents comprising performing a search of the database using a query to produce query result documents, constructing a word dictionary of words within the query result documents, pruning function words from the word dictionary, forming first vectors for words remaining in a word dictionary, constructing an out-link dictionary of documents within the database that are pointed to by the query result documents, adding the query result documents to the out-link dictionary, pruning documents from the out-link dictionary that are pointed to by fewer than a first predetermined number of the query result documents, forming second vectors for documents remaining in the out-link dictionary, constructing an in-link dictionary of documents within the database that point to the query result documents, adding the query result documents to the in-link dictionary, pruning documents from the in-link dictionary that point to fewer than a second predetermined number of the query result documents, forming third vectors for documents remaining in the in-link dictionary, normalizing the first vectors, the second vectors, and the third vectors to create vector triplets for document remaining in the in-link dictionary and the out-link dictionary, clustering the vector triplets using the toric k-means process, and annotating/summarizing the obtained clusters using nuggets of information, the nuggets including summary, breakthrough, review, keyword, citation, and reference.
52 Citations
54 Claims
-
1. A method of searching a database containing hypertext documents, said method comprising:
-
searching said database using a query to produce a set of hypertext documents, and clustering said set of hypertext documents into various clusters such that documents within each cluster are similar to each other. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method of searching a database containing hypertext documents, said method comprising:
-
searching said database using a query to produce a set of hypertext documents; and
clustering said set of hypertext documents into various clusters such that documents within each cluster are similar to each other, wherein said clustering is based upon words contained in each hypertext document, out-links from each hypertext document, and in-links to each hypertext document. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
-
-
28. A method of searching a database of documents comprising:
-
performing a search of said database using a query to produce query result documents;
constructing a word dictionary of words within said query result documents;
constructing an out-link dictionary of documents within said database that are pointed to by said query result documents;
adding said query result documents to said out-link dictionary;
constructing an in-link dictionary of documents within said database that point to said query result documents; and
adding said query result documents to said in-link dictionary. - View Dependent Claims (29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
-
-
41. A method of searching a database of documents comprising:
-
performing a search of said database using a query to produce query result documents;
constructing a word dictionary of words within said query result documents;
pruning function words from said word dictionary;
forming first vectors for words remaining in said word dictionary;
constructing an out-link dictionary of documents within said database that are pointed to by said query result documents;
adding said query result documents to said out-link dictionary;
pruning documents from said out-link dictionary that are pointed to by fewer than a first predetermined number of said query result documents;
forming second vectors for documents remaining in said out-link dictionary;
constructing an in-link dictionary of documents within said database that point to said query result documents;
adding said query result documents to said in-link dictionary;
pruning documents from said in-link dictionary that point to fewer than a second predetermined number of said query result documents;
forming third vectors for documents remaining in said in-link dictionary;
normalizing said first vectors, said second vectors, and said third vectors to create vector triplets for document remaining in said in-link dictionary and said out-link dictionary;
clustering the said vector triplets using a four step process of toric k-means comprising;
(a) arbitrarily segregating said vector triplets into clusters;
(b) for each cluster, computing a set of concept triplets describing said cluster;
(c) re-segregating said vector triplets into more coherent set of clusters by putting each vector triplet into a cluster corresponding to a concept triplet that is most similar to, a given vector triplet; and
(d) determining a coherence for each of said clusters based on a similarity of vector triplets within each of said clusters, and repeating steps (b)-(c) until coherence of the obtained clusters no longer significantly increases; and
annotating and summarizing said vector triplets using nuggets of information, said nuggets including summary, breakthrough, review, keyword, citation, and reference.
-
-
42. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform a method of searching a database containing hypertext documents, said method comprising:
-
searching said database using a query to produce a set of hypertext documents; and
clustering said set of hypertext documents into various clusters such that documents within each cluster are similar to each other, wherein said clustering is based upon words contained in each hypertext document, out-links from each hypertext document, and in-links to each hypertext document. - View Dependent Claims (43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54)
-
Specification