Suffix Tree Similarity Measure for Document Clustering
First Claim
1. A computer system comprising at least one memory having stored therein computer executable components for facilitating document similarity measure and a processor that executes the computer executable components, the computer executable components comprising:
- a mapping component to map a suffix tree document model to a vector document model, wherein the vector document model is a vector with M elements, and M is the total number of nodes in the suffix tree document model;
a weighting component to weight elements of the mapped vector document model; and
a similarity component to determine the similarity between two or more weighted vector document models.
1 Assignment
0 Petitions
Accused Products
Abstract
The subject innovation provides for systems and methods to facilitate weighted suffix tree clustering. Conventional suffix tree cluster models can be augmented by incorporating quality measures to facilitate improved performance. Further the quality measure can be employed in determining cluster labels that show improvements in accuracy over conventional means. Additionally “stopnodes” can be defined to facilitate traversing suffix tree models efficiently. Quality measurements can be determined based in part on weighting factors applied to terms in a vector model, said terms being mapped from a suffix tree model.
94 Citations
14 Claims
-
1. A computer system comprising at least one memory having stored therein computer executable components for facilitating document similarity measure and a processor that executes the computer executable components, the computer executable components comprising:
-
a mapping component to map a suffix tree document model to a vector document model, wherein the vector document model is a vector with M elements, and M is the total number of nodes in the suffix tree document model; a weighting component to weight elements of the mapped vector document model; and a similarity component to determine the similarity between two or more weighted vector document models. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for Web document clustering in online forum communities, comprising:
-
acquiring a plurality of documents from a online forum system; creating a suffix tree document model for the a plurality of documents; mapping the suffix tree document model to a vector document model; weighting elements of the mapped vector document model to generate weighted elements; based on the weighted elements, determining the similarity between two or more weighted vector document models; building clustering result according to the similarity between two or more weighted vector documents models by using GAHC algorithm; and generating a cluster topic summary for each cluster. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A system for document clustering, comprising:
-
means for mapping a suffix tree document model to a vector document model to generate a mapping; means for weighting elements of the mapped vector document model to generate weighted elements to be applied the mapping; and means for determining similarity between two or more weighted vector document models based on the weighted elements.
-
Specification