Suffix tree similarity measure for document clustering
First Claim
Patent Images
1. A system, comprising:
- at least one memory having stored therein computer executable instructions; and
a processor, coupled to the at least one memory, configured to execute or facilitate execution of the computer executable instructions to at least;
create a suffix tree document model that is a representation of a plurality of documents;
convert the suffix tree document model into a vector document model that is a representation of a document of the plurality of documents to form the suffix tree document model converted into the vector document model, wherein the vector document model is a vector with M elements and M is a total number of nodes in the suffix tree document model;
weight elements of the suffix tree document model converted into the vector document model; and
determine a similarity between two or more weighted vector document models, each representing a respective document of the plurality of documents.
1 Assignment
0 Petitions
Accused Products
Abstract
The subject innovation provides for systems and methods to facilitate weighted suffix tree clustering. Conventional suffix tree cluster models can be augmented by incorporating quality measures to facilitate improved performance. Further the quality measure can be employed in determining cluster labels that show improvements in accuracy over conventional means. Additionally “stopnodes” can be defined to facilitate traversing suffix tree models efficiently. Quality measurements can be determined based in part on weighting factors applied to terms in a vector model, said terms being mapped from a suffix tree model.
49 Citations
20 Claims
-
1. A system, comprising:
-
at least one memory having stored therein computer executable instructions; and a processor, coupled to the at least one memory, configured to execute or facilitate execution of the computer executable instructions to at least; create a suffix tree document model that is a representation of a plurality of documents; convert the suffix tree document model into a vector document model that is a representation of a document of the plurality of documents to form the suffix tree document model converted into the vector document model, wherein the vector document model is a vector with M elements and M is a total number of nodes in the suffix tree document model; weight elements of the suffix tree document model converted into the vector document model; and determine a similarity between two or more weighted vector document models, each representing a respective document of the plurality of documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method, comprising:
-
creating, by a device comprising a processor, a suffix tree document model for a plurality of documents received from an online forum system; after the creating the suffix tree document model, generating the suffix tree document model into vector document models representing respective ones of the plurality of documents, wherein the vector document models are vectors with M elements, and M is a total number of nodes in the suffix tree document model; weighting elements of the vector document models generated from the suffix tree document model to generate weighted elements; based on the weighted elements, determining a similarity between weighted vector document models, representing respective documents from the plurality of documents; forming a plurality of clusters according to the similarity between the weighted vector documents models by using group agglomerative hierarchical clustering; and generating respective cluster topic summaries for the plurality of clusters. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory computer readable medium comprising computer executable instructions that, in response to execution, cause a system comprising a processor to perform operations, comprising:
-
creating a suffix tree document model that represents a plurality of documents; translating the suffix tree document model representing the plurality of documents into a vector document model that represents one of the plurality of documents to generate a translation of the vector document model from the suffix tree document model, wherein the vector document model is a vector with M elements, and M is a total number of nodes in the suffix tree document model; weighting elements of the translation of the vector document model to generate weighted vector document models comprising weighted elements; and determining a similarity between at least two weighted vector models of the weighted vector document models, each representing a respective document from the plurality of documents, based on the weighted elements. - View Dependent Claims (18, 19, 20)
-
Specification