Suffix tree similarity measure for document clustering
First Claim
Patent Images
1. A system, comprising:
- a memory having stored therein executable instructions; and
a processor, coupled to the memory, configured to execute or facilitate execution of the executable instructions to at least;
create a suffix tree document model that is a first representation of documents in at least one knowledge source on a computerized network;
convert the suffix tree document model to a vector document model that is a second representation of the documents, wherein the vector document model comprises respective weighted vectors for the documents, where each weighted vector of the respective weighted vectors consists of M elements and M is a total number of nodes in the suffix tree document model not including a root node of the suffix tree document model;
determine at least one similarity between two or more of the documents based upon the respective weighted vectors;
generate clusters of the documents based on the at least one similarity; and
in response to a search query of the least one knowledge source, providing a search result based on the clusters of the documents.
1 Assignment
0 Petitions
Accused Products
Abstract
The subject innovation provides for systems and methods to facilitate weighted suffix tree clustering. Conventional suffix tree cluster models can be augmented by incorporating quality measures to facilitate improved performance. Further the quality measure can be employed in determining cluster labels that show improvements in accuracy over conventional means. Additionally “stopnodes” can be defined to facilitate traversing suffix tree models efficiently. Quality measurements can be determined based in part on weighting factors applied to terms in a vector model, said terms being mapped from a suffix tree model.
16 Citations
20 Claims
-
1. A system, comprising:
-
a memory having stored therein executable instructions; and a processor, coupled to the memory, configured to execute or facilitate execution of the executable instructions to at least; create a suffix tree document model that is a first representation of documents in at least one knowledge source on a computerized network; convert the suffix tree document model to a vector document model that is a second representation of the documents, wherein the vector document model comprises respective weighted vectors for the documents, where each weighted vector of the respective weighted vectors consists of M elements and M is a total number of nodes in the suffix tree document model not including a root node of the suffix tree document model; determine at least one similarity between two or more of the documents based upon the respective weighted vectors; generate clusters of the documents based on the at least one similarity; and in response to a search query of the least one knowledge source, providing a search result based on the clusters of the documents. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method, comprising:
-
creating, by a device comprising a processor, a suffix tree document model for a set of documents in at least one knowledge source on a computerized network; generating, by the device, a vector document model from the suffix tree document model, wherein the vector document model is a representation of the set of documents and comprises respective weighted vectors for documents of the set of documents, wherein each weighted vector of the respective weighted vectors consists of M elements, and M is a total number of nodes in the suffix tree document model not including a root node of the suffix tree document model; determining, by the device, at least one similarity between two or more of the documents based upon the respective weighted vectors; forming, by the device, clusters of the documents based on the at least one similarity; and in response to a search query of the least one knowledge source, providing, by the device, a search result based on the clusters of the documents. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A non-transitory, computer readable medium comprising executable instructions that, in response to execution, cause a system comprising a processor to perform operations, comprising:
-
creating a suffix tree document model that represents a plurality of documents in at least one knowledge source on a computerized network; translating the suffix tree document model to a vector document model that represents the plurality of documents, wherein the vector document model comprises respective weighted vectors for documents of the plurality of documents, wherein each weighted vector of the respective weighted vectors consists of M elements, and M is a total number of nodes in the suffix tree document model not including a root node of the suffix tree document model; determining at least one similarity between two or more of the documents based upon the respective weighted vectors; forming a plurality of clusters of the documents based on the at least one similarity; and in response to a search query of the least one knowledge source, providing a search result based on the plurality of clusters of the documents. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification