Suffix tree similarity measure for document clustering

US 8,676,815 B2
Filed: 05/06/2009
Issued: 03/18/2014
Est. Priority Date: 05/07/2008
Status: Active Grant

First Claim

Patent Images

1. A system, comprising:

at least one memory having stored therein computer executable instructions; and

a processor, coupled to the at least one memory, configured to execute or facilitate execution of the computer executable instructions to at least;

create a suffix tree document model that is a representation of a plurality of documents;

convert the suffix tree document model into a vector document model that is a representation of a document of the plurality of documents to form the suffix tree document model converted into the vector document model, wherein the vector document model is a vector with M elements and M is a total number of nodes in the suffix tree document model;

weight elements of the suffix tree document model converted into the vector document model; and

determine a similarity between two or more weighted vector document models, each representing a respective document of the plurality of documents.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The subject innovation provides for systems and methods to facilitate weighted suffix tree clustering. Conventional suffix tree cluster models can be augmented by incorporating quality measures to facilitate improved performance. Further the quality measure can be employed in determining cluster labels that show improvements in accuracy over conventional means. Additionally “stopnodes” can be defined to facilitate traversing suffix tree models efficiently. Quality measurements can be determined based in part on weighting factors applied to terms in a vector model, said terms being mapped from a suffix tree model.

49 Citations

View as Search Results

20 Claims

1. A system, comprising:
- at least one memory having stored therein computer executable instructions; and
  
  a processor, coupled to the at least one memory, configured to execute or facilitate execution of the computer executable instructions to at least;
  
  create a suffix tree document model that is a representation of a plurality of documents;
  
  convert the suffix tree document model into a vector document model that is a representation of a document of the plurality of documents to form the suffix tree document model converted into the vector document model, wherein the vector document model is a vector with M elements and M is a total number of nodes in the suffix tree document model;
  
  weight elements of the suffix tree document model converted into the vector document model; and
  
  determine a similarity between two or more weighted vector document models, each representing a respective document of the plurality of documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The system of claim 1, wherein the processor and memory are distributed across at least two network devices.
  - 3. The system of claim 1, wherein the elements of the suffix tree document model converted into the vector document model are weighted based on a term frequency of a corresponding suffix tree node, wherein the term frequency of the corresponding suffix tree node with respect to the document is a total number of times the document traverses the corresponding suffix tree node.
  - 4. The system of claim 1, wherein the elements of the suffix tree document model converted into the vector document model are weighted based on an inverse document frequency of a corresponding suffix tree node, and the inverse document frequency of the corresponding suffix tree node is a number of different documents that have traversed the corresponding suffix tree node.
  - 5. The system of claim 1, wherein the processor is further configured to execute or facilitate the execution of the computer executable instructions to determine the similarity using a cosine similarity function.
  - 6. The system of claim 1, wherein the processor is further configured to execute or facilitate the execution of the computer executable instructions to populate the suffix tree document model or clean data for conversion.
  - 7. The system of claim 1, wherein the processor is further configured to execute or facilitate the execution of the computer executable instructions to:
    - generate stopnodes, wherein a threshold of an inverse document frequency is determinative of whether a node is a stopnode of the stopnodes,generate cluster topics based at least in part on information related to unconverted data information, anddetermine a quality of the cluster topics based at least in part on the unconverted data information.
  - 8. The system of claim 7, wherein the processor is further configured to execute or facilitate the execution of the computer executable instructions to retain term information in a subset of branches of the suffix tree document model that relates to one or more stop words that are removed in phrases represented by the branches of the suffix tree document model.

9. A method, comprising:
- creating, by a device comprising a processor, a suffix tree document model for a plurality of documents received from an online forum system;
  
  after the creating the suffix tree document model, generating the suffix tree document model into vector document models representing respective ones of the plurality of documents, wherein the vector document models are vectors with M elements, and M is a total number of nodes in the suffix tree document model;
  
  weighting elements of the vector document models generated from the suffix tree document model to generate weighted elements;
  
  based on the weighted elements, determining a similarity between weighted vector document models, representing respective documents from the plurality of documents;
  
  forming a plurality of clusters according to the similarity between the weighted vector documents models by using group agglomerative hierarchical clustering; and
  
  generating respective cluster topic summaries for the plurality of clusters.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The method of claim 9, wherein the weighting the elements includes weighting according to the formula:
  - 11. The method of claim 9, wherein the determining the similarity includes determining the similarity based on a cosine similarity function.
  - 12. The method of claim 9, further comprising:
    - collecting a topic thread in an online forum of the online forum system, wherein the topic thread comprises a topic post and a plurality of reply posts;
      
      stripping non-word tokens of the topic post and the plurality of reply posts;
      
      parsing remaining texts of the topic post and the plurality of reply posts into words of parsed posts;
      
      identifying and removing stop words in the parsed posts from the parsing;
      
      applying Porter stemming to the parsed posts to generate stemmed posts having stemmed words;
      
      combining the stemmed words into objects including selecting a subject of the topic thread as a title of the plurality of documents,combining text of the stemmed posts into the plurality of documents in order of respective submitted times; and
      
      merging the objects with at least a predetermined number of words into a document that facilitates populating the suffix tree document model.
  - 13. The method of claim 12, further comprising:
    - creating stopnodes in the suffix tree document model to retain at least a part of information related to the stop words.
  - 14. The method of claim 9, further comprising at least one of cleaning the data before forming the suffix tree document model of the data to generate clean data, merging the clean data into an object for populating the suffix tree document model, and populating the suffix tree document model prior to the generating the suffix tree document model to a vector document model.
  - 15. The method of claim 9, further comprising at least one of:
    - determining a quality score for generating cluster topics based at least in part on information related to unmapped data information, selecting a first subset of documents based at least in part on the quality score, selecting and sorting a subset of nodes traversed by a first subset of documents based at least in part on the quality score by a predetermined metric, and labeling a cluster based at least in part on a calculation depending on a quality measure determination.
  - 16. The method of claim 9, wherein the weighting the elements includes determining the weighted elements as a function of a term frequency and a document frequency of a suffix tree node.

17. A non-transitory computer readable medium comprising computer executable instructions that, in response to execution, cause a system comprising a processor to perform operations, comprising:
- creating a suffix tree document model that represents a plurality of documents;
  
  translating the suffix tree document model representing the plurality of documents into a vector document model that represents one of the plurality of documents to generate a translation of the vector document model from the suffix tree document model, wherein the vector document model is a vector with M elements, and M is a total number of nodes in the suffix tree document model;
  
  weighting elements of the translation of the vector document model to generate weighted vector document models comprising weighted elements; and
  
  determining a similarity between at least two weighted vector models of the weighted vector document models, each representing a respective document from the plurality of documents, based on the weighted elements.
- View Dependent Claims (18, 19, 20)
- - 18. The non-transitory computer readable medium of claim 17, wherein the operations further comprise determining vector factor weights corresponding to each of the elements of the vector document model as a function of a term frequency and a document frequency of a suffix tree node.
  - 19. The system of claim 18, wherein the processor is further configured to execute or facilitate the execution of the computer executable instructions to determine a longest common prefix in the suffix tree document model to compare sub-strings between vector terms.
  - 20. The system of claim 19, wherein the processor is further configured to execute or facilitate the execution of the computer executable instructions to determine weight factors for vector terms of the vector document model and incorporate the longest common prefix in a determination of at least one of the vector factor weights.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
City University of Hong Kong
Original Assignee
City University of Hong Kong
Inventors
Deng, Xiaotie, Chim, Hung
Primary Examiner(s)
Bhatia, Ajay
Assistant Examiner(s)
BURNS, RANDALL WHITMAN

Application Number

US12/436,722
Publication Number

US 20090307213A1
Time in Patent Office

1,777 Days
Field of Search

707/749, 707/776, 707/758, 707/E17.046, 707/E17.05, 707/E17.051, 707/E17.087, 707/E17.089, 707/999.007
US Class Current

707/749
CPC Class Codes

G06F 16/285   Clustering or classification

G06F 16/35   Clustering; Classification

G06F 16/93   Document management systems

Suffix tree similarity measure for document clustering

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

49 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Suffix tree similarity measure for document clustering

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

49 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links