Suffix tree similarity measure for document clustering

US 10,565,233 B2
Filed: 03/17/2014
Issued: 02/18/2020
Est. Priority Date: 05/07/2008
Status: Active Grant

First Claim

Patent Images

1. A system, comprising:

a memory having stored therein executable instructions; and

a processor, coupled to the memory, configured to execute or facilitate execution of the executable instructions to at least;

create a suffix tree document model that is a first representation of documents in at least one knowledge source on a computerized network;

convert the suffix tree document model to a vector document model that is a second representation of the documents, wherein the vector document model comprises respective weighted vectors for the documents, where each weighted vector of the respective weighted vectors consists of M elements and M is a total number of nodes in the suffix tree document model not including a root node of the suffix tree document model;

determine at least one similarity between two or more of the documents based upon the respective weighted vectors;

generate clusters of the documents based on the at least one similarity; and

in response to a search query of the least one knowledge source, providing a search result based on the clusters of the documents.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The subject innovation provides for systems and methods to facilitate weighted suffix tree clustering. Conventional suffix tree cluster models can be augmented by incorporating quality measures to facilitate improved performance. Further the quality measure can be employed in determining cluster labels that show improvements in accuracy over conventional means. Additionally “stopnodes” can be defined to facilitate traversing suffix tree models efficiently. Quality measurements can be determined based in part on weighting factors applied to terms in a vector model, said terms being mapped from a suffix tree model.

16 Citations

View as Search Results

20 Claims

1. A system, comprising:
- a memory having stored therein executable instructions; and
  
  a processor, coupled to the memory, configured to execute or facilitate execution of the executable instructions to at least;
  
  create a suffix tree document model that is a first representation of documents in at least one knowledge source on a computerized network;
  
  convert the suffix tree document model to a vector document model that is a second representation of the documents, wherein the vector document model comprises respective weighted vectors for the documents, where each weighted vector of the respective weighted vectors consists of M elements and M is a total number of nodes in the suffix tree document model not including a root node of the suffix tree document model;
  
  determine at least one similarity between two or more of the documents based upon the respective weighted vectors;
  
  generate clusters of the documents based on the at least one similarity; and
  
  in response to a search query of the least one knowledge source, providing a search result based on the clusters of the documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The system of claim 1, wherein the processor is further configured to execute or facilitate the execution of the executable instructions to:
    - weight elements of respective vectors resulting in the respective weighted vectors.
  - 3. The system of claim 2, wherein a weight of an element is based on a frequency, in a document associated with the weighted vector, of at least one word represented in a suffix tree node of the suffix tree document model corresponding to the element.
  - 4. The system of claim 3, wherein the frequency of the at least one word represented in the suffix tree node comprises a total number of times the document traverses the suffix tree node, and is a function of document frequency of the suffix tree node that comprises a number of different documents that traverse the suffix tree node.
  - 5. The system of claim 1, wherein the processor is further configured to execute or facilitate the execution of the executable instructions to generate a stop node based on a determination of whether a threshold condition of an inverse of a document frequency is satisfied by a node of the suffix tree document model, and wherein the stop node represents at least one word that is not a stop word and retains information related to at least one stop word removed from the plurality of documents.
  - 6. The system of claim 1, wherein the processor is further configured to execute or facilitate the execution of the executable instructions to remove one or more stop words from the documents before creating the suffix tree document model to generate a set of clean text, merging the set of clean text to an object for populating the suffix tree document model, and populating the suffix tree document model with the object.
  - 7. The system of claim 1, wherein generation of the clusters of the documents comprises the execution of the executable instructions to:
    - generate cluster topics based at least in part on additional information not converted from the suffix tree document model to the vector document model; and
      
      determine a quality of the cluster topics based at least in part on the additional information.

8. A method, comprising:
- creating, by a device comprising a processor, a suffix tree document model for a set of documents in at least one knowledge source on a computerized network;
  
  generating, by the device, a vector document model from the suffix tree document model, wherein the vector document model is a representation of the set of documents and comprises respective weighted vectors for documents of the set of documents, wherein each weighted vector of the respective weighted vectors consists of M elements, and M is a total number of nodes in the suffix tree document model not including a root node of the suffix tree document model;
  
  determining, by the device, at least one similarity between two or more of the documents based upon the respective weighted vectors;
  
  forming, by the device, clusters of the documents based on the at least one similarity; and
  
  in response to a search query of the least one knowledge source, providing, by the device, a search result based on the clusters of the documents.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The method of claim 8, further comprising:
    - weighting, by the device, elements of respective vectors to generate the respective weighted vectors, wherein a weight of an element of a weighted vector of the respective weighted vectors is based on a frequency, in a document associated with the weighted vector, of a term represented in a suffix tree node of the suffix tree document model corresponding to the element, wherein the suffix tree node comprises an inverse relation to a document frequency.
  - 10. The method of claim 8, wherein the forming the clusters comprises:
    - forming, by the device, the clusters by using group agglomerative hierarchical clustering; and
      
      generating, by the device, respective cluster topic summaries for the plurality of clusters.
  - 11. The method of claim 8, further comprising:
    - collecting a topic thread that comprises a topic post and a set of reply posts;
      
      stripping non-word tokens of the topic post and the set of reply posts;
      
      parsing remaining texts of the topic post and the set of reply posts to words of parsed posts;
      
      identifying and removing stop words in the parsed posts;
      
      applying Porter stemming to the parsed posts to generate stemmed posts having stemmed words;
      
      combining the stemmed words to objects including selecting a subject of the topic thread as a title of the set of documents; and
      
      merging objects comprising at least a predetermined number of words into a document of the set of documents, wherein the objects are ordered in the document according to respective submitted times of the topic post or the reply posts from which the object was derived.
  - 12. The method of claim 8, further comprising generating, by the device, a set of clean data by removing stop words from the set of documents before creating the suffix tree document model, merging, by the device, the set of clean data to an object for populating the suffix tree document model, and populating, by the device, the suffix tree document model based on the object prior to the generating the vector document model.
  - 13. The method of claim 8, further comprising:
    - generating, by the device, a quality score and cluster topics based at least in part on additional information extracted during the generating the vector document model, selecting, by the device, a subset of documents of the set of documents based at least in part on the quality score, selecting and sorting, by the device, a subset of nodes traversed by the subset of documents based at least in part on the quality score by a predetermined metric, and labeling, by the device, a cluster based at least in part on a calculation depending on a quality measure.

14. A non-transitory, computer readable medium comprising executable instructions that, in response to execution, cause a system comprising a processor to perform operations, comprising:
- creating a suffix tree document model that represents a plurality of documents in at least one knowledge source on a computerized network;
  
  translating the suffix tree document model to a vector document model that represents the plurality of documents, wherein the vector document model comprises respective weighted vectors for documents of the plurality of documents, wherein each weighted vector of the respective weighted vectors consists of M elements, and M is a total number of nodes in the suffix tree document model not including a root node of the suffix tree document model;
  
  determining at least one similarity between two or more of the documents based upon the respective weighted vectors;
  
  forming a plurality of clusters of the documents based on the at least one similarity; and
  
  in response to a search query of the least one knowledge source, providing a search result based on the plurality of clusters of the documents.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The non-transitory computer readable medium of claim 14, wherein the operations further comprise:
    - weighting elements of respective vectors resulting in the respective weighted vectors, wherein a weight of an element of a weighted vector of the respective vectors is based on a term frequency of a suffix tree node of the suffix tree document model corresponding to the element and a document frequency of the suffix tree node that comprises a number of different documents that traverse the suffix tree node.
  - 16. The non-transitory computer readable medium of claim 15, wherein the weight of element is determined as the function of the term frequency that is inversely related to the document frequency.
  - 17. The non-transitory computer readable medium of claim 16, wherein the processor is further configured to execute or facilitate the execution of the executable instructions to determine respective weight factors for elements of the respective vectors and incorporate a longest common prefix in a determination of at least one of the weight factors.
  - 18. The non-transitory computer readable medium of claim 14, wherein the forming the plurality of clusters comprises:
    - forming the plurality of clusters by utilizing a group agglomerative hierarchical clustering process; and
      
      generating respective cluster topic summaries for the plurality of clusters.
  - 19. The non-transitory computer readable medium of claim 14, further comprising:
    - creating stop nodes in the suffix tree document model by determining whether a threshold condition of a document frequency is satisfied by nodes of the suffix tree document model, wherein a stop node represents at least one word that is not a stop word and retains information related to at least one stop word removed from the plurality of documents.
  - 20. The non-transitory computer readable medium of claim 14, further comprising:
    - generating a set of clean data from the plurality of documents by removing one or more stop words from the plurality of documents, merging the set of clean data to an object with text data for populating the suffix tree document model, and populating the suffix tree document model based on the object.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
City University of Hong Kong
Original Assignee
City University of Hong Kong
Inventors
Deng, Xiaotie, Chim, Hung
Primary Examiner(s)
Giuliani, Giuseppi

Application Number

US14/216,714
Publication Number

US 20140304267A1
Time in Patent Office

2,164 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 16/285   Clustering or classification

G06F 16/35   Clustering; Classification

G06F 16/93   Document management systems

Suffix tree similarity measure for document clustering

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

16 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Suffix tree similarity measure for document clustering

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

16 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links