×

System and method for creating labels for clusters

  • US 10,210,251 B2
  • Filed: 02/25/2014
  • Issued: 02/19/2019
  • Est. Priority Date: 07/01/2013
  • Status: Active Grant
First Claim
Patent Images

1. A system for automatically creating at least one label for at least one cluster of text documents in a computing environment, the system comprising:

  • a processor; and

    a memory coupled to the processor, wherein the processor is capable of executing a plurality of modules stored in the memory, and wherein the plurality of modules comprise;

    a receiving module configured to receive an input data comprising a set of text documents;

    a candidate items selector configured to;

    select a plurality of candidate items occurring repetitively in the input data, wherein the selection of the plurality of candidate items is not based on verifying meaning of the candidate items or co-occurrence of the candidate items, wherein the plurality of candidate items comprises one or more words, a combination of words, and one or more phrases, wherein the candidate items are selected using n-gram selection technique, and a value of ‘

    n’

    ranges from 1 to 5; and

    generate a sorted list of the plurality of candidate items based on a frequency of occurrence of the plurality of candidate items in the input data;

    a combination array generator configured to select a ‘

    i’

    number of top scorer candidate items for the each n-gram from the sorted list of the plurality of candidate items to populate a two-dimensional array of ‘

    i*n×

    i*n’

    size by creating pairs between each n-gram candidate items for ‘

    n’

    varying from 1 to n, and wherein the two dimensional array is a matrix of the ‘

    i*n×

    i*n’

    size, wherein the pairs of candidate items are generated by making one to one pair combinations of each of the ‘

    i’

    number of candidate items of each n-gram in a {(unigram, unigram), (unigram, bigram), (unigram, trigram) up to (n-gram, n-gram)} pattern;

    a coverage value analyzer configured to determine a coverage value for each pair of the candidate items associated with each cell of the two-dimensional array to further populate a sorted two-dimensional array based on the coverage value, wherein each cell of the two-dimensional array contains the coverage value of each pair of the candidate items in the input data, wherein the coverage value for each pair of the candidate items of the matrix is computed by adding the coverage value for each candidate item of the pair of the candidate items and subtracting the coverage value for the pair of candidate items having in common from the addition;

    a candidate pair selector configured to select a predefined number of pairs of the candidate items from the sorted two-dimensional array to further process and generate a list of the pairs of the candidate items, wherein the two-dimensional array is sorted in accordance with a descending order of the coverage value for each pair of the candidate items and the candidate pair selector selects pairs of the candidate items occurring foremost, and wherein the candidate pair selector is further configured to select at least top two pairs of the candidate items from the sorted two-dimensional array;

    a unique word filter configured to accept the list of the pairs of the candidate items to determine a number of unique words in each of the pairs of the candidate items, wherein a unique word of the determined number of unique words further comprises a word appearing in no more than two documents in a single cluster of the at least one cluster; and

    a cluster label selector configured to sort the list of the pairs of the candidate items using the coverage value and the number of unique words to create a sorted list of the pairs of the candidate items for selecting a cluster label from the sorted list of the pairs of the candidate items.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×