System and method for creating labels for clusters
First Claim
1. A system for automatically creating at least one label for at least one cluster of text documents in a computing environment, the system comprising:
- a processor; and
a memory coupled to the processor, wherein the processor is capable of executing a plurality of modules stored in the memory, and wherein the plurality of modules comprise;
a receiving module configured to receive an input data comprising a set of text documents;
a candidate items selector configured to;
select a plurality of candidate items occurring repetitively in the input data, wherein the selection of the plurality of candidate items is not based on verifying meaning of the candidate items or co-occurrence of the candidate items, wherein the plurality of candidate items comprises one or more words, a combination of words, and one or more phrases, wherein the candidate items are selected using n-gram selection technique, and a value of ‘
n’
ranges from 1 to 5; and
generate a sorted list of the plurality of candidate items based on a frequency of occurrence of the plurality of candidate items in the input data;
a combination array generator configured to select a ‘
i’
number of top scorer candidate items for the each n-gram from the sorted list of the plurality of candidate items to populate a two-dimensional array of ‘
i*n×
i*n’
size by creating pairs between each n-gram candidate items for ‘
n’
varying from 1 to n, and wherein the two dimensional array is a matrix of the ‘
i*n×
i*n’
size, wherein the pairs of candidate items are generated by making one to one pair combinations of each of the ‘
i’
number of candidate items of each n-gram in a {(unigram, unigram), (unigram, bigram), (unigram, trigram) up to (n-gram, n-gram)} pattern;
a coverage value analyzer configured to determine a coverage value for each pair of the candidate items associated with each cell of the two-dimensional array to further populate a sorted two-dimensional array based on the coverage value, wherein each cell of the two-dimensional array contains the coverage value of each pair of the candidate items in the input data, wherein the coverage value for each pair of the candidate items of the matrix is computed by adding the coverage value for each candidate item of the pair of the candidate items and subtracting the coverage value for the pair of candidate items having in common from the addition;
a candidate pair selector configured to select a predefined number of pairs of the candidate items from the sorted two-dimensional array to further process and generate a list of the pairs of the candidate items, wherein the two-dimensional array is sorted in accordance with a descending order of the coverage value for each pair of the candidate items and the candidate pair selector selects pairs of the candidate items occurring foremost, and wherein the candidate pair selector is further configured to select at least top two pairs of the candidate items from the sorted two-dimensional array;
a unique word filter configured to accept the list of the pairs of the candidate items to determine a number of unique words in each of the pairs of the candidate items, wherein a unique word of the determined number of unique words further comprises a word appearing in no more than two documents in a single cluster of the at least one cluster; and
a cluster label selector configured to sort the list of the pairs of the candidate items using the coverage value and the number of unique words to create a sorted list of the pairs of the candidate items for selecting a cluster label from the sorted list of the pairs of the candidate items.
1 Assignment
0 Petitions
Accused Products
Abstract
Disclosed is a method and system for creating labels for cluster in computing environment. The system comprises receiving module, candidate items selector, combination array generator, coverage value analyzer, candidate pair selector, unique word filter and cluster label selector. Receiving module receives input data and candidate items selector selects candidate items occurring repetitively using n-gram technique to generate list of candidate items with frequency of occurrence. Combination array generator selects candidate items to populate two-dimensional array wherein each array element represents pair of n-gram. Coverage value analyzer determines coverage value for each pair of n-gram from array. Candidate pair selector selects pairs of n-gram from two-dimensional array to process and generate list of candidate pairs. The unique word filter determines number of unique words in each candidate pair. Cluster label selector sorts list of candidate pairs using coverage value and number of unique words to select cluster label.
49 Citations
13 Claims
-
1. A system for automatically creating at least one label for at least one cluster of text documents in a computing environment, the system comprising:
-
a processor; and a memory coupled to the processor, wherein the processor is capable of executing a plurality of modules stored in the memory, and wherein the plurality of modules comprise; a receiving module configured to receive an input data comprising a set of text documents; a candidate items selector configured to; select a plurality of candidate items occurring repetitively in the input data, wherein the selection of the plurality of candidate items is not based on verifying meaning of the candidate items or co-occurrence of the candidate items, wherein the plurality of candidate items comprises one or more words, a combination of words, and one or more phrases, wherein the candidate items are selected using n-gram selection technique, and a value of ‘
n’
ranges from 1 to 5; andgenerate a sorted list of the plurality of candidate items based on a frequency of occurrence of the plurality of candidate items in the input data; a combination array generator configured to select a ‘
i’
number of top scorer candidate items for the each n-gram from the sorted list of the plurality of candidate items to populate a two-dimensional array of ‘
i*n×
i*n’
size by creating pairs between each n-gram candidate items for ‘
n’
varying from 1 to n, and wherein the two dimensional array is a matrix of the ‘
i*n×
i*n’
size, wherein the pairs of candidate items are generated by making one to one pair combinations of each of the ‘
i’
number of candidate items of each n-gram in a {(unigram, unigram), (unigram, bigram), (unigram, trigram) up to (n-gram, n-gram)} pattern;a coverage value analyzer configured to determine a coverage value for each pair of the candidate items associated with each cell of the two-dimensional array to further populate a sorted two-dimensional array based on the coverage value, wherein each cell of the two-dimensional array contains the coverage value of each pair of the candidate items in the input data, wherein the coverage value for each pair of the candidate items of the matrix is computed by adding the coverage value for each candidate item of the pair of the candidate items and subtracting the coverage value for the pair of candidate items having in common from the addition; a candidate pair selector configured to select a predefined number of pairs of the candidate items from the sorted two-dimensional array to further process and generate a list of the pairs of the candidate items, wherein the two-dimensional array is sorted in accordance with a descending order of the coverage value for each pair of the candidate items and the candidate pair selector selects pairs of the candidate items occurring foremost, and wherein the candidate pair selector is further configured to select at least top two pairs of the candidate items from the sorted two-dimensional array; a unique word filter configured to accept the list of the pairs of the candidate items to determine a number of unique words in each of the pairs of the candidate items, wherein a unique word of the determined number of unique words further comprises a word appearing in no more than two documents in a single cluster of the at least one cluster; and a cluster label selector configured to sort the list of the pairs of the candidate items using the coverage value and the number of unique words to create a sorted list of the pairs of the candidate items for selecting a cluster label from the sorted list of the pairs of the candidate items. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for automatically creating at least one label for at least one cluster of text documents in a computing environment, the method comprising:
-
receiving an input data comprising a set of text documents; selecting a plurality of candidate items occurring repetitively in the input data wherein the selection of the plurality of candidate items is not based on verifying meaning of the candidate items or co-occurrence of the candidate items, wherein the plurality of candidate items comprises one or more words, a combination of words, and one or more phrases, wherein the candidate items are selected using n-gram selection technique, and a value of ‘
n’
ranges from 1 to 5;generating a sorted list of the plurality of candidate items based on a frequency of occurrence of the plurality of candidate items in the input data; selecting a ‘
i’
number of top scorer candidate items for the each n-gram from the sorted list of the plurality of candidate items to populate a two-dimensional array of ‘
i*n×
i*n’
size by creating pairs between each n-gram candidate items for ‘
n’
varying from 1 to n, and wherein the two dimensional array is a matrix of the ‘
i*n×
i*n’
size, wherein the pairs of the candidate items are generated by making one to one pair combinations of each of the ‘
i’
number of candidate items of each n-gram in a {(unigram, unigram), (unigram, bigram), (unigram, trigram) up to (n-gram, n-gram)} pattern;determining a coverage value for each pair of the candidate items associated with each cell of the two-dimensional array to further populate a sorted two-dimensional array based on the coverage value, wherein each cell of the two-dimensional array contains the coverage value of each pair of the candidate items in the input data, wherein the coverage value for each pair of the candidate items of the matrix is computed by adding the coverage value for each candidate item of the pair of the candidate items and subtracting the coverage value for the pair of candidate items having in common from the addition; selecting a predefined number of pairs of the candidate items from the sorted two-dimensional array to further process and generate a list of the pairs of the candidate items, wherein the two-dimensional array is sorted in accordance with a descending order of the coverage value for each pair of the candidate items and the candidate pair selector selects pairs of the candidate items occurring foremost, wherein at least top two pairs of the candidate items are selected from the sorted two-dimensional array; accepting the list of the pairs of the candidate items to determine a number of unique words in each of the pairs of the candidate items; and sorting the list of the pairs of the candidate items using the coverage value and the number of unique words to create a sorted list of the pairs of the candidate items for selecting a cluster label form the sorted list of the pairs of the candidate items, wherein a unique word of the determined number of unique words further comprises a word appearing in no more than two documents in a single cluster of the at least one cluster; wherein the receiving, the selecting the plurality of candidates, the selecting the predefined number of the plurality of candidate items, the determining the coverage value, the selecting the predefined number of pairs, the accepting the list, and the sorting the list are performed by a processor of a computerized device. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A non-transitory computer readable medium having embodied thereon a computer program for automatically creating at least one label for at least one cluster of text documents, the non-transitory computer readable medium comprising:
-
a program code for receiving an input data comprising a set of text documents; a program code for selecting a plurality of candidate items occurring repetitively in the input data, wherein the selection of the plurality of candidate items is not based on verifying meaning of the candidate items or co-occurrence of the candidate items, wherein the plurality of candidate items comprises one or more words, a combination of words, and one or more phrases, wherein the candidate items are selected using n-gram selection technique, and a value of ‘
n’
ranges from 1 to 5;a program code for generating a sorted list of the plurality of candidate items based on a frequency of occurrence of the plurality of candidate items in the input data; a program code for selecting a ‘
i’
number of top scorer candidate items for the each n-gram from the sorted list of the plurality of candidate items to populate a two-dimensional array of ‘
i*n×
i*n’
size by creating pairs between each n-gram candidate items for ‘
n’
varying from 1 to n, and wherein the two dimensional array is a matrix of the i*n×
i*n’
size, wherein the pairs of the candidate items are generated by making one to one pair combinations of each of the ‘
i’
number of candidate items of each n-gram in a {(unigram, unigram), (unigram, bigram), (unigram, trigram) up to (n-gram, n-gram)} pattern;a program code for determining a coverage value for each pair of the candidate items associated with each cell of from the two-dimensional array to further sort the two-dimensional array based on the coverage value for each pair of the candidate items to populate a sorted two-dimensional array, wherein each cell of the two-dimensional array contains the coverage value of each pair of the candidate items in the input data, wherein the coverage value for each pair of the candidate items of the matrix is computed by adding the coverage value for each candidate item of the pair of the candidate items and subtracting the coverage value for the pair of candidate items having in common from the addition; a program code for selecting a predefined number of pairs of the candidate items from the sorted two-dimensional array occurring foremost to further process and generate a list of the pairs of the candidate items, wherein the two-dimensional array is sorted in accordance with a descending order of the coverage value for each pair of the candidate items and the candidate pair selector selects pairs of the candidate items occurring foremost, wherein at least top two pairs of the candidate items are selected from the sorted two-dimensional array; a program code for accepting the list of the pairs of the candidate items to determine a number of unique words in each of the pairs of the candidate items, wherein a unique word of the determined number of unique words further comprises a word appearing in no more than two documents in a single cluster of the at least one cluster; and a program code for sorting the list of the pairs of the candidate items using the coverage value and the number of unique words to create a sorted list of the pairs of the candidate items for selecting a cluster label form the sorted list of the pairs of the candidate items. - View Dependent Claims (12, 13)
-
Specification