System and method for creating labels for clusters

US 10,210,251 B2
Filed: 02/25/2014
Issued: 02/19/2019
Est. Priority Date: 07/01/2013
Status: Active Grant

First Claim

Patent Images

1. A system for automatically creating at least one label for at least one cluster of text documents in a computing environment, the system comprising:

a processor; and

a memory coupled to the processor, wherein the processor is capable of executing a plurality of modules stored in the memory, and wherein the plurality of modules comprise;

a receiving module configured to receive an input data comprising a set of text documents;

a candidate items selector configured to;

select a plurality of candidate items occurring repetitively in the input data, wherein the selection of the plurality of candidate items is not based on verifying meaning of the candidate items or co-occurrence of the candidate items, wherein the plurality of candidate items comprises one or more words, a combination of words, and one or more phrases, wherein the candidate items are selected using n-gram selection technique, and a value of ‘

n’

ranges from 1 to 5; and

generate a sorted list of the plurality of candidate items based on a frequency of occurrence of the plurality of candidate items in the input data;

a combination array generator configured to select a ‘

i’

number of top scorer candidate items for the each n-gram from the sorted list of the plurality of candidate items to populate a two-dimensional array of ‘

i*n×

i*n’

size by creating pairs between each n-gram candidate items for ‘

n’

varying from 1 to n, and wherein the two dimensional array is a matrix of the ‘

i*n×

i*n’

size, wherein the pairs of candidate items are generated by making one to one pair combinations of each of the ‘

i’

number of candidate items of each n-gram in a {(unigram, unigram), (unigram, bigram), (unigram, trigram) up to (n-gram, n-gram)} pattern;

a coverage value analyzer configured to determine a coverage value for each pair of the candidate items associated with each cell of the two-dimensional array to further populate a sorted two-dimensional array based on the coverage value, wherein each cell of the two-dimensional array contains the coverage value of each pair of the candidate items in the input data, wherein the coverage value for each pair of the candidate items of the matrix is computed by adding the coverage value for each candidate item of the pair of the candidate items and subtracting the coverage value for the pair of candidate items having in common from the addition;

a candidate pair selector configured to select a predefined number of pairs of the candidate items from the sorted two-dimensional array to further process and generate a list of the pairs of the candidate items, wherein the two-dimensional array is sorted in accordance with a descending order of the coverage value for each pair of the candidate items and the candidate pair selector selects pairs of the candidate items occurring foremost, and wherein the candidate pair selector is further configured to select at least top two pairs of the candidate items from the sorted two-dimensional array;

a unique word filter configured to accept the list of the pairs of the candidate items to determine a number of unique words in each of the pairs of the candidate items, wherein a unique word of the determined number of unique words further comprises a word appearing in no more than two documents in a single cluster of the at least one cluster; and

a cluster label selector configured to sort the list of the pairs of the candidate items using the coverage value and the number of unique words to create a sorted list of the pairs of the candidate items for selecting a cluster label from the sorted list of the pairs of the candidate items.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed is a method and system for creating labels for cluster in computing environment. The system comprises receiving module, candidate items selector, combination array generator, coverage value analyzer, candidate pair selector, unique word filter and cluster label selector. Receiving module receives input data and candidate items selector selects candidate items occurring repetitively using n-gram technique to generate list of candidate items with frequency of occurrence. Combination array generator selects candidate items to populate two-dimensional array wherein each array element represents pair of n-gram. Coverage value analyzer determines coverage value for each pair of n-gram from array. Candidate pair selector selects pairs of n-gram from two-dimensional array to process and generate list of candidate pairs. The unique word filter determines number of unique words in each candidate pair. Cluster label selector sorts list of candidate pairs using coverage value and number of unique words to select cluster label.

49 Citations

View as Search Results

13 Claims

1. A system for automatically creating at least one label for at least one cluster of text documents in a computing environment, the system comprising:
- a processor; and
  
  a memory coupled to the processor, wherein the processor is capable of executing a plurality of modules stored in the memory, and wherein the plurality of modules comprise;
  
  a receiving module configured to receive an input data comprising a set of text documents;
  
  a candidate items selector configured to;
  
  select a plurality of candidate items occurring repetitively in the input data, wherein the selection of the plurality of candidate items is not based on verifying meaning of the candidate items or co-occurrence of the candidate items, wherein the plurality of candidate items comprises one or more words, a combination of words, and one or more phrases, wherein the candidate items are selected using n-gram selection technique, and a value of ‘
  
  n’
  
  ranges from 1 to 5; and
  
  generate a sorted list of the plurality of candidate items based on a frequency of occurrence of the plurality of candidate items in the input data;
  
  a combination array generator configured to select a ‘
  
  i’
  
  number of top scorer candidate items for the each n-gram from the sorted list of the plurality of candidate items to populate a two-dimensional array of ‘
  
  i*n×
  
  i*n’
  
  size by creating pairs between each n-gram candidate items for ‘
  
  n’
  
  varying from 1 to n, and wherein the two dimensional array is a matrix of the ‘
  
  i*n×
  
  i*n’
  
  size, wherein the pairs of candidate items are generated by making one to one pair combinations of each of the ‘
  
  i’
  
  number of candidate items of each n-gram in a {(unigram, unigram), (unigram, bigram), (unigram, trigram) up to (n-gram, n-gram)} pattern;
  
  a coverage value analyzer configured to determine a coverage value for each pair of the candidate items associated with each cell of the two-dimensional array to further populate a sorted two-dimensional array based on the coverage value, wherein each cell of the two-dimensional array contains the coverage value of each pair of the candidate items in the input data, wherein the coverage value for each pair of the candidate items of the matrix is computed by adding the coverage value for each candidate item of the pair of the candidate items and subtracting the coverage value for the pair of candidate items having in common from the addition;
  
  a candidate pair selector configured to select a predefined number of pairs of the candidate items from the sorted two-dimensional array to further process and generate a list of the pairs of the candidate items, wherein the two-dimensional array is sorted in accordance with a descending order of the coverage value for each pair of the candidate items and the candidate pair selector selects pairs of the candidate items occurring foremost, and wherein the candidate pair selector is further configured to select at least top two pairs of the candidate items from the sorted two-dimensional array;
  
  a unique word filter configured to accept the list of the pairs of the candidate items to determine a number of unique words in each of the pairs of the candidate items, wherein a unique word of the determined number of unique words further comprises a word appearing in no more than two documents in a single cluster of the at least one cluster; and
  
  a cluster label selector configured to sort the list of the pairs of the candidate items using the coverage value and the number of unique words to create a sorted list of the pairs of the candidate items for selecting a cluster label from the sorted list of the pairs of the candidate items.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system of claim 1, wherein the input data comprises at least one of:
    - a set of text documents and a set of text records associated with the at least one cluster.
  - 3. The system of claim 1, wherein the list of the plurality of candidate items is sorted in accordance with a descending order of the frequency of occurrence of the plurality of candidate items, wherein a foremost predefined number of the plurality of candidate items is selected from the sorted list of the plurality of candidate items.
  - 4. The system of claim 1, wherein the coverage value for each pair of the candidate items is determined to ensure a maximum coverage with a minimum overlap and the two-dimensional array is sorted in accordance with a descending order of the coverage value for each pair of the candidate items.
  - 5. The system of claim 1, wherein the cluster label selector sorts the list of the pairs of the candidate items by using at least one of:
    - the coverage value in a descending order and the number of unique words in an ascending order, and the coverage value in the ascending order and the number of unique words in the descending order, to create the sorted list of the pairs of the candidate items, and wherein the cluster label selector selects at least three pairs of the candidate items from the sorted list of pairs of the candidate items to select the cluster labels.

6. A method for automatically creating at least one label for at least one cluster of text documents in a computing environment, the method comprising:
- receiving an input data comprising a set of text documents;
  
  selecting a plurality of candidate items occurring repetitively in the input data wherein the selection of the plurality of candidate items is not based on verifying meaning of the candidate items or co-occurrence of the candidate items, wherein the plurality of candidate items comprises one or more words, a combination of words, and one or more phrases, wherein the candidate items are selected using n-gram selection technique, and a value of ‘
  
  n’
  
  ranges from 1 to 5;
  
  generating a sorted list of the plurality of candidate items based on a frequency of occurrence of the plurality of candidate items in the input data;
  
  selecting a ‘
  
  i’
  
  number of top scorer candidate items for the each n-gram from the sorted list of the plurality of candidate items to populate a two-dimensional array of ‘
  
  i*n×
  
  i*n’
  
  size by creating pairs between each n-gram candidate items for ‘
  
  n’
  
  varying from 1 to n, and wherein the two dimensional array is a matrix of the ‘
  
  i*n×
  
  i*n’
  
  size, wherein the pairs of the candidate items are generated by making one to one pair combinations of each of the ‘
  
  i’
  
  number of candidate items of each n-gram in a {(unigram, unigram), (unigram, bigram), (unigram, trigram) up to (n-gram, n-gram)} pattern;
  
  determining a coverage value for each pair of the candidate items associated with each cell of the two-dimensional array to further populate a sorted two-dimensional array based on the coverage value, wherein each cell of the two-dimensional array contains the coverage value of each pair of the candidate items in the input data, wherein the coverage value for each pair of the candidate items of the matrix is computed by adding the coverage value for each candidate item of the pair of the candidate items and subtracting the coverage value for the pair of candidate items having in common from the addition;
  
  selecting a predefined number of pairs of the candidate items from the sorted two-dimensional array to further process and generate a list of the pairs of the candidate items, wherein the two-dimensional array is sorted in accordance with a descending order of the coverage value for each pair of the candidate items and the candidate pair selector selects pairs of the candidate items occurring foremost, wherein at least top two pairs of the candidate items are selected from the sorted two-dimensional array;
  
  accepting the list of the pairs of the candidate items to determine a number of unique words in each of the pairs of the candidate items; and
  
  sorting the list of the pairs of the candidate items using the coverage value and the number of unique words to create a sorted list of the pairs of the candidate items for selecting a cluster label form the sorted list of the pairs of the candidate items, wherein a unique word of the determined number of unique words further comprises a word appearing in no more than two documents in a single cluster of the at least one cluster;
  
  wherein the receiving, the selecting the plurality of candidates, the selecting the predefined number of the plurality of candidate items, the determining the coverage value, the selecting the predefined number of pairs, the accepting the list, and the sorting the list are performed by a processor of a computerized device.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The method of claim 6, wherein the input data further comprises at least one of:
    - a set of text documents and a set of text records associated with the at least one cluster.
  - 8. The method of claim 6, wherein the list of the plurality of candidate items is sorted in accordance with a descending order of the frequency of occurrence of the plurality of candidate items, wherein a foremost predefined number of the plurality of candidate items are selected from the sorted list of the plurality of candidate items.
  - 9. The method of claim 6, wherein sorting the list of the pairs of the candidate items is performed by using at least one of:
    - the coverage value in a descending order and the number of unique words in an ascending order, and the coverage value in the ascending order and the number of unique words in the descending order, to create the sorted list of the pairs of the candidate items, wherein at least three pairs of the candidate items are selected from the sorted list of pairs of the candidate items to select the cluster labels.
  - 10. The method of claim 6, wherein the step of determining the coverage value for each pair of the candidate items further comprises determining the coverage value for each pair of the candidate items to ensure a maximum coverage with a minimum overlap.

11. A non-transitory computer readable medium having embodied thereon a computer program for automatically creating at least one label for at least one cluster of text documents, the non-transitory computer readable medium comprising:
- a program code for receiving an input data comprising a set of text documents;
  
  a program code for selecting a plurality of candidate items occurring repetitively in the input data, wherein the selection of the plurality of candidate items is not based on verifying meaning of the candidate items or co-occurrence of the candidate items, wherein the plurality of candidate items comprises one or more words, a combination of words, and one or more phrases, wherein the candidate items are selected using n-gram selection technique, and a value of ‘
  
  n’
  
  ranges from 1 to 5;
  
  a program code for generating a sorted list of the plurality of candidate items based on a frequency of occurrence of the plurality of candidate items in the input data;
  
  a program code for selecting a ‘
  
  i’
  
  number of top scorer candidate items for the each n-gram from the sorted list of the plurality of candidate items to populate a two-dimensional array of ‘
  
  i*n×
  
  i*n’
  
  size by creating pairs between each n-gram candidate items for ‘
  
  n’
  
  varying from 1 to n, and wherein the two dimensional array is a matrix of the i*n×
  
  i*n’
  
  size, wherein the pairs of the candidate items are generated by making one to one pair combinations of each of the ‘
  
  i’
  
  number of candidate items of each n-gram in a {(unigram, unigram), (unigram, bigram), (unigram, trigram) up to (n-gram, n-gram)} pattern;
  
  a program code for determining a coverage value for each pair of the candidate items associated with each cell of from the two-dimensional array to further sort the two-dimensional array based on the coverage value for each pair of the candidate items to populate a sorted two-dimensional array, wherein each cell of the two-dimensional array contains the coverage value of each pair of the candidate items in the input data, wherein the coverage value for each pair of the candidate items of the matrix is computed by adding the coverage value for each candidate item of the pair of the candidate items and subtracting the coverage value for the pair of candidate items having in common from the addition;
  
  a program code for selecting a predefined number of pairs of the candidate items from the sorted two-dimensional array occurring foremost to further process and generate a list of the pairs of the candidate items, wherein the two-dimensional array is sorted in accordance with a descending order of the coverage value for each pair of the candidate items and the candidate pair selector selects pairs of the candidate items occurring foremost, wherein at least top two pairs of the candidate items are selected from the sorted two-dimensional array;
  
  a program code for accepting the list of the pairs of the candidate items to determine a number of unique words in each of the pairs of the candidate items, wherein a unique word of the determined number of unique words further comprises a word appearing in no more than two documents in a single cluster of the at least one cluster; and
  
  a program code for sorting the list of the pairs of the candidate items using the coverage value and the number of unique words to create a sorted list of the pairs of the candidate items for selecting a cluster label form the sorted list of the pairs of the candidate items.
- View Dependent Claims (12, 13)
- - 12. The non-transitory computer readable medium of claim 11, wherein the input data comprises at least one of:
    - a set of text documents and a set of text records associated with the at least one cluster.
  - 13. The non-transitory computer readable medium of claim 11, wherein the list of the plurality of candidate items is sorted in accordance with a descending order of the frequency of occurrence of the plurality of candidate items, wherein the foremost predefined number of the plurality of candidate items is selected from the sorted list of the plurality of candidate items.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
TATA Consultancy Services Limited (Tata Sons Pvt Ltd.)
Original Assignee
TATA Consultancy Services Limited (Tata Sons Pvt Ltd.)
Inventors
Deshpande, Shailesh Shankar, Palshikar, Girish Keshav, G, Athiappan
Primary Examiner(s)
Le, Miranda

Application Number

US14/188,979
Publication Number

US 20150006531A1
Time in Patent Office

1,820 Days
Field of Search

707737
US Class Current
CPC Class Codes

G06F 16/783 using metadata automaticall...

System and method for creating labels for clusters

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

49 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for creating labels for clusters

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

49 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links