Method and system for naming a cluster of words and phrases
First Claim
1. A method of naming a cluster of words and phrases extracted from a set of documents, using a lexical database, comprising the steps of:
- a. generating definition vectors of words in the cluster using the lexical database;
b. determining support of the definition vectors;
c. determining most relevant definition vector corresponding to each word in the cluster;
d. generating concepts from the most relevant definition vectors using a preselected clustering method;
e. determining support of said concepts;
f. designating predetermined number of top ranked concepts as dominant concepts;
g. naming the cluster by a predetermined number of most frequent words from the cluster, if the dominant concepts are not designated; and
h. naming the cluster by choosing words from the lexical database that describe the dominant concepts in accurate detail, if the dominant concepts are designated.
10 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides a method, system and computer program for naming a cluster, or a hierarchy of clusters, of words and phrases that have been extracted from a set of documents. The invention takes these clusters as the input and generates appropriate labels for the clusters using a lexical database. Naming involves first finding out all possible word senses for all the words in the cluster, using the lexical database; and then augmenting each word sense with words that are semantically similar to that word sense to form respective definition vectors. Thereafter, word sense disambiguation is done to find out the most relevant sense for each word. Definition vectors are clustered into groups. Each group represents a concept. These concepts are thereafter ranked based on their support. Finally, a pre-specified number of words and phrases from the definition vectors of the dominant concepts are selected as labels, based on their generality in the lexical database. Therefore, the labels may not necessarily consist of the original words in the cluster. A hierarchy of clusters is named in a recursive fashion starting from leaf clusters. Dominant concepts in child clusters are propagated into their parent to reduce the labeling complexity of parent clusters.
217 Citations
48 Claims
-
1. A method of naming a cluster of words and phrases extracted from a set of documents, using a lexical database, comprising the steps of:
-
a. generating definition vectors of words in the cluster using the lexical database; b. determining support of the definition vectors; c. determining most relevant definition vector corresponding to each word in the cluster; d. generating concepts from the most relevant definition vectors using a preselected clustering method; e. determining support of said concepts; f. designating predetermined number of top ranked concepts as dominant concepts; g. naming the cluster by a predetermined number of most frequent words from the cluster, if the dominant concepts are not designated; and h. naming the cluster by choosing words from the lexical database that describe the dominant concepts in accurate detail, if the dominant concepts are designated. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for naming a cluster of words and phrases extracted from a document, using a lexical database, comprising:
-
a. means for generating definition vectors of words in said cluster using said lexical database; b. means for determining support of said definition vectors; c. means for determining most relevant definition vector corresponding to each said word in said cluster; d. means for generating concepts from said most relevant definition vectors; e. means for determining support of said concepts; f. means for designating predetermined number of top ranked said concepts as dominant concepts; and g. means for naming said cluster using said dominant concepts. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A computer program product for naming a cluster of words and phrases, extracted from a document, using a lexical database, the computer program product embodied in a computer-readable medium and comprising:
-
a. computer-readable program code means for generating definition vectors of words in said cluster using said lexical database; b. computer-readable program code means for determining support of said definition vectors; c. computer-readable program code means for determining most relevant definition vector corresponding to each said word in said cluster; d. computer-readable program code means for generating concepts from said most relevant definition vectors; e. computer-readable program code means for determining support of said concepts; f. computer-readable program code means for designating predetermined number of top ranked said concepts as dominant concepts; and g. computer-readable program code means for naming said cluster using said dominant concepts. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28)
-
-
29. A method of naming clusters of words and phrases extracted from a set of documents in a hierarchy that is in the form of a tree, using a lexical database, comprising the steps of:
-
a. designating a cluster from the hierarchy of clusters as a selected cluster; b. determining whether the selected cluster is a leaf cluster in the hierarchy; c. if the selected cluster is a leaf cluster, performing the steps of; i. designating the selected cluster as an updated cluster; ii. generating definition vectors of words in the updated cluster using the lexical database; iii. determining support of the definition vectors; iv. determining most relevant definition vector corresponding to each word in the updated cluster; v. generating concepts from the most relevant definition vectors using a preselected clustering method; vi. determining support of the concepts; vii. designating predetermined number of top ranked concepts as dominant concepts; viii. naming the selected cluster by a predetermined number of most frequent words from the cluster, if the dominant concepts are not designated; and ix. naming the selected cluster by choosing words from the lexical database that describe the dominant concepts in accurate detail, if the dominant concepts are designated; d. if the selected cluster is not a leaf cluster, performing the steps of; i. augmenting the selected cluster with already generated dominant concepts of the selected cluster'"'"'s children clusters; ii. augmenting the selected cluster with concepts of the selected cluster'"'"'s children clusters that may not be dominant in that children cluster; iii. designating augmented cluster as an updated cluster; iv. generating definition vectors of words in the updated cluster using the lexical database; v. determining support of the definition vectors; vi. determining most relevant definition vector corresponding to each word in the updated cluster; vii. generating concepts from the most relevant definition vectors using a preselected clustering method; viii. determining support of the concepts; ix. designating predetermined number of top ranked concepts as dominant concepts; x. naming the selected cluster by a predetermined number of most frequent words from the cluster, if the dominant concepts are not designated; and xi. naming the selected cluster by choosing words from the lexical database that describe the dominant concepts in accurate detail, if the dominant concepts are designated; e. if all clusters of the hierarchy of clusters have not been designated as the selected cluster for naming, repeating steps a–
d. - View Dependent Claims (30, 31, 32, 33, 34, 35, 36, 37, 38)
-
-
39. A system for naming clusters of words and phrases extracted from a set of documents in a hierarchy that is in the form of a tree, using a lexical database, comprising:
-
a. means for selecting a cluster of words and phrases from said hierarchy of clusters; b. means for updating said cluster depending upon whether said cluster is a leaf cluster or not; c. means for generating definition vectors of words in said updated cluster using said lexical database; d. means for determining support of said definition vectors; e. means for determining most relevant definition vector corresponding to each said word in said updated cluster; f. means for generating concepts from said most relevant definition vectors using a preselected clustering method; g. means for determining support of said concepts; h. means for designating predetermined number of top ranked said concepts as dominant concepts; and i. means for naming said cluster using said dominant concepts. - View Dependent Claims (40, 41, 42, 43, 44, 45, 46, 47, 48)
-
Specification