System and method for automatically discovering a hierarchy of concepts from a corpus of documents
First Claim
1. A computer-implemented method for automatically discovering a hierarchy of concepts from a corpus of documents, the concept hierarchy arranges concepts into multiple levels of abstraction, the method comprising:
- a. extracting signatures from the corpus of documents, wherein a signature comprises a noun or a noun phrase;
b. identifying similarity between the signatures using a refined distribution, wherein the refined distribution is obtained by computing and iteratively refining similarity measures between the signatures;
c. hierarchically clustering related signatures to generate concepts, wherein a concept is a cluster of related nouns and noun phrases;
d. hierarchically arranging the concepts to obtain a concept hierarchy;
e. labeling the concepts arranged in the concept hierarchy; and
f. creating an interface for the concept hierarchy.
10 Assignments
0 Petitions
Accused Products
Abstract
The invention is a method, system and computer program for automatically discovering concepts from a corpus of documents and automatically generating a labeled concept hierarchy. The method involves extraction of signatures from the corpus of documents. The similarity between signatures is computed using a statistical measure. The frequency distribution of signatures is refined to alleviate any inaccuracy in the similarity measure. The signatures are also disambiguated to address the polysemy problem. The similarity measure is recomputed based on the refined frequency distribution and disambiguated signatures. The recomputed similarity measure reflects actual similarity between signatures. The recomputed similarity measure is then used for clustering related signatures. The signatures are clustered to generate concepts and concepts are arranged in a concept hierarchy. The concept hierarchy automatically generates query for a particular concept and retrieves relevant documents associated with the concept.
182 Citations
39 Claims
-
1. A computer-implemented method for automatically discovering a hierarchy of concepts from a corpus of documents, the concept hierarchy arranges concepts into multiple levels of abstraction, the method comprising:
-
a. extracting signatures from the corpus of documents, wherein a signature comprises a noun or a noun phrase; b. identifying similarity between the signatures using a refined distribution, wherein the refined distribution is obtained by computing and iteratively refining similarity measures between the signatures; c. hierarchically clustering related signatures to generate concepts, wherein a concept is a cluster of related nouns and noun phrases; d. hierarchically arranging the concepts to obtain a concept hierarchy; e. labeling the concepts arranged in the concept hierarchy; and f. creating an interface for the concept hierarchy. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system for automatically discovering a hierarchy of concepts from a corpus of documents, the concept hierarchy organizes concepts into multiple levels of abstraction, the system comprising:
-
a. means for extracting signatures from the corpus of documents, wherein a signature comprises a noun or a noun phrase; b. means for identifying similarity between the signatures using a refined distribution, wherein the refined distribution is obtained by computing and iteratively refining similarity measures between the signatures; c. means for hierarchically clustering related signatures to generate concepts, wherein a concept is a cluster of related nouns and noun phrases; d. means for hierarchically arranging the concepts to obtain a concept hierarchy; e. means for labeling concepts arranged in the concept hierarchy; and f. means for creating an interface for the concept hierarchy. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A computer-implemented method for automatically discovering a hierarchy of concepts from a corpus of documents, the concept hierarchy arranging concepts into multiple levels of abstraction, the method comprising:
-
a. extracting signatures from the corpus of documents, wherein a signature comprises a noun or a noun phrase; b. identifying similarity between the signatures;
wherein the step of identifying similarity between the signatures includes the steps of;i. representing the signatures using distribution of the signatures in the corpus of documents; ii. computing similarity measure between the signatures; iii. refining distribution of the signatures in the corpus of documents; iv. re-computing similarity measure between the signatures based on the refined distribution; and v. identifying related signatures using the re-computed similarity measure; c. hierarchically clustering related signatures to generate concepts, wherein a concept is a cluster of related nouns and noun phrases; d. hierarchically arranging the concepts to obtain a concept hierarchy; e. labeling the concepts arranged in the concept hierarchy; and f. creating an interface for the concept hierarchy. - View Dependent Claims (27, 28, 29, 30, 31, 32)
-
-
33. A system for automatically discovering a hierarchy of concepts from a corpus of documents, the concept hierarchy arranging concepts into multiple levels of abstraction, the system comprising:
-
a. means for extracting signatures from the corpus of documents, wherein a signature comprises a noun or a noun phrase; b. means for identifying similarity between the signatures;
wherein the means for identifying similarity between the signatures includes;i. means for representing the signatures using distribution of the signatures in the corpus of documents; ii. means for computing similarity measure between the signatures; iii. means for refining distribution of the signatures in the corpus of documents; iv. means for re-computing similarity measure between the signatures based on the refined distribution; and v. means for identifying related signatures using the re-computed similarity measure; c. means for hierarchically clustering related signatures to generate concepts, wherein a concept is a cluster of related nouns and noun phrases; d. means for hierarchically arranging the concepts to obtain a concept hierarchy; e. means for labeling the concepts arranged in the concept hierarchy; and f. means for creating an interface for the concept hierarchy. - View Dependent Claims (34, 35, 36, 37, 38, 39)
-
Specification