System and method for automatically discovering a hierarchy of concepts from a corpus of documents
First Claim
1. A method for automatically discovering a hierarchy of concepts from a corpus of documents, the concept hierarchy organizes concepts into multiple levels of abstraction, the method comprising:
- a. extracting signatures from the corpus of documents;
b. identifying similarity between signatures;
c. hierarchically clustering related signatures to generate concepts and hierarchically clustering concepts thus generated, whereby hierarchical clustering obtains a concept hierarchy;
d. labeling the concepts organized in the concept hierarchy; and
e. creating an interface for the concept hierarchy generated.
10 Assignments
0 Petitions
Accused Products
Abstract
The invention is a method, system and computer program for automatically discovering concepts from a corpus of documents and automatically generating a labeled concept hierarchy. The method involves extraction of signatures from the corpus of documents. The similarity between signatures is computed using a statistical measure. The frequency distribution of signatures is refined to alleviate any inaccuracy in the similarity measure. The signatures are also disambiguated to address the polysemy problem. The similarity measure is recomputed based on the refined frequency distribution and disambiguated signatures. The recomputed similarity measure reflects actual similarity between signatures. The recomputed similarity measure is then used for clustering related signatures. The signatures are clustered to generate concepts and concepts are arranged in a concept hierarchy. The concept hierarchy automatically generates query for a particular concept and retrieves relevant documents associated with the concept.
380 Citations
25 Claims
-
1. A method for automatically discovering a hierarchy of concepts from a corpus of documents, the concept hierarchy organizes concepts into multiple levels of abstraction, the method comprising:
-
a. extracting signatures from the corpus of documents;
b. identifying similarity between signatures;
c. hierarchically clustering related signatures to generate concepts and hierarchically clustering concepts thus generated, whereby hierarchical clustering obtains a concept hierarchy;
d. labeling the concepts organized in the concept hierarchy; and
e. creating an interface for the concept hierarchy generated. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system for automatically discovering a hierarchy of concepts from a corpus of documents, the concept hierarchy organizes concepts into multiple levels of abstraction, the system comprising:
-
a. means for extracting signatures from the corpus of documents;
b. means for identifying similarity between signatures;
c. means for hierarchically clustering related signatures to generate concepts and hierarchically clustering concepts thus generated, whereby hierarchical clustering obtains a concept hierarchy;
d. means for labeling concepts organized in the concept hierarchy; and
e. means for creating an interface for the concept hierarchy. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
Specification