Automated category discovery for a terminological knowledge base
First Claim
1. A method for automated generation of sub-categories from categories of a terminological knowledge base, said method comprising the steps of:
- storing a corpus of documents, wherein a document comprises a plurality of themes, and a theme identifies thematic content contained within said document;
storing a knowledge base comprising a plurality of hierarchically arranged categories, wherein a subset of said categories of said knowledge base comprise dimensional categories;
selecting a target category in said knowledge base to generate at least one new sub-category, said target category comprising a plurality of terms classified within said target category, such that one or more groups of terms associated with said target category are divided for association with said new sub-category;
selecting, for each term classified within said target category, a plurality of themes from said corpus of documents;
generating a plurality of dimensional category vectors, one for each term, by associating said themes selected for a term to a dimensional category;
determining if one or more terminological groups of terms exist in said knowledge base by clustering said dimensional category vectors for each term; and
selecting, as said new sub-category for said target category, one or more terminological groups discovered.
2 Assignments
0 Petitions
Accused Products
Abstract
A terminological system automatically generates sub-categories from categories of a knowledge base. The knowledge base includes a plurality of hierarchically arranged categories, as well as terms associated with the categories. A subset of the categories of the knowledge base are designated “dimensional categories.” The system also stores a corpus of documents, including themes and corresponding theme weights for each document. A target category is selected to generate sub-categories. A set of themes from the corpus of documents are selected for each term. Dimensional category vectors, one for each term, are generated by associating the set of themes for a term to a dimensional category in the knowledge base. The dimensional category vectors for each term are analyzed to determine if one or more clusters of terminological groups exist to generate new sub-categories. A content processing system, which generates themes and theme weights, is also disclosed.
-
Citations
11 Claims
-
1. A method for automated generation of sub-categories from categories of a terminological knowledge base, said method comprising the steps of:
-
storing a corpus of documents, wherein a document comprises a plurality of themes, and a theme identifies thematic content contained within said document;
storing a knowledge base comprising a plurality of hierarchically arranged categories, wherein a subset of said categories of said knowledge base comprise dimensional categories;
selecting a target category in said knowledge base to generate at least one new sub-category, said target category comprising a plurality of terms classified within said target category, such that one or more groups of terms associated with said target category are divided for association with said new sub-category;
selecting, for each term classified within said target category, a plurality of themes from said corpus of documents;
generating a plurality of dimensional category vectors, one for each term, by associating said themes selected for a term to a dimensional category;
determining if one or more terminological groups of terms exist in said knowledge base by clustering said dimensional category vectors for each term; and
selecting, as said new sub-category for said target category, one or more terminological groups discovered. - View Dependent Claims (2, 3, 4, 5)
storing a theme weights for said themes;
generating cumulative weights for said dimensional category vectors from corresponding themes; and
utilizing said cumulative weights in said dimensional category vectors to ascertain terminological groups.
-
-
3. The method as set forth in claim 2, wherein the step of generating cumulative weights for said dimensional category vectors comprising the steps of:
-
storing ancestor and descendant relationships for hierarchically arranged categories in said knowledge base; and
summing theme weights from said themes associated to a dimensional category if said category and said dimensional category comprises a descendant relationship, respectively, and not an ancestor relationship.
-
-
4. The method as set forth in claim 1, wherein the step of selecting, for each term associated with said target category, a plurality of themes from said corpus of documents comprises the steps of:
-
selecting a first document set, comprising a plurality of documents, that include said target category as a theme;
selecting a second set of documents, for each term, that includes documents from said first set that contain said term; and
selecting themes, for each term, that correspond to said second set of documents.
-
-
5. The method as set forth in claim 1, wherein the step of determining if one or more terminological groups of terms exist in said knowledge base by clustering said dimensional category vectors for each term comprises the step of executing a multi-dimensional clustering algorithm, wherein each dimensional category comprises a single dimension in said multi-dimensional clustering algorithm.
-
6. A computer readable medium for automated generation of sub-categories from categories of a terminological knowledge base comprising a set of instructions, which when executed by a computer, causes the computer to perform the steps of:
-
storing a corpus of documents, wherein a document comprises a plurality of themes, and a theme identifies thematic content contained within said document;
storing a knowledge base comprising a plurality of hierarchically arranged categories, wherein a subset of said categories of said knowledge base comprise dimensional categories;
selecting a target category in said knowledge base to generate at least one new sub-category, said target category comprising a plurality of terms classified within said target category, such that one or more groups of terms associated with said target category are divided for association with said new sub-category;
selecting, for each term classified within said target category, a plurality of themes from said corpus of documents;
generating a plurality of dimensional category vectors, one for each term, by associating said themes selected for a term to a dimensional category;
determining if one or more terminological groups of terms exist in said knowledge base by clustering said dimensional category vectors for each term; and
selecting, as said new sub-category for said target category, one or more terminological groups discovered. - View Dependent Claims (7, 8, 9, 10)
storing a theme weights for said themes;
generating cumulative weights for said dimensional category vectors from corresponding themes; and
utilizing said cumulative weights in said dimensional category vectors to ascertain terminological groups.
-
-
8. The computer readable medium as set forth in claim 7, wherein the step of generating cumulative weights for said dimensional category vectors comprising the steps of:
-
storing ancestor and descendant relationships for hierarchically arranged categories in said knowledge base; and
summing theme weights from said themes associated to a dimensional category if said category and said dimensional category comprises a descendant relationship, respectively, and not an ancestor relationship.
-
-
9. The computer readable medium as set forth in claim 6, wherein the step of selecting, for each term associated with said target category, a plurality of themes from said corpus of documents comprises the steps of:
-
selecting a first document set, comprising a plurality of documents, that include said target category as a theme;
selecting a second set of documents, for each term, that includes documents from said first set that contain said term; and
selecting themes, for each term, that correspond to said second set of documents.
-
-
10. The computer readable medium as set forth in claim 6, wherein the step of determining if one or more terminological groups of terms exist in said knowledge base by clustering said dimensional category vectors for each term comprises the step of executing a multi-dimensional clustering algorithm, wherein each dimensional category comprises a single dimension in said multi-dimensional clustering algorithm.
-
11. An apparatus comprising:
-
a corpus of documents, wherein a document comprises a plurality of themes, and a theme identifies thematic content contained within said document;
a knowledge base comprising a plurality of hierarchically arranged categories, wherein a subset of said categories of said knowledge base comprise dimensional categories; and
processor unit coupled to said corpus of documents and said knowledge base for selecting a target category in said knowledge base to generate at least one new sub-category, said target category comprising a plurality of terms classified within said target category such that one or more groups of terms associated with said target category are divided for association with said new sub-category, for selecting, for each term classified within said target category, a plurality of themes from said corpus of documents, for generating a plurality of dimensional category vectors, one for each term, by associating said themes selected for a term to a dimensional category, for determining if one or more terminological groups of terms exist in said knowledge base by clustering said dimensional category vectors for each term, and for selecting, as said new sub-category for said target category, one or more terminological groups discovered.
-
Specification