System and method of building and using hierarchical knowledge structures
First Claim
1. A computerized method for building hierarchical knowledge structures comprisingreceiving a first ontology including initial categories, an indication of sample data for a given category of the initial categories, and an indication of symbolic knowledge for the given category;
- andpopulating the first ontology with new features to form a second ontology for use in document categorization, the populating comprising determining the new features from the sample data using a statistical machine learning process and retaining the new features and the symbolic knowledge within the second ontology in association with the given category, wherein the sample data comprises sample documents, and the determining comprises;
extracting attributes from the sample documents;
calculating a statistical concept-distance metric for the attributes;
selecting a first subset of the attributes that are more distinguishing with respect to the sample documents based on the calculated statistical concept-distance metric for the attributes and a first user-controllable input; and
selecting a second subset of the first attribute subset based on the given category and a relevance measure for attributes in the first attribute subset with respect to the given category, the relevance measure being affected by a second user-controllable input;
wherein the new features comprise attributes in the second attribute subset.
2 Assignments
0 Petitions
Accused Products
Abstract
The present disclosure includes systems and techniques relating to building and using hierarchical knowledge structures. In general, embodiments of the invention feature a computer program product and a method including receiving a first ontology including initial categories, an indication of sample data for a given category of the initial categories, and an indication of symbolic knowledge for the given category; and populating the first ontology with new features to form a second ontology, the populating comprising determining the new features from the sample data using a statistical machine learning process and retaining the new features and the symbolic knowledge within the second ontology in association with the given category. In another aspect, embodiments of the invention feature a knowledge management system including a hierarchical knowledge structure that categorizes information according to cognitive and semantic qualities within a knowledge domain.
-
Citations
36 Claims
-
1. A computerized method for building hierarchical knowledge structures comprising
receiving a first ontology including initial categories, an indication of sample data for a given category of the initial categories, and an indication of symbolic knowledge for the given category; - and
populating the first ontology with new features to form a second ontology for use in document categorization, the populating comprising determining the new features from the sample data using a statistical machine learning process and retaining the new features and the symbolic knowledge within the second ontology in association with the given category, wherein the sample data comprises sample documents, and the determining comprises; extracting attributes from the sample documents; calculating a statistical concept-distance metric for the attributes; selecting a first subset of the attributes that are more distinguishing with respect to the sample documents based on the calculated statistical concept-distance metric for the attributes and a first user-controllable input; and selecting a second subset of the first attribute subset based on the given category and a relevance measure for attributes in the first attribute subset with respect to the given category, the relevance measure being affected by a second user-controllable input; wherein the new features comprise attributes in the second attribute subset. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- and
-
16. A computer program product, encoded on a computer-readable medium, operable to cause data processing apparatus to perform operations comprising:
-
receiving a first ontology including initial categories, an indication of sample data for a given category of the initial categories, and an indication of symbolic knowledge for the given category; and populating the first ontology with new features to form a second ontology for use in document categorization, the populating comprising determining the new features from the sample data using a statistical machine learning process and retaining the new features and the symbolic knowledge within the second ontology in association with the given category, wherein the sample data comprises sample documents, and the determining comprises; extracting attributes from the sample documents; calculating a statistical concept-distance metric for the attributes; selecting a first subset of the attributes that are more distinguishing with respect to the sample documents based on the calculated statistical concept-distance metric for the attributes and a first user-controllable input; and selecting a second subset of the first attribute subset based on the given category and a relevance measure for attributes in the first attribute subset with respect to the given category, the relevance measure being affected by a second user-controllable input; wherein the new features comprise attributes in the second attribute subset. - View Dependent Claims (17, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
-
-
27. The computer program product of claim 24, wherein the determining comprises discretizing frequency values VA for the attribute A in the documents S based on a statistical variance of the frequency values VA;
- and wherein input for the calculating the information gain includes the discretized frequency values.
-
28. The computer program product of claim 27, wherein the discretizing comprises grouping the frequency values VA based on a maximum per-group variance determined according to a third user-controllable input.
-
29. The computer program product of claim 16, wherein the determining comprises determining a variance in the calculated statistical concept-distance metric for the attributes, and wherein the selecting the first subset comprises selecting the first subset based on the first user-controllable input combined with the determined variance in the calculated statistical concept-distance metric for the attributes.
-
30. The computer program product of claim 16, wherein the determining comprises determining a variance in frequency values for the attributes, and wherein the relevance measure is affected by the second user-controllable input combined with the determined variance in the frequency values for the attributes.
-
31. The computer program product of claim 16, wherein the indication of symbolic knowledge comprises a tag and a keyword, the tag indicating an existing symbolic ontology, and the populating comprising mining the existing symbolic ontology based on the keyword to obtain the symbolic knowledge.
-
32. The computer program product of claim 31, wherein the existing symbolic ontology comprises a public ontology of an online lexical reference system, and the mining comprises accessing the online lexical reference system over a network.
-
33. The computer program product of claim 31, wherein the indication of sample data comprises a second tag and references to sample documents, the second tag indicating the statistical machine learning process selected from multiple available statistical machine learning processes.
-
34. The computer program product of claim 17, wherein the query includes a balancing factor, and the combining comprises adjusting the contributions from the machine-learned new features and the symbolic knowledge based on the balancing factor.
-
35. The computer program product of claim 17, wherein the query comprises a document, and the retrieving comprises identifying a category for the document.
-
36. The computer program product of claim 17, wherein the query comprises a search string, and the retrieving comprises:
-
identifying a document related to the search string; and obtaining information associated with the identified document.
-
-
18. A computerized system comprising:
-
a knowledge management system including a hierarchical knowledge structure that categorizes information according to cognitive and semantic qualities within a knowledge domain, the hierarchical knowledge structure including discrete knowledge types included within a common information category of the hierarchical knowledge structure, the discrete knowledge types including knowledge represented explicitly through domain vocabulary words and relationships among the domain vocabulary words, and the discrete knowledge types including knowledge represented as designated sample data to be processed using statistical machine learning analysis, wherein the knowledge management system includes a computer program product operable to cause data processing apparatus to process the discrete knowledge types and to effect a programming interface used to access the hierarchical knowledge structure; and a document handling system configured to use the programming interface to access and obtain information from the knowledge management system;
wherein the computer program product is operable to cause data processing apparatus to perform operation comprising;extracting attributes from the sample data; calculating a statistical concept-distance metric for the attributes; selecting a first subset of the attributes that are more distinguishing with respect to the sample data based on the calculated statistical concept-distance metric for the attributes and a first user-controllable input; and selecting a second subset of the first attribute subset based on the given category and a relevance measure for attributes in the first attribute subset with respect to the given category, the relevance measure being affected by a second user-controllable input; and augmenting the hierarchical knowledge with the second attribute subset. - View Dependent Claims (19, 20, 21, 22, 23)
-
Specification