System and method for text categorization based on ontologies
First Claim
1. A system for text categorization based on ontologies, the system comprising:
- a plurality of data collector software modules stored and operating on a plurality of network-attached computers;
a categorizer software module stored and operating on a network-attached server computer; and
a database server comprising an indexed database of documents and their categorizations, and further comprising a plurality of ontologies, each ontology comprising a plurality of hierarchical taxonomies and each hierarchical taxonomy comprising a plurality of taxons;
wherein the data collector software modules receive a text to be classified and submits the received text to the categorizer software module; and
further wherein the categorizer performs the following steps to categorize the received text;
splitting the text into sentences;
selecting words or phrases from the sentences of the received text that are present in one or more of the plurality of ontologies stored in the database server;
determining one or more specific subcategories that the sentence corresponds to in view of pattern analysis of the selected words or phrases;
selecting a plurality of subtrees from the plurality of ontologies based on the determination of the selected words or phrases belonging to one or more specific subcategories;
determining a weight for each subcategory within the one or more of the specific subcategories;
using the selected plurality of subtrees to create at least one modified subtree by eliminating from the selected plurality of subtrees subcategories having a category weight below a threshold;
for each of the at least one modified subtree, computing a conditionality coefficient to make a Boolean determination of whether to consider the respective modified subtree or not in categorization of the text; and
using any modified subtree that has been determined to be considered in categorization of the text to categorize the text;
wherein the conditionality coefficient is determined at least in part by user-defined rules or the presence, or absence, of diagnostics in nodes of at least one neighboring subtree.
1 Assignment
0 Petitions
Accused Products
Abstract
A system for text categorization based on ontologies comprising data collector software modules; a categorizer software module; and a database comprising an indexed database of documents and their categorizations, and further comprising a plurality of ontologies, each ontology comprising a plurality of hierarchical taxonomies and each hierarchical taxonomy comprising a plurality of taxons. The data collector software modules receive a document to be classified and submit them to the categorizer software module; and the categorizer performs the following steps to categorize each document: splitting the document into sentences; selecting words or phrases that are present in ontologies stored in the database server; selecting a plurality of subtrees from the ontologies based on the presence of specific subcategories in the document; determining a weight for each subcategory; pruning subcategories having a weight below a threshold; and for each of the plurality of modified subtrees, computing a conditionality coefficient.
-
Citations
2 Claims
-
1. A system for text categorization based on ontologies, the system comprising:
-
a plurality of data collector software modules stored and operating on a plurality of network-attached computers; a categorizer software module stored and operating on a network-attached server computer; and a database server comprising an indexed database of documents and their categorizations, and further comprising a plurality of ontologies, each ontology comprising a plurality of hierarchical taxonomies and each hierarchical taxonomy comprising a plurality of taxons; wherein the data collector software modules receive a text to be classified and submits the received text to the categorizer software module; and further wherein the categorizer performs the following steps to categorize the received text; splitting the text into sentences; selecting words or phrases from the sentences of the received text that are present in one or more of the plurality of ontologies stored in the database server; determining one or more specific subcategories that the sentence corresponds to in view of pattern analysis of the selected words or phrases; selecting a plurality of subtrees from the plurality of ontologies based on the determination of the selected words or phrases belonging to one or more specific subcategories; determining a weight for each subcategory within the one or more of the specific subcategories; using the selected plurality of subtrees to create at least one modified subtree by eliminating from the selected plurality of subtrees subcategories having a category weight below a threshold; for each of the at least one modified subtree, computing a conditionality coefficient to make a Boolean determination of whether to consider the respective modified subtree or not in categorization of the text; and using any modified subtree that has been determined to be considered in categorization of the text to categorize the text; wherein the conditionality coefficient is determined at least in part by user-defined rules or the presence, or absence, of diagnostics in nodes of at least one neighboring subtree.
-
-
2. A method for text categorization based on ontologies, the method comprising the steps of:
-
receiving, via a plurality of data collector software modules stored and operating on a plurality of network-attached computers, a text to be classified; submitting the received text to a categorizer software module stored and operating on a network-attached server computer; performing the following using the categorizer software module;
splitting the text into sentences;selecting words or phrases from the sentences of the received text that are present in one or more the plurality of ontologies stored in the database server; determining one or more specific subcategories that the sentence corresponds to in view of pattern analysis of the selected words or phrases; selecting a plurality of subtrees from the plurality of ontologies based on the determination of the selected words or phrases belonging to one or more specific subcategories; determining a weight for each subcategory within the one or more of the specific subcategories; using the selected plurality of subtrees to create at least one modified subtree by eliminating from the selected plurality of subtrees subcategories having a category weight below a threshold; for each of the at least one of modified subtree, computing a conditionality coefficient; using the conditionality coefficient to make a Boolean determination of whether to consider the respective modified subtree or not in categorization of the text; and using any modified subtree that has been determined to be considered to store a resulting document categorization in a database server comprising an indexed database of texts and each of their respective categorizations, and further comprising a plurality of ontologies, each ontology comprising a plurality of hierarchical taxonomies and each hierarchical taxonomy comprising a plurality of taxons; wherein the conditionality coefficient is determined at least in part by user-defined rules or the presence, or absence, of diagnostics in nodes of at least one neighboring subtree.
-
Specification