Hybrid classifier for assigning natural language processing (NLP) inputs to domains in real-time
First Claim
1. A method for classifying an input query within multiple domains for natural language processing, the method comprising generating trigram corpora, each trigram corpus for a domain, comprising:
- defining a query model for the domain comprising at least an ontology having a set of semantic tokens organized in hierarchical levels, and a set of domain-specific semantic constructions each including one or more of the semantic tokens linked by one of a set of predefined grammatical relations;
obtaining an expanded set of semantic constructions as a semantic corpus for the domain by replacing at least one of the semantic tokens of at least one of the semantic constructions of the query model with corresponding semantic tokens at a lower hierarchical level in the ontology of the query model;
performing a trigram analysis on the semantic corpus to obtain the trigram corpus for the domain comprising entries each corresponding to a trigram of three-token sequence appearing in the semantic corpus, each entry comprising;
a three-token sequence having a first, second and third semantic token; and
a corresponding trigram probability representing a relative probability that the third semantic token appearing in the semantic corpus given the first and the second semantic tokens;
obtaining an input query text from a remote device via a network connection;
determining normalized relevance scores for the input query text corresponding to the multiple domains based on the input query text, query models of the multiple domains and the trigram corpora of the multiple domains;
ordering the normalized relevance scores for the input query text with respect to the multiple domains;
classifying the input query text according to the ordering; and
transmitting, to the remote device, a communication comprising the input query text based upon the classifying.
6 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for domain classification in natural language processing based on domains are disclosed. The method includes generating a trigram corpus for the purpose of classification based on a trigram analysis of a domain model containing a hierarchical ontology and semantic construction that maps patterns of semantic tokens to syntactic patterns. An input string is parsed within each domain, tokenized in each domain. The resulting trigrams for the input text in each domain are looked up in the corresponding trigram corpus to determine the relevancy of each domain to the input text. The input string is thus classified based on the relevancy determination. The systems and methods avoids having to rely on existing annotated domain corpora for classification and allows for fast regeneration of the classifier when domain models are under frequent update and development.
-
Citations
20 Claims
-
1. A method for classifying an input query within multiple domains for natural language processing, the method comprising generating trigram corpora, each trigram corpus for a domain, comprising:
-
defining a query model for the domain comprising at least an ontology having a set of semantic tokens organized in hierarchical levels, and a set of domain-specific semantic constructions each including one or more of the semantic tokens linked by one of a set of predefined grammatical relations; obtaining an expanded set of semantic constructions as a semantic corpus for the domain by replacing at least one of the semantic tokens of at least one of the semantic constructions of the query model with corresponding semantic tokens at a lower hierarchical level in the ontology of the query model; performing a trigram analysis on the semantic corpus to obtain the trigram corpus for the domain comprising entries each corresponding to a trigram of three-token sequence appearing in the semantic corpus, each entry comprising; a three-token sequence having a first, second and third semantic token; and a corresponding trigram probability representing a relative probability that the third semantic token appearing in the semantic corpus given the first and the second semantic tokens; obtaining an input query text from a remote device via a network connection; determining normalized relevance scores for the input query text corresponding to the multiple domains based on the input query text, query models of the multiple domains and the trigram corpora of the multiple domains; ordering the normalized relevance scores for the input query text with respect to the multiple domains; classifying the input query text according to the ordering; and transmitting, to the remote device, a communication comprising the input query text based upon the classifying. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A domain classification system for domain based natural language processing, comprising:
-
memory storing; query models each for a domain of a set of domains and comprising at least an ontology having a set of semantic tokens organized in hierarchical levels, and a set of domain-specific semantic constructions each including one or more of the semantic tokens linked by one of a set of predefined grammatical relations; trigram corpora, each corpus for a domain of the set of domains and each having entries comprising; trigrams obtained based on the query model of the domain, each trigram having a first, second and third semantic token; and corresponding trigram probabilities each representing a relative probability that the third semantic token appearing in the trigrams given the first and the second sematic tokens; and computer executable instructions; and a processor in communication with the memory, and when executing the instructions, configured to implement software components comprising; a receiving component configured to receive an input text from a remote device via a network connection; a set of domain relevance analyzer components corresponding to the set of domains configured to determine relevance scores for the input text with respect to the set of domains based on tokenized sequences of the input text corresponding to the set of domains, the query models, and the trigram corpora; a classifier component configured to classify the input text among the set of domains based on the relevance scores with respect to the set of domains for the input text; and a component configured to transmit, to the remote device, a communication comprising the input text based upon the classifying. - View Dependent Claims (16, 17)
-
-
18. A computer readable medium storing computer executable instructions that when executed by a processor cause the processor to perform a method for classifying an input query within multiple domains for natural language processing, the method comprising generating trigram corpora, each trigram corpus for a domain, comprising:
-
defining a query model for the domain comprising at least an ontology having a set of semantic tokens organized in hierarchical levels, and a set of domain-specific semantic constructions each including one or more of the semantic tokens linked by one of a set of predefined grammatical relations; obtaining an expanded set of semantic constructions as a semantic corpus for the domain by replacing at least one of the semantic tokens of at least one of the semantic constructions of the query model with corresponding semantic tokens at a lower hierarchical level in the ontology of the query model; performing a trigram analysis on the semantic corpus to obtain the trigram corpus for the domain comprising entries each corresponding to a trigram of three-token sequence appearing in the semantic corpus, each entry comprising; a three-token sequence having a first, second and third semantic token; and a corresponding trigram probability representing a relative probability associated with the third semantic token appearing in the semantic corpus given the first and the second semantic tokens; obtaining an input query text from a remote device via a network connection; determining normalized relevance scores for the input query text corresponding to the multiple domains based on the input query text, query models of the multiple domains and the trigram corpora of the multiple domains; ordering the normalized relevance scores for the input query text with respect to the multiple domains; classifying the input query text according to the ordering; and transmitting, to the remote device, a communication comprising the input query text based upon the classifying. - View Dependent Claims (19, 20)
-
Specification