Adaptive and scalable method for resolving natural language ambiguities
First Claim
1. A method for resolving natural language ambiguities within text documents on a computer system comprising a processor and memory that would cause the processor to perform the following, comprising the steps of:
- i. training probabilistic classifiers from annotated training data containing a sense tag for each polysemous word;
ii. processing said text documents into tokens and determining their part-of-speech tags;
iii computing a measure of confidence using said probabilistic classifiers for each known sense of said tokens defined within a semantic lexicon based on contextual features and assigning a default sense for tokens absent from said semantic lexicon based on their part-of-speech tags;
iv. determining assignment of word senses for each said token in said sentence such that the combined probability across said sentence is maximized; and
v. integrating additional contextual features as generated by one or more of the following natural language processing modules into said probabilistic classifiers whereby said measure of confidence is improved;
using a chunking module to identify multi-word phrases and the associated measure of confidence for each phrase;
using a named-entity recognition module to identify named entities and the associated measure of confidence for each entity;
using a syntactic parsing module to construct sentential parse trees and the associated measure of confidence for each tree;
using an anaphora resolution module to identify anaphor references and the associated measure of confidence for each reference;
using a discourse categorization module to determine document categories and the associated measure of confidence for each category;
using a discourse structure analysis module to determine discourse structures and the associated measure of confidence for each structure.
3 Assignments
0 Petitions
Accused Products
Abstract
A method for resolving ambiguities in natural language by organizing the task into multiple iterations of analysis done in successive levels of depth. The processing is adaptive to the users'"'"' need for accuracy and efficiency. At each level of processing the most accurate disambiguation is made based on the available information. As more analysis is done, additional knowledge is incorporated in a systematic manner to improve disambiguation accuracy. Associated with each level of processing is a measure of confidence, used to gauge the confidence of a process in its disambiguation accuracy. An overall confidence measure is also used to reflect the level of the analysis done.
506 Citations
30 Claims
-
1. A method for resolving natural language ambiguities within text documents on a computer system comprising a processor and memory that would cause the processor to perform the following, comprising the steps of:
-
i. training probabilistic classifiers from annotated training data containing a sense tag for each polysemous word; ii. processing said text documents into tokens and determining their part-of-speech tags; iii computing a measure of confidence using said probabilistic classifiers for each known sense of said tokens defined within a semantic lexicon based on contextual features and assigning a default sense for tokens absent from said semantic lexicon based on their part-of-speech tags; iv. determining assignment of word senses for each said token in said sentence such that the combined probability across said sentence is maximized; and v. integrating additional contextual features as generated by one or more of the following natural language processing modules into said probabilistic classifiers whereby said measure of confidence is improved;
using a chunking module to identify multi-word phrases and the associated measure of confidence for each phrase;
using a named-entity recognition module to identify named entities and the associated measure of confidence for each entity;
using a syntactic parsing module to construct sentential parse trees and the associated measure of confidence for each tree;
using an anaphora resolution module to identify anaphor references and the associated measure of confidence for each reference;
using a discourse categorization module to determine document categories and the associated measure of confidence for each category;
using a discourse structure analysis module to determine discourse structures and the associated measure of confidence for each structure. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. An apparatus for use in a natural language processing system for resolving natural language ambiguities within text documents, comprising:
-
a trainer that trains probabilistic classifiers from annotated training data containing a sense tag for each polysemous word; a part-of-speech processor that processes said text documents into tokens and determines their part-of-speech tags; a classifier module that computes a measure of confidence using said probabilistic classifiers for each known sense of said tokens defined within a semantic lexicon based on contextual features and assigns a default sense for tokens absent from said semantic lexicon based on their part-of-speech tags; a word sense disambiguator that determines assignment of word senses for each said token in said sentence such that the combined probability across said sentence is maximized; and a context integrator that integrates additional contextual features as generated by one or more of the following natural language processing apparatuses into said probabilistic classifiers whereby said measure of confidence is improved; using a chunking apparatus that identifies multi-word phrases and the associated measure of confidence for each phrase; using a named-entity recognition apparatus that identifies named entities and the associated measure of confidence for each entity; using a syntactic parsing apparatus that constructs sentential parse trees and the associated measure of confidence for each tree; using an anaphora resolution apparatus that identifies anaphor references and the associated measure of confidence for each reference; using a discourse categorization apparatus that determines document categories and the associated measure of confidence for each category; using a discourse structure analysis apparatus that determines discourse structures and the associated measure of confidence for each structure. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
-
Specification