Adaptive and scalable method for resolving natural language ambiguities

US 7,475,010 B2
Filed: 09/02/2004
Issued: 01/06/2009
Est. Priority Date: 09/03/2003
Status: Active Grant

First Claim

Patent Images

1. A method for resolving natural language ambiguities within text documents on a computer system comprising a processor and memory that would cause the processor to perform the following, comprising the steps of:

i. training probabilistic classifiers from annotated training data containing a sense tag for each polysemous word;

ii. processing said text documents into tokens and determining their part-of-speech tags;

iii computing a measure of confidence using said probabilistic classifiers for each known sense of said tokens defined within a semantic lexicon based on contextual features and assigning a default sense for tokens absent from said semantic lexicon based on their part-of-speech tags;

iv. determining assignment of word senses for each said token in said sentence such that the combined probability across said sentence is maximized; and

v. integrating additional contextual features as generated by one or more of the following natural language processing modules into said probabilistic classifiers whereby said measure of confidence is improved;

using a chunking module to identify multi-word phrases and the associated measure of confidence for each phrase;

using a named-entity recognition module to identify named entities and the associated measure of confidence for each entity;

using a syntactic parsing module to construct sentential parse trees and the associated measure of confidence for each tree;

using an anaphora resolution module to identify anaphor references and the associated measure of confidence for each reference;

using a discourse categorization module to determine document categories and the associated measure of confidence for each category;

using a discourse structure analysis module to determine discourse structures and the associated measure of confidence for each structure.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for resolving ambiguities in natural language by organizing the task into multiple iterations of analysis done in successive levels of depth. The processing is adaptive to the users'"'"' need for accuracy and efficiency. At each level of processing the most accurate disambiguation is made based on the available information. As more analysis is done, additional knowledge is incorporated in a systematic manner to improve disambiguation accuracy. Associated with each level of processing is a measure of confidence, used to gauge the confidence of a process in its disambiguation accuracy. An overall confidence measure is also used to reflect the level of the analysis done.

506 Citations

30 Claims

1. A method for resolving natural language ambiguities within text documents on a computer system comprising a processor and memory that would cause the processor to perform the following, comprising the steps of:
- i. training probabilistic classifiers from annotated training data containing a sense tag for each polysemous word;
  
  ii. processing said text documents into tokens and determining their part-of-speech tags;
  
  iii computing a measure of confidence using said probabilistic classifiers for each known sense of said tokens defined within a semantic lexicon based on contextual features and assigning a default sense for tokens absent from said semantic lexicon based on their part-of-speech tags;
  
  iv. determining assignment of word senses for each said token in said sentence such that the combined probability across said sentence is maximized; and
  
  v. integrating additional contextual features as generated by one or more of the following natural language processing modules into said probabilistic classifiers whereby said measure of confidence is improved;
  
  using a chunking module to identify multi-word phrases and the associated measure of confidence for each phrase;
  
  using a named-entity recognition module to identify named entities and the associated measure of confidence for each entity;
  
  using a syntactic parsing module to construct sentential parse trees and the associated measure of confidence for each tree;
  
  using an anaphora resolution module to identify anaphor references and the associated measure of confidence for each reference;
  
  using a discourse categorization module to determine document categories and the associated measure of confidence for each category;
  
  using a discourse structure analysis module to determine discourse structures and the associated measure of confidence for each structure.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method for resolving natural language ambiguities within text documents of claim 1, wherein said part-of-speech tag is generated by a part-of-speech tagger comprising the following steps:
    - training a probabilistic part-of-speech classifier using annotated training data containing a part-of-speech tag for each token;
      
      computing outcome probabilities using said probabilistic part-of-speech classifier for each token for each sentence in said text documents based on contextual features;
      
      determining assignment of part-of-speech tags for each said token in said sentence such that the combined probability across said sentence is maximized; and
      
      integrating additional contextual features as generated by one or more of the following natural language processing modules into said probabilistic classifiers whereby said measure of confidence is improved;
      
      using a word sense disambiguation module to determine word senses and the associated measure of confidence for each word;
      
      using a chunking module to identify multi-word phrases and the associated measure of confidence for each phrase;
      
      using a named-entity recognition module to identify named entities and the associated measure of confidence for each entity;
      
      using a syntactic parsing module to construct sentential parse trees and the associated measure of confidence for each tree;
      
      using an anaphora resolution module to identify anaphor references and the associated measure of confidence for each reference;
      
      using a discourse categorization module to determine document categories and the associated measure of confidence for each category;
      
      using a discourse structure analysis module to determine discourse structures and the associated measure of confidence for each structure.
  - 3. The method for resolving natural language ambiguities within text documents of claim 1, further comprising, identifying multi-word phrases of said text documents using a chunking module whereby additional contextual features are extracted.
  - 4. The method for identifying multi-word phrases of claim 3, comprising the following steps of:
    - training a probabilistic chunking classifier using annotated training data containing a chunk tag for each token;
      
      computing outcome probabilities using said probabilistic chunking classifier for each token for each sentence in said text documents based on contextual features;
      
      determining assignment of chunk tags for each said token in said sentence such that the combined probability across said sentence is maximized; and
      
      integrating additional contextual features as generated by one or more of the following natural language processing modules into said probabilistic classifiers whereby said measure of confidence is improved;
      
      using a word sense disambiguation module to determine word senses and the associated measure of confidence for each word;
      
      using a named-entity recognition module to identify named entities and the associated measure of confidence for each entity;
      
      using a syntactic parsing module to construct sentential parse trees and the associated measure of confidence for each tree;
      
      using an anaphora-resolution module to identify anaphor references and the associated measure of confidence for each reference;
      
      using a discourse categorization module to determine document categories and the associated measure of confidence for each category;
      
      using a discourse structure analysis module to determine discourse structures and the associated measure of confidence for each structure.
  - 5. The method for resolving natural language ambiguities within text documents of claim 1, further comprising resolving named-entity ambiguities of said text documents using a named-entity recognition module whereby additional contextual features are extracted.
  - 6. The method for resolving named-entity ambiguities of claim 5, comprising the following steps of:
    - training a probabilistic named-entity classifier using annotated training data containing a named-entity tag for each token;
      
      computing outcome probabilities using said probabilistic named-entity classifier for each token for each sentence in said text documents based on contextual features;
      
      determining assignment of named-entity tags for each said token in said sentence such that the combined probability across said sentence is maximized; and
      
      integrating additional contextual features as generated by one or more of the following natural language processing modules into said probabilistic classifiers whereby said measure of confidence is improved;
      
      using a word sense disambiguation module to determine word senses and the associated measure of confidence for each word;
      
      using a chunking module to identify multi-word phrases and the associated measure of confidence for each phrase;
      
      using a syntactic parsing module to construct sentential parse trees and the associated measure of confidence for each tree;
      
      using an anaphora resolution module to identify anaphor references and the associated measure of confidence for each reference;
      
      using a discourse categorization module to determine document categories and the associated measure of confidence for each category;
      
      using a discourse structure analysis module to determine discourse structures and the associated measure of confidence for each structure.
  - 7. The method for resolving natural language ambiguities within text documents of claim 1, further comprising resolving structural ambiguities of said text documents using a syntactical parsing module whereby additional contextual features are extracted.
  - 8. The method for resolving structural ambiguities of claim 7, comprising the following steps of:
    - inducing a grammar and training a probabilistic parse tree scorer using training data containing parse tree annotations;
      
      scoring potential parse tree candidates acceptable by said grammar using said probabilistic parse tree scorer for each sentence in said text documents;
      
      determining a parse tree that spans the entire said sentence having the highest score computed by said probabilistic parse tree scorer; and
      
      integrating additional contextual features as generated by one or more of the following natural language processing modules into said probabilistic classifiers whereby said measure of confidence is improved;
      
      using a word sense disambiguation module to determine word senses and the associated measure of confidence for each word;
      
      using a chunking module to identify multi-word phrases and the associated measure of confidence for each phrase;
      
      using a named-entity recognition module to identify named entities and the associated measure of confidence for each entity;
      
      using an anaphora resolution module to identify anaphor references and the associated measure of confidence for each reference;
      
      using a discourse categorization module to determine document categories and the associated measure of confidence for each category;
      
      using a discourse structure analysis module to determine discourse structures and the associated measure of confidence for each structure.
  - 9. The method for resolving natural language ambiguities within text documents of claim 1, further comprising resolving anaphora references of said text documents using an anaphora resolution module whereby additional contextual features are extracted.
  - 10. The method for resolving anaphora references of claim 9, comprising the following steps of:
    - training a probabilistic anaphora-alignment classifier using training data containing anaphora to antecedent annotations;
      
      determining an anaphor to antecedent alignment for each anaphor in said text documents by maximizing the probability computed using said probabilistic anaphora-alignment classifier based on contextual features; and
      
      integrating additional contextual features as generated by one or more of the following natural language processing modules into said probabilistic classifiers whereby said measure of confidence is improved;
      
      using a word sense disambiguation module to determine word senses and the associated measure of confidence for each word;
      
      using a chunking module to identify multi-word phrases and the associated measure of confidence for each phrase;
      
      using a named-entity recognition module to identify named entities and the associated measure of confidence for each entity;
      
      using a syntactic parsing module to construct sentential parse trees and the associated measure of confidence for each tree;
      
      using a discourse categorization module to determine document categories and the associated measure of confidence for each category;
      
      using a discourse structure analysis module to determine discourse structures and the associated measure of confidence for each structure.
  - 11. The method for resolving natural language ambiguities within text documents of claim 1, further comprising determining discourse categories of said text documents using a discourse category analysis module whereby additional contextual features are extracted.
  - 12. The method for determining discourse categories of claim 11, comprising the following steps of:
    - training probabilistic discourse category classifiers using annotated training data containing discourse categories for each document;
      
      determining discourse categories of said text documents by maximizing the probability computed using said probabilistic discourse category classifiers based on contextual features; and
      
      integrating additional contextual features as generated by one or more of the following natural language processing modules into said probabilistic classifiers whereby said measure of confidence is improved;
      
      using a word sense disambiguation module to determine word senses and the associated measure of confidence for each word;
      
      using a chunking module to identify multi-word phrases and the associated measure of confidence for each phrase;
      
      using a named-entity recognition module to identify named entities and the associated measure of confidence for each entity;
      
      using a syntactic parsing module to construct sentential parse trees and the associated measure of confidence for each tree;
      
      using an anaphora resolution module to identify anaphor references and the associated measure of confidence for each reference;
      
      using a discourse structure analysis module to determine discourse structures and the associated measure of confidence for each structure.
  - 13. The method for resolving natural language ambiguities within text documents of claim 1, further comprising determining discourse structures of said text documents using a discourse structure analysis module whereby additional contextual features are extracted.
  - 14. The method for determining discourse structures of claim 13, comprising the following steps of:
    - creating discourse structure templates containing slots to be filled by discourse subjects;
      
      training probabilistic discourse structure classifiers for said templates and said slots using training data containing discourse structure annotations;
      
      determining a discourse structure template of said text document by maximizing the probability computed using said probabilistic discourse structure classifiers based on contextual features;
      
      filling slots of said discourse structure template by maximizing the probability computed using said probabilistic discourse structure classifiers based on contextual features; and
      
      integrating additional contextual features as generated by one or more of the following natural language processing modules into said probabilistic classifiers whereby said measure of confidence is improved;
      
      using a word sense disambiguation module to determine word senses and the associated measure of confidence for each word;
      
      using a chunking module to identify multi-word phrases and the associated measure of confidence for each phrase;
      
      using a named-entity recognition module to identify named entities and the associated measure of confidence for each entity;
      
      using a syntactic parsing module to construct sentential parse trees and the associated measure of confidence for each tree;
      
      using an anaphora resolution module to identify anaphor references and the associated measure of confidence for each reference;
      
      using a discourse categorization module to determine document categories and the associated measure of confidence for each category.
  - 15. The method for resolving natural language ambiguities within text documents of claim 1, wherein said semantic lexicon is organized as an ontology.

16. An apparatus for use in a natural language processing system for resolving natural language ambiguities within text documents, comprising:
- a trainer that trains probabilistic classifiers from annotated training data containing a sense tag for each polysemous word;
  
  a part-of-speech processor that processes said text documents into tokens and determines their part-of-speech tags;
  
  a classifier module that computes a measure of confidence using said probabilistic classifiers for each known sense of said tokens defined within a semantic lexicon based on contextual features and assigns a default sense for tokens absent from said semantic lexicon based on their part-of-speech tags;
  
  a word sense disambiguator that determines assignment of word senses for each said token in said sentence such that the combined probability across said sentence is maximized; and
  
  a context integrator that integrates additional contextual features as generated by one or more of the following natural language processing apparatuses into said probabilistic classifiers whereby said measure of confidence is improved;
  
  using a chunking apparatus that identifies multi-word phrases and the associated measure of confidence for each phrase;
  
  using a named-entity recognition apparatus that identifies named entities and the associated measure of confidence for each entity;
  
  using a syntactic parsing apparatus that constructs sentential parse trees and the associated measure of confidence for each tree;
  
  using an anaphora resolution apparatus that identifies anaphor references and the associated measure of confidence for each reference;
  
  using a discourse categorization apparatus that determines document categories and the associated measure of confidence for each category;
  
  using a discourse structure analysis apparatus that determines discourse structures and the associated measure of confidence for each structure.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 17. The apparatus for use in a natural language processing system of claim 16, wherein said part-of-speech processor comprises:
    - a probabilistic part-of-speech classifier that computes a probability for each token for each sentence in said text documents based on contextual features;
      
      a trainer that trains said probabilistic part-of-speech classifier using annotated training data containing a part-of-speech tag for each token;
      
      a part-of-speech assigner that determines assignment of part-of-speech tags for each said token in said sentence such that the combined probability across said sentence is maximized; and
      
      a context integrator that integrates additional contextual features as generated by one or more of the following natural language processing apparatuses into said probabilistic classifiers whereby said measure of confidence is improved;
      
      using a word sense disambiguation apparatus that identifies the word senses and the associated measure of confidence for each word;
      
      using a chunking apparatus that identifies multi-word phrases and the associated measure of confidence for each phrase;
      
      using a named-entity recognition apparatus that identifies named entities and the associated measure of confidence for each entity;
      
      using a syntactic parsing apparatus that constructs sentential parse trees and the associated measure of confidence for each tree;
      
      using an anaphora resolution apparatus that identifies anaphor references and the associated measure of confidence for each reference;
      
      using a discourse categorization apparatus that determines document categories and the associated measure of confidence for each category;
      
      using a discourse structure analysis apparatus that determines discourse structures and the associated measure of confidence for each structure.
  - 18. The apparatus for use in a natural language processing system of claim 16, further comprising a chunking apparatus that identifies multi-word phrases of said text documents whereby additional contextual features are extracted.
  - 19. The chunking apparatus for identifying multi-word phrases of claim 18, comprising:
    - a probabilistic chunking classifiers that computes outcome probabilities for each token for each sentence in said text documents based on contextual features;
      
      a trainer that trains said probabilistic chunking classifier using annotated training data containing a chunk tag for each token;
      
      a chunk tag assigner that determines assignment of chunk tags for each said token in said sentence such that the combined probability across said sentence is maximized; and
      
      a context integrator that integrates additional contextual features as generated by one or more of the following natural language processing apparatuses into said probabilistic classifiers whereby said measure of confidence is improved;
      
      using a word sense disambiguation apparatus that identifies the word senses and the associated measure of confidence for each word;
      
      using a named-entity recognition apparatus that identifies named entities and the associated measure of confidence for each entity;
      
      using a syntactic parsing apparatus that constructs sentential parse trees and the associated measure of confidence for each tree;
      
      using an anaphora resolution apparatus that identifies anaphor references and the associated measure of confidence for each reference;
      
      using a discourse categorization apparatus that determines document categories and the associated measure of confidence for each category;
      
      using a discourse structure analysis apparatus that determines discourse structures and the associated measure of confidence for each structure.
  - 20. The apparatus for use in a natural language processing system of claim 16, further comprising a named-entity recognition apparatus that resolves named-entity ambiguities of said text documents whereby additional contextual features are extracted.
  - 21. The named-entity recognition apparatus for resolving named-entity ambiguities as recited in claim 20, comprising:
    - a probabilistic named-entity classifier that computes outcome probabilities for each token for each sentence in said text documents based on contextual features;
      
      a trainer that trains said probabilistic named-entity classifier using annotated training data containing a named-entity tag for each token;
      
      a named-entity assigner that determines assignment of named-entity tags for each said token in said sentence such that the combined probability across said sentence is maximized; and
      
      a context integrator that integrates additional contextual features as generated by one or more of the following natural language processing apparatuses into said probabilistic classifiers whereby said measure of confidence is improved;
      
      using a word sense disambiguation apparatus that identifies the word senses and the associated measure of confidence for each word;
      
      using a chunking apparatus that identifies multi-word phrases and the associated measure of confidence for each phrase;
      
      using a syntactic parsing apparatus that constructs sentential parse trees and the associated measure of confidence for each tree;
      
      using an anaphora resolution apparatus that identifies anaphor references and the associated measure of confidence for each reference;
      
      using a discourse categorization apparatus that determines document categories and the associated measure of confidence for each category;
      
      using a discourse structure analysis apparatus that determines discourse structures and the associated measure of confidence for each structure.
  - 22. The apparatus for use in a natural language processing system of claim 16, further comprising a syntactical parsing apparatus that resolves structural ambiguities of said text documents whereby additional contextual features are extracted.
  - 23. The syntactical parsing apparatus for resolving structural ambiguities of claim 22, comprising:
    - a probabilistic parse tree scorer that scores potential parse tree candidates acceptable by a grammar for each sentence in said text documents;
      
      a trainer that induces said grammar and trains said probabilistic parse tree scorer using training data containing parse tree annotations;
      
      a parse disambiguator that determines a parse tree that spans the entire said sentence having the highest score computed by said probabilistic parse tree scorer; and
      
      a context integrator that integrates additional contextual features as generated by one or more of the following natural language processing apparatuses into said probabilistic classifiers whereby said measure of confidence is improved;
      
      using a word sense disambiguation apparatus that identifies the word senses and the associated measure of confidence for each word;
      
      using a chunking apparatus that identifies multi-word phrases and the associated measure of confidence for each phrase;
      
      using a named-entity recognition apparatus that identifies named entities and the associated measure of confidence for each entity;
      
      using an anaphora resolution apparatus that identifies anaphor references and the associated measure of confidence for each reference;
      
      using a discourse categorization apparatus that determines document categories and the associated measure of confidence for each category;
      
      using a discourse structure analysis apparatus that determines discourse structures and the associated measure of confidence for each structure.
  - 24. The apparatus for use in a natural language processing system of claim 16, further comprising of an anaphora resolution apparatus that resolves anaphora references of said text documents whereby additional contextual features are extracted.
  - 25. The apparatus for resolving anaphora references of claim 24, comprising:
    - a probabilistic anaphora-aliqnment classifier that determines an anaphor to antecedent alignment for each anaphor in said text documents by maximizing the probability computed using based on contextual features;
      
      a trainer that trains said probabilistic anaphora-alignment classifier using training data containing anaphora to antecedent annotations; and
      
      a context integrator that integrates additional contextual features as generated by one or more of the following natural language processing apparatuses into said probabilistic classifiers whereby said measure of confidence is improved;
      
      using a word sense disambiguation apparatus that identifies the word senses and the associated measure of confidence for each word;
      
      using a chunking apparatus that identifies multi-word phrases and the associated measure of confidence for each phrase;
      
      using a named-entity recognition apparatus that identifies named entities and the associated measure of confidence for each entity;
      
      using a syntactic parsing apparatus that constructs sentential parse trees and the associated measure of confidence for each tree;
      
      using a discourse categorization apparatus that determines document categories and the associated measure of confidence for each category;
      
      using a discourse structure analysis apparatus that determines discourse structures and the associated measure of confidence for each structure.
  - 26. The apparatus for use in a natural language processing system of claim 16, further comprising a discourse category analysis apparatus that determines discourse categories of said text documents whereby additional contextual features are extracted.
  - 27. The apparatus for determining discourse categories of claim 26, comprising:
    - probabilistic discourse category classifiers that determine discourse categories of said text documents by maximizing the probability computed based on contextual features;
      
      a trainer that trains said probabilistic discourse category classifiers for each category using annotated training data containing discourse categories for each document; and
      
      a context integrator that integrates additional contextual features as generated by one or more of the following natural language processing apparatuses into said probabilistic classifiers whereby said measure of confidence is improved;
      
      using a word sense disambiguation apparatus that identifies the word senses and the associated measure of confidence for each word;
      
      using a chunking apparatus that identifies multi-word phrases and the associated measure of confidence for each phrase;
      
      using a named-entity recognition apparatus that identifies named entities and the associated measure of confidence for each entity;
      
      using a syntactic parsing apparatus that constructs sentential parse trees and the associated measure of confidence for each tree;
      
      using an anaphora resolution apparatus that identifies anaphor references and the associated measure of confidence for each reference;
      
      using a discourse structure analysis apparatus that determines discourse structures and the associated measure of confidence for each structure.
  - 28. The apparatus for use in a natural language processing system of claim 16, further comprising a discourse structure analysis apparatus that determines discourse structures of said text documents whereby additional contextual features are extracted.
  - 29. The apparatus for determining discourse structures of claim 28, comprising:
    - a repository for storing discourse structure templates containing slots to be filled by discourse subjects;
      
      probabilistic discourse structure classifiers that determine a discourse structure template of said text document by maximizing the probability computed based on contextual features;
      
      a trainer that trains said probabilistic discourse structure classifiers for said templates and said slots using training data containing discourse structure annotations;
      
      slot fillers that fill slots of said discourse structure template by maximizing the probability computed using said probabilistic discourse structure classifiers based on contextual features; and
      
      a context integrator that integrates additional contextual features as generated by one or more of the following natural language processing apparatuses into said probabilistic classifiers whereby said measure of confidence is improved;
      
      using a word sense disambiguation apparatus that identifies the word senses and the associated measure of confidence for each word;
      
      using a chunking apparatus that identifies multi-word phrases and the associated measure of confidence for each phrase;
      
      using a named-entity recognition apparatus that identifies named entities and the associated measure of confidence for each entity;
      
      using a syntactic parsing apparatus that constructs sentential parse trees and the associated measure of confidence for each tree;
      
      using an anaphora resolution apparatus that identifies anaphor references and the associated measure of confidence for each reference;
      
      using a discourse categorization apparatus that determines document categories and the associated measure of confidence for each category.
  - 30. The apparatus for use in a natural language processing system of claim 16, wherein said semantic lexicon is organized as an ontology.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
PRJ Holding Company LLC
Original Assignee
Lingospot, Inc. (Piksel Incorporated)
Inventors
Chao, Gerald CheShun
Primary Examiner(s)
Hudspeth; David R
Assistant Examiner(s)
Rider; Justin W

Application Number

US10/932,836
Publication Number

US 20050049852A1
Time in Patent Office

1,587 Days
Field of Search

704/7, 704/10, 715/255
US Class Current

704/10
CPC Class Codes

G06F 40/284 Lexical analysis, e.g. toke...

G06F 40/30 Semantic analysis

Adaptive and scalable method for resolving natural language ambiguities

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

506 Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Adaptive and scalable method for resolving natural language ambiguities

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

506 Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links