Extracting Tokens in a Natural Language Understanding Application

US 20080312905A1
Filed: 06/18/2007
Published: 12/18/2008
Est. Priority Date: 06/18/2007
Status: Active Grant

First Claim

Patent Images

1. A method of processing text within a natural language understanding system, the method comprising:

applying a first tokenization technique to a sentence using a statistical tokenization model;

applying a second tokenization technique to the sentence using a named entity when the first tokenization technique does not extract a needed token according to a class of the sentence; and

outputting a token determined according to at least one of the tokenization techniques.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of processing text within a natural language understanding system can include applying a first tokenization technique to a sentence using a statistical tokenization model. A second tokenization technique using a named entity can be applied to the sentence when the first tokenization technique does not extract a needed token according to a class of the sentence. A token determined according to at least one of the tokenization techniques can be output.

52 Citations

View as Search Results

20 Claims

1. A method of processing text within a natural language understanding system, the method comprising:
- applying a first tokenization technique to a sentence using a statistical tokenization model;
  
  applying a second tokenization technique to the sentence using a named entity when the first tokenization technique does not extract a needed token according to a class of the sentence; and
  
  outputting a token determined according to at least one of the tokenization techniques.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, further comprising first determining the class for the sentence using a statistical classification model.
  - 3. The method of claim 2, further comprising determining whether tokenization is needed according to the class of the sentence.
  - 4. The method of claim 3, further comprising identifying the needed token according to the class.
  - 5. The method of claim 1, wherein applying a first tokenization technique further comprises selecting the first statistical tokenization model to be a statistical tokenization model trained using sentences comprising a named entity having a low correlation with the needed token.
  - 6. The method of claim 1, further comprising:
    - determining whether at least one token is needed according to the class and a tokenization result of the first tokenization technique; and
      
      when no further tokens are needed, discontinuing processing of the sentence and outputting at least one token determined for the sentence.
  - 7. The method of claim 1, further comprising applying a third tokenization technique to the sentence using a different statistical tokenization model when the second tokenization technique does not obtain the needed token according to the class of the sentence.
  - 8. The method of claim 1, wherein applying a third tokenization technique further comprises selecting the second statistical tokenization model to be a statistical tokenization model built using sentences to do not comprise a named entity.
  - 9. The method of claim 7, further comprising:
    - determining whether at least one token is needed according to the class and a tokenization result of the second tokenization technique; and
      
      when no further tokens are needed, discontinuing processing of the sentence and outputting at least one token determined for the sentence.

10. A method of processing text within a natural language understanding (NLU) system, the method comprising:
- determining a class for a sentence received by the NLU system at runtime;
  
  processing the sentence using a first statistical tokenization model;
  
  processing the sentence using a named entity when a token that is needed according to the class is not extracted using the first statistical tokenization model;
  
  processing the sentence using a second statistical tokenization model when a token that is needed according to the class is not extracted using the named entity; and
  
  outputting a token determined according to at least one of the first statistical tokenization model, the named entity, or the second statistical tokenization model.
- View Dependent Claims (11, 12)
- - 11. The method of claim 10, further comprising selecting the first statistical tokenization model to be a statistical tokenization model trained using sentences that comprise at least one named entity that has a low correlation with the token.
  - 12. The method of claim 10, further comprising selecting the second statistical tokenization model to be a statistical tokenization model training using sentences that do not comprise a named entity.

13. A computer program product comprising:
- a computer-usable medium comprising computer-usable program code that processes text within a natural language understanding system, the computer-usable medium comprising;
  
  computer-usable program code that applies a first tokenization technique to a sentence using a statistical tokenization model;
  
  computer-usable program code that applies a second tokenization technique to the sentence using a named entity when the first tokenization technique does not extract a needed token according to a class of the sentence; and
  
  computer-usable program code that outputs a token determined according to at least one of the tokenization techniques.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The computer program product of claim 13, wherein the computer-usable medium further comprises:
    - computer-usable program code that first determines the class for the sentence using a statistical classification model; and
      
      computer-usable program code that determines whether tokenization is needed according to the class of the sentence.
  - 15. The computer program product of claim 14, wherein the computer-usable medium further comprises computer-usable program code that identifies the needed token according to the class.
  - 16. The computer program product of claim 13, wherein the computer-usable program code that applies a first tokenization technique further comprises computer-usable program code that selects the first statistical tokenization model to be a statistical tokenization model trained using sentences comprising a named entity having a low correlation with the needed token.
  - 17. The computer program product of claim 13, the computer-usable medium further comprising:
    - computer-usable program code that determines whether at least one token is needed according to the class and a tokenization result of the first tokenization technique; and
      
      computer-usable program code that discontinues processing of the sentence and outputs at least one token determined for the sentence when no further tokens are needed.
  - 18. The computer program product of claim 13, wherein the computer-usable medium further comprises computer-usable program code that applies a third tokenization technique to the sentence using a different statistical tokenization model when the second tokenization technique does not obtain the needed token according to the class of the sentence.
  - 19. The computer program product of claim 13, wherein the computer-usable program code that applies a third tokenization technique further comprises computer-usable program code that selects the second statistical tokenization model to be a statistical tokenization model built using sentences to do not comprise a named entity.
  - 20. The computer program product of claim 18, wherein the computer-usable medium further comprises:
    - computer-usable program code that determines whether at least one token is needed according to the class and a tokenization result of the second tokenization technique; and
      
      computer-usable program code that discontinues processing of the sentence and outputs at least one token determined for the sentence when no further tokens are needed.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Balchandran, Rajesh, Boyer, Linda M., Purdy, Gregory

Granted Patent

US 8,285,539 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/9
CPC Class Codes

G06F 40/284 Lexical analysis, e.g. toke...

G06F 40/295 Named entity recognition

Extracting Tokens in a Natural Language Understanding Application

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

52 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Extracting Tokens in a Natural Language Understanding Application

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

52 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others