Extracting tokens in a natural language understanding application

US 8,285,539 B2
Filed: 06/18/2007
Issued: 10/09/2012
Est. Priority Date: 06/18/2007
Status: Active Grant

First Claim

Patent Images

1. A method of processing text within a natural language understanding system, the method comprising:

via a processor, applying a first tokenization technique to a sentence using a statistical tokenization model;

via the processor, applying a second subsequent tokenization technique to the sentence using a named entity only when the first tokenization technique does not extract a needed token according to a class of the sentence; and

via the processor, outputting a token determined according to at least one of the tokenization techniques.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of processing text within a natural language understanding system can include applying a first tokenization technique to a sentence using a statistical tokenization model. A second tokenization technique using a named entity can be applied to the sentence when the first tokenization technique does not extract a needed token according to a class of the sentence. A token determined according to at least one of the tokenization techniques can be output.

Citations

20 Claims

1. A method of processing text within a natural language understanding system, the method comprising:
- via a processor, applying a first tokenization technique to a sentence using a statistical tokenization model;
  
  via the processor, applying a second subsequent tokenization technique to the sentence using a named entity only when the first tokenization technique does not extract a needed token according to a class of the sentence; and
  
  via the processor, outputting a token determined according to at least one of the tokenization techniques.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, further comprising, via the processor, first determining the class for the sentence using a statistical classification model.
  - 3. The method of claim 2, further comprising, via the processor, determining whether tokenization is needed according to the class of the sentence.
  - 4. The method of claim 3, further comprising, via the processor, identifying the needed token according to the class.
  - 5. The method of claim 1, wherein applying a first tokenization technique further comprises selecting the first statistical tokenization model to be a statistical tokenization model trained using sentences comprising a named entity having a low correlation with the needed token.
  - 6. The method of claim 1, further comprising:
    - via the processor, determining whether at least one token is needed according to the class and a tokenization result of the first tokenization technique; and
      
      when no further tokens are needed, via the processor, discontinuing processing of the sentence and outputting at least one token determined for the sentence.
  - 7. The method of claim 1, further comprising, via the processor, applying a third tokenization technique to the sentence using a different statistical tokenization model when the second subsequent tokenization technique does not obtain the needed token according to the class of the sentence.
  - 8. The method of claim 1, wherein applying a third tokenization technique further comprises selecting the second subsequent statistical tokenization model to be a statistical tokenization model built using sentences to do not comprise a named entity.
  - 9. The method of claim 7, further comprising:
    - via the processor, determining whether at least one token is needed according to the class and a tokenization result of the second subsequent tokenization technique; and
      
      when no further tokens are needed, via the processor, discontinuing processing of the sentence and outputting at least one token determined for the sentence.

10. A method of processing text within a natural language understanding (NLU) system, the method comprising:
- via a processor, determining a class for a sentence received by the NLU system at runtime;
  
  via the processor, processing the sentence using a first statistical tokenization model;
  
  via the processor, processing the sentence using a named entity when a token that is needed according to the class is not extracted using the first statistical tokenization model;
  
  via the processor, processing the sentence using a second subsequent statistical tokenization model only when a token that is needed according to the class is not extracted using the named entity; and
  
  via the processor, outputting a token determined according to at least one of the first statistical tokenization model, the named entity, or the second subsequent statistical tokenization model.
- View Dependent Claims (11, 12)
- - 11. The method of claim 10, further comprising, via the processor, selecting the first statistical tokenization model to be a statistical tokenization model trained using sentences that comprise at least one named entity that has a low correlation with the token.
  - 12. The method of claim 10, further comprising, via the processor, selecting the second subsequent statistical tokenization model to be a statistical tokenization model training using sentences that do not comprise a named entity.

13. A computer program product comprising:
- a computer-readable storage comprising computer-usable program code stored thereon that processes text within a natural language understanding system, the computer-readable storage comprising;
  
  computer-usable program code that applies a first tokenization technique to a sentence using a statistical tokenization model;
  
  computer-usable program code that applies a second subsequent tokenization technique to the sentence using a named only entity only when the first tokenization technique does not extract a needed token according to a class of the sentence; and
  
  computer-usable program code that outputs a token determined according to at least one of the tokenization techniques.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The computer program product of claim 13, wherein the computer-readable storage further comprises:
    - computer-usable program code that first determines the class for the sentence using a statistical classification model; and
      
      computer-usable program code that determines whether tokenization is needed according to the class of the sentence.
  - 15. The computer program product of claim 14, wherein the computer-readable storage further comprises computer-usable program code that identifies the needed token according to the class.
  - 16. The computer program product of claim 13, wherein the computer-usable program code that applies a first tokenization technique further comprises computer-usable program code that selects the first statistical tokenization model to be a statistical tokenization model trained using sentences comprising a named entity having a low correlation with the needed token.
  - 17. The computer program product of claim 13, the computer-readable storage further comprising:
    - computer-usable program code that determines whether at least one token is needed according to the class and a tokenization result of the first tokenization technique; and
      
      computer-usable program code that discontinues processing of the sentence and outputs at least one token determined for the sentence when no further tokens are needed.
  - 18. The computer program product of claim 13, wherein the computer-readable storage further comprises computer-usable program code that applies a third tokenization technique to the sentence using a different statistical tokenization model when the second subsequent tokenization technique does not obtain the needed token according to the class of the sentence.
  - 19. The computer program product of claim 13, wherein the computer-usable program code that applies a third tokenization technique further comprises computer-usable program code that selects the second subsequent statistical tokenization model to be a statistical tokenization model built using sentences to do not comprise a named entity.
  - 20. The computer program product of claim 18, wherein the computer-readable storage further comprises:
    - computer-usable program code that determines whether at least one token is needed according to the class and a tokenization result of the second subsequent tokenization technique; and
      
      computer-usable program code that discontinues processing of the sentence and outputs at least one token determined for the sentence when no further tokens are needed.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Balchandran, Rajesh, Boyer, Linda M., Purdy, Gregory
Primary Examiner(s)
COLUCCI, MICHAEL C

Application Number

US11/764,285
Publication Number

US 20080312905A1
Time in Patent Office

1,940 Days
Field of Search

704/1, 704/9, 704/257, 704/10, 704/3, 709/232, 715/255, 718/100
US Class Current

704/9
CPC Class Codes

G06F 40/284 Lexical analysis, e.g. toke...

G06F 40/295 Named entity recognition

Extracting tokens in a natural language understanding application

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Extracting tokens in a natural language understanding application

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links