Intelligent system that dynamically improves its knowledge and code-base for natural language understanding

US 9,965,458 B2
Filed: 12/09/2015
Issued: 05/08/2018
Est. Priority Date: 12/09/2014
Status: Active Grant

First Claim

Patent Images

1. A method for tokenizing text for natural language processing, the method comprising:

generating, by one or more processors in a natural language processing platform, and from a pool of documents, a set of statistical models comprising one or more entries each indicating a likelihood of appearance of a character/letter sequence in the pool of documents;

receiving, by the one or more processors, a set of rules comprising rules that identify character/letter sequences as valid tokens;

transforming, by the one or more processors, one or more entries in the statistical models into new rules that are added to the set of rules when the entries indicate a high likelihood;

receiving, by the one or more processors, a document to be processed;

dividing, by the one or more processors, the document to be processed into tokens based on the set of statistical models and the set of rules, wherein the statistical models are applied where the rules fail to unambiguously tokenize the document; and

outputting, by the one or more processors, the divided tokens for natural language processing.

View all claims

13 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, methods, and apparatuses are presented for a novel natural language tokenizer and tagger. In some embodiments, a method for tokenizing text for natural language processing comprises: generating from a pool of documents, a set of statistical models comprising one or more entries each indicating a likelihood of appearance of a character/letter sequence in the pool of documents; receiving a set of rules comprising rules that identify character/letter sequences as valid tokens; transforming one or more entries in the statistical models into new rules that are added to the set of rules when the entries indicate a high likelihood; receiving a document to be processed; dividing the document to be processed into tokens based on the set of statistical models and the set of rules, wherein the statistical models are applied where the rules fail to unambiguously tokenize the document; and outputting the divided tokens for natural language processing.

Citations

20 Claims

1. A method for tokenizing text for natural language processing, the method comprising:
- generating, by one or more processors in a natural language processing platform, and from a pool of documents, a set of statistical models comprising one or more entries each indicating a likelihood of appearance of a character/letter sequence in the pool of documents;
  
  receiving, by the one or more processors, a set of rules comprising rules that identify character/letter sequences as valid tokens;
  
  transforming, by the one or more processors, one or more entries in the statistical models into new rules that are added to the set of rules when the entries indicate a high likelihood;
  
  receiving, by the one or more processors, a document to be processed;
  
  dividing, by the one or more processors, the document to be processed into tokens based on the set of statistical models and the set of rules, wherein the statistical models are applied where the rules fail to unambiguously tokenize the document; and
  
  outputting, by the one or more processors, the divided tokens for natural language processing.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1, wherein the set of statistical models further comprises statistical models based on human annotation, and the method further comprises:
    - generating, by the one or more processors, one or more human readable prompts configured to elicit annotations of one or more documents in the pool of documents, wherein the annotations comprise identification of one or more character/letter sequences in the documents as valid tokens;
      
      receiving, by the one or more processors, one or more annotations elicited by the human readable prompts; and
      
      generating, by the one or more processors, statistical models based on the received annotations, wherein the statistical models comprise one or more entries each indicating a likelihood of appearance of a character/letter sequence in the character/letter sequences annotated as valid tokens.
  - 3. The method of claim 1, wherein:
    - the document to be processed is in one or more languages; and
      
      the divided tokens are outputted in a language agnostic format.
  - 4. The method of claim 3, wherein:
    - the document to be processed is in more than one language;
      
      the set of rules further comprises a rule that divides portions of the document in different languages into different segments; and
      
      the segments of the document in different languages are divided into tokens based on a different combination of rules and statistical models.
  - 5. The method of claim 1, wherein the set of rules further comprises a rule that triggers the application of rules and/or statistical models for further tokenization.
  - 6. The method of claim 1, wherein at least one of the divided tokens contains a morpheme.
  - 7. The method of claim 1, wherein at least one of the divided tokens contains a group of words.
  - 8. The method of claim 7, wherein at least one of the divided tokens contains a turn in a conversation.
  - 9. The method of claim 1, wherein dividing the document to be processed into tokens based on the set of statistical models comprises comparing statistical likelihood of more than one candidate set of tokens.
  - 10. The method of claim 9, wherein the candidate set of tokens that contains tokens with smallest sizes is preferred.
  - 11. The method of claim 9, wherein more than one candidate set of tokens is outputted for natural language processing.
  - 12. The method of claim 1, wherein:
    - the set of statistical models further comprises one or more statistical models for normalizing variants of a token into a single token and/or the set of rules further comprises one or more rules for normalizing variants of a token into a single token; and
      
      the method further comprises normalizing variants of a token into a single token based on the statistical models and/or the rules.
  - 13. The method of claim 1, wherein:
    - the set of statistical models further comprises one or more statistical models for adding tags to the tokens and/or the set of rules further comprises one or more rules for adding tags to the tokens; and
      
      the method further comprises adding tags to the tokens based on the statistical models and/or the rules.
  - 14. The method of claim 13, wherein the tags are based on semantic information and/or structural information.
  - 15. The method of claim 1, wherein the set of rules further comprises one or more rules that identify markup language content, an Internet address, a hashtag, or an emoji/emoticon.
  - 16. The method of claim 1, wherein the set of statistical models and/or the set of rules are adjusted based at least in part on an author of the document.
  - 17. The method of claim 1, wherein the set of statistical models and/or the set of rules are based at least in part on intra-document information.

18. An apparatus for tokenizing text for natural language processing, the apparatus comprising one or more processors configured to:
- generate from a pool of documents a set of statistical models comprising one or more entries each indicating a likelihood of appearance of a character/letter sequence in the pool of documents;
  
  receive a set of rules comprising rules that identify character/letter sequences as valid tokens;
  
  transform one or more entries in the statistical models into new rules that are added to the set of rules when the entries indicate a high likelihood;
  
  receive a document to be processed;
  
  divide the document to be processed into tokens based on the set of statistical models and the set of rules, wherein the statistical models are applied where the rules fail to unambiguously tokenize the document; and
  
  output the divided tokens for natural language processing.
- View Dependent Claims (19)
- - 19. The apparatus of claim 18, wherein the set of statistical models further comprises statistical models based on human annotation, and the one or more processors are further configured to:
    - generate one or more human readable prompts configured to elicit annotations of one or more documents in the pool of documents, wherein the annotations comprise identification of one or more character/letter sequences in the documents as valid tokens;
      
      receive one or more annotations elicited by the human readable prompts; and
      
      generate statistical models based on the received annotations, wherein the statistical models comprise one or more entries each indicating a likelihood of appearance of a character/letter sequence in the character/letter sequences annotated as valid tokens.

20. A non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to:
- generate from a pool of documents a set of statistical models comprising one or more entries each indicating a likelihood of appearance of a character/letter sequence in the pool of documents;
  
  receive a set of rules comprising rules that identify character/letter sequences as valid tokens;
  
  transform one or more entries in the statistical models into new rules that are added to the set of rules when the entries indicate a high likelihood;
  
  receive a document to be processed;
  
  divide the document to be processed into tokens based on the set of statistical models and the set of rules, wherein the statistical models are applied where the rules fail to unambiguously tokenize the document; and
  
  output the divided tokens for natural language processing.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
AI IP Investments Limited
Original Assignee
Sansa AI Incorporated
Inventors
Munro, Robert J., Voigt, Rob, Erle, Schuyler D., Callahan, Brendan D., King, Gary C., Long, Jessica D., Brenier, Jason, Saxena, Tripti, Krawczyk, Stefan
Primary Examiner(s)
GUERRA-ERAZO, EDGAR X

Application Number

US14/964,512
Publication Number

US 20160162466A1
Time in Patent Office

881 Days
Field of Search

704 1- 10
US Class Current
CPC Class Codes

G06F 40/216   using statistical methods

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/30   Semantic analysis

Intelligent system that dynamically improves its knowledge and code-base for natural language understanding

First Claim

13 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Intelligent system that dynamically improves its knowledge and code-base for natural language understanding

First Claim

13 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links