Intelligent system that dynamically improves its knowledge and code-base for natural language understanding
First Claim
1. A method for tokenizing text for natural language processing, the method comprising:
- generating, by one or more processors in a natural language processing platform, and from a pool of documents, a set of statistical models comprising one or more entries each indicating a likelihood of appearance of a character/letter sequence in the pool of documents;
receiving, by the one or more processors, a set of rules comprising rules that identify character/letter sequences as valid tokens;
transforming, by the one or more processors, one or more entries in the statistical models into new rules that are added to the set of rules when the entries indicate a high likelihood;
receiving, by the one or more processors, a document to be processed;
dividing, by the one or more processors, the document to be processed into tokens based on the set of statistical models and the set of rules, wherein the statistical models are applied where the rules fail to unambiguously tokenize the document; and
outputting, by the one or more processors, the divided tokens for natural language processing.
13 Assignments
0 Petitions
Accused Products
Abstract
Systems, methods, and apparatuses are presented for a novel natural language tokenizer and tagger. In some embodiments, a method for tokenizing text for natural language processing comprises: generating from a pool of documents, a set of statistical models comprising one or more entries each indicating a likelihood of appearance of a character/letter sequence in the pool of documents; receiving a set of rules comprising rules that identify character/letter sequences as valid tokens; transforming one or more entries in the statistical models into new rules that are added to the set of rules when the entries indicate a high likelihood; receiving a document to be processed; dividing the document to be processed into tokens based on the set of statistical models and the set of rules, wherein the statistical models are applied where the rules fail to unambiguously tokenize the document; and outputting the divided tokens for natural language processing.
-
Citations
20 Claims
-
1. A method for tokenizing text for natural language processing, the method comprising:
-
generating, by one or more processors in a natural language processing platform, and from a pool of documents, a set of statistical models comprising one or more entries each indicating a likelihood of appearance of a character/letter sequence in the pool of documents; receiving, by the one or more processors, a set of rules comprising rules that identify character/letter sequences as valid tokens; transforming, by the one or more processors, one or more entries in the statistical models into new rules that are added to the set of rules when the entries indicate a high likelihood; receiving, by the one or more processors, a document to be processed; dividing, by the one or more processors, the document to be processed into tokens based on the set of statistical models and the set of rules, wherein the statistical models are applied where the rules fail to unambiguously tokenize the document; and outputting, by the one or more processors, the divided tokens for natural language processing. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. An apparatus for tokenizing text for natural language processing, the apparatus comprising one or more processors configured to:
-
generate from a pool of documents a set of statistical models comprising one or more entries each indicating a likelihood of appearance of a character/letter sequence in the pool of documents; receive a set of rules comprising rules that identify character/letter sequences as valid tokens; transform one or more entries in the statistical models into new rules that are added to the set of rules when the entries indicate a high likelihood; receive a document to be processed; divide the document to be processed into tokens based on the set of statistical models and the set of rules, wherein the statistical models are applied where the rules fail to unambiguously tokenize the document; and output the divided tokens for natural language processing. - View Dependent Claims (19)
-
-
20. A non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to:
-
generate from a pool of documents a set of statistical models comprising one or more entries each indicating a likelihood of appearance of a character/letter sequence in the pool of documents; receive a set of rules comprising rules that identify character/letter sequences as valid tokens; transform one or more entries in the statistical models into new rules that are added to the set of rules when the entries indicate a high likelihood; receive a document to be processed; divide the document to be processed into tokens based on the set of statistical models and the set of rules, wherein the statistical models are applied where the rules fail to unambiguously tokenize the document; and output the divided tokens for natural language processing.
-
Specification