System and method for text normalization using atomic tokens
First Claim
1. A method comprising:
- receiving a text corpus;
tokenizing, via a tokenization module on a computing device, the text corpus into application tokens, each application token of the application tokens comprising one of a sequence of letters, a sequence of digits, and punctuation, wherein the tokenization module is trained on training data generated by a feature extraction module that extracts morphological and lexical text features from a training data token and from an n-left token or an n-right token associated with the training data token;
comparing the application tokens to a language-independent pattern list that comprises number patterns, to yield a token comparison;
identifying text-to-speech pronunciation guidelines associated with each application token in the application tokens, wherein the text-to-speech pronunciation guidelines comprise at least one of reorder, asword, and split; and
generating, via a text-to-speech computer system and an output device, audible speech from the application tokens in the text corpus using the token comparison and the text-to-speech pronunciation guidelines.
1 Assignment
0 Petitions
Accused Products
Abstract
A system, method and computer-readable storage devices are for normalizing text for ASR and TTS in a language-neutral way. The system described herein divides Unicode text into meaningful chunks called “atomic tokens.” The atomic tokens strongly correlate to their actual pronunciation, and not to their meaning The system combines the tokenization with a data-driven classification scheme, followed by class-determined actions to convert text to normalized form. The classification labels are based on pronunciation, unlike alternative approaches that typically employ Named Entity-based categories. Thus, this approach is relatively simple to adapt to new languages. Non-experts can easily annotate training data because the tokens are based on pronunciation alone.
37 Citations
20 Claims
-
1. A method comprising:
-
receiving a text corpus; tokenizing, via a tokenization module on a computing device, the text corpus into application tokens, each application token of the application tokens comprising one of a sequence of letters, a sequence of digits, and punctuation, wherein the tokenization module is trained on training data generated by a feature extraction module that extracts morphological and lexical text features from a training data token and from an n-left token or an n-right token associated with the training data token; comparing the application tokens to a language-independent pattern list that comprises number patterns, to yield a token comparison; identifying text-to-speech pronunciation guidelines associated with each application token in the application tokens, wherein the text-to-speech pronunciation guidelines comprise at least one of reorder, asword, and split; and generating, via a text-to-speech computer system and an output device, audible speech from the application tokens in the text corpus using the token comparison and the text-to-speech pronunciation guidelines. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system comprising:
-
a processor configured to perform text-to-speech generation; and a computer-readable storage medium having instructions stored which, when executed by the processor, cause the processor to perform operations comprising; receiving a text corpus; tokenizing, via a tokenization module, the text corpus into application tokens, each application token of the application tokens comprising one of a sequence of letters, a sequence of digits, and punctuation, wherein the tokenization module is trained on training data generated by a feature extraction module that extracts morphological and lexical text features from a training data token and from an n-left token or an n-right token associated with the training data token; comparing the application tokens to a language-independent pattern list that comprises number patterns, to yield a token comparison; identifying text-to-speech pronunciation guidelines associated with each application token in the application tokens, wherein the text-to-speech pronunciation guidelines comprise at least one of reorder, asword, and split; and generating audible speech from the application tokens in the text corpus using the token comparison and the text-to-speech pronunciation guidelines. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer-readable storage device having instructions stored which, when executed by a computing device configured to perform text-to-speech generation, cause the computing device to perform operations comprising:
-
receiving a text corpus; tokenizing, via a tokenization module, the text corpus into application tokens, each application token of the application tokens comprising one of a sequence of letters, a sequence of digits, and punctuation, wherein the tokenization module is trained on training data generated by a feature extraction module that extracts morphological and lexical text features from a training data token and from an n-left token or an n-right token associated with the training data token; comparing the application tokens to a language-independent pattern list that comprises number patterns, to yield a token comparison; identifying text-to-speech pronunciation guidelines associated with each application token in the application tokens, wherein the text-to-speech pronunciation guidelines comprise at least one of reorder, asword, and split; and generating audible speech from the application tokens in the text corpus using the token comparison and the text-to-speech pronunciation guidelines. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification