×

System and method for text normalization using atomic tokens

  • US 10,388,270 B2
  • Filed: 11/05/2014
  • Issued: 08/20/2019
  • Est. Priority Date: 11/05/2014
  • Status: Active Grant
First Claim
Patent Images

1. A method comprising:

  • receiving a text corpus;

    tokenizing, via a tokenization module on a computing device, the text corpus into application tokens, each application token of the application tokens comprising one of a sequence of letters, a sequence of digits, and punctuation, wherein the tokenization module is trained on training data generated by a feature extraction module that extracts morphological and lexical text features from a training data token and from an n-left token or an n-right token associated with the training data token;

    comparing the application tokens to a language-independent pattern list that comprises number patterns, to yield a token comparison;

    identifying text-to-speech pronunciation guidelines associated with each application token in the application tokens, wherein the text-to-speech pronunciation guidelines comprise at least one of reorder, asword, and split; and

    generating, via a text-to-speech computer system and an output device, audible speech from the application tokens in the text corpus using the token comparison and the text-to-speech pronunciation guidelines.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×