Inverse text normalization for automatic speech recognition
First Claim
1. A non-transitory computer-readable storage medium storing one or more programs configured to be executed by one or more processors of an electronic device, the one or more programs including instructions for:
- receiving speech input;
generating a spoken-form text representation of the speech input, the spoken-form text representation comprising a token sequence;
determining a feature representation for the spoken-form text representation;
determining, based on the feature representation, a sequence of labels assigned to the token sequence, the sequence of labels specifying a plurality of edit operations to perform on the token sequence, wherein each edit operation of the plurality of edit operations corresponds to one of a plurality of predetermined types of edit operations;
generating a written-form text representation of the speech input by applying the plurality of edit operations to the token sequence in accordance with the sequence of labels; and
performing, using the generated written-form text representation, a task responsive to the speech input.
1 Assignment
0 Petitions
Accused Products
Abstract
Techniques for inverse text normalization are provided. In some examples, speech input is received and a spoken-form text representation of the speech input is generated. The spoken-form text representation includes a token sequence. A feature representation is determined for the spoken-form text representation and a sequence of labels is determined based on the feature representation. The sequence of labels is assigned to the token sequence and specifies a plurality of edit operations to perform on the token sequence. Each edit operation of the plurality of edit operations corresponds to one of a plurality of predetermined types of edit operations. A written-form text representation of the speech input is generated by applying the plurality of edit operations to the token sequence in accordance with the sequence of labels. A task responsive to the speech input is performed using the generated written-form text representation.
-
Citations
23 Claims
-
1. A non-transitory computer-readable storage medium storing one or more programs configured to be executed by one or more processors of an electronic device, the one or more programs including instructions for:
-
receiving speech input; generating a spoken-form text representation of the speech input, the spoken-form text representation comprising a token sequence; determining a feature representation for the spoken-form text representation; determining, based on the feature representation, a sequence of labels assigned to the token sequence, the sequence of labels specifying a plurality of edit operations to perform on the token sequence, wherein each edit operation of the plurality of edit operations corresponds to one of a plurality of predetermined types of edit operations; generating a written-form text representation of the speech input by applying the plurality of edit operations to the token sequence in accordance with the sequence of labels; and performing, using the generated written-form text representation, a task responsive to the speech input. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A method for inverse text normalization, the method comprising:
at an electronic device having one or more processors and memory; receiving speech input; generating a spoken-form text representation of the speech input, the spoken-form text representation comprising a token sequence; determining a feature representation for the spoken-form text representation; determining, based on the feature representation, a sequence of labels assigned to the token sequence, the sequence of labels specifying a plurality of edit operations to perform on the token sequence, wherein each edit operation of the plurality of edit operations corresponds to one of a plurality of predetermined types of edit operations; generating a written-form text representation of the speech input by applying the plurality of edit operations to the token sequence in accordance with the sequence of labels; and performing, using the generated written-form text representation, a task responsive to the speech input.
-
15. An electronic device, comprising:
-
one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for; receiving speech input; generating a spoken-form text representation of the speech input, the spoken-form text representation comprising a token sequence; determining a feature representation for the spoken-form text representation; determining, based on the feature representation, a sequence of labels assigned to the token sequence, the sequence of labels specifying a plurality of edit operations to perform on the token sequence, wherein each edit operation of the plurality of edit operations corresponds to one of a plurality of predetermined types of edit operations; generating a written-form text representation of the speech input by applying the plurality of edit operations to the token sequence in accordance with the sequence of labels; and performing, using the generated written-form text representation, a task responsive to the speech input. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23)
-
Specification