Inverse text normalization for automatic speech recognition

US 10,592,604 B2
Filed: 06/29/2018
Issued: 03/17/2020
Est. Priority Date: 03/12/2018
Status: Active Grant

First Claim

Patent Images

1. A non-transitory computer-readable storage medium storing one or more programs configured to be executed by one or more processors of an electronic device, the one or more programs including instructions for:

receiving speech input;

generating a spoken-form text representation of the speech input, the spoken-form text representation comprising a token sequence;

determining a feature representation for the spoken-form text representation;

determining, based on the feature representation, a sequence of labels assigned to the token sequence, the sequence of labels specifying a plurality of edit operations to perform on the token sequence, wherein each edit operation of the plurality of edit operations corresponds to one of a plurality of predetermined types of edit operations;

generating a written-form text representation of the speech input by applying the plurality of edit operations to the token sequence in accordance with the sequence of labels; and

performing, using the generated written-form text representation, a task responsive to the speech input.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for inverse text normalization are provided. In some examples, speech input is received and a spoken-form text representation of the speech input is generated. The spoken-form text representation includes a token sequence. A feature representation is determined for the spoken-form text representation and a sequence of labels is determined based on the feature representation. The sequence of labels is assigned to the token sequence and specifies a plurality of edit operations to perform on the token sequence. Each edit operation of the plurality of edit operations corresponds to one of a plurality of predetermined types of edit operations. A written-form text representation of the speech input is generated by applying the plurality of edit operations to the token sequence in accordance with the sequence of labels. A task responsive to the speech input is performed using the generated written-form text representation.

Citations

23 Claims

1. A non-transitory computer-readable storage medium storing one or more programs configured to be executed by one or more processors of an electronic device, the one or more programs including instructions for:
- receiving speech input;
  
  generating a spoken-form text representation of the speech input, the spoken-form text representation comprising a token sequence;
  
  determining a feature representation for the spoken-form text representation;
  
  determining, based on the feature representation, a sequence of labels assigned to the token sequence, the sequence of labels specifying a plurality of edit operations to perform on the token sequence, wherein each edit operation of the plurality of edit operations corresponds to one of a plurality of predetermined types of edit operations;
  
  generating a written-form text representation of the speech input by applying the plurality of edit operations to the token sequence in accordance with the sequence of labels; and
  
  performing, using the generated written-form text representation, a task responsive to the speech input.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The computer-readable storage medium of claim 1, wherein the feature representation for the spoken-form text representation is generated based on a representation of the token sequence and a representation of lexical features identified from the token sequence.
  - 3. The computer-readable storage medium of claim 2, wherein the feature representation for the spoken-form text representation is generated further based on a representation of features in the token sequence that correspond to one or more of the plurality of predetermined types of edit operations.
  - 4. The computer-readable storage medium of claim 1, wherein determining the sequence of labels includes assigning each label of the sequence of labels to a respective token in the token sequence.
  - 5. The computer-readable storage medium of claim 1, wherein the sequence of labels includes a label specifying that no edits operations are to be performed on an associated token of the token sequence.
  - 6. The computer-readable storage medium of claim 1, wherein one or more labels of the sequence of labels each specify one or more edit operations that apply to at most one token of the token sequence.
  - 7. The computer-readable storage medium of claim 1, wherein the sequence of labels defines a segment of the token sequence, and wherein the plurality of edit operations includes one or more first edit operations that each apply to two or more tokens in the segment.
  - 8. The computer-readable storage medium of claim 7, wherein the plurality of edit operations further includes one or more second edit operations that each apply to no more than one token of the token sequence, and wherein the one or more programs further include instructions for:
    - generating an intermediate text representation by applying the one or more second edit operations to the token sequence; and
      
      generating the written-form text representation by applying the one or more first edit operations to the intermediate text representation.
  - 9. The computer-readable storage medium of claim 1, wherein the plurality of predetermined types of edit operations includes a rewrite operation type for removing a first token of the token sequence or replacing the first token with a different token.
  - 10. The computer-readable storage medium of claim 1, wherein the plurality of predetermined types of edit operations includes a prepend type of edit operation for inserting one or more characters before a second token of the token sequence.
  - 11. The computer-readable storage medium of claim 1, wherein the plurality of predetermined types of edit operations includes an append type of edit operation for inserting one or more characters after a third token of the token sequence.
  - 12. The computer-readable storage medium of claim 1, wherein the plurality of predetermined types of edit operations includes a spacing type of edit operation for inserting or removing a space before or after a fourth token of the token sequence.
  - 13. The computer-readable storage medium of claim 1, wherein determining the sequence of labels further comprises:
    - determining a context-dependent feature vector of a fifth token in the token sequence based on a current backward context state of the fifth token and a current forward context state of the fifth token;
      
      determining, based on the context-dependent feature vector of the fifth token, a set of probabilities associated with a predetermined set of possible labels; and
      
      based on the determined set of probabilities, selecting, from the predetermined set of possible label, a label to assign to the fifth token.

14. A method for inverse text normalization, the method comprising:
- at an electronic device having one or more processors and memory;
  
  receiving speech input;
  
  generating a spoken-form text representation of the speech input, the spoken-form text representation comprising a token sequence;
  
  determining a feature representation for the spoken-form text representation;
  
  determining, based on the feature representation, a sequence of labels assigned to the token sequence, the sequence of labels specifying a plurality of edit operations to perform on the token sequence, wherein each edit operation of the plurality of edit operations corresponds to one of a plurality of predetermined types of edit operations;
  
  generating a written-form text representation of the speech input by applying the plurality of edit operations to the token sequence in accordance with the sequence of labels; and
  
  performing, using the generated written-form text representation, a task responsive to the speech input.

15. An electronic device, comprising:
- one or more processors;
  
  a memory; and
  
  one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for;
  
  receiving speech input;
  
  generating a spoken-form text representation of the speech input, the spoken-form text representation comprising a token sequence;
  
  determining a feature representation for the spoken-form text representation;
  
  determining, based on the feature representation, a sequence of labels assigned to the token sequence, the sequence of labels specifying a plurality of edit operations to perform on the token sequence, wherein each edit operation of the plurality of edit operations corresponds to one of a plurality of predetermined types of edit operations;
  
  generating a written-form text representation of the speech input by applying the plurality of edit operations to the token sequence in accordance with the sequence of labels; and
  
  performing, using the generated written-form text representation, a task responsive to the speech input.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23)
- - 16. The device of claim 15, wherein the feature representation for the spoken-form text representation is generated based on a representation of the token sequence and a representation of lexical features identified from the token sequence.
  - 17. The device of claim 16, wherein the feature representation for the spoken-form text representation is generated further based on a representation of features in the token sequence that correspond to one or more of the plurality of predetermined types of edit operations.
  - 18. The device of claim 15, wherein determining the sequence of labels includes assigning each label of the sequence of labels to a respective token in the token sequence.
  - 19. The device of claim 15, wherein the sequence of labels includes a label specifying that no edits operations are to be performed on an associated token of the token sequence.
  - 20. The device of claim 15, wherein one or more labels of the sequence of labels each specify one or more edit operations that apply to at most one token of the token sequence.
  - 21. The device of claim 15, wherein the sequence of labels defines a segment of the token sequence, and wherein the plurality of edit operations includes one or more first edit operations that each apply to two or more tokens in the segment.
  - 22. The device of claim 21, wherein the plurality of edit operations further includes one or more second edit operations that each apply to no more than one token of the token sequence, and wherein the one or more programs further include instructions for:
    - generating an intermediate text representation by applying the one or more second edit operations to the token sequence; and
      
      generating the written-form text representation by applying the one or more first edit operations to the intermediate text representation.
  - 23. The device of claim 15, wherein determining the sequence of labels further comprises:
    - determining a context-dependent feature vector of a fifth token in the token sequence based on a current backward context state of the fifth token and a current forward context state of the fifth token;
      
      determining, based on the context-dependent feature vector of the fifth token, a set of probabilities associated with a predetermined set of possible labels; and
      
      based on the determined set of probabilities, selecting, from the predetermined set of possible label, a label to assign to the fifth token.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Pusateri, Ernest J., Ambati, Bharat Ram, Brooks, Elizabeth S., McAllaster, Donald R., Nagesha, Venkatesh, Platek, Ondrej
Primary Examiner(s)
Azad, Abul K

Application Number

US16/024,425
Publication Number

US 20190278841A1
Time in Patent Office

627 Days
Field of Search
US Class Current
CPC Class Codes

G06F 40/151   Transformation

G06F 40/284   Lexical analysis, e.g. toke...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 15/30   Distributed recognition, e....

G10L 2015/223   Execution procedure of a sp...

Inverse text normalization for automatic speech recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Inverse text normalization for automatic speech recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links