Method and apparatus for generating synthetic speech with contrastive stress

US 8,571,870 B2
Filed: 08/09/2010
Issued: 10/29/2013
Est. Priority Date: 02/12/2010
Status: Active Grant

First Claim

Patent Images

1. A method for providing speech output for a speech-enabled application, the method comprising:

receiving from the speech-enabled application a text input comprising a text transcription of a desired speech output;

generating, using at least one computer system, an audio speech output corresponding to at least a portion of the text input, the audio speech output comprising at least one portion carrying contrastive stress to contrast with at least one other portion of the audio speech output; and

providing the audio speech output for the speech-enabled application;

wherein the generating comprises;

identifying a plurality of tokens of the text input of a same text normalization type for which a contrastive stress pattern is to be applied;

identifying at least one token of the plurality of tokens to be rendered with contrastive stress; and

assigning contrastive stress to be carried by at least one portion of the audio speech output corresponding to at least one portion of the at least one token of the text input;

wherein the assigning comprises;

identifying at least one first portion of the at least one token of the plurality of tokens that differs from at least one corresponding first portion of at least one other token of the plurality of tokens, and at least one second portion of the at least one token that does not differ from at least one corresponding second portion of the at least one other token; and

assigning contrastive stress to be carried by at least one first portion of the audio speech output corresponding to the identified at least one first portion of the at least one token, but not to at least one second portion of the audio speech output corresponding to the identified at least one second portion of the at least one token.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for generating synthetic speech with contrastive stress. In one aspect, a speech-enabled application generates a text input including a text transcription of a desired speech output, and inputs the text input to a speech synthesis system. The synthesis system generates an audio speech output corresponding to at least a portion of the text input, with at least one portion carrying contrastive stress, and provides the audio speech output for the speech-enabled application. In another aspect, a speech-enabled application inputs a plurality of text strings, each corresponding to a portion of a desired speech output, to a software module for rendering contrastive stress. The software module identifies a plurality of audio recordings that render at least one portion of at least one of the text strings as speech carrying contrastive stress. The speech-enabled application generates an audio speech output corresponding to the desired speech output using the audio recordings.

35 Citations

View as Search Results

51 Claims

1. A method for providing speech output for a speech-enabled application, the method comprising:
- receiving from the speech-enabled application a text input comprising a text transcription of a desired speech output;
  
  generating, using at least one computer system, an audio speech output corresponding to at least a portion of the text input, the audio speech output comprising at least one portion carrying contrastive stress to contrast with at least one other portion of the audio speech output; and
  
  providing the audio speech output for the speech-enabled application;
  
  wherein the generating comprises;
  
  identifying a plurality of tokens of the text input of a same text normalization type for which a contrastive stress pattern is to be applied;
  
  identifying at least one token of the plurality of tokens to be rendered with contrastive stress; and
  
  assigning contrastive stress to be carried by at least one portion of the audio speech output corresponding to at least one portion of the at least one token of the text input;
  
  wherein the assigning comprises;
  
  identifying at least one first portion of the at least one token of the plurality of tokens that differs from at least one corresponding first portion of at least one other token of the plurality of tokens, and at least one second portion of the at least one token that does not differ from at least one corresponding second portion of the at least one other token; and
  
  assigning contrastive stress to be carried by at least one first portion of the audio speech output corresponding to the identified at least one first portion of the at least one token, but not to at least one second portion of the audio speech output corresponding to the identified at least one second portion of the at least one token.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the same text normalization type is selected from the group consisting of:
    - an alphanumeric sequence type, an address type, a Boolean value type, a currency type, a date type, a digit sequence type, a fractional number type, a proper name type, a number type, an ordinal number type, a telephone number type, a flight number type, a state name type, a street name type, a street number type, a time type and a zipcode type.
  - 3. The method of claim 1, wherein the plurality of tokens are identified based at least in part on at least one indication in the text input that the contrastive stress pattern is desired in association with the plurality of tokens.
  - 4. The method of claim 3, wherein the at least one indication comprises at least one Speech Synthesis Markup Language tag.
  - 5. The method of claim 1, wherein identifying the plurality of tokens comprises:
    - tokenizing the text input;
      
      automatically identifying the text normalization type of the plurality of tokens; and
      
      automatically determining that the contrastive stress pattern is to be applied for the plurality of tokens.
  - 6. The method of claim 1, wherein the at least one token to be rendered with contrastive stress is identified based at least in part on an order of the plurality of tokens in the text input.
  - 7. The method of claim 1, wherein identifying the at least one token to be rendered with contrastive stress further comprises:
    - identifying at least one linking token in the text input indicating applicability of contrastive stress; and
      
      based at least in part on the at least one linking token, identifying the at least one token to be rendered with contrastive stress.
  - 8. The method of claim 7, wherein the at least one linking token comprises at least one sequence of one or more tokens selected from the group consisting of:
    - originally, but, is now, or, and, whereas, as opposed to, as compared with, as contrasted with, and versus.
  - 9. The method of claim 1, wherein the at least one first portion of the at least one token that differs from the at least one corresponding first portion of the at least one other token is identified based at least in part on a normalized orthography of the at least a portion of the text input.

10. Apparatus for providing speech output for a speech-enabled application, the apparatus comprising:
- a memory storing a plurality of processor-executable instructions; and
  
  at least one processor, operatively coupled to the memory, that executes the instructions to;
  
  receive from the speech-enabled application a text input comprising a text transcription of a desired speech output;
  
  generate an audio speech output corresponding to at least a portion of the text input, the audio speech output comprising at least one portion carrying contrastive stress to contrast with at least one other portion of the audio speech output; and
  
  provide the audio speech output for the speech-enabled application;
  
  wherein the at least one processor executes the instructions to generate the audio speech output at least in part by;
  
  identifying a plurality of tokens of the text input of a same text normalization type for which a contrastive stress pattern is to be applied;
  
  identifying at least one token of the plurality of tokens to be rendered with contrastive stress; and
  
  assigning contrastive stress to be carried by at least one portion of the audio speech output corresponding to at least one portion of the at least one token of the text input;
  
  wherein the at least one processor executes the instructions to perform the assigning at least in part by;
  
  identifying at least one first portion of the at least one token of the plurality of tokens that differs from at least one corresponding first portion of at least one other token of the plurality of tokens, and at least one second portion of the at least one token that does not differ from at least one corresponding second portion of the at least one other token; and
  
  assigning contrastive stress to be carried by at least one first portion of the audio speech output corresponding to the identified at least one first portion of the at least one token, but not to at least one second portion of the audio speech output corresponding to the identified at least one second portion of the at least one token.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The apparatus of claim 10, wherein the same text normalization type is selected from the group consisting of:
    - an alphanumeric sequence type, an address type, a Boolean value type, a currency type, a date type, a digit sequence type, a fractional number type, a proper name type, a number type, an ordinal number type, a telephone number type, a flight number type, a state name type, a street name type, a street number type, a time type and a zipcode type.
  - 12. The apparatus of claim 10, wherein the at least one processor executes the instructions to identify the plurality of tokens based at least in part on at least one indication in the text input that the contrastive stress pattern is desired in association with the plurality of tokens.
  - 13. The apparatus of claim 12, wherein the at least one indication comprises at least one Speech Synthesis Markup Language tag.
  - 14. The apparatus of claim 10, wherein the at least one processor executes the instructions to identify the plurality of tokens at least in part by:
    - tokenizing the text input;
      
      automatically identifying the text normalization type of the plurality of tokens; and
      
      automatically determining that the contrastive stress pattern is to be applied for the plurality of tokens.
  - 15. The apparatus of claim 10, wherein the at least one processor executes the instructions to identify the at least one token to be rendered with contrastive stress based at least in part on an order of the plurality of tokens in the text input.
  - 16. The apparatus of claim 10, wherein the at least one processor executes the instructions to identify the at least one token to be rendered with contrastive stress at least in part by:
    - identifying at least one linking token in the text input indicating applicability of contrastive stress; and
      
      based at least in part on the at least one linking token, identifying the at least one token to be rendered with contrastive stress.
  - 17. The apparatus of claim 16, wherein the at least one linking token comprises at least one sequence of one or more tokens selected from the group consisting of:
    - originally, but, is now, or, and, whereas, as opposed to, as compared with, as contrasted with, and versus.
  - 18. The apparatus of claim 10, wherein the at least one processor executes the instructions to identify the at least one first portion of the at least one token that differs from the at least one corresponding first portion of the at least one other token based at least in part on a normalized orthography of the at least a portion of the text input.

19. At least one non-transitory computer-readable storage medium encoded with a plurality of computer-executable instructions that, when executed, perform a method for providing speech output for a speech-enabled application, the method comprising:
- receiving from the speech-enabled application a text input comprising a text transcription of a desired speech output;
  
  generating an audio speech output corresponding to at least a portion of the text input, the audio speech output comprising at least one portion carrying contrastive stress to contrast with at least one other portion of the audio speech output; and
  
  providing the audio speech output for the speech-enabled application;
  
  wherein the generating comprises;
  
  identifying a plurality of tokens of the text input of a same text normalization type for which a contrastive stress pattern is to be applied;
  
  identifying at least one token of the plurality of tokens to be rendered with contrastive stress; and
  
  assigning contrastive stress to be carried by at least one portion of the audio speech output corresponding to at least one portion of the at least one token of the text input;
  
  wherein the assigning comprises;
  
  identifying at least one first portion of the at least one token of the plurality of tokens that differs from at least one corresponding first portion of at least one other token of the plurality of tokens, and at least one second portion of the at least one token that does not differ from at least one corresponding second portion of the at least one other token; and
  
  assigning contrastive stress to be carried by at least one first portion of the audio speech output corresponding to the identified at least one first portion of the at least one token, but not to at least one second portion of the audio speech output corresponding to the identified at least one second portion of the at least one token.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
- - 20. The at least one non-transitory computer-readable storage medium of claim 19, wherein the same text normalization type is selected from the group consisting of:
    - an alphanumeric sequence type, an address type, a Boolean value type, a currency type, a date type, a digit sequence type, a fractional number type, a proper name type, a number type, an ordinal number type, a telephone number type, a flight number type, a state name type, a street name type, a street number type, a time type and a zipcode type.
  - 21. The at least one non-transitory computer-readable storage medium of claim 19, wherein the plurality of tokens are identified based at least in part on at least one indication in the text input that the contrastive stress pattern is desired in association with the plurality of tokens.
  - 22. The at least one non-transitory computer-readable storage medium of claim 21, wherein the at least one indication comprises at least one Speech Synthesis Markup Language tag.
  - 23. The at least one non-transitory computer-readable storage medium of claim 19, wherein identifying the plurality of tokens comprises:
    - tokenizing the text input;
      
      automatically identifying the text normalization type of the plurality of tokens; and
      
      automatically determining that the contrastive stress pattern is to be applied for the plurality of tokens.
  - 24. The at least one non-transitory computer-readable storage medium of claim 19, wherein the at least one token to be rendered with contrastive stress is identified based at least in part on an order of the plurality of tokens in the text input.
  - 25. The at least one non-transitory computer-readable storage medium of claim 19, wherein identifying the at least one token to be rendered with contrastive stress further comprises:
    - identifying at least one linking token in the text input indicating applicability of contrastive stress; and
      
      based at least in part on the at least one linking token, identifying the at least one token to be rendered with contrastive stress.
  - 26. The at least one non-transitory computer-readable storage medium of claim 25, wherein the at least one linking token comprises at least one sequence of one or more tokens selected from the group consisting of:
    - originally, but, is now, or, and, whereas, as opposed to, as compared with, as contrasted with, and versus.
  - 27. The at least one non-transitory computer-readable storage medium of claim 19, wherein the at least one first portion of the at least one token that differs from the at least one corresponding first portion of the at least one other token is identified based at least in part on a normalized orthography of the at least a portion of the text input.

28. A method for providing speech output via a speech-enabled application, the method comprising:
- generating, using at least one computer system executing the speech-enabled application, a text input comprising a text transcription of a desired speech output, the text input comprising a plurality of tokens of a same text normalization type for which a contrastive stress pattern is to be applied, at least one token of the plurality of tokens comprising at least one first portion that differs from at least one corresponding first portion of at least one other token of the plurality of tokens, and at least one second portion that does not differ from at least one corresponding second portion of the at least one other token;
  
  inputting the text input to at least one speech synthesis engine configured to assign contrastive stress to be carried by at least one first portion of an audio speech output corresponding to the at least one first portion of the at least one token, but not to at least one second portion of the audio speech output corresponding to the at least one second portion of the at least one token;
  
  receiving the audio speech output from the at least one speech synthesis engine; and
  
  providing the audio speech output to at least one user of the speech-enabled application.
- View Dependent Claims (29, 30, 31, 32, 33, 34, 35)
- - 29. The method of claim 28, wherein the generating comprises including in the text input at least one indication that a contrastive stress pattern is desired in association with at least one portion of the text input.
  - 30. The method of claim 29, wherein the at least one indication comprises at least one Speech Synthesis Markup Language tag.
  - 31. The method of claim 29, wherein the generating further comprises identifying a plurality of fields of the text input of a same text normalization type for which the contrastive stress pattern is desired.
  - 32. The method of claim 31, wherein the same text normalization type is selected from the group consisting of:
    - an alphanumeric sequence type, an address type, a Boolean value type, a currency type, a date type, a digit sequence type, a fractional number type, a proper name type, a number type, an ordinal number type, a telephone number type, a flight number type, a state name type, a street name type, a street number type, a time type and a zipcode type.
  - 33. The method of claim 31, wherein the at least one indication comprises specific identification of at least one portion of the text input that is to be rendered to carry contrastive stress.
  - 34. The method of claim 33, wherein the generating further comprises identifying the at least one portion of the text input that is to be rendered to carry contrastive stress as at least one portion of at least one field of the plurality of fields that differs from at least one corresponding portion of at least one other field of the plurality of fields.
  - 35. The method of claim 34, wherein identifying the at least one portion of the text input that is to be rendered to carry contrastive stress is performed by passing the plurality of fields to a function to identify the at least one portion that is to be rendered to carry contrastive stress.

36. Apparatus for providing speech output via a speech-enabled application, the apparatus comprising:
- a memory storing a plurality of processor-executable instructions; and
  
  at least one processor, operatively coupled to the memory, that executes the instructions to;
  
  generate a text input comprising a text transcription of a desired speech output, the text input comprising a plurality of tokens of a same text normalization type for which a contrastive stress pattern is to be applied, at least one token of the plurality of tokens comprising at least one first portion that differs from at least one corresponding first portion of at least one other token of the plurality of tokens, and at least one second portion that does not differ from at least one corresponding second portion of the at least one other token;
  
  input the text input to at least one speech synthesis engine configured to assign contrastive stress to be carried by at least one first portion of an audio speech output corresponding to the at least one first portion of the at least one token, but not to at least one second portion of the audio speech output corresponding to the at least one second portion of the at least one token;
  
  receive the audio speech output from the at least one speech synthesis engine; and
  
  provide the audio speech output to at least one user of the speech-enabled application.
- View Dependent Claims (37, 38, 39, 40, 41, 42, 43)
- - 37. The apparatus of claim 36, wherein the at least one processor executes the instructions to generate the text input at least in part by including in the text input at least one indication that a contrastive stress pattern is desired in association with at least one portion of the text input.
  - 38. The apparatus of claim 37, wherein the at least one indication comprises at least one Speech Synthesis Markup Language tag.
  - 39. The apparatus of claim 37, wherein the at least one processor executes the instructions to generate the text input at least in part by identifying a plurality of fields of the text input of a same text normalization type for which the contrastive stress pattern is desired.
  - 40. The apparatus of claim 39, wherein the same text normalization type is selected from the group consisting of:
    - an alphanumeric sequence type, an address type, a Boolean value type, a currency type, a date type, a digit sequence type, a fractional number type, a proper name type, a number type, an ordinal number type, a telephone number type, a flight number type, a state name type, a street name type, a street number type, a time type and a zipcode type.
  - 41. The apparatus of claim 39, wherein the at least one indication comprises specific identification of at least one portion of the text input that is to be rendered to carry contrastive stress.
  - 42. The apparatus of claim 41, wherein the at least one processor executes the instructions to generate the text input at least in part by identifying the at least one portion of the text input that is to be rendered to carry contrastive stress as at least one portion of at least one field of the plurality of fields that differs from at least one corresponding portion of at least one other field of the plurality of fields.
  - 43. The apparatus of claim 42, wherein the at least one processor executes the instructions to identify the at least one portion of the text input that is to be rendered to carry contrastive stress at least in part by passing the plurality of fields to a function to identify the at least one portion that is to be rendered to carry contrastive stress.

44. At least one non-transitory computer-readable storage medium encoded with a plurality of computer-executable instructions that, when executed, perform a method for providing speech output via a speech-enabled application, the method comprising:
- generating a text input comprising a text transcription of a desired speech output, the text input comprising a plurality of tokens of a same text normalization type for which a contrastive stress pattern is to be applied, at least one token of the plurality of tokens comprising at least one first portion that differs from at least one corresponding first portion of at least one other token of the plurality of tokens, and at least one second portion that does not differ from at least one corresponding second portion of the at least one other token;
  
  inputting the text input to at least one speech synthesis engine configured to assign contrastive stress to be carried by at least one first portion of an audio speech output corresponding to the at least one first portion of the at least one token, but not to at least one second portion of the audio speech output corresponding to the at least one second portion of the at least one token;
  
  receiving the audio speech output from the at least one speech synthesis engine; and
  
  providing the audio speech output to at least one user of the speech-enabled application.
- View Dependent Claims (45, 46, 47, 48, 49, 50, 51)
- - 45. The at least one non-transitory computer-readable storage medium of claim 44, wherein the generating comprises including in the text input at least one indication that a contrastive stress pattern is desired in association with at least one portion of the text input.
  - 46. The at least one non-transitory computer-readable storage medium of claim 45, wherein the at least one indication comprises at least one Speech Synthesis Markup Language tag.
  - 47. The at least one non-transitory computer-readable storage medium of claim 45, wherein the generating further comprises identifying a plurality of fields of the text input of a same text normalization type for which the contrastive stress pattern is desired.
  - 48. The at least one non-transitory computer-readable storage medium of claim 47, wherein the same text normalization type is selected from the group consisting of:
    - an alphanumeric sequence type, an address type, a Boolean value type, a currency type, a date type, a digit sequence type, a fractional number type, a proper name type, a number type, an ordinal number type, a telephone number type, a flight number type, a state name type, a street name type, a street number type, a time type and a zipcode type.
  - 49. The at least one non-transitory computer-readable storage medium of claim 47, wherein the at least one indication comprises specific identification of at least one portion of the text input that is to be rendered to carry contrastive stress.
  - 50. The at least one non-transitory computer-readable storage medium of claim 49, wherein the generating further comprises identifying the at least one portion of the text input that is to be rendered to carry contrastive stress as at least one portion of at least one field of the plurality of fields that differs from at least one corresponding portion of at least one other field of the plurality of fields.
  - 51. The at least one non-transitory computer-readable storage medium of claim 50, wherein identifying the at least one portion of the text input that is to be rendered to carry contrastive stress is performed by passing the plurality of fields to a function to identify the at least one portion that is to be rendered to carry contrastive stress.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Meyer, Darren C., Springer, Stephen R.
Primary Examiner(s)
COLUCCI, MICHAEL C

Application Number

US12/853,086
Publication Number

US 20110202346A1
Time in Patent Office

1,177 Days
Field of Search

704/260, 704/270.1, 704/9, 704/258, 704/209, 704/234, 704/235, 704/271, 704/275, 434/178, 434/236
US Class Current

704/260
CPC Class Codes

G10L 13/02 Methods for producing synth...

G10L 13/10 Prosody rules derived from ...

Method and apparatus for generating synthetic speech with contrastive stress

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

35 Citations

51 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for generating synthetic speech with contrastive stress

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

35 Citations

51 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links