Method and apparatus for generating synthetic speech with contrastive stress

US 8,825,486 B2
Filed: 01/22/2014
Issued: 09/02/2014
Est. Priority Date: 02/12/2010
Status: Active Grant

- Alert
- Pin

First Claim

Patent Images

1. A method for use with a speech-enabled application, the method comprising:

receiving, from the speech-enabled application, input comprising a plurality of text strings;

identifying a first portion of a first text string of the plurality of text strings as differing from a corresponding first portion of a second text string of the plurality of text strings, and a second portion of the first text string as not differing from a corresponding second portion of the second text string;

assigning contrastive stress to the first portion of the first text string and/or to the corresponding first portion of the second text string, but not to the second portion of the first text string, and not to the corresponding second portion of the second text string;

generating, using at least one computer system, speech synthesis output to render the plurality of text strings as speech having the assigned contrastive stress; and

providing the speech synthesis output for the speech-enabled application.

View all claims

7 Assignments

Timeline View

Assignment View

Litigations

0 Petitions

Accused Products

Abstract

Techniques for generating synthetic speech with contrastive stress. In one aspect, a speech-enabled application generates a text input including a text transcription of a desired speech output, and inputs the text input to a speech synthesis system. The synthesis system generates an audio speech output corresponding to at least a portion of the text input, with at least one portion carrying contrastive stress, and provides the audio speech output for the speech-enabled application. In another aspect, a speech-enabled application inputs a plurality of text strings, each corresponding to a portion of a desired speech output, to a software module for rendering contrastive stress. The software module identifies a plurality of audio recordings that render at least one portion of at least one of the text strings as speech carrying contrastive stress. The speech-enabled application generates an audio speech output corresponding to the desired speech output using the audio recordings.

Citations

20 Claims

1. A method for use with a speech-enabled application, the method comprising:
- receiving, from the speech-enabled application, input comprising a plurality of text strings;
  
  identifying a first portion of a first text string of the plurality of text strings as differing from a corresponding first portion of a second text string of the plurality of text strings, and a second portion of the first text string as not differing from a corresponding second portion of the second text string;
  
  assigning contrastive stress to the first portion of the first text string and/or to the corresponding first portion of the second text string, but not to the second portion of the first text string, and not to the corresponding second portion of the second text string;
  
  generating, using at least one computer system, speech synthesis output to render the plurality of text strings as speech having the assigned contrastive stress; and
  
  providing the speech synthesis output for the speech-enabled application.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the identifying comprises identifying the first portion of the first text string as differing from the corresponding first portion of the second text string based at least in part on a normalized orthography of the first and second text strings.
  - 3. The method of claim 1, wherein the first and second text strings represent different numerical fields within a larger text string.
  - 4. The method of claim 3, wherein the numerical fields are selected from the group consisting of:
    - currency fields, date fields, digit sequence fields, number fields, fractional number fields, ordinal number fields, telephone number fields, flight number fields, street number fields, time fields, and zipcode fields.
  - 5. The method of claim 1, wherein the receiving comprises receiving the first and second text strings as first and second parameters passed to a function called by the speech-enabled application to render the first and second text strings with a contrastive stress pattern.
  - 6. The method of claim 1, wherein the speech synthesis output comprises identification of a plurality of audio recordings to render the plurality of text strings as speech, at least one of the plurality of audio recordings being selected to render the first portion of the first text string and/or the first portion of the second text string as speech carrying contrastive stress.
  - 7. The method of claim 1, wherein the speech synthesis output comprises an indication of the first portion of the first text string and/or the corresponding first portion of the second text string as being assigned contrastive stress.

8. At least one non-transitory computer-readable storage medium encoded with a plurality of computer-executable instructions that, when executed, perform a method for use with a speech-enabled application, the method comprising:
- receiving, from the speech-enabled application, input comprising a plurality of text strings;
  
  identifying a first portion of a first text string of the plurality of text strings as differing from a corresponding first portion of a second text string of the plurality of text strings, and a second portion of the first text string as not differing from a corresponding second portion of the second text string;
  
  assigning contrastive stress to the first portion of the first text string and/or to the corresponding first portion of the second text string, but not to the second portion of the first text string, and not to the corresponding second portion of the second text string;
  
  generating speech synthesis output to render the plurality of text strings as speech having the assigned contrastive stress; and
  
  providing the speech synthesis output for the speech-enabled application.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The at least one non-transitory computer-readable storage medium of claim 8, wherein the identifying comprises identifying the first portion of the first text string as differing from the corresponding first portion of the second text string based at least in part on a normalized orthography of the first and second text strings.
  - 10. The at least one non-transitory computer-readable storage medium of claim 8, wherein the first and second text strings represent different numerical fields within a larger text string.
  - 11. The at least one non-transitory computer-readable storage medium of claim 10, wherein the numerical fields are selected from the group consisting of:
    - currency fields, date fields, digit sequence fields, number fields, fractional number fields, ordinal number fields, telephone number fields, flight number fields, street number fields, time fields, and zipcode fields.
  - 12. The at least one non-transitory computer-readable storage medium of claim 8, wherein the receiving comprises receiving the first and second text strings as first and second parameters passed to a function called by the speech-enabled application to render the first and second text strings with a contrastive stress pattern.
  - 13. The at least one non-transitory computer-readable storage medium of claim 8, wherein the speech synthesis output comprises identification of a plurality of audio recordings to render the plurality of text strings as speech, at least one of the plurality of audio recordings being selected to render the first portion of the first text string and/or the first portion of the second text string as speech carrying contrastive stress.
  - 14. The at least one non-transitory computer-readable storage medium of claim 8, wherein the speech synthesis output comprises an indication of the first portion of the first text string and/or the corresponding first portion of the second text string as being assigned contrastive stress.

15. A method for generating speech output via a speech-enabled application, the method comprising:
- generating, using at least one computer system executing the speech-enabled application, a plurality of text strings, each of the plurality of text strings corresponding to a portion of a desired speech output;
  
  inputting the plurality of text strings to at least one software module configured to identify a first portion of a first text string of the plurality of text strings as differing from a corresponding first portion of a second text string of the plurality of text strings, and a second portion of the first text string as not differing from a corresponding second portion of the second text string;
  
  receiving, from the at least one software module, speech synthesis output to render the plurality of text strings with contrastive stress assigned to the first portion of the first text string and/or to the corresponding first portion of the second text string, but not to the second portion of the first text string, and not to the corresponding second portion of the second text string; and
  
  generating, using the speech synthesis output, an audio speech output corresponding to the desired speech output.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The method of claim 15, wherein the at least one software module is configured to identify the first portion of the first text string as differing from the corresponding first portion of the second text string based at least in part on a normalized orthography of the first and second text strings.
  - 17. The method of claim 15, wherein the first and second text strings represent different numerical fields within a larger text string.
  - 18. The method of claim 17, wherein the numerical fields are selected from the group consisting of:
    - currency fields, date fields, digit sequence fields, number fields, fractional number fields, ordinal number fields, telephone number fields, flight number fields, street number fields, time fields, and zipcode fields.
  - 19. The method of claim 15, wherein the inputting comprises passing the first and second text strings as first and second parameters to a function called by the speech-enabled application to render the first and second text strings with a contrastive stress pattern.
  - 20. The method of claim 15, wherein the speech synthesis output comprises identification of a plurality of audio recordings to render the plurality of text strings as speech, at least one of the plurality of audio recordings being selected to render the first portion of the first text string and/or the first portion of the second text string as speech carrying contrastive stress.

Specification

Resources

Litigation Campaign Assessment

Litigation Data

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Meyer, Darren C., Springer, Stephen R.
Primary Examiner(s)
COLUCCI, MICHAEL C

Application Number

US14/161,535
Publication Number

US 20140129230A1
Time in Patent Office

223 Days
Field of Search

704/271, 704/260, 704/258, 704/234, 704/209, 434/236, 434/178
US Class Current

704/260
CPC Class Codes

G10L 13/00   Speech synthesis; Text to s...

G10L 13/02   Methods for producing synth...

G10L 13/033   Voice editing, e.g. manipul...

G10L 13/04   Details of speech synthesis...

Method and apparatus for generating synthetic speech with contrastive stress

First Claim

7 Assignments

Litigations

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for generating synthetic speech with contrastive stress

First Claim

7 Assignments

Subscription Required

Subscription Required

Litigations

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links