Inserting breath sounds into text-to-speech output

US 9,508,338 B1
Filed: 11/15/2013
Issued: 11/29/2016
Est. Priority Date: 11/15/2013
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of generating speech including audible breath sounds, the method comprising:

receiving input text for text-to-speech (TTS) processing;

identifying punctuation in the input text;

determining a first location in the input text for insertion of a breath sound based at least in part on the punctuation;

determining a second location in the input text for insertion of a breath sound based at least in part on the punctuation;

determining a linguistic distance between the first location and second location;

using a cost function to identify a first breath unit for the first location, the cost function based at least in part on the identified punctuation, the linguistic distance between the first location and second location, and a linguistic context of the first location;

using the cost function to identify a second breath unit for the second location, the cost function based at least in part on the identified punctuation, the linguistic distance between the first location and second location, and a linguistic context of the second location; and

synthesizing speech corresponding to the input text, wherein the synthesized speech comprises a first breath sound corresponding to the first breath unit at the first location and a second breath sound corresponding to the second breath unit at the second location.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A text-to-speech (TTS) system may be configured to incorporate breath sounds in the output speech. By incorporating breath sounds into speech output from text a TTS system may be able to mimic more naturally sounding human speech, particularly for long-form narration of text longer than short phrases. The breath sounds may be stored as units for unit selection or may be generated during parametric synthesis. The acoustic features of the breath sounds and duration between breaths may depend upon the punctuation of text, the linguistic distance between breaths, the breaks between intonational phrases, the linguistic context of the breaths, and other factors.

Citations

23 Claims

1. A computer-implemented method of generating speech including audible breath sounds, the method comprising:
- receiving input text for text-to-speech (TTS) processing;
  
  identifying punctuation in the input text;
  
  determining a first location in the input text for insertion of a breath sound based at least in part on the punctuation;
  
  determining a second location in the input text for insertion of a breath sound based at least in part on the punctuation;
  
  determining a linguistic distance between the first location and second location;
  
  using a cost function to identify a first breath unit for the first location, the cost function based at least in part on the identified punctuation, the linguistic distance between the first location and second location, and a linguistic context of the first location;
  
  using the cost function to identify a second breath unit for the second location, the cost function based at least in part on the identified punctuation, the linguistic distance between the first location and second location, and a linguistic context of the second location; and
  
  synthesizing speech corresponding to the input text, wherein the synthesized speech comprises a first breath sound corresponding to the first breath unit at the first location and a second breath sound corresponding to the second breath unit at the second location.
- View Dependent Claims (2, 3)
- - 2. The computer-implemented method of claim 1, further comprising identifying an intonational phrase in the input text, wherein:
    - the first location occurs at a beginning of the intontational phrase;
      
      the second location occurs at an end of the intonational phrase;
      
      using the cost function to identify the first breath unit is further based at least in part on linguistic features of the intonational phrase; and
      
      using the cost function to identify the second breath unit is further based at least in part on the linguistic features of the intonational phrase.
  - 3. The computer-implemented method of claim 1, further comprising determining a rate of speech based at least in part on a duration of the first breath sound, and wherein synthesizing the speech is based at least in part on the determined rate of speech.

4. A computing system, comprising:
- at least one processor;
  
  a memory device including instructions operable to be executed by the at least one processor to perform a set of actions, configuring the at least one processor;
  
  to receive input text;
  
  to identify a location in the input text for a breath;
  
  to identify a duration for the breath in the location;
  
  to determine a breath sound for the location, the breath sound determined from a plurality of breath sounds; and
  
  to synthesize speech corresponding to the input text using the duration and data corresponding to the breath sound, the synthesized speech comprising the breath sound at substantially the location.
- View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 5. The computing system of claim 4, wherein the at least one processor is further configured to identify punctuation in the input text, wherein the instructions further comprise instructions to configure the at least one processor to identify the location based at least in part on the identified punctuation.
  - 6. The computing system of claim 4, wherein the instructions further comprise instructions to configure the at least one processor to identify an intonational phrase in the input text, wherein the at least one processor is configured to identify the location based at least in part on the intonational phrase.
  - 7. The computing system of claim 4, wherein the plurality of breath sounds are represented by a plurality of parametric models of breath sounds and wherein the instructions further comprise instructions to configure the at least one processor to synthesize the breath sound using a vocoder.
  - 8. The computing system of claim 4, wherein the plurality of breath sounds are represented by a plurality of pre-recorded breath sounds and wherein the instructions further comprise instructions to configure the at least one processor to synthesize the breath sound using unit selection of the plurality of pre-recorded breath sounds.
  - 9. The computing system of claim 4, wherein the instructions further comprise instructions to configure the at least one processor to identify a second location in the input text for a second breath, and wherein the synthesized speech comprises a second breath sound at the second location.
  - 10. The computing system of claim 9, wherein the instructions further comprise instructions to configure the at least one processor to determine a distance between the identified location and second location, and wherein the breath sound is based at least in part on the distance.
  - 11. The computing system of claim 9, wherein the instructions further comprise instructions to configure the at least one processor to determine a distance between the identified location and second location, and wherein the second breath sound is based at least in part on the distance.
  - 12. The computing system of claim 9, wherein the second location is based at least in part on a duration of the breath sound.
  - 13. The computing system of claim 4, wherein the synthesized speech comprises spoken speech that overlaps at least a portion of the breath sound.

14. A non-transitory computer-readable storage medium storing processor-executable instructions for controlling a computing device, comprising:
- program code to receive input text;
  
  program code to identify a location in the input text for a breath;
  
  program code to identify a duration for the breath in the location;
  
  program code to determine a breath sound for the location, the breath sound determined from a plurality of breath sounds; and
  
  program code to synthesize speech corresponding to the input text using the duration and data corresponding to the breath sound, the synthesized speech comprising the breath sound at substantially the location.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 15. The non-transitory computer-readable storage medium of claim 14, further comprising program code to identify punctuation in the input text, wherein the program code to identify the location is based at least in part on the identified punctuation.
  - 16. The non-transitory computer-readable storage medium of claim 14, further comprising program code to identify an intonational phrase in the input text, wherein the program code to identify the location is based at least in part on the intonational phrase.
  - 17. The non-transitory computer-readable storage medium of claim 14, wherein the plurality of breath sounds are represented by a plurality of parametric models of breath sounds and wherein the non-transitory computer-readable storage medium further comprises program code to synthesize the breath sound using a vocoder.
  - 18. The non-transitory computer-readable storage medium of claim 14, wherein the plurality of breath sounds are represented by a plurality of pre-recorded breath sounds and wherein the non-transitory computer-readable storage medium further comprises program code to synthesize the breath sound using unit selection of the plurality of pre-recorded breath sounds.
  - 19. The non-transitory computer-readable storage medium of claim 14, further comprising program code to identify a second location in the input text for a second breath, and wherein the synthesized speech comprises a second breath sound at the second location.
  - 20. The non-transitory computer-readable storage medium of claim 19, further comprising program code to determine a distance between the identified location and second location, and wherein the breath sound is based at least in part on the distance.
  - 21. The non-transitory computer-readable storage medium of claim 19, further comprising program code to determine a distance between the identified location and second location, and wherein the second breath sound is based at least in part on the distance.
  - 22. The non-transitory computer-readable storage medium of claim 19, wherein the second location is based at least in part on a duration of the breath sound.
  - 23. The non-transitory computer-readable storage medium of claim 14, wherein the synthesized speech comprises spoken speech that overlaps at least a portion of the breath sound.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Tegi, Maciej, Czuczman, Michal, Mois, Remus Razvan, Kaszczuk, Michal Tadeusz
Primary Examiner(s)
Opsasnick, Michael N

Application Number

US14/081,233
Time in Patent Office

1,110 Days
Field of Search

704/266
US Class Current

1/1
CPC Class Codes

G10L 13/02   Methods for producing synth...

G10L 13/06   Elementary speech units use...

G10L 2013/083   Special characters, e.g. pu...

Inserting breath sounds into text-to-speech output

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Inserting breath sounds into text-to-speech output

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links