Systems and methods for multi-style speech synthesis

US 9,570,065 B2
Filed: 09/29/2014
Issued: 02/14/2017
Est. Priority Date: 09/29/2014
Status: Active Grant

First Claim

Patent Images

1. A speech synthesis method, comprising:

using at least one computer hardware processor to perform;

obtaining input comprising text and an identification of a desired speaking style to use in synthesizing the text as speech;

identifying a plurality of speech segments for use in synthesizing the text as speech, the identifying comprising;

identifying a first speech segment recorded and/or synthesized in a first speaking style based at least in part on a measure of similarity between the desired speaking style and the first speaking style; and

identifying a second speech segment recorded and/or synthesized in a second speaking style different from the first speaking style based at least in part on a measure of similarity between the desired speaking style and the second speaking style;

synthesizing speech from the text in the desired speaking style, at least in part, by using the first speech segment and the second speech segment; and

outputting the synthesized speech via at least one physical device.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for performing multi-style speech synthesis. The techniques include using at least one computer hardware processor to perform: obtaining input comprising text and an identification of a first speaking style to use in rendering the text as speech; identifying a plurality of speech segments for use in rendering the text as speech, the identified plurality of speech segments comprising a first speech segment having the first speaking style and a second speech segment having a second speaking style different from the first speaking style; and rendering the text as speech having the first speaking style, at least in part, by using the identified plurality of speech segments.

Citations

20 Claims

1. A speech synthesis method, comprising:
- using at least one computer hardware processor to perform;
  
  obtaining input comprising text and an identification of a desired speaking style to use in synthesizing the text as speech;
  
  identifying a plurality of speech segments for use in synthesizing the text as speech, the identifying comprising;
  
  identifying a first speech segment recorded and/or synthesized in a first speaking style based at least in part on a measure of similarity between the desired speaking style and the first speaking style; and
  
  identifying a second speech segment recorded and/or synthesized in a second speaking style different from the first speaking style based at least in part on a measure of similarity between the desired speaking style and the second speaking style;
  
  synthesizing speech from the text in the desired speaking style, at least in part, by using the first speech segment and the second speech segment; and
  
  outputting the synthesized speech via at least one physical device.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The speech synthesis method of claim 1, wherein the identifying comprises:
    - identifying the second speech segment based, at least in part, on how well acoustic characteristics of the second speech segment match acoustic characteristics associated with the desired speaking style.
  - 3. The speech synthesis method of claim 2, wherein the identifying the second speech segment is based, at least in part, on how well prosodic characteristics of the second speech segment match prosodic characteristics associated with the desired speaking style.
  - 4. The speech synthesis method of claim 2, wherein identifying the second speech segment comprises:
    - calculating a value indicative of how well the acoustic characteristics of the second speech segment match acoustic characteristics associated with the desired speaking style.
  - 5. The speech synthesis method of claim 4, wherein the calculating is performed based at least in part on a transformation from a second group of speech segments having the second speaking style to a first group of speech segments having the desired speaking style, wherein the second group of speech segments comprises the second speech segment, wherein the first and second groups of speech segments are associated with a same phonetic context.
  - 6. The speech synthesis method of claim 5, wherein the first group of speech segments is represented by a first statistical model and the second group of speech segments is represented by a second statistical model, and wherein calculating the value comprises:
    - using the transformation to transform the second statistical model to obtain a transformed statistical model; and
      
      calculating the value as a distance between the transformed second statistical model and the first statistical model.
  - 7. The speech synthesis method of claim 6, wherein the distance between the transformed second statistical model and the first statistical model is a Kullback-Liebler divergence between the transformed second statistical model and the first statistical model.
  - 8. The speech synthesis method of claim 1, wherein the synthesizing comprises generating speech by applying concatenative synthesis techniques to the first speech segment and the second speech segment.

9. A system, comprising:
- at least one computer hardware processor;
  
  at least one physical device for outputting sound; and
  
  at least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform;
  
  obtaining input comprising text and an identification of a desired speaking style to use in synthesizing the text as speech;
  
  identifying a plurality of speech segments for use in synthesizing the text as speech, the identifying comprising;
  
  identifying a first speech segment recorded and/or synthesized in a first speaking style based at least in part on a measure of similarity between the desired speaking style and the first speaking style; and
  
  identifying a second speech segment recorded and/or synthesized in a second speaking style different from the first speaking style based at least in part on a measure of similarity between the desired speaking style and the second speaking style;
  
  synthesizing speech from the text in the desired speaking style, at least in part, by using the first speech segment and the second speech segment; and
  
  outputting the synthesized speech via the at least one physical device.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The system of claim 9, wherein the identifying comprises:
    - identifying the second speech segment based, at least in part, on how well acoustic characteristics of the second speech segment match acoustic characteristics associated with the desired speaking style.
  - 11. The system of claim 10, wherein the identifying the second speech segment is based, at least in part, on how well prosodic characteristics of the second speech segment match prosodic characteristics associated with the desired speaking style.
  - 12. The system of claim 10, wherein identifying the second speech segment comprises:
    - calculating a value indicative of how well the acoustic characteristics of the second speech segment match acoustic characteristics associated with the desired speaking style.
  - 13. The system of claim 12, wherein the calculating is performed based at least in part on a transformation from a second group of speech segments having the second speaking style to a first group of speech segments having the desired speaking style, wherein the second group of speech segments comprises the second speech segment, wherein the first and second groups of speech segments are associated with a same phonetic context.
  - 14. The system of claim 13, wherein the first group of speech segments is represented by a first statistical model and the second group of speech segments is represented by a second statistical model, and wherein calculating the value comprises:
    - using the transformation to transform the second statistical model to obtain a transformed statistical model; and
      
      calculating the value as a distance between the transformed second statistical model and the first statistical model.

15. At least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform:
- obtaining input comprising text and an identification of a desired speaking style to use in synthesizing the text as speech;
  
  identifying a plurality of speech segments for use in synthesizing the text as speech, the identifying comprising;
  
  identifying a first speech segment recorded and/or synthesized in a first speaking style based at least in part on a measure of similarity between the desired speaking style and the first speaking style; and
  
  identifying a second speech segment recorded and/or synthesized in a second speaking style different from the first speaking style based at least in part on a measure of similarity between the desired speaking style and the second speaking style;
  
  synthesizing speech from the text in the desired speaking style, at least in part, by using the first speech segment and the second speech segment; and
  
  outputting the synthesized speech via the at least one physical device.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The at least one non-transitory computer-readable storage medium of claim 15, wherein the identifying comprises:
    - identifying the second speech segment based, at least in part, on how well acoustic characteristics of the second speech segment match acoustic characteristics associated with the desired speaking style.
  - 17. The at least one non-transitory computer-readable storage medium of claim 16, wherein the identifying the second speech segment is based, at least in part, on how well prosodic characteristics of the second speech segment match prosodic characteristics associated with the desired speaking style.
  - 18. The at least one non-transitory computer-readable storage medium of claim 16, wherein identifying the second speech segment comprises:
    - calculating a value indicative of how well the acoustic characteristics of the second speech segment match acoustic characteristics associated with the desired speaking style.
  - 19. The at least one non-transitory computer-readable storage medium of claim 18, wherein the calculating is performed based at least in part on a transformation from a second group of speech segments having the second speaking style to a first group of speech segments having the desired speaking style, wherein the second group of speech segments comprises the second speech segment, wherein the first and second groups of speech segments are associated with a same phonetic context.
  - 20. The at least one non-transitory computer-readable storage medium of claim 19, wherein the first group of speech segments is represented by a first statistical model and the second group of speech segments is represented by a second statistical model, and wherein calculating the value comprises:
    - using the transformation to transform the second statistical model to obtain a transformed statistical model; and
      
      calculating the value as a distance between the transformed second statistical model and the first statistical model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Pollet, Vincent
Primary Examiner(s)
MCFADDEN, SUSAN IRIS

Application Number

US14/499,444
Publication Number

US 20160093289A1
Time in Patent Office

869 Days
Field of Search

704/260
US Class Current

1/1
CPC Class Codes

G10L 13/027   Concept to speech synthesis...

G10L 13/047   Architecture of speech synt...

G10L 13/07   Concatenation rules

G10L 13/08   Text analysis or generation...

Systems and methods for multi-style speech synthesis

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for multi-style speech synthesis

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links