Systems and methods for multi-style speech synthesis

US 9,990,915 B2
Filed: 12/28/2016
Issued: 06/05/2018
Est. Priority Date: 09/29/2014
Status: Active Grant

First Claim

Patent Images

1. A speech synthesis method, comprising:

using at least one computer hardware processor to perform;

obtaining input comprising text and an identification of a desired speaking style to use in synthesizing the text as speech;

identifying a plurality of speech segments for use in synthesizing the text as speech, the identifying comprising identifying a first speech segment recorded and/or synthesized in a first speaking style that is different from the desired speaking style based at least in part on a measure of similarity between the desired speaking style and the first speaking style;

synthesizing speech from the text in the desired speaking style at least in part by using the first speech segment; and

outputting the synthesized speech.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for performing multi-style speech synthesis. The techniques include using at least one computer hardware processor to perform: obtaining input comprising text and an identification of a desired speaking style to use in rendering the text as speech; identifying a plurality of speech segments for use in synthesizing the text as speech, the identifying comprising identifying a first speech segment recorded and/or synthesized in a first speaking style that is different from the desired speaking style based at least in part on a measure of similarity between the desired speaking style and the first speaking style; synthesizing speech from the text in the desired speaking style at least in part by using the first speech segment; and outputting the synthesized speech.

Citations

20 Claims

1. A speech synthesis method, comprising:
- using at least one computer hardware processor to perform;
  
  obtaining input comprising text and an identification of a desired speaking style to use in synthesizing the text as speech;
  
  identifying a plurality of speech segments for use in synthesizing the text as speech, the identifying comprising identifying a first speech segment recorded and/or synthesized in a first speaking style that is different from the desired speaking style based at least in part on a measure of similarity between the desired speaking style and the first speaking style;
  
  synthesizing speech from the text in the desired speaking style at least in part by using the first speech segment; and
  
  outputting the synthesized speech.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The speech synthesis method of claim 1, wherein the identifying the first speech segment is based at least in part on how well acoustic characteristics of the first speech segment match acoustic characteristics associated with the desired speaking style.
  - 3. The speech synthesis method of claim 2, wherein the identifying the first speech segment is based at least in part on how well prosodic characteristics of the first speech segment match prosodic characteristics associated with the desired speaking style.
  - 4. The speech synthesis method of claim 2, wherein the identifying the first speech segment comprises:
    - calculating a value indicative of how well the acoustic characteristics of the first speech segment match the acoustic characteristics associated with the desired speaking style.
  - 5. The speech synthesis method of claim 4, wherein the calculating is performed based at least in part on a transformation from a first group of speech segments recorded and/or synthesized in the first speaking style to a second group of speech segments recorded and/or synthesized in the desired speaking style, wherein the first group of speech segments comprises the first speech segment, wherein the first and second groups of speech segments are associated with a same phonetic context.
  - 6. The speech synthesis method of claim 5, wherein the first group of speech segments is represented by a first statistical model and the second group of speech segments is represented by a second statistical model, and wherein the calculating comprises:
    - using the transformation to transform the first statistical model to obtain a transformed first statistical model; and
      
      calculating the value as a distance between the transformed first statistical model and the second statistical model.
  - 7. The speech synthesis method of claim 6, wherein the distance between the transformed first statistical model and the second statistical model is a Kullback-Liebler divergence between the transformed first statistical model and the second statistical model.
  - 8. The speech synthesis method of claim 1, wherein the identifying the plurality of segments comprises:
    - identifying a second speech segment recorded and/or synthesized in a second speaking style that is the same as the desired speaking style.
  - 9. The speech synthesis method of claim 8, wherein the synthesizing comprises:
    - synthesizing speech from the text in the desired speaking style at least in part by using the first speech segment and the second speech segment.
  - 10. The speech synthesis method of claim 9, wherein the synthesizing comprises:
    - generating speech by applying at least one concatenative synthesis technique to the first speech segment and the second speech segment.

11. A system, comprising:
- at least one computer hardware processor; and
  
  at least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform;
  
  obtaining input comprising text and an identification of a desired speaking style to use in synthesizing the text as speech;
  
  identifying a plurality of speech segments for use in synthesizing the text as speech, the identifying comprising identifying a first speech segment recorded and/or synthesized in a first speaking style that is different from the desired speaking style based at least in part on a measure of similarity between the desired speaking style and the first speaking style;
  
  synthesizing speech from the text in the desired speaking style at least in part by using the first speech segment; and
  
  outputting the synthesized speech.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The system of claim 11, wherein the identifying the first speech segment is based at least in part on how well acoustic characteristics of the first speech segment match acoustic characteristics associated with the desired speaking style.
  - 13. The system of claim 12, wherein the identifying the first speech segment comprises:
    - calculating a value indicative of how well the acoustic characteristics of the first speech segment match the acoustic characteristics associated with the desired speaking style.
  - 14. The system of claim 13, wherein the calculating is performed based at least in part on a transformation from a first group of speech segments recorded and/or synthesized in the first speaking style to a second group of speech segments recorded and/or synthesized in the desired speaking style, wherein the first group of speech segments comprises the first speech segment, wherein the first and second groups of speech segments are associated with a same phonetic context.
  - 15. The system of claim 14, wherein the first group of speech segments is represented by a first statistical model and the second group of speech segments is represented by a second statistical model, and wherein the calculating comprises:
    - using the transformation to transform the first statistical model to obtain a transformed first statistical model; and
      
      calculating the value as a distance between the transformed first statistical model and the second statistical model.

16. At least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform:
- obtaining input comprising text and an identification of a desired speaking style to use in synthesizing the text as speech;
  
  identifying a plurality of speech segments for use in synthesizing the text as speech, the identifying comprising identifying a first speech segment recorded and/or synthesized in a first speaking style that is different from the desired speaking style based at least in part on a measure of similarity between the desired speaking style and the first speaking style;
  
  synthesizing speech from the text in the desired speaking style at least in part by using the first speech segment; and
  
  outputting the synthesized speech.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The at least one non-transitory computer-readable storage medium of claim 16, wherein the identifying the first speech segment is based at least in part on how well acoustic characteristics of the first speech segment match acoustic characteristics associated with the desired speaking style.
  - 18. The at least one non-transitory computer-readable storage medium of claim 17, wherein the identifying the first speech segment comprises:
    - calculating a value indicative of how well the acoustic characteristics of the first speech segment match the acoustic characteristics associated with the desired speaking style.
  - 19. The at least one non-transitory computer-readable storage medium of claim 18, wherein the calculating is performed based at least in part on a transformation from a first group of speech segments recorded and/or synthesized in the first speaking style to a second group of speech segments recorded and/or synthesized in the desired speaking style, wherein the first group of speech segments comprises the first speech segment, wherein the first and second groups of speech segments are associated with a same phonetic context.
  - 20. The at least one non-transitory computer-readable storage medium of claim 19, wherein the first group of speech segments is represented by a first statistical model and the second group of speech segments is represented by a second statistical model, and wherein the calculating comprises:
    - using the transformation to transform the first statistical model to obtain a transformed first statistical model; and
      
      calculating the value as a distance between the transformed first statistical model and the second statistical model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Pollet, Vincent
Primary Examiner(s)
MCFADDEN, SUSAN IRIS

Application Number

US15/392,923
Publication Number

US 20170110110A1
Time in Patent Office

524 Days
Field of Search

704260
US Class Current
CPC Class Codes

G10L 13/027   Concept to speech synthesis...

G10L 13/047   Architecture of speech synt...

G10L 13/07   Concatenation rules

G10L 13/08   Text analysis or generation...

Systems and methods for multi-style speech synthesis

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for multi-style speech synthesis

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links