SYSTEMS AND METHODS FOR MULTI-STYLE SPEECH SYNTHESIS

US 20160093289A1
Filed: 09/29/2014
Published: 03/31/2016
Est. Priority Date: 09/29/2014
Status: Active Grant

First Claim

Patent Images

1. A speech synthesis method, comprising:

using at least one computer hardware processor to perform;

obtaining input comprising text and an identification of a first speaking style to use in rendering the text as speech;

identifying a plurality of speech segments for use in rendering the text as speech, the identified plurality of speech segments comprising a first speech segment having the first speaking style and a second speech segment having a second speaking style different from the first speaking style; and

rendering the text as speech having the first speaking style, at least in part, by using the identified plurality of speech segments.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for performing multi-style speech synthesis. The techniques include using at least one computer hardware processor to perform: obtaining input comprising text and an identification of a first speaking style to use in rendering the text as speech; identifying a plurality of speech segments for use in rendering the text as speech, the identified plurality of speech segments comprising a first speech segment having the first speaking style and a second speech segment having a second speaking style different from the first speaking style; and rendering the text as speech having the first speaking style, at least in part, by using the identified plurality of speech segments.

Citations

20 Claims

1. A speech synthesis method, comprising:
- using at least one computer hardware processor to perform;
  
  obtaining input comprising text and an identification of a first speaking style to use in rendering the text as speech;
  
  identifying a plurality of speech segments for use in rendering the text as speech, the identified plurality of speech segments comprising a first speech segment having the first speaking style and a second speech segment having a second speaking style different from the first speaking style; and
  
  rendering the text as speech having the first speaking style, at least in part, by using the identified plurality of speech segments.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The speech synthesis method of claim 1, wherein the identifying comprises:
    - identifying the second speech segment based, at least in part, on how well acoustic characteristics of the second speech segment match acoustic characteristics associated with the first speaking style.
  - 3. The speech synthesis method of claim 2, wherein the identifying the second speech segment is based, at least in part, on how well prosodic characteristics of the second speech segment match prosodic characteristics associated with the first speaking style.
  - 4. The speech synthesis method of claim 2, wherein identifying the second speech segment comprises:
    - calculating a value indicative of how well the acoustic characteristics of the second speech segment match acoustic characteristics associated with the first speaking style.
  - 5. The speech synthesis method of claim 4, wherein the calculating is performed based at least in part on a transformation from a second group of speech segments having the second speaking style to a first group of speech segments having the first speaking style, wherein the second group of speech segments comprises the second speech segment, wherein the first and second groups of speech segments are associated with a same phonetic context.
  - 6. The speech synthesis method of claim 5, wherein the first group of speech segments is represented by a first statistical model and the second group of speech segments is represented by a second statistical model, and wherein calculating the value comprises:
    - using the transformation to transform the second statistical model to obtain a transformed statistical model; and
      
      calculating the value as a distance between the transformed second statistical model and the first statistical model.
  - 7. The speech synthesis method of claim 6, wherein the distance between the transformed second statistical model and the first statistical model is a Kullback-Liebler divergence between the transformed second statistical model and the first statistical model.
  - 8. The speech synthesis method of claim 1, wherein the rendering comprises generating speech by applying concatenative synthesis techniques to the identified plurality of speech segments.

9. A system, comprising:
- at least one computer hardware processor; and
  
  at least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform;
  
  obtaining input comprising text and an identification of a first speaking style to use in rendering the text as speech;
  
  identifying a plurality of speech segments for use in rendering the text as speech, the identified plurality of speech segments comprising a first speech segment having the first speaking style and a second speech segment having a second speaking style different from the first speaking style; and
  
  rendering the text as speech having the first speaking style, at least in part, by using the identified plurality of speech segments.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The system of claim 9, wherein the identifying comprises:
    - identifying the second speech segment based, at least in part, on how well acoustic characteristics of the second speech segment match acoustic characteristics associated with the first speaking style.
  - 11. The system of claim 10, wherein the identifying the second speech segment is based, at least in part, on how well prosodic characteristics of the second speech segment match prosodic characteristics associated with the first speaking style.
  - 12. The system of claim 10, wherein identifying the second speech segment comprises:
    - calculating a value indicative of how well the acoustic characteristics of the second speech segment match acoustic characteristics associated with the first speaking style.
  - 13. The system of claim 12, wherein the calculating is performed based at least in part on a transformation from a second group of speech segments having the second speaking style to a first group of speech segments having the first speaking style, wherein the second group of speech segments comprises the second speech segment, wherein the first and second groups of speech segments are associated with a same phonetic context.
  - 14. The system of claim 13, wherein the first group of speech segments is represented by a first statistical model and the second group of speech segments is represented by a second statistical model, and wherein calculating the value comprises:
    - using the transformation to transform the second statistical model to obtain a transformed statistical model; and
      
      calculating the value as a distance between the transformed second statistical model and the first statistical model.

15. At least one computer-readable storage medium storing processor-executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform:
- obtaining input comprising text and an identification of a first speaking style to use in rendering the text as speech;
  
  identifying a plurality of speech segments for use in rendering the text as speech, the identified plurality of speech segments comprising a first speech segment having the first speaking style and a second speech segment having a second speaking style different from the first speaking style; and
  
  rendering the text as speech having the first speaking style, at least in part, by using the identified plurality of speech segments.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The at least one computer-readable storage medium of claim 15, wherein the identifying comprises:
    - identifying the second speech segment based, at least in part, on how well acoustic characteristics of the second speech segment match acoustic characteristics associated with the first speaking style.
  - 17. The at least one computer-readable storage medium of claim 16, wherein the identifying the second speech segment is based, at least in part, on how well prosodic characteristics of the second speech segment match prosodic characteristics associated with the first speaking style.
  - 18. The at least one computer-readable storage medium of claim 16, wherein identifying the second speech segment comprises:
    - calculating a value indicative of how well the acoustic characteristics of the second speech segment match acoustic characteristics associated with the first speaking style.
  - 19. The at least one computer-readable storage medium of claim 18, wherein the calculating is performed based at least in part on a transformation from a second group of speech segments having the second speaking style to a first group of speech segments having the first speaking style, wherein the second group of speech segments comprises the second speech segment, wherein the first and second groups of speech segments are associated with a same phonetic context.
  - 20. The at least one computer-readable storage medium of claim 19, wherein the first group of speech segments is represented by a first statistical model and the second group of speech segments is represented by a second statistical model, and wherein calculating the value comprises:
    - using the transformation to transform the second statistical model to obtain a transformed statistical model; and
      
      calculating the value as a distance between the transformed second statistical model and the first statistical model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Pollet, Vincent

Granted Patent

US 9,570,065 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G10L 13/027   Concept to speech synthesis...

G10L 13/047   Architecture of speech synt...

G10L 13/07   Concatenation rules

G10L 13/08   Text analysis or generation...

SYSTEMS AND METHODS FOR MULTI-STYLE SPEECH SYNTHESIS

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEMS AND METHODS FOR MULTI-STYLE SPEECH SYNTHESIS

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links