Methods, apparatus and data structure for cross-language speech adaptation

US 9,798,653 B1
Filed: 05/05/2010
Issued: 10/24/2017
Est. Priority Date: 05/05/2010
Status: Expired due to Fees

First Claim

Patent Images

1. A system comprising:

data storage for storing;

a full speech model based on speech in a language spoken by a first person who is fluent in the language,a limited set of utterances in a fluent language of a second person based on speech spoken by the second person who is non-fluent in the language spoken by the first person, anda full speech model of the second person based on speech by the second person, anda processor configured to implement;

a cross-language speech adapter that processes the full speech model based on speech in the language spoken by the first person and the limited set of utterances in the fluent language of the second person based on speech spoken by the second person who is non-fluent in the language spoken by the first person and outputs an adapted speech model, the processing including applying at least one transformation to the full speech model according to the limited set of utterances to produce the adapted speech model, anda tree combination unit the tree combination unit combining the full speech model of the second person based on speech by the second person and the adapted speech model with Text-to Speech (TTS) engine files of the adapted speech model and the full speech model of the second person,wherein the transformation includes a plurality of;

(1) a constrained maximum likelihood linear regression (CMLLR) transformation, (2) a MLLR adaptation of the mean (MLLRMEAN) transformation, (3) a variance MLLR (MLLRVAR) transformation, and (4) a maximum a posteriori (MAP) linear regression transformation.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Adapted speech models produce fluent synthesized speech in a voice that sounds as if the speaker were fluent in a language in which the speaker is actually non-fluent. A full speech model is obtained based on fluent speech in the language spoken by a first person who is fluent in the language. A limited set of utterances is obtained in the language spoken by a second person who is non-fluent in the language but able to speak the limited set of utterances in the language. The full speech model of the first person is then processed with the limited set of utterances of the second person to produce an adapted speech model. The adapted speech model may be stored to a multi-lingual speech model as a child node of a root with an associated language selection question and branches pointed to the adapted speech model and other speech models, respectively.

137 Citations

7 Claims

1. A system comprising:
- data storage for storing;
  
  a full speech model based on speech in a language spoken by a first person who is fluent in the language,a limited set of utterances in a fluent language of a second person based on speech spoken by the second person who is non-fluent in the language spoken by the first person, anda full speech model of the second person based on speech by the second person, anda processor configured to implement;
  
  a cross-language speech adapter that processes the full speech model based on speech in the language spoken by the first person and the limited set of utterances in the fluent language of the second person based on speech spoken by the second person who is non-fluent in the language spoken by the first person and outputs an adapted speech model, the processing including applying at least one transformation to the full speech model according to the limited set of utterances to produce the adapted speech model, anda tree combination unit the tree combination unit combining the full speech model of the second person based on speech by the second person and the adapted speech model with Text-to Speech (TTS) engine files of the adapted speech model and the full speech model of the second person,wherein the transformation includes a plurality of;
  
  (1) a constrained maximum likelihood linear regression (CMLLR) transformation, (2) a MLLR adaptation of the mean (MLLRMEAN) transformation, (3) a variance MLLR (MLLRVAR) transformation, and (4) a maximum a posteriori (MAP) linear regression transformation.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system of claim 1 further comprising a text-to-speech (TTS) engine.
  - 3. The system of claim 2 wherein the text-to-speech (TTS) engine outputs fluid synthesized speech.
  - 4. The system of claim 3 wherein the text-to-speech (TTS) engine receives a multi-lingual phoneme stream.
  - 5. The system of claim 4 wherein the multi-lingual phoneme stream was transformed from multi-lingual text by a text processor.

6. A method comprising:
- receiving at an input interface of a computer system having at least a processor and a memory in addition to the input and output interface, a full speech model based on speech in a language spoken by a first person who is fluent in the language;
  
  receiving at the input interface, a limited set of utterances in a fluent language of a second person based on speech spoken by the second person who is non-fluent in the language spoken by the first person;
  
  applying, in the computer system, a transformation technique with an adaptation module to the full speech model according to the limited set of utterances to produce a plurality of adapted speech models,wherein a cross-language speech adapter processes the full speech model based on speech in the language spoken by the first person and the limited set of utterances in the fluent language of the second person based on speech spoken by the second person who is non-fluent in the language spoken by the first person and outputs an adapted speech model, the processing including applying at least one transformation to the full speech model according to the limited set of utterances to produce the adapted speech model; and
  
  synthesizing, in the computer system, speech using each of the plurality of adapted speech models to generate a plurality of synthesized speech samples,wherein the transformation technique includes a plurality of;
  
  (1) a constrained maximum likelihood linear regression (CMLLR) transformation, (2) a MLLR adaptation of the mean (MLLRMEAN) transformation, (3) a variance MLLR (MLLRVAR) transformation, and (4) a maximum a posteriori (MAP) linear regression transformation.
- View Dependent Claims (7)
- - 7. The method of claim 6 wherein a plurality of speech samples are presented to the adaptation module for selection of one by the plurality of transformations that produced a synthesized speech sample having a voice that most closely resembles the voice of a second person and sounds as if the second person were fluent in the language.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Shao, Xu, Breen, Andrew
Primary Examiner(s)
Sirjani, Fariba

Application Number

US12/773,998
Time in Patent Office

2,729 Days
Field of Search

704 1- 9, 704258, 704260, 704270, 704272
US Class Current
CPC Class Codes

G06F 12/00   Accessing, addressing or al...

G06F 40/00   Handling natural language d...

G06F 40/58   Use of machine translation,...

G10L 13/00   Speech synthesis; Text to s...

G10L 13/086   Detection of language

Methods, apparatus and data structure for cross-language speech adaptation

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

137 Citations

7 Claims

Specification

Use Cases

Quick Links

Others

Methods, apparatus and data structure for cross-language speech adaptation

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

137 Citations

7 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others