Easy generation and automatic training of spoken dialog systems using text-to-speech

US 7,885,817 B2
Filed: 06/29/2005
Issued: 02/08/2011
Est. Priority Date: 03/08/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented dialog system training environment, comprising:

a processor to execute components of a dialog system;

a memory coupled to the processor;

a user simulator that during the dialog system training provides at least one text to speech training output associated with an utterance, the output having variable qualities; and

a dialog system that comprises;

a speech model having a plurality of modifiable speech model parameters, the speech model receives the at least one text to speech training output as a speech model input related to the utterance and produces related speech model output features;

a dialog action model having a plurality of modifiable dialog action model parameters, the dialog action model receives the related speech model output features from the speech model and produces related output actions, the plurality of modifiable speech model parameters, the plurality of modifiable dialog action model parameters, or a combination thereof, are based, at least in part, upon the utterance, the action taken by the dialog action model, or a combination thereof; and

the dialog system identifies the utterance that is in need of clarification by initiating a repair dialog, wherein the utterance associated with the repair dialog is identified includes;

determining what states of the repair dialog are reached from other states, the dialog system learns which states to go to when observing an appropriate speech and dialog features by trying all repair paths using the user simulator where a user'"'"'s voice is generated using various text-to-speech (TTS) engines at adjustable levels; and

determining which states of the repair dialog are failures or successes.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A dialog system training environment and method using text-to-speech (TTS) are provided. The only knowledge a designer requires is a simple specification of when the dialog system has failed or succeeded, and for any state of the dialog, a list of the possible actions the system can take.

The training environment simulates a user using TTS varied at adjustable levels, a dialog action model of a dialog system responds to the produced utterance by trying out all possible actions until it has failed or succeeded. From the data accumulated in the training environment it is possible for the dialog action model to learn which states to go to when it observes the appropriate speech and dialog features so as to increase the likelihood of success. The data can also be used to improve the speech model.

Citations

18 Claims

1. A computer-implemented dialog system training environment, comprising:
- a processor to execute components of a dialog system;
  
  a memory coupled to the processor;
  
  a user simulator that during the dialog system training provides at least one text to speech training output associated with an utterance, the output having variable qualities; and
  
  a dialog system that comprises;
  
  a speech model having a plurality of modifiable speech model parameters, the speech model receives the at least one text to speech training output as a speech model input related to the utterance and produces related speech model output features;
  
  a dialog action model having a plurality of modifiable dialog action model parameters, the dialog action model receives the related speech model output features from the speech model and produces related output actions, the plurality of modifiable speech model parameters, the plurality of modifiable dialog action model parameters, or a combination thereof, are based, at least in part, upon the utterance, the action taken by the dialog action model, or a combination thereof; and
  
  the dialog system identifies the utterance that is in need of clarification by initiating a repair dialog, wherein the utterance associated with the repair dialog is identified includes;
  
  determining what states of the repair dialog are reached from other states, the dialog system learns which states to go to when observing an appropriate speech and dialog features by trying all repair paths using the user simulator where a user'"'"'s voice is generated using various text-to-speech (TTS) engines at adjustable levels; and
  
  determining which states of the repair dialog are failures or successes.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer-implemented dialog system training environment of claim 1, wherein the dialog action model is modified with data collected from the training environment.
  - 3. The computer-implemented dialog system training environment of claim 1, wherein the dialog action model further comprises a probability distribution associated with uncertainty regarding the plurality of modifiable dialog action model parameters.
  - 4. The computer-implemented dialog system training environment of claim 1, wherein the user simulator further comprises adjusting at least one of a voice, a pitch, a rate or volume settings of the text to speech training output.
  - 5. The computer-implemented dialog system training environment of claim 1, wherein the user simulator further comprises simulating a noisy environment.
  - 6. The computer-implemented dialog system training environment of claim 1, wherein the user simulator further comprises a language model storing information associated with the utterance.
  - 7. The computer-implemented dialog system training environment of claim 1, wherein the speech model is modified with data collected from the training environment.

8. A method of training a speech recognition learning system either offline or online, the method comprising:
- generating at least one text to speech output to an utterance using a user simulator during the training, the output having qualities comprising at least one of a voice, a pitch, a rate or a volume;
  
  identifying speech features related to the utterance using a speech model or using a dialog action model;
  
  identifying the utterance that is in need of clarification by initiating a repair dialog, wherein the utterance associated with the repair dialog is identified includes;
  
  determining what states of the repair dialog are reached from other states, the dialog system learns which states to go to when observing an appropriate speech and dialog features by trying all repair paths using the user simulator where a user'"'"'s voice is generated using various text-to-speech (TTS) engines at adjustable levels; and
  
  determining which states of the repair dialog are failures or successes;
  
  performing an action related to the speech features; and
  
  updating the dialog action model, the speech model or both, based at least in part on the speech features, the action, or a combination thereof.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The method of claim 8, wherein the qualities of the at least one text to speech output related to the utterance varies.
  - 10. The method of claim 8, wherein the dialog action model further comprises a probability distribution associated with uncertainty regarding the dialog action model parameters or the speech model parameters.
  - 11. The method of claim 8, wherein the user simulator further comprises simulating a noisy background.
  - 12. The method of claim 8, wherein the user simulator further comprises a language model storing information associated with the utterance.
  - 13. The method of claim 8, wherein the speech model is modified with data collected from the training.

14. A computer-implemented dialog training system, comprising:
- a processing unit to implement the dialog training system;
  
  a user simulator generating a simulated training data set for the dialog training system, the simulating training data set comprising text to speech training output associated with one or more simulated utterances varying over at least one of a pitch, a rate, a volume, a noise, or combinations thereof;
  
  a speech model having a plurality of modifiable speech model parameters relating to the one or more simulated utterances, the speech model receiving the simulated training data set as a speech model input and producing related speech model output features relating to the one or more simulated utterances;
  
  a dialog action model having a plurality of modifiable dialog action model parameters relating to the one or more simulated utterances, the dialog action model receiving the related speech model output features and producing corresponding output actions relating to the one or more simulated utterances; and
  
  the dialog training system further receives one or more simulated utterances that is in need of clarification by initiating a repair dialog, wherein the utterance associated with the repair dialog is identified includes;
  
  determining what states of the repair dialog are reached from other states, the dialog system learns which states to go to when observing an appropriate speech and dialog features by trying all repair paths using the user simulator where a user'"'"'s voice is generated using various text-to-speech (TTS) engines at adjustable levels; and
  
  determining which states of the repair dialog are failures or successes.
- View Dependent Claims (15, 16, 17, 18)
- - 15. The computer-implemented dialog system of claim 14, wherein the dialog action model further comprises a probability distribution associated with uncertainty regarding the plurality of modifiable dialog action model parameters.
  - 16. The computer-implemented dialog system of claim 14, wherein the user simulator further comprises simulating a noisy environment.
  - 17. The computer-implemented dialog system of claim 14, wherein the user simulator further comprises a language model storing information associated with the one or more simulated utterances.
  - 18. The computer-implemented dialog system of claim 14, wherein the speech model is modified with data collected from the training system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Paek, Timothy S., Chickering, David M.
Primary Examiner(s)
Wozniak; James S
Assistant Examiner(s)
He; Jialong

Application Number

US11/170,584
Publication Number

US 20060206332A1
Time in Patent Office

2,050 Days
Field of Search

704/270, 704/275
US Class Current

704/270
CPC Class Codes

G10L 13/00   Speech synthesis; Text to s...

G10L 15/063   Training

G10L 15/22   Procedures used during a sp...

Easy generation and automatic training of spoken dialog systems using text-to-speech

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Easy generation and automatic training of spoken dialog systems using text-to-speech

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links