STATE MAPPING FOR CROSS-LANGUAGE SPEAKER ADAPTATION

US 20100198577A1
Filed: 02/03/2009
Published: 08/05/2010
Est. Priority Date: 02/03/2009
Status: Abandoned Application

First Claim

Patent Images

1. One or more computer-readable storage media storing instructions for cross-language speaker adaptation in speech-to-speech language translation that when executed instruct a processor to perform acts comprising:

sampling a source speaker'"'"'s voice in a speaker'"'"'s language (VSLS);

sampling an auxiliary speaker'"'"'s voice in the source speaker'"'"'s language (VALS);

sampling the auxiliary speaker'"'"'s voice in a listener'"'"'s language (VALL);

sampling a listener'"'"'s voice in the listener'"'"'s language (VLLL);

recognizing VSLS into text of the source speaker'"'"'s language (TLS);

translating the TLS to text of the listener'"'"'s language (TLL);

generating a Hidden Markov Model (HMM) model for the VALS;

mapping VSLS samples to VALS HMM states using context mapping;

generating a HMM model for the VALL;

mapping VALS HMM model states to VALL HMM model states, wherein the HMM states of the VALS model are mapped to the HMM states of the VALL model which are closest in an acoustic space using distortion measure mapping;

generating a HMM model for the VLLL;

mapping states of the VALL HMM model to states of the VLLL HMM model using context mapping; and

modifying VLLL using the VSLS samples to form a source speaker'"'"'s voice speaking the listener'"'"'s language (VOLL).

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Creation of sub-phonemic Hidden Markov Model (HMM) states and the mapping of those states results in improved cross-language speaker adaptation. The smaller sub-phonemic mapping provides improvements in usability and intelligibility particularly between languages with few common phonemes. HMM states of different languages may be mapped to one another using a distance between the HMM states in acoustic space. This distance may be calculated using Kullback-Leibler divergence and multi-space probability distribution. By combining distance mapping and context mapping for different speakers of the same language improved cross-language speaker adaptation is possible.

Citations

20 Claims

1. One or more computer-readable storage media storing instructions for cross-language speaker adaptation in speech-to-speech language translation that when executed instruct a processor to perform acts comprising:
- sampling a source speaker'"'"'s voice in a speaker'"'"'s language (VSLS);
  
  sampling an auxiliary speaker'"'"'s voice in the source speaker'"'"'s language (VALS);
  
  sampling the auxiliary speaker'"'"'s voice in a listener'"'"'s language (VALL);
  
  sampling a listener'"'"'s voice in the listener'"'"'s language (VLLL);
  
  recognizing VSLS into text of the source speaker'"'"'s language (TLS);
  
  translating the TLS to text of the listener'"'"'s language (TLL);
  
  generating a Hidden Markov Model (HMM) model for the VALS;
  
  mapping VSLS samples to VALS HMM states using context mapping;
  
  generating a HMM model for the VALL;
  
  mapping VALS HMM model states to VALL HMM model states, wherein the HMM states of the VALS model are mapped to the HMM states of the VALL model which are closest in an acoustic space using distortion measure mapping;
  
  generating a HMM model for the VLLL;
  
  mapping states of the VALL HMM model to states of the VLLL HMM model using context mapping; and
  
  modifying VLLL using the VSLS samples to form a source speaker'"'"'s voice speaking the listener'"'"'s language (VOLL).
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The computer-readable storage media of claim 1, wherein each HMM state represents a distinctive sub-phonemic acoustic-phonetic event.
  - 3. The computer-readable storage media of claim 1, wherein context mapping comprises determining the HMM states within a first HMM model and a second HMM model to be context mapped;
    - and mapping HMM states in the first model to HMM states in the second model which have a corresponding context in the second HMM model.
  - 4. The computer-readable storage media of claim 1, wherein distortion measure mapping further comprises setting a distance threshold and disallowing mappings exceeding the distance threshold.
  - 5. The computer-readable storage media of claim 1, wherein the closest states in the distortion measure mapping are determined by:
    - ${\hat{S}}^{X} = \underset{S^{X}}{\arg \min} D (S^{X}, S_{j}^{Y})$ where, S_j^Yis a state in language Y and S_j^Xis a state in language X and D is the distance between two states.
  - 6. The computer-readable storage media of claim 1, wherein the closest states in the distortion measure mapping are determined by:
    - ${\hat{S}}^{X} = \underset{S^{X}}{\arg \min} D (S^{X}, S_{j}^{Y})$ where, S_j^Yis a state in language Y and S_j^Xis a state in language X and D is the distance between two states, wherein D is calculated by a Kullback-Leibler Divergence (KLD) with multi-space probability distribution (MSD);
      
      $\begin{matrix} D_{KL} (p  q) = \int_{Ω} p (x) \log (\frac{p (x)}{q (x)}) \partial x \\ = \sum_{g = 1}^{G} {\int_{Ω_{g}} ω_{g}^{p} M_{g}^{p} (x) \log (\frac{ω_{g}^{p} M_{g}^{p} (x)}{ω_{g}^{q} M_{g}^{q} (x)}) \partial x} \\ = \sum_{g = 1}^{G} {ω_{g}^{p} \log (\frac{ω_{g}^{p}}{ω_{g}^{q}}) + ω_{g}^{p} \int_{Ω_{g}} M_{g}^{p} (x) \log (\frac{M_{g}^{p} (x)}{M_{g}^{q} (x)}) \partial x} \\ = \sum_{g = 1}^{G} {ω_{g}^{p} D_{KL} (M_{g}^{p}  M_{g}^{q})} + \sum_{g = 1}^{G} {ω_{g}^{p} \log (\frac{ω_{g}^{p}}{ω_{g}^{q}})} \end{matrix}$ where p and q are distributions, and the whole sample space may be divided by G subspaces with index g.

7. A method comprising:
- sampling first speech from a speaker in a first language (VALS);
  
  decomposing the first speech into first speech sub-phoneme samples;
  
  generating a Hidden Markov Model (HMM) model of the VALS comprising HMM states, wherein each state represents a distinctive sub-phonemic acoustic-phonetic event derived from the first speech sub-phoneme samples;
  
  training the first state model VALS using the sub-phoneme samples;
  
  sampling second speech from the speaker in a second language (VALL);
  
  decomposing the second speech into first speech sub-phoneme samples;
  
  generating a Hidden Markov Model (HMM) model of the VALL comprising HMM states, wherein each state represents a distinctive sub-phonemic acoustic-phonetic event derived from the second speech sub-phoneme samples;
  
  training the second state model VALL using the sub-phoneme samples; and
  
  determining corresponding states between VALS HMM model states and VALL HMM model states using Kullback-Leibler Divergence with multi-space probability distribution (KLD).
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
- - 8. The method of claim 7, wherein the corresponding states are determined by:
    - ${\hat{S}}^{X} = \underset{S^{X}}{\arg \min} D (S^{X}, S_{j}^{Y})$ where, S_j^Yis a state in language Y and S_j^Xis a state in language X and D is the distance between two states in acoustic space, wherein D is calculated by KLD and MSD of the form;
      
      $\begin{matrix} D_{KL} (p  q) = \int_{Ω} p (x) \log (\frac{p (x)}{q (x)}) \partial x \\ = \sum_{g = 1}^{G} {\int_{Ω_{g}} ω_{g}^{p} M_{g}^{p} (x) \log (\frac{ω_{g}^{p} M_{g}^{p} (x)}{ω_{g}^{q} M_{g}^{q} (x)}) \partial x} \\ = \sum_{g = 1}^{G} {ω_{g}^{p} \log (\frac{ω_{g}^{p}}{ω_{g}^{q}}) + ω_{g}^{p} \int_{Ω_{g}} M_{g}^{p} (x) \log (\frac{M_{g}^{p} (x)}{M_{g}^{q} (x)}) \partial x} \\ = \sum_{g = 1}^{G} {ω_{g}^{p} D_{KL} (M_{g}^{p}  M_{g}^{q})} + \sum_{g = 1}^{G} {ω_{g}^{p} \log (\frac{ω_{g}^{p}}{ω_{g}^{q}})} \end{matrix}$ where p and q are distributions, and the whole sample space may be divided by G subspaces with index g.
  - 9. The method of claim 7, further comprising mapping corresponding states of the VALS HMM model to the VALL HMM model.
  - 10. The method of claim 7, further comprising determining a similarity between VALS HMM model states and VALL HMM model states based on a distance between the VALS HMM states and VALL HMM states in an acoustic space defined by the KLD.
  - 11. The method of claim 7, wherein training the first state model VALS using the sub-phoneme samples comprises taking a plurality of sub-phoneme samples for the same sub-phoneme and building a state.
  - 12. The method of claim 7, further comprising:
    - sampling speech from a source speaker speaking the language of the source speaker (VSLS) and generating HMM states VSLS;
      
      sampling a listener'"'"'s speech in the listener'"'"'s language (VLLL);
      
      recognizing speech VSLS into text of the source speaker'"'"'s language (TLS);
      
      translating the TLS into text of the language of the listener (TLL);
      
      mapping VSLS samples to VALS HMM states using context mapping;
      
      generating a HMM model for the VLLL;
      
      mapping states of the VALL HMM model to states of the VLLL HMM model using context mapping; and
      
      modifying VLLL using the samples of VSLS using the mappings to form source speaker'"'"'s voice speaking the listener'"'"'s language (VOLL).
  - 13. The method of claim 12, wherein context mapping comprises mapping a first HMM state in a first HMM model to a second HMM state in a second HMM model where the first HMM state has the same context as the second HMM state.
  - 14. The method of claim 12, further comprising synthesizing the source speaker'"'"'s voice speaking TLL in the listener'"'"'s language (VOLL).

15. A system of speech-to-speech translation with cross-language speaker adaptation, the system comprising:
- a processor;
  
  a memory coupled to the processor;
  
  a speaker adaptation module, stored in memory and configured to execute on the processor, the speaker adaptation module configured to map a first Hidden Markov Model (HMM) model of speech in a first language to a second HMM model of speech in a second language using Kullback-Leibler Divergence (KLD) with multi-space probability distribution (MSD).
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The system of claim 15, wherein the HMM models further comprise HMM states, where each state in the HMM represents a distinctive sub-phonemic acoustic-phonetic event.
  - 17. The system of claim 16, wherein the speaker adaptation module is further configured to determine a distance between HMM states in the first HMM model and the second HMM model.
  - 18. The system of claim 17, further comprising a distance threshold configured in the speaker adaptation module and mapping HMM states in the speaker adaptation module from the first HMM model with HMM states from the second HMM model which are within the distance threshold.
  - 19. The system of claim 15, further comprising:
    - an input module coupled to the processor and memory;
      
      an output module coupled to the processor and memory;
      
      a speech recognition module stored in memory and configured to execute on the processor, the speech recognition module configured to receive the first speech from the input module and recognize the first speech to form text in the first language;
      
      a text translation module stored in memory and configured to execute on the processor, the text translation module configured to translate text from the first language to the second language; and
      
      a speech synthesis module stored in memory and configured to execute on the processor, the speech synthesis module configured to generate synthesized speech from the translated text in the second language for output through the output module.
  - 20. The system of claim 19, wherein the speech recognition module, text translation module, and speech synthesis module are in operation and available for use while the remaining modules are unavailable.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Soong, Frank Kao-Ping, Chen, Yi-Ning, Qian, Yao

Application Number

US12/365,107
Publication Number

US 20100198577A1
Time in Patent Office

Days
Field of Search
US Class Current

704/2
CPC Class Codes

G06F 40/58   Use of machine translation,...

G10L 15/07   to the speaker

G10L 15/144   Training of HMMs

G10L 15/187   Phonemic context, e.g. pron...

STATE MAPPING FOR CROSS-LANGUAGE SPEAKER ADAPTATION

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

STATE MAPPING FOR CROSS-LANGUAGE SPEAKER ADAPTATION

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links