SPEECH PROCESSOR, A SPEECH PROCESSING METHOD AND A METHOD OF TRAINING A SPEECH PROCESSOR

US 20110218804A1
Filed: 01/26/2011
Published: 09/08/2011
Est. Priority Date: 03/02/2010
Status: Active Grant

First Claim

Patent Images

1. A speech recognition method, said method comprising:

receiving a speech input from a speaker which comprises a sequence of observations; and

determining the likelihood of a sequence of words arising from the sequence of observations using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker,the speech recognition method further comprising determining the likelihood of a sequence of observations occurring in a given language using a language model; and

combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, wherein the structure of said decision trees is based on second training data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition method, the method involving:

- receiving a speech input from a known speaker of a sequence of observations; and
- determining the likelihood of a sequence of words arising from the sequence of observations using an acoustic model, the acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, the acoustic model having been trained using first training data and adapted using second training data to said speaker,
- the speech recognition method also determining the likelihood of a sequence of observations occurring in a given language using a language model; and
- combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, wherein the structure of said decision trees is based on second training data.

59 Citations

View as Search Results

20 Claims

1. A speech recognition method, said method comprising:
- receiving a speech input from a speaker which comprises a sequence of observations; and
  
  determining the likelihood of a sequence of words arising from the sequence of observations using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker,the speech recognition method further comprising determining the likelihood of a sequence of observations occurring in a given language using a language model; and
  
  combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, wherein the structure of said decision trees is based on second training data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 17)
- - 2. A speech recognition method according to claim 1, wherein the structure of the decision trees is based on both the first and second training data.
  - 3. A method according to claim 1, wherein the structure is determined from the splitting of the nodes of the trees and has been calculated using maximum a posterior criteria.
  - 4. A method according to claim 3, wherein the splitting is calculated using maximum a posterior criteria implemented as:
  - 5. A method according to claim 1, wherein the context dependency is implemented as tri-phones.
  - 6. A method according to claim 1, wherein said acoustic model comprises probability distributions which are represented by means and variances and wherein said decision trees are provided for both means and variances.
  - 7. A method according to claim 1, wherein said context based information is selected from phonetic, linguistic and prosodic contexts.
  - 8. A method according to claim 1, wherein said decision trees are used to model at least one selected from expressive contexts, gender, age or voice characteristics.
  - 17. A carrier medium carrying computer readable instructions for controlling the computer to carry out the method of claim 1.

9. A text to speech processing method, said method comprising:
- receiving a text input which comprises a sequence of words; and
  
  determining the likelihood of a sequence of speech vectors arising from the sequence of words using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker,wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, wherein the structure of said decision trees is based on second training data.

10. A method of training an acoustic model for a speech processing system, the method comprising:
- receiving first training data, said first training data comprising speech and text corresponding to said speech;
  
  training a first acoustic model using said first training data;
  
  receiving second training data from a known speaker;
  
  adapting said first acoustic model to form a second acoustic model using said second training data,wherein adapting said first model to form said second model comprises constructing decision trees to model context dependency, and wherein the structure of the decision trees is based on the second training data.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. A method according to claim 10, further comprising storing the first acoustic model such that adaptation to the second acoustic model can be performed at a different location.
  - 12. A method according to claim 10, wherein training said first acoustic model comprises:
    - initialising a plurality of Hidden Markov Models;
      
      re-estimating the HMMs on the basis of the first training data; and
      
      construct decision trees to model contexts in said first training data.
  - 13. A method according to claim 12, wherein training of said first model further comprises re-estimating the HMMs clustered by the decision trees.
  - 14. A method according to claim 10, wherein training the second model comprises:
    - deriving HMM parameters for said second model by running the forward-backward algorithm on said second training data and said first training data;
      
      scaling the statistics obtained from the first training data using a parameter; and
      
      constructing decision trees using said first and second training data.
  - 15. A method according to claim 14, further comprising determining said parameter by trial and error.
  - 16. A method according to claim 14, wherein training of said second model further comprises re-estimating the HMMs clustered by the decision trees.

18. A speech recognition apparatus comprising:
- a receiver for receiving a speech input from a speaker which comprises a sequence of observations; and
  
  a processor configured to;
  
  determine the likelihood of a sequence of words arising from the sequence of observations using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker;
  
  determine the likelihood of a sequence of observations occurring in a given language using a language model; and
  
  combine the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal,wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, wherein the structure of said decision trees is based on second training data.
- View Dependent Claims (20)
- - 20. A speech to speech translation system, said system comprising a speech recognition system according to claim 18 configured to recognise speech in a first language, a translation module configured to translate text received in a first language into text of a second language and a text to speech system according to claim 19 configured to output speech in said second language.

19. A text to speech system comprising:
- A receiver for receiving a text input which comprises a sequence of words; and
  
  a processor, said processor being configured to;
  
  determine the likelihood of a sequence of speech vectors arising from the sequence of words using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker,wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, wherein the structure of said decision trees is based on second training data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
Chun, Byung Ha

Granted Patent

US 9,043,213 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/243
CPC Class Codes

G06F 40/58   Use of machine translation,...

G10L 13/08   Text analysis or generation...

G10L 15/07   to the speaker

G10L 15/14   using statistical models, e...

G10L 15/144   Training of HMMs

G10L 15/187   Phonemic context, e.g. pron...

SPEECH PROCESSOR, A SPEECH PROCESSING METHOD AND A METHOD OF TRAINING A SPEECH PROCESSOR

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

59 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SPEECH PROCESSOR, A SPEECH PROCESSING METHOD AND A METHOD OF TRAINING A SPEECH PROCESSOR

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

59 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links