Speech recognition and synthesis utilizing context dependent acoustic models containing decision trees

US 9,043,213 B2
Filed: 01/26/2011
Issued: 05/26/2015
Est. Priority Date: 03/02/2010
Status: Expired due to Fees

First Claim

Patent Images

1. A speech recognition method executed by processing circuitry programmed to implement speech recognition, said method comprising:

receiving a speech input from a speaker which comprises a sequence of observations; and

determining, using the processing circuitry, a likelihood of a sequence of words arising from the sequence of observations using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker,determining, using the processing circuitry, a likelihood of a sequence of observations occurring in a given language using a language model; and

combining, using the processing circuitry, the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal,wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as;

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition method including the steps of receiving a speech input from a known speaker of a sequence of observations and determining the likelihood of a sequence of words arising from the sequence of observations using an acoustic model. The acoustic model has a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation and has been trained using first training data and adapted using second training data to said speaker. The speech recognition method also determines the likelihood of a sequence of observations occurring in a given language using a language model and combines the likelihoods determined by the acoustic model and the language model and outputs a sequence of words identified from said speech input signal. The acoustic model is context based for the speaker, the context based information being contained in the model using a plurality of decision trees and the structure of the decision trees is based on second training data.

Citations

11 Claims

1. A speech recognition method executed by processing circuitry programmed to implement speech recognition, said method comprising:
- receiving a speech input from a speaker which comprises a sequence of observations; and
  
  determining, using the processing circuitry, a likelihood of a sequence of words arising from the sequence of observations using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker,determining, using the processing circuitry, a likelihood of a sequence of observations occurring in a given language using a language model; and
  
  combining, using the processing circuitry, the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal,wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as;
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The speech recognition method according to claim 1, wherein the structure of the decision trees is based on both the first and second training data.
  - 3. The method according to claim 1, wherein the context dependency is implemented as tri-phones.
  - 4. The method according to claim 1, wherein said acoustic model comprises probability distributions which are represented by means and variances and wherein said decision trees are provided for both means and variances.
  - 5. The method according to claim 1, wherein said context based information is selected from phonetic, linguistic and prosodic contexts.
  - 6. The method according to claim 1, wherein said decision trees are used to model at least one selected from expressive contexts, gender, age or voice characteristics.
  - 7. A non-transitory computer readable carrier medium carrying computer readable instructions for controlling the computer to carry out the method of claim 1.

8. A text to speech processing method executed by processing circuitry programmed to implement text to speech processing, comprising:
- receiving a text input which comprises a sequence of words; and
  
  determining, using the processing circuitry, a likelihood of a sequence of speech vectors arising from the sequence of words using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to a speaker,wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as;

9. A speech recognition apparatus comprising:
- a receiver for receiving a speech input from a speaker which comprises a sequence of observations; and
  
  processing circuitry programmed to implement speech recognition and configured to;
  
  determine a likelihood of a sequence of words arising from the sequence of observations using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker;
  
  determine a likelihood of a sequence of observations occurring in a given language using a language model; and
  
  combine the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal,wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as;

10. A text to speech system comprising:
- a receiver for receiving a text input which comprises a sequence of words; and
  
  processing circuitry programmed to implement text to speech processing and configured to;
  
  determine a likelihood of a sequence of speech vectors arising from the sequence of words using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker,wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as;

11. The speech to speech translation system, said system comprising:
- a speech recognition system configured to recognize speech in a first language,a translation module configured to translate text received in a first language into text of a second language, anda text to speech system configured to output speech in said second language,wherein the speech recognition apparatus comprises;
  
  a receiver for receiving a speech input from a speaker which comprises a sequence of observations; and
  
  processing circuitry programmed to implement speech recognition and configured to;
  
  determine a likelihood of a sequence of words arising from the sequence of observations using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker;
  
  determine a likelihood of a sequence of observations occurring in a given language using a language model; and
  
  combine the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal,wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as;

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
Chun, Byung Ha
Primary Examiner(s)
WOZNIAK, JAMES S

Application Number

US13/014,185
Publication Number

US 20110218804A1
Time in Patent Office

1,581 Days
Field of Search

704/251, 704/257, 704/258, 704/260, 704/277, 704/255
US Class Current

704/277
CPC Class Codes

G06F 40/58   Use of machine translation,...

G10L 13/08   Text analysis or generation...

G10L 15/07   to the speaker

G10L 15/14   using statistical models, e...

G10L 15/144   Training of HMMs

G10L 15/187   Phonemic context, e.g. pron...

Speech recognition and synthesis utilizing context dependent acoustic models containing decision trees

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

11 Claims

Specification

Solutions

Use Cases

Quick Links

Speech recognition and synthesis utilizing context dependent acoustic models containing decision trees

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

11 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links