Speech recognition and synthesis utilizing context dependent acoustic models containing decision trees
First Claim
1. A speech recognition method executed by processing circuitry programmed to implement speech recognition, said method comprising:
- receiving a speech input from a speaker which comprises a sequence of observations; and
determining, using the processing circuitry, a likelihood of a sequence of words arising from the sequence of observations using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker,determining, using the processing circuitry, a likelihood of a sequence of observations occurring in a given language using a language model; and
combining, using the processing circuitry, the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal,wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as;
1 Assignment
0 Petitions
Accused Products
Abstract
A speech recognition method including the steps of receiving a speech input from a known speaker of a sequence of observations and determining the likelihood of a sequence of words arising from the sequence of observations using an acoustic model. The acoustic model has a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation and has been trained using first training data and adapted using second training data to said speaker. The speech recognition method also determines the likelihood of a sequence of observations occurring in a given language using a language model and combines the likelihoods determined by the acoustic model and the language model and outputs a sequence of words identified from said speech input signal. The acoustic model is context based for the speaker, the context based information being contained in the model using a plurality of decision trees and the structure of the decision trees is based on second training data.
-
Citations
11 Claims
-
1. A speech recognition method executed by processing circuitry programmed to implement speech recognition, said method comprising:
-
receiving a speech input from a speaker which comprises a sequence of observations; and determining, using the processing circuitry, a likelihood of a sequence of words arising from the sequence of observations using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker, determining, using the processing circuitry, a likelihood of a sequence of observations occurring in a given language using a language model; and combining, using the processing circuitry, the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as; - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A text to speech processing method executed by processing circuitry programmed to implement text to speech processing, comprising:
-
receiving a text input which comprises a sequence of words; and determining, using the processing circuitry, a likelihood of a sequence of speech vectors arising from the sequence of words using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to a speaker, wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as;
-
-
9. A speech recognition apparatus comprising:
-
a receiver for receiving a speech input from a speaker which comprises a sequence of observations; and processing circuitry programmed to implement speech recognition and configured to; determine a likelihood of a sequence of words arising from the sequence of observations using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker; determine a likelihood of a sequence of observations occurring in a given language using a language model; and combine the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as;
-
-
10. A text to speech system comprising:
-
a receiver for receiving a text input which comprises a sequence of words; and processing circuitry programmed to implement text to speech processing and configured to; determine a likelihood of a sequence of speech vectors arising from the sequence of words using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker, wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as;
-
-
11. The speech to speech translation system, said system comprising:
-
a speech recognition system configured to recognize speech in a first language, a translation module configured to translate text received in a first language into text of a second language, and a text to speech system configured to output speech in said second language, wherein the speech recognition apparatus comprises; a receiver for receiving a speech input from a speaker which comprises a sequence of observations; and processing circuitry programmed to implement speech recognition and configured to; determine a likelihood of a sequence of words arising from the sequence of observations using an acoustic model, said acoustic model having a plurality of model parameters describing probability distributions which relate a word or part thereof to an observation, said acoustic model having been trained using first training data and adapted using second training data to said speaker; determine a likelihood of a sequence of observations occurring in a given language using a language model; and combine the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein said acoustic model is context based for said speaker, said context based information being contained in said model using a plurality of decision trees, the structure of said decision trees being based on second training data, the decision trees splitting at nodes and wherein the structure is determined from the splitting of the nodes of the trees that has been calculated using maximum a posteriori criteria implemented as;
-
Specification