SPEECH RECOGNITION APPARATUS AND METHOD AND PROGRAM THEREFOR
First Claim
1. A speech recognition apparatus comprising:
- a generating unit configured to generate a speech feature vector expressing a speech feature for each of a plurality of frames obtained by dividing an input speech between a start time and an end time and including frames from a start frame to an end frame;
a first storage unit configured to store a first acoustic model obtained by modeling a speech feature of each word by using a state transition model including a plurality of states and a plurality of transition paths, each word being included in the input speech;
a second storage unit configured to store at least one second acoustic model different from the first acoustic model;
a first calculation unit configured to calculate, for each state, a first probability of transition to a state at the end frame for each word from the first acoustic model and a speech feature vector sequence from the start frame to the end frame to obtain a plurality of first probabilities for each word, and select a maximum probability of the first probabilities;
a selection unit configured to select, for each word, a maximum probability transition path corresponding to the maximum probability, the maximum probability transition path indicating transition from a start state at the start frame to an end state at the end frame;
a conversion unit configured to convert, for each word, the maximum probability transition path into a corresponding transition path corresponding to the second acoustic model;
a second calculation unit configured to calculate, for each word, a second probability of transition to the state at the end frame on the corresponding transition path from the second acoustic model and the speech feature vector sequence; and
a finding unit configured to find to which word the input speech corresponds based on the maximum probability for each word and the second probability for each word.
4 Assignments
0 Petitions
Accused Products
Abstract
A speech recognition apparatus includes a generating unit generating a speech-feature vector expressing a feature for each of frames obtained by dividing an input speech, a storage unit storing a first acoustic model obtained by modeling a feature of each word by using a state transition model, a storage unit configured to store at least one second acoustic model, a calculation unit calculating, for each state, a first probability of transition to an at-end-frame state to obtain first probabilities, and select a maximum probability of the first probabilities, a selection unit selecting a maximum-probability-transition path, a conversion unit converting the maximum-probability-transition path into a corresponding-transition-path corresponding to the second acoustic model, a calculation unit calculating a second probability of transition to the at-end-frame state on the corresponding-transition-path, and a finding unit finding to which word the input speech corresponds based on the maximum probability and the second probability.
-
Citations
18 Claims
-
1. A speech recognition apparatus comprising:
-
a generating unit configured to generate a speech feature vector expressing a speech feature for each of a plurality of frames obtained by dividing an input speech between a start time and an end time and including frames from a start frame to an end frame; a first storage unit configured to store a first acoustic model obtained by modeling a speech feature of each word by using a state transition model including a plurality of states and a plurality of transition paths, each word being included in the input speech; a second storage unit configured to store at least one second acoustic model different from the first acoustic model; a first calculation unit configured to calculate, for each state, a first probability of transition to a state at the end frame for each word from the first acoustic model and a speech feature vector sequence from the start frame to the end frame to obtain a plurality of first probabilities for each word, and select a maximum probability of the first probabilities; a selection unit configured to select, for each word, a maximum probability transition path corresponding to the maximum probability, the maximum probability transition path indicating transition from a start state at the start frame to an end state at the end frame; a conversion unit configured to convert, for each word, the maximum probability transition path into a corresponding transition path corresponding to the second acoustic model; a second calculation unit configured to calculate, for each word, a second probability of transition to the state at the end frame on the corresponding transition path from the second acoustic model and the speech feature vector sequence; and a finding unit configured to find to which word the input speech corresponds based on the maximum probability for each word and the second probability for each word. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A speech recognition apparatus comprising:
-
a generating unit configured to generate a first speech feature vector expressing a speech feature for each of a plurality of frames obtained by dividing an input speech between a start time and an end time and including frames from a first frame to a second frame next to the first frame; a first storage unit configured to store a first acoustic model obtained by modeling a speech feature of each word by using a state transition model including a plurality of states and a plurality of first transition paths, each word being included in the input speech; a second storage unit configured to store at least one second acoustic model different from the first acoustic model; a first calculation unit configured to calculate, for each state, a first probability of transition to a state at the second frame for each word from the first acoustic model, a second probability of transition to the state at the first frame and a second speech feature vector of the second frame to obtain a plurality of first probabilities for each word, and select a maximum probability of the first probabilities; a selection unit configured to select, for each word, at least a maximum probability transition path of the maximum probability transition path and a plurality of second transition paths, the maximum probability transition path corresponding to the maximum probability and indicating transition from a first state at the first frame to a second state at the second frame, the second transition paths having probabilities lower than the maximum probability; a conversion unit configured to convert, for each word, the maximum probability transition path, or the maximum probability transition path and at least one second transition path when the second transition path of the second transition paths is selected, into at least one corresponding transition path corresponding to the second acoustic model; a second calculation unit configured to calculate, for each word, a third probability of transition to the second state at the second frame on the corresponding transition path from the second acoustic model, the second speech feature vector and a fourth probability of transition to the second state at the first frame; an operation unit configured to repeatedly operate the first calculation unit, the selection unit, the conversion unit, and the second calculation unit until the second frame becomes an end frame; and a finding unit configured to find to which word the input speech corresponds based on the maximum probability at the end frame for each word and the third probability at the end frame for each word. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A speech recognition method comprising:
-
generating a first speech feature vector expressing a speech feature for each of a plurality of frames obtained by dividing an input speech between a start time and an end time and including frames from a first frame to a second frame next to the first frame; storing in a first storage unit a first acoustic model obtained by modeling a speech feature of each word by using a state transition model including a plurality of states and a plurality of first transition paths, each word being included in the input speech; storing in a second storage unit at least one second acoustic model different from the first acoustic model; calculating, for each state, a first probability of transition to a state at the second frame for each word from the first acoustic model, a second probability of transition to the state at the first frame and a second speech feature vector of the second frame to obtain a plurality of first probabilities for each word; selecting a maximum probability of the first probabilities; selecting, for each word, at least a maximum probability transition path of the maximum probability transition path and a plurality of second transition paths, the maximum probability transition path corresponding to the maximum probability and indicating transition from a first state at the first frame to a second state at the second frame, the second transition paths having probabilities lower than the maximum probability; converting, for each word, the maximum probability transition path, or the maximum probability transition path and at least one second transition path when the second transition path of the second transition paths is selected, into at least one corresponding transition path corresponding to the second acoustic model; calculating, for each word, a third probability of transition to the second state at the second frame on the corresponding transition path from the second acoustic model, the second speech feature vector and a fourth probability of transition to the second state at the first frame; repeatedly operating calculating the first probability of transition, selecting the maximum probability transition path, converting the maximum probability transition path, or the maximum probability transition path and the second transition path when the second transition path of the second transition paths is selected, and calculating the second probability until the second frame becomes an end frame; and finding to which word the input speech corresponds based on the maximum probability at the end frame for each word and the third probability at the end frame for each word.
-
-
18. A computer readable storage medium storing instructions of a computer program which when executed by a computer results in performance of steps comprising:
-
generating a first speech feature vector expressing a speech feature for each of a plurality of frames obtained by dividing an input speech between a start time and an end time and including frames from a first frame to a second frame next to the first frame; storing in a first storage unit a first acoustic model obtained by modeling a speech feature of each word by using a state transition model including a plurality of states and a plurality of first transition paths, each word being included in the input speech; storing in a second storage unit at least one second acoustic model different from the first acoustic model; calculating, for each state, a first probability of transition to a state at the second frame for each word from the first acoustic model, a second probability of transition to the state at the first frame and a second speech feature vector of the second frame to obtain a plurality of first probabilities for each word; selecting a maximum probability of the first probabilities; selecting, for each word, at least a maximum probability transition path of the maximum probability transition path and a plurality of second transition paths, the maximum probability transition path corresponding to the maximum probability and indicating transition from a first state at the first frame to a second state at the second frame, the second transition paths having probabilities lower than the maximum probability; converting, for each word, the maximum probability transition path, or the maximum probability transition path and at least one second transition path when the second transition path of the second transition paths is selected, into at least one corresponding transition path corresponding to the second acoustic model; calculating, for each word, a third probability of transition to the second state at the second frame on the corresponding transition path from the second acoustic model, the second speech feature vector and a fourth probability of transition to the second state at the first frame; repeatedly operating calculating the first probability of transition, selecting the maximum probability transition path, converting the maximum probability transition path, or the maximum probability transition path and the second transition path when the second transition path of the second transition paths is selected, and calculating the second probability until the second frame becomes an end frame; and finding to which word the input speech corresponds based on the maximum probability at the end frame for each word and the third probability at the end frame for each word.
-
Specification