SPEECH RECOGNITION APPARATUS AND METHOD AND PROGRAM THEREFOR

US 20080243506A1
Filed: 02/08/2008
Published: 10/02/2008
Est. Priority Date: 03/28/2007
Status: Active Grant

First Claim

Patent Images

1. A speech recognition apparatus comprising:

a generating unit configured to generate a speech feature vector expressing a speech feature for each of a plurality of frames obtained by dividing an input speech between a start time and an end time and including frames from a start frame to an end frame;

a first storage unit configured to store a first acoustic model obtained by modeling a speech feature of each word by using a state transition model including a plurality of states and a plurality of transition paths, each word being included in the input speech;

a second storage unit configured to store at least one second acoustic model different from the first acoustic model;

a first calculation unit configured to calculate, for each state, a first probability of transition to a state at the end frame for each word from the first acoustic model and a speech feature vector sequence from the start frame to the end frame to obtain a plurality of first probabilities for each word, and select a maximum probability of the first probabilities;

a selection unit configured to select, for each word, a maximum probability transition path corresponding to the maximum probability, the maximum probability transition path indicating transition from a start state at the start frame to an end state at the end frame;

a conversion unit configured to convert, for each word, the maximum probability transition path into a corresponding transition path corresponding to the second acoustic model;

a second calculation unit configured to calculate, for each word, a second probability of transition to the state at the end frame on the corresponding transition path from the second acoustic model and the speech feature vector sequence; and

a finding unit configured to find to which word the input speech corresponds based on the maximum probability for each word and the second probability for each word.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition apparatus includes a generating unit generating a speech-feature vector expressing a feature for each of frames obtained by dividing an input speech, a storage unit storing a first acoustic model obtained by modeling a feature of each word by using a state transition model, a storage unit configured to store at least one second acoustic model, a calculation unit calculating, for each state, a first probability of transition to an at-end-frame state to obtain first probabilities, and select a maximum probability of the first probabilities, a selection unit selecting a maximum-probability-transition path, a conversion unit converting the maximum-probability-transition path into a corresponding-transition-path corresponding to the second acoustic model, a calculation unit calculating a second probability of transition to the at-end-frame state on the corresponding-transition-path, and a finding unit finding to which word the input speech corresponds based on the maximum probability and the second probability.

Citations

18 Claims

1. A speech recognition apparatus comprising:
- a generating unit configured to generate a speech feature vector expressing a speech feature for each of a plurality of frames obtained by dividing an input speech between a start time and an end time and including frames from a start frame to an end frame;
  
  a first storage unit configured to store a first acoustic model obtained by modeling a speech feature of each word by using a state transition model including a plurality of states and a plurality of transition paths, each word being included in the input speech;
  
  a second storage unit configured to store at least one second acoustic model different from the first acoustic model;
  
  a first calculation unit configured to calculate, for each state, a first probability of transition to a state at the end frame for each word from the first acoustic model and a speech feature vector sequence from the start frame to the end frame to obtain a plurality of first probabilities for each word, and select a maximum probability of the first probabilities;
  
  a selection unit configured to select, for each word, a maximum probability transition path corresponding to the maximum probability, the maximum probability transition path indicating transition from a start state at the start frame to an end state at the end frame;
  
  a conversion unit configured to convert, for each word, the maximum probability transition path into a corresponding transition path corresponding to the second acoustic model;
  
  a second calculation unit configured to calculate, for each word, a second probability of transition to the state at the end frame on the corresponding transition path from the second acoustic model and the speech feature vector sequence; and
  
  a finding unit configured to find to which word the input speech corresponds based on the maximum probability for each word and the second probability for each word.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The apparatus according to claim 1, wherein the finding unit finds, as a recognized word, a word corresponding to a maximum value among the maximum probability for each word at the end frame and the second probability for each word at the end frame.
  - 3. The apparatus according to claim 1, wherein the finding unit calculates, for each word, a sum of the maximum probability at the end frame and the second probability at the end frame and finds, as a recognized word, a word exhibiting a maximum sum of the sums.
  - 4. The apparatus according to claim 1, wherein the finding unit calculates, for each word, an absolute value of a difference between the maximum probability at the end frame and the second probability at the end frame and finds, as a recognized word, a finding word of words exhibiting absolute values not less than a threshold, the finding word corresponding to a maximum value among the maximum probability at the end frame and the probability at the end frame.
  - 5. The apparatus according to claim 1, wherein the first acoustic model and the second acoustic model use state transition models having the same topology.
  - 6. The apparatus according to claim 1, further comprising a storage unit configured to store a transition path conversion table indicating a correspondence relationship between first transition paths of the first acoustic model and second transition paths of the second acoustic model, andwherein the conversion unit converts a first transition path of the first acoustic model into a second transition path of the second acoustic model which corresponds to the first transition path.
  - 7. The apparatus according to claim 1, further comprising setting unit configured to setting a first acoustic model to be used by the first calculation unit and a second acoustic model to be used by the second calculation unit by switching the first acoustic model and the second acoustic model.

8. A speech recognition apparatus comprising:
- a generating unit configured to generate a first speech feature vector expressing a speech feature for each of a plurality of frames obtained by dividing an input speech between a start time and an end time and including frames from a first frame to a second frame next to the first frame;
  
  a first storage unit configured to store a first acoustic model obtained by modeling a speech feature of each word by using a state transition model including a plurality of states and a plurality of first transition paths, each word being included in the input speech;
  
  a second storage unit configured to store at least one second acoustic model different from the first acoustic model;
  
  a first calculation unit configured to calculate, for each state, a first probability of transition to a state at the second frame for each word from the first acoustic model, a second probability of transition to the state at the first frame and a second speech feature vector of the second frame to obtain a plurality of first probabilities for each word, and select a maximum probability of the first probabilities;
  
  a selection unit configured to select, for each word, at least a maximum probability transition path of the maximum probability transition path and a plurality of second transition paths, the maximum probability transition path corresponding to the maximum probability and indicating transition from a first state at the first frame to a second state at the second frame, the second transition paths having probabilities lower than the maximum probability;
  
  a conversion unit configured to convert, for each word, the maximum probability transition path, or the maximum probability transition path and at least one second transition path when the second transition path of the second transition paths is selected, into at least one corresponding transition path corresponding to the second acoustic model;
  
  a second calculation unit configured to calculate, for each word, a third probability of transition to the second state at the second frame on the corresponding transition path from the second acoustic model, the second speech feature vector and a fourth probability of transition to the second state at the first frame;
  
  an operation unit configured to repeatedly operate the first calculation unit, the selection unit, the conversion unit, and the second calculation unit until the second frame becomes an end frame; and
  
  a finding unit configured to find to which word the input speech corresponds based on the maximum probability at the end frame for each word and the third probability at the end frame for each word.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16)
- - 9. The apparatus according to claim 8, wherein the finding unit finds, as a recognized word, a word corresponding to a maximum value among the maximum probability for each word at the end frame and the third probability for each word at the end frame.
  - 10. The apparatus according to claim 8, wherein the finding unit calculates, for each word, a sum of the maximum probability at the end frame and the third probability at the end frame and finds, as a recognized word, a word exhibiting a maximum sum of the sums.
  - 11. The apparatus according to claim 8, wherein the finding unit calculates, for each word, an absolute value of a difference between the maximum probability at the end frame and the third probability at the end frame and finds, as a recognized word, a finding word of words exhibiting absolute values not less than a threshold, the finding word corresponding to a maximum value among the maximum probability at the end frame and the third probability at the end frame.
  - 12. The apparatus according to claim 8, wherein the first acoustic model and the second acoustic model use state transition models having the same topology.
  - 13. The apparatus according to claim 8, further comprising a storage unit configured to store a transition path conversion table indicating a correspondence relationship between transition paths of the first acoustic model and transition paths of the second acoustic model, andwherein the conversion unit converts a first transition path of the first acoustic model into a second transition path of the second acoustic model which corresponds to the first transition path.
  - 14. The apparatus according to claim 8, further comprising setting unit configured to setting a first acoustic model to be used by the first calculation unit and a second acoustic model to be used by the second calculation unit by switching the first acoustic model and the second acoustic model.
  - 15. The apparatus according to claim 8, wherein the selection unit selects only the maximum probability transition path.
  - 16. The apparatus according to claim 8, wherein the selection unit selects a plurality of transition paths.

17. A speech recognition method comprising:
- generating a first speech feature vector expressing a speech feature for each of a plurality of frames obtained by dividing an input speech between a start time and an end time and including frames from a first frame to a second frame next to the first frame;
  
  storing in a first storage unit a first acoustic model obtained by modeling a speech feature of each word by using a state transition model including a plurality of states and a plurality of first transition paths, each word being included in the input speech;
  
  storing in a second storage unit at least one second acoustic model different from the first acoustic model;
  
  calculating, for each state, a first probability of transition to a state at the second frame for each word from the first acoustic model, a second probability of transition to the state at the first frame and a second speech feature vector of the second frame to obtain a plurality of first probabilities for each word;
  
  selecting a maximum probability of the first probabilities;
  
  selecting, for each word, at least a maximum probability transition path of the maximum probability transition path and a plurality of second transition paths, the maximum probability transition path corresponding to the maximum probability and indicating transition from a first state at the first frame to a second state at the second frame, the second transition paths having probabilities lower than the maximum probability;
  
  converting, for each word, the maximum probability transition path, or the maximum probability transition path and at least one second transition path when the second transition path of the second transition paths is selected, into at least one corresponding transition path corresponding to the second acoustic model;
  
  calculating, for each word, a third probability of transition to the second state at the second frame on the corresponding transition path from the second acoustic model, the second speech feature vector and a fourth probability of transition to the second state at the first frame;
  
  repeatedly operating calculating the first probability of transition, selecting the maximum probability transition path, converting the maximum probability transition path, or the maximum probability transition path and the second transition path when the second transition path of the second transition paths is selected, and calculating the second probability until the second frame becomes an end frame; and
  
  finding to which word the input speech corresponds based on the maximum probability at the end frame for each word and the third probability at the end frame for each word.

18. A computer readable storage medium storing instructions of a computer program which when executed by a computer results in performance of steps comprising:
- generating a first speech feature vector expressing a speech feature for each of a plurality of frames obtained by dividing an input speech between a start time and an end time and including frames from a first frame to a second frame next to the first frame;
  
  storing in a first storage unit a first acoustic model obtained by modeling a speech feature of each word by using a state transition model including a plurality of states and a plurality of first transition paths, each word being included in the input speech;
  
  storing in a second storage unit at least one second acoustic model different from the first acoustic model;
  
  calculating, for each state, a first probability of transition to a state at the second frame for each word from the first acoustic model, a second probability of transition to the state at the first frame and a second speech feature vector of the second frame to obtain a plurality of first probabilities for each word;
  
  selecting a maximum probability of the first probabilities;
  
  selecting, for each word, at least a maximum probability transition path of the maximum probability transition path and a plurality of second transition paths, the maximum probability transition path corresponding to the maximum probability and indicating transition from a first state at the first frame to a second state at the second frame, the second transition paths having probabilities lower than the maximum probability;
  
  converting, for each word, the maximum probability transition path, or the maximum probability transition path and at least one second transition path when the second transition path of the second transition paths is selected, into at least one corresponding transition path corresponding to the second acoustic model;
  
  calculating, for each word, a third probability of transition to the second state at the second frame on the corresponding transition path from the second acoustic model, the second speech feature vector and a fourth probability of transition to the second state at the first frame;
  
  repeatedly operating calculating the first probability of transition, selecting the maximum probability transition path, converting the maximum probability transition path, or the maximum probability transition path and the second transition path when the second transition path of the second transition paths is selected, and calculating the second probability until the second frame becomes an end frame; and
  
  finding to which word the input speech corresponds based on the maximum probability at the end frame for each word and the third probability at the end frame for each word.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation), Toshiba Digital Solutions Corporation (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
Fujimura, Hiroshi, Tanaka, Shinichi, Sakai, Masaru

Granted Patent

US 8,510,111 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/256.4
CPC Class Codes

G10L 15/32 Multiple recognisers used i...

G10L 2015/085 Methods for reducing search...

SPEECH RECOGNITION APPARATUS AND METHOD AND PROGRAM THEREFOR

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

SPEECH RECOGNITION APPARATUS AND METHOD AND PROGRAM THEREFOR

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links