SPEECH RECOGNITION SYSTEM, SPEECH RECOGNITION PROGRAM, AND SPEECH RECOGNITION METHOD
First Claim
1. A speech recognition system comprising:
- a feature calculating unit that converts an input sound signal into a feature for each frame;
a sound level calculating unit that calculates an input sound level expressed as either power of the sound signal in each frame or logarithm of the power or an amplitude of the sound signal in each frame or logarithm of the amplitude;
a decoding unit that receives the feature of each frame calculated by the feature calculating unit, matches the feature with an acoustic model and a linguistic model recorded in advance, and determines a recognized word sequence to be output based on the matching results, the acoustic model being data obtained by modeling of what feature speech is likely to have, and the linguistic model being data relating to a recognition word;
a start-point detector that compares the input sound level calculated by the sound level calculating unit with a reference value to determine a start frame serving as a start point of a speech section and notifies the decoding unit of the start frame;
an end-point detector that compares the input sound level calculated by the sound level calculating unit with a reference value to determine an end frame serving as an end point of the speech section and notifies the decoding unit of the end frame; and
a reference value updating unit that updates the reference value in accordance with variations in the input sound level after the start frame,wherein when the reference value updating unit updates the reference value, the start-point detector updates the start frame using the updated reference value and notifies the decoding unit of the updated start frame, andwhen the decoding unit starts matching after receiving the feature of each frame calculated by the feature calculating unit and then is notified of the start frame from the start-point detector before being notified of the end frame from the end-point detector, the decoding unit corrects the matching results in accordance with the notified start frame.
1 Assignment
0 Petitions
Accused Products
Abstract
A speech recognition system includes the following: a feature calculating unit; a sound level calculating unit that calculates an input sound level in each frame; a decoding unit that matches the feature of each frame with an acoustic model and a linguistic model, and outputs a recognized word sequence; a start-point detector that determines a start frame of a speech section based on a reference value; an end-point detector that determines an end frame of the speech section based on a reference value; and a reference value updating unit that updates the reference value in accordance with variations in the input sound level. The start-point detector updates the start frame every time the reference value is updated. The decoding unit starts matching before being notified of the end frame and corrects the matching results every time it is notified of the start frame. The speech recognition system can suppress a delay in response time while performing speech recognition based on a proper speech section.
41 Citations
9 Claims
-
1. A speech recognition system comprising:
-
a feature calculating unit that converts an input sound signal into a feature for each frame; a sound level calculating unit that calculates an input sound level expressed as either power of the sound signal in each frame or logarithm of the power or an amplitude of the sound signal in each frame or logarithm of the amplitude; a decoding unit that receives the feature of each frame calculated by the feature calculating unit, matches the feature with an acoustic model and a linguistic model recorded in advance, and determines a recognized word sequence to be output based on the matching results, the acoustic model being data obtained by modeling of what feature speech is likely to have, and the linguistic model being data relating to a recognition word; a start-point detector that compares the input sound level calculated by the sound level calculating unit with a reference value to determine a start frame serving as a start point of a speech section and notifies the decoding unit of the start frame; an end-point detector that compares the input sound level calculated by the sound level calculating unit with a reference value to determine an end frame serving as an end point of the speech section and notifies the decoding unit of the end frame; and a reference value updating unit that updates the reference value in accordance with variations in the input sound level after the start frame, wherein when the reference value updating unit updates the reference value, the start-point detector updates the start frame using the updated reference value and notifies the decoding unit of the updated start frame, and when the decoding unit starts matching after receiving the feature of each frame calculated by the feature calculating unit and then is notified of the start frame from the start-point detector before being notified of the end frame from the end-point detector, the decoding unit corrects the matching results in accordance with the notified start frame. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A speech recognition program stored in a computer-readable recording medium that allows a computer to execute:
-
feature calculation processing that converts an input sound signal into a feature for each frame; sound level calculation processing that calculates an input sound level expressed as either power or an amplitude of the sound signal in each frame; decoding processing that receives the feature of each frame calculated by the feature calculation processing, matches the feature with an acoustic model and a linguistic model recorded in advance, and outputs a recognized word sequence based on the matching results, the acoustic model being data obtained by modeling of what feature speech is likely to have, and the linguistic model being data relating to a recognition word; start-point detection processing that compares the input sound level calculated by the sound level calculation processing with a reference value to determine a start frame serving as a start point of a speech section; end-point detection processing that compares the input sound level calculated by the sound level calculation processing with a reference value to determine an end frame serving as an end point of the speech section; and reference value updating processing that updates the reference value in accordance with variations in the input sound level after the start frame, wherein when the reference value is updated by the reference value updating processing, the start frame is updated using the updated reference value by the start-point detection processing, and when matching is started by the decoding processing after receiving the feature of each frame calculated by the feature calculation processing, and then the start frame is updated by the start-point detection processing before the end frame is determined by the end-point detection processing, the matching results are corrected in accordance with the updated start frame by the decoding processing.
-
-
9. A speech recognition method comprising:
-
calculating a feature by converting an input sound signal into a feature for each frame; calculating an input sound level expressed as either power or an amplitude of the sound signal in each frame; decoding the feature of each frame by receiving the feature of each frame, matching the feature with an acoustic model and a linguistic model recorded in advance, and outputting a recognized word sequence based on the matching results, the acoustic model being data obtained by modeling of what feature speech is likely to have, and the linguistic model being data relating to a recognition word; determining a start frame serving as a start point of a speech section by comparing the input sound level with a reference value; determining an end frame serving as an end point of the speech section by comparing the input sound level with a reference; and updating the reference value in accordance with variations in the input sound level after the start frame, wherein when the reference value is updated, the start frame is updated using the updated reference value, and when matching is started after receiving the feature of each frame, and then the start frame is updated before the end frame is determined, the matching results are corrected in accordance with the updated start frame.
-
Specification