Correction of matching results for speech recognition
First Claim
1. A speech recognition system implemented by a processor comprising:
- a feature calculating unit that converts, by the processor, an input sound signal into a feature for each frame;
a sound level calculating unit that calculates, by the processor, an input sound level expressed as either power of the sound signal in each frame or logarithm of the power or an amplitude of the sound signal in each frame or logarithm of the amplitude;
a decoding unit that receives the feature of each frame calculated by the feature calculating unit, matches, by the processor, the feature with an acoustic model and a linguistic model recorded in advance, and determines a recognized word sequence to be output based on the matching results, the acoustic model being data obtained by modeling of what feature speech is likely to have, and the linguistic model being data relating to a recognition word;
a start-point detector that compares, by the processor, the input sound level calculated by the sound level calculating unit with a reference value to determine a start frame serving as a start point of a speech section and notifies the decoding unit of the start frame;
an end-point detector that compares, by the processor, the input sound level calculated by the sound level calculating unit with a reference value to determine an end frame serving as an end point of the speech section and notifies the decoding unit of the end frame; and
a reference value updating unit that updates, by the processor, the reference value in accordance with variations in the input sound level after the start frame,wherein when the reference value updating unit updates the reference value, the start-point detector updates the start frame using the updated reference value and notifies the decoding unit of the updated start frame, andwhen the decoding unit starts matching the feature with the acoustic model and the linguistic model and then is notified of the updated start frame from the start-point detector after starting the matching and before being notified of the end frame from the end-point detector, the decoding unit corrects the already existing matching results of the decoding unit in accordance with the notified updated start frame.
1 Assignment
0 Petitions
Accused Products
Abstract
A speech recognition system includes the following: a feature calculating unit; a sound level calculating unit that calculates an input sound level in each frame; a decoding unit that matches the feature of each frame with an acoustic model and a linguistic model, and outputs a recognized word sequence; a start-point detector that determines a start frame of a speech section based on a reference value; an end-point detector that determines an end frame of the speech section based on a reference value; and a reference value updating unit that updates the reference value in accordance with variations in the input sound level. The start-point detector updates the start frame every time the reference value is updated. The decoding unit starts matching before being notified of the end frame and corrects the matching results every time it is notified of the start frame. The speech recognition system can suppress a delay in response time while performing speech recognition based on a proper speech section.
-
Citations
9 Claims
-
1. A speech recognition system implemented by a processor comprising:
-
a feature calculating unit that converts, by the processor, an input sound signal into a feature for each frame; a sound level calculating unit that calculates, by the processor, an input sound level expressed as either power of the sound signal in each frame or logarithm of the power or an amplitude of the sound signal in each frame or logarithm of the amplitude; a decoding unit that receives the feature of each frame calculated by the feature calculating unit, matches, by the processor, the feature with an acoustic model and a linguistic model recorded in advance, and determines a recognized word sequence to be output based on the matching results, the acoustic model being data obtained by modeling of what feature speech is likely to have, and the linguistic model being data relating to a recognition word; a start-point detector that compares, by the processor, the input sound level calculated by the sound level calculating unit with a reference value to determine a start frame serving as a start point of a speech section and notifies the decoding unit of the start frame; an end-point detector that compares, by the processor, the input sound level calculated by the sound level calculating unit with a reference value to determine an end frame serving as an end point of the speech section and notifies the decoding unit of the end frame; and a reference value updating unit that updates, by the processor, the reference value in accordance with variations in the input sound level after the start frame, wherein when the reference value updating unit updates the reference value, the start-point detector updates the start frame using the updated reference value and notifies the decoding unit of the updated start frame, and when the decoding unit starts matching the feature with the acoustic model and the linguistic model and then is notified of the updated start frame from the start-point detector after starting the matching and before being notified of the end frame from the end-point detector, the decoding unit corrects the already existing matching results of the decoding unit in accordance with the notified updated start frame. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A speech recognition program stored in a non-transitory computer-readable recording medium that allows a computer to execute:
-
feature calculation processing that converts an input sound signal into a feature for each frame; sound level calculation processing that calculates an input sound level expressed as power or an amplitude of the sound signal in each frame; decoding processing that receives the feature of each frame calculated by the feature calculation processing, matches the feature with an acoustic model and a linguistic model recorded in advance, and outputs a recognized word sequence based on the matching results, the acoustic model being data obtained by modeling of what feature speech is likely to have, and the linguistic model being data relating to a recognition word; start-point detection processing that compares the input sound level calculated by the sound level calculation processing with a reference value to determine a start frame serving as a start point of a speech section; end-point detection processing that compares the input sound level calculated by the sound level calculation processing with a reference value to determine an end frame serving as an end point of the speech section; and reference value updating processing that updates the reference value in accordance with variations in the input sound level after the start frame, wherein when the reference value is updated by the reference value updating processing, the start frame is updated using the updated reference value by the start-point detection processing, and when matching is started by the decoding processing and then the start frame is updated by the start-point detection processing after the matching is started and before the end frame is determined by the end-point detection processing, the already existing matching results by the decoding processing are corrected in accordance with the updated start frame by the decoding processing.
-
-
9. A speech recognition method executed by a processor comprising:
-
calculating, by the processor, a feature by converting an input sound signal into a feature for each frame; calculating, by the processor, an input sound level expressed as power or an amplitude of the sound signal in each frame; decoding, by the processor, the feature of each frame by receiving the feature of each frame, matching the feature with an acoustic model and a linguistic model recorded in advance, and outputting a recognized word sequence based on the matching results, the acoustic model being data obtained by modeling of what feature speech is likely to have, and the linguistic model being data relating to a recognition word; determining, by the processor, a start frame serving as a start point of a speech section by comparing the input sound level with a reference value; determining, by the processor, an end frame serving as an end point of the speech section by comparing the input sound level with a reference value; and updating, by the processor, the reference value in accordance with variations in the input sound level after the start frame, wherein when the reference value is updated, the start frame is updated using the updated reference value, and when matching is started and then the start frame is updated after the matching is started and before the end frame is determined, the already existing matching results are corrected in accordance with the updated start frame.
-
Specification