Correction of matching results for speech recognition

US 7,991,614 B2
Filed: 09/11/2009
Issued: 08/02/2011
Est. Priority Date: 03/20/2007
Status: Active Grant

First Claim

Patent Images

1. A speech recognition system implemented by a processor comprising:

a feature calculating unit that converts, by the processor, an input sound signal into a feature for each frame;

a sound level calculating unit that calculates, by the processor, an input sound level expressed as either power of the sound signal in each frame or logarithm of the power or an amplitude of the sound signal in each frame or logarithm of the amplitude;

a decoding unit that receives the feature of each frame calculated by the feature calculating unit, matches, by the processor, the feature with an acoustic model and a linguistic model recorded in advance, and determines a recognized word sequence to be output based on the matching results, the acoustic model being data obtained by modeling of what feature speech is likely to have, and the linguistic model being data relating to a recognition word;

a start-point detector that compares, by the processor, the input sound level calculated by the sound level calculating unit with a reference value to determine a start frame serving as a start point of a speech section and notifies the decoding unit of the start frame;

an end-point detector that compares, by the processor, the input sound level calculated by the sound level calculating unit with a reference value to determine an end frame serving as an end point of the speech section and notifies the decoding unit of the end frame; and

a reference value updating unit that updates, by the processor, the reference value in accordance with variations in the input sound level after the start frame,wherein when the reference value updating unit updates the reference value, the start-point detector updates the start frame using the updated reference value and notifies the decoding unit of the updated start frame, andwhen the decoding unit starts matching the feature with the acoustic model and the linguistic model and then is notified of the updated start frame from the start-point detector after starting the matching and before being notified of the end frame from the end-point detector, the decoding unit corrects the already existing matching results of the decoding unit in accordance with the notified updated start frame.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition system includes the following: a feature calculating unit; a sound level calculating unit that calculates an input sound level in each frame; a decoding unit that matches the feature of each frame with an acoustic model and a linguistic model, and outputs a recognized word sequence; a start-point detector that determines a start frame of a speech section based on a reference value; an end-point detector that determines an end frame of the speech section based on a reference value; and a reference value updating unit that updates the reference value in accordance with variations in the input sound level. The start-point detector updates the start frame every time the reference value is updated. The decoding unit starts matching before being notified of the end frame and corrects the matching results every time it is notified of the start frame. The speech recognition system can suppress a delay in response time while performing speech recognition based on a proper speech section.

Citations

9 Claims

1. A speech recognition system implemented by a processor comprising:
- a feature calculating unit that converts, by the processor, an input sound signal into a feature for each frame;
  
  a sound level calculating unit that calculates, by the processor, an input sound level expressed as either power of the sound signal in each frame or logarithm of the power or an amplitude of the sound signal in each frame or logarithm of the amplitude;
  
  a decoding unit that receives the feature of each frame calculated by the feature calculating unit, matches, by the processor, the feature with an acoustic model and a linguistic model recorded in advance, and determines a recognized word sequence to be output based on the matching results, the acoustic model being data obtained by modeling of what feature speech is likely to have, and the linguistic model being data relating to a recognition word;
  
  a start-point detector that compares, by the processor, the input sound level calculated by the sound level calculating unit with a reference value to determine a start frame serving as a start point of a speech section and notifies the decoding unit of the start frame;
  
  an end-point detector that compares, by the processor, the input sound level calculated by the sound level calculating unit with a reference value to determine an end frame serving as an end point of the speech section and notifies the decoding unit of the end frame; and
  
  a reference value updating unit that updates, by the processor, the reference value in accordance with variations in the input sound level after the start frame,wherein when the reference value updating unit updates the reference value, the start-point detector updates the start frame using the updated reference value and notifies the decoding unit of the updated start frame, andwhen the decoding unit starts matching the feature with the acoustic model and the linguistic model and then is notified of the updated start frame from the start-point detector after starting the matching and before being notified of the end frame from the end-point detector, the decoding unit corrects the already existing matching results of the decoding unit in accordance with the notified updated start frame.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The speech recognition system according to claim 1, wherein the decoding unit determines the recognized word sequence based on only the matching results of the features of the frames included in the speech section between the start frame received from the start-point detector and the end frame received from the end-point detector.
  - 3. The speech recognition system according to claim 1, wherein the decoding unit corrects the matching results by assigning weights to the decoding results of the feature of each frame, andthe decoding unit performs weighting so that the weights assigned to the matching results of the features of the frames outside the speech section between the start frame received from the start-point detector and the end frame received from the end-point detector are lighter than those assigned to the matching results of the features of the frames within the speech section between the start frame and the end frame.
  - 4. The speech recognition system according to claim 1, wherein the decoding unit determines the recognized word sequence to be output by excluding a word containing frames that are not present in the section between the start frame received from the start-point detector and the end frame received from the end-point detector.
  - 5. The speech recognition system according to claim 1, wherein the reference value updating unit calculates a maximum input sound level in the frames subsequent to the start frame and updates the reference value in accordance with the maximum input sound level.
  - 6. The speech recognition system according to claim 1, wherein the reference value updating unit gradually reduces the reference value over time.
  - 7. The speech recognition system according to claim 5, wherein the reference value updating unit gradually reduces the calculated maximum input sound level over time.

8. A speech recognition program stored in a non-transitory computer-readable recording medium that allows a computer to execute:
- feature calculation processing that converts an input sound signal into a feature for each frame;
  
  sound level calculation processing that calculates an input sound level expressed as power or an amplitude of the sound signal in each frame;
  
  decoding processing that receives the feature of each frame calculated by the feature calculation processing, matches the feature with an acoustic model and a linguistic model recorded in advance, and outputs a recognized word sequence based on the matching results, the acoustic model being data obtained by modeling of what feature speech is likely to have, and the linguistic model being data relating to a recognition word;
  
  start-point detection processing that compares the input sound level calculated by the sound level calculation processing with a reference value to determine a start frame serving as a start point of a speech section;
  
  end-point detection processing that compares the input sound level calculated by the sound level calculation processing with a reference value to determine an end frame serving as an end point of the speech section; and
  
  reference value updating processing that updates the reference value in accordance with variations in the input sound level after the start frame,wherein when the reference value is updated by the reference value updating processing, the start frame is updated using the updated reference value by the start-point detection processing, andwhen matching is started by the decoding processing and then the start frame is updated by the start-point detection processing after the matching is started and before the end frame is determined by the end-point detection processing, the already existing matching results by the decoding processing are corrected in accordance with the updated start frame by the decoding processing.

9. A speech recognition method executed by a processor comprising:
- calculating, by the processor, a feature by converting an input sound signal into a feature for each frame;
  
  calculating, by the processor, an input sound level expressed as power or an amplitude of the sound signal in each frame;
  
  decoding, by the processor, the feature of each frame by receiving the feature of each frame, matching the feature with an acoustic model and a linguistic model recorded in advance, and outputting a recognized word sequence based on the matching results, the acoustic model being data obtained by modeling of what feature speech is likely to have, and the linguistic model being data relating to a recognition word;
  
  determining, by the processor, a start frame serving as a start point of a speech section by comparing the input sound level with a reference value;
  
  determining, by the processor, an end frame serving as an end point of the speech section by comparing the input sound level with a reference value; and
  
  updating, by the processor, the reference value in accordance with variations in the input sound level after the start frame,wherein when the reference value is updated, the start frame is updated using the updated reference value, andwhen matching is started and then the start frame is updated after the matching is started and before the end frame is determined, the already existing matching results are corrected in accordance with the updated start frame.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fujitsu Limited
Original Assignee
Fujitsu Limited
Inventors
Washio, Nobuyuki, Harada, Shouji
Primary Examiner(s)
Wozniak; James S.
Assistant Examiner(s)
He; Jialong

Application Number

US12/558,249
Publication Number

US 20100004932A1
Time in Patent Office

690 Days
Field of Search

704/226, 704/233, 704/210, 704/215, 704/248
US Class Current

704/248
CPC Class Codes

G10L 15/05 Word boundary detection

Correction of matching results for speech recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

Correction of matching results for speech recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links