Environment adaptive speech recognition method and device

US 9,870,771 B2
Filed: 05/09/2016
Issued: 01/16/2018
Est. Priority Date: 11/14/2013
Status: Active Grant

First Claim

Patent Images

1. A speech recognition method, comprising:

receiving, by a speech recognition device, an input speech, wherein the speech recognition device comprises a noise type detection engine, a storage area and a speech engine;

dividing the input speech, by the speech recognition device, into detection speech at a beginning of the input speech and a to-be-recognized speech following the detection speech, wherein a length of speech data comprised in the detection speech is less than a length of speech data comprised in the to-be-recognized speech;

selecting, by the noise type detection engine based on comparing the detection speech with a plurality of speech training samples under a plurality of different sample environments, a sample environment corresponding to a speech training sample among the plurality of speech training samples that has a minimum difference with the detection speech, as a detection environment type, wherein the plurality of sample environments comprises a quiet environment and a noise environment;

detecting, by the speech recognition device, a storage area;

outputting, by the speech recognition device, when a recognizable previous environment type exists in the storage area, a speech correction instruction according to a result of comparison between the detection environment type and the previous environment type, wherein the previous environment type comprises a quiet environment or a noise environment;

controlling, by the speech engine according to the speech correction instruction, correction on the to-be-recognized speech, and outputting an initial recognition result;

separately comparing, by the noise type detection engine, the received to-be-recognized speech with the plurality of the speech training samples, and selecting a sample environment corresponding to a speech training sample among the plurality of speech training samples that has a minimum difference with the to-be-recognized speech, as a current environment type;

storing, by the speech recognition device, the current environment type to the storage area, and abandoning the current environment type after a preset duration; and

outputting, by the speech recognition device, a final recognition result after a confidence value of the initial recognition result is adjusted according to the current environment type.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition method, a speech recognition device, and an electronic device. In this method, first determining is performed by using a sample environment corresponding to a detection speech and a previous environment type, so as to output a corresponding speech correction instruction to a speech engine; then, a to-be-recognized speech is input to the speech engine and a noise type detection engine at the same time, and the speech engine corrects the to-be-recognized speech by using the speech correction instruction, so that quality of an original speech is not impaired by noise processing, and a corresponding initial recognition result is output; the noise type detection engine determines a current environment type by using the to-be-recognized speech and a speech training sample under a different environment; finally, confidence of the initial recognition result is adjusted by using the current environment type.

12 Citations

View as Search Results

17 Claims

1. A speech recognition method, comprising:
- receiving, by a speech recognition device, an input speech, wherein the speech recognition device comprises a noise type detection engine, a storage area and a speech engine;
  
  dividing the input speech, by the speech recognition device, into detection speech at a beginning of the input speech and a to-be-recognized speech following the detection speech, wherein a length of speech data comprised in the detection speech is less than a length of speech data comprised in the to-be-recognized speech;
  
  selecting, by the noise type detection engine based on comparing the detection speech with a plurality of speech training samples under a plurality of different sample environments, a sample environment corresponding to a speech training sample among the plurality of speech training samples that has a minimum difference with the detection speech, as a detection environment type, wherein the plurality of sample environments comprises a quiet environment and a noise environment;
  
  detecting, by the speech recognition device, a storage area;
  
  outputting, by the speech recognition device, when a recognizable previous environment type exists in the storage area, a speech correction instruction according to a result of comparison between the detection environment type and the previous environment type, wherein the previous environment type comprises a quiet environment or a noise environment;
  
  controlling, by the speech engine according to the speech correction instruction, correction on the to-be-recognized speech, and outputting an initial recognition result;
  
  separately comparing, by the noise type detection engine, the received to-be-recognized speech with the plurality of the speech training samples, and selecting a sample environment corresponding to a speech training sample among the plurality of speech training samples that has a minimum difference with the to-be-recognized speech, as a current environment type;
  
  storing, by the speech recognition device, the current environment type to the storage area, and abandoning the current environment type after a preset duration; and
  
  outputting, by the speech recognition device, a final recognition result after a confidence value of the initial recognition result is adjusted according to the current environment type.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method according to claim 1, wherein when the previous environment type is not recognized in the storage area, the method further comprises:
    - acquiring, by the speech recognition device, a pre-stored initial environment type, wherein the initial environment type comprises a quiet environment or a noise environment; and
      
      determining, by the speech recognition device, according to the initial environment type and the detection environment type, and outputting the speech correction instruction.
  - 3. The method according to claim 2, wherein the determining, by the speech recognition device, according to the initial environment type and the detection environment type, and outputting the speech correction instruction comprises:
    - determining, by the speech recognition device, whether the initial environment type is the same as the detection environment type;
      
      if the initial environment type is the same as the detection environment type, outputting by the speech recognition device, when both the initial environment type and the detection environment type are noise environments, a speech correction instruction used for speech quality enhancement, and outputting by the speech recognition device, when both the initial environment type and the detection environment type are quiet environments, a speech correction instruction used for disabling noise reduction processing; and
      
      if the initial environment type is not the same as the detection environment type, outputting by the speech recognition device, when the initial environment type is a noise environment, a speech correction instruction used for speech quality enhancement, and outputting by the speech recognition device, when the initial environment type is a quiet environment, a speech correction instruction used for disabling noise reduction processing.
  - 4. The method according to claim 1, wherein the outputting by the speech recognition device, when a recognizable previous environment type exists in the storage area, a speech correction instruction according to a result of comparison between the detection environment type and the previous environment type comprises:
    - acquiring the previous environment type and effective impact duration T of the previous environment type on the input speech;
      
      calculating a time difference t between time for inputting the detection speech and time for previously inputting a speech, and an impact value w(t) of the previous environment type on the detection environment type, wherein w(t) is a truncation function that decays with time t, a value of w(t) is obtained by training sample data of a speech training sample under a different sample environment, and values of t and T are positive integers;
      
      determining a balance relationship between the previous environment type and the detection environment type;
      
      outputting, when both the previous environment type and the detection environment type are noise environments, a speech correction instruction used for speech quality enhancement;
      
      outputting, when both the previous environment type and the detection environment type are quiet environments, a speech correction instruction used for disabling noise reduction processing;
      
      outputting, when the previous environment type is a noise environment, the detection environment type is a quiet environment, and w(t)>
      
      =0.5, a speech correction instruction used for speech quality enhancement;
      
      outputting, when the previous environment type is a noise environment, the detection environment type is a quiet environment, and w(t)<
      
      0.5, a speech correction instruction used for disabling noise reduction processing; and
      
      when the w(t)>
      
      T, outputting, when the detection environment type is a quiet environment, a speech correction instruction used for disabling noise reduction processing, and outputting, when the detection environment is a noise environment, a speech correction instruction used for speech quality enhancement.
  - 5. The method according to claim 1, wherein the separately comparing, by the noise type detection engine, the received to-be-recognized speech with a speech training sample under a different sample environment, and selecting a sample environment corresponding to a speech training sample that has a minimum difference with the to-be-recognized speech, as a current environment type comprises:
    - analyzing, by the noise type detection engine, a speech frame part and a noise frame part of the received to-be-recognized speech to acquire a noise level, a speech level, and a signal-to-noise ratio (SNR) of the to-be-recognized speech;
      
      comparing the noise level, the speech level, and the SNR of the to-be-recognized speech with a noise training level, a speech training level, and a training SNR of a speech training sample under a different sample environment, respectively; and
      
      determining that a sample environment corresponding to a noise training level that has a minimum difference with the noise level, a speech training level that has a minimum difference with the speech level, and a training SNR that has a minimum difference with the SNR is the current environment type.
  - 6. The method according to claim 1, whereinthe preset duration when the current environment type is a quiet environment is longer than the preset duration when the current environment type is a noise environment.
  - 7. The method according to claim 1, wherein the noise environment comprises:
    - a vehicle-mounted low noise environment, a vehicle-mounted high noise environment, an ordinary roadside environment, a busy roadside environment, and a noisy environment.

8. A speech recognition device, comprising:
- a processor, configured to;
  
  acquire a detection speech and a to-be-recognized following the detection speech by sampling an input speech, and input the detection speech and the to-be-recognized speech into a noise type detection engine and a speech engine at the same time;
  
  detect a storage area, and output, when a recognizable previous environment type exists in the storage area, a speech correction instruction according to a result of comparison between a detection environment type output by the noise type detection engine in response to receiving the detection speech and the to-be-recognized speech and the previous environment type; and
  
  output a final recognition result after a confidence value of an initial recognition result output by the engine is adjusted according to a current environment type output by the noise type detection engine, wherein a length of speech data comprised in the detection speech is less than a length of speech data comprised in the to-be-recognized speech, and the previous environment type is a quiet environment or a noise environment;
  
  the noise type detection engine interfaced to the processor, configured to;
  
  separately compare the detection speech and the to-be-recognized speech that are output by the processor with a plurality of speech training samples under a plurality of different sample environments;
  
  select a sample environment corresponding to a speech training sample that has a minimum difference with the detection speech, as a detection environment type;
  
  select a sample environment corresponding to a speech training sample that has a minimum difference with the to-be-recognized speech, as a current environment type; and
  
  store the current environment type to the storage area, and abandon the current environment type after preset duration; and
  
  the speech engine interfaced to the noise type detection engine and the processor, configured to receive the speech correction instruction from the processor and control correction on the received to-be-recognized speech according to the speech correction instruction output by the processor, and output an initial recognition result.
- View Dependent Claims (9, 10)
- - 9. The device according to claim 8, wherein the processor, configured to detect a storage area, and output, when a recognizable previous environment exists in the storage area, a speech correction instruction according to a result of comparison between a detection environment type output by the noise type detection engine and the previous environment type comprises:
    - the processor, configured to acquire the previous environment type and effective impact duration T of the previous environment type on the input speech;
      
      calculate a time difference t between time for inputting the detection speech and time for previously inputting a speech, and an impact value w(t) of the previous environment type on the detection environment type;
      
      determine a balance relationship between the previous environment type and the detection environment type;
      
      output, when both the previous environment type and the detection environment type are noise environments a speech correction instruction used for speech quality enhancement;
      
      output, when both the previous environment type and the detection environment type are quiet environments, a speech correction instruction used for disabling noise reduction processing;
      
      output, when the previous environment type is a noise environment, the detection environment is a quiet environment, and w(t)>
      
      =0.5, a speech correction instruction used for speech quality enhancement;
      
      output, when the previous environment type is a noise environment, the detection environment is a quiet environment, and w(t)<
      
      0.5, a speech correction instruction used for disabling noise reduction processing; and
      
      when w(t)>
      
      T, output, when the detection environment type is a quiet environment, a speech correction instruction used for disabling noise reduction processing; and
      
      output, when the detection environment type is a noise environment, a speech correction instruction used for speech quality enhancement, whereinw(t) is a truncation function that decays with time t, a value of w(t) is obtained by training sample data of a speech training sample under a different sample environment, and values of t and T are positive integers.
  - 10. The device according to claim 8, wherein the noise type detection engine, configured to compare the to-be-recognized speech output by the processor with a speech training sample under a different sample environment;
    - select a sample environment corresponding to a speech training sample that has a minimum difference with the to-be-recognized speech, as a current environment type comprises;
      
      the noise type detection engine, configured to analyze a speech frame part and a noise frame part of the received to-be-recognized speech to acquire a noise level, a speech level, and a signal-to-noise ratio (SNR) of the to-be-recognized speech;
      
      compare the noise level, the speech level, and the SNR of the to-be-recognized speech with a noise training level, a speech training level, and a training SNR of a speech training sample under a different sample environment, respectively; and
      
      determine that a sample environment corresponding to a noise training level that has a minimum difference with the noise level, a speech training level that has a minimum difference with the speech level, and a training SNR that has a minimum difference with the SNR is the current environment type.

11. An electronic device, comprising a speech recognition device, a speech recording device connected to the speech recognition device, and a microphone connected to the recording device;
- wherein the speech recording device is configured to collect and record an input speech by using the microphone, and is configured to input the recorded input speech to the speech recognition device;
  
  wherein the speech recognition device is configured to;
  
  receive an input speech;
  
  divide the input into a detection speech at the beginning of the input speech and a to-be-recognized speech following the detection speech, wherein a length of speech data comprised in the detection speech is less than a length of speech data comprised in the to-be-recognized speech;
  
  select, after comparing the detection speech with a plurality of speech training samples under a plurality of different sample environments, a sample environment corresponding to a speech training sample among the plurality of speech training samples that has a minimum difference with the detection speech, as a detection environment type, wherein the sample environment comprises a quiet environment and a noise environment;
  
  detect a storage area in the speech recognition device;
  
  output, when a recognizable previous environment type exists in the storage area, a speech correction instruction according to a result of comparison between the detection environment type and the previous environment type, wherein the previous environment type is a quiet environment or a noise environment;
  
  control, according to the speech correction instruction, correction on the to-be-recognized speech, and output an initial recognition result;
  
  separately compare the received to-be-recognized speech with the plurality of speech training samples, and select sample environment corresponding to a speech training sample among the plurality of speech training samples that has a minimum difference with the to-be-recognized speech, as a current environment type;
  
  store the current environment type to the storage area, and abandon the current environment type after preset duration; and
  
  output a final recognition result after a confidence value of the initial recognition result is adjusted according to the current environment type.
- View Dependent Claims (12, 13, 14, 15, 16, 17)
- - 12. The electronic device according to claim 11, wherein when the previous environment type is not recognized in the storage area, the speech recording device is further configured to:
    - acquire a pre-stored initial environment type, wherein the initial environment type comprises a quiet environment or a noise environment;
      
      determine according to the initial environment type and the detection environment type; and
      
      output the speech correction instruction.
  - 13. The electronic device according to claim 12, wherein the determine according to the initial environment type and the detection environment type, and output the speech correction instruction comprises:
    - determine whether the initial environment type is the same as the detection environment type;
      
      if the initial environment type is the same as the detection environment type, output a speech correction instruction used for speech quality enhancement when both the initial environment type and the detection environment type are noise environments, and output a speech correction instruction used for disabling noise reduction processing when both the initial environment type and the detection environment type are quiet environments; and
      
      if the initial environment type is not the same as the detection environment type, output a speech correction instruction used for speech quality enhancement when the initial environment type is a noise environment, and output a speech correction instruction used for disabling noise reduction processing when the initial environment type is a quiet environment.
  - 14. The electronic device according to claim 11, wherein the output, when a recognizable previous environment type exists in the storage area, a speech correction instruction according to a result of comparison between the detection environment type and the previous environment type comprises:
    - acquire the previous environment type and effective impact duration T of the previous environment type on the input speech;
      
      calculate a time difference t between time for inputting the detection speech and time for previously inputting a speech, and an impact value w(t) of the previous environment type on the detection environment type, wherein w(t) is a truncation function that decays with time t, a value of w(t) is obtained by training sample data of a speech training sample under a different sample environment, and values of t and T are positive integers;
      
      determine a balance relationship between the previous environment type and the detection environment type;
      
      output, when both the previous environment type and the detection environment type are noise environments, a speech correction instruction used for speech quality enhancement;
      
      output, when both the previous environment type and the detection environment type are quiet environments, a speech correction instruction used for disabling noise reduction processing;
      
      output, when the previous environment type is a noise environment, the detection environment type is a quiet environment, and w(t)>
      
      =0.5, a speech correction instruction used for speech quality enhancement;
      
      output, when the previous environment type is a noise environment, the detection environment type is a quiet environment, and w(t)<
      
      0.5, a speech correction instruction used for disabling noise reduction processing; and
      
      when the w(t)>
      
      T, output, when the detection environment type is a quiet environment, a speech correction instruction used for disabling noise reduction processing, and output, when the detection environment is a noise environment, a speech correction instruction used for speech quality enhancement.
  - 15. The electronic device according to claim 11, wherein the separately compare the received to-be-recognized speech with a speech training sample under a different sample environment, and select a sample environment corresponding to a speech training sample that has a minimum difference with the to-be-recognized speech, as a current environment type comprises:
    - analyze a speech frame part and a noise frame part of the received to-be-recognized speech to acquire a noise level, a speech level, and a signal-to-noise ratio (SNR) of the to-be-recognized speech;
      
      compare the noise level, the speech level, and the SNR of the to-be-recognized speech with a noise training level, a speech training level, and a training SNR of a speech training sample under a different sample environment, respectively; and
      
      determine that a sample environment corresponding to a noise training level that has a minimum difference with the noise level, a speech training level that has a minimum difference with the speech level, and a training SNR that has a minimum difference with the SNR is the current environment type.
  - 16. The electronic device according to claim 11, whereinthe preset duration when the current environment type is a quiet environment is longer than the preset duration when the current environment type is a noise environment.
  - 17. The electronic device according to claim 11, wherein the noise environment comprises:
    - a vehicle-mounted low noise environment, a vehicle-mounted high noise environment, an ordinary roadside environment, a busy roadside environment, and a noisy environment.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Huawei Technologies Co., Ltd. (Huawei Investment & Holding Co., Ltd.)
Original Assignee
Huawei Technologies Co., Ltd. (Huawei Investment & Holding Co., Ltd.)
Inventors
Zhou, Junyang
Primary Examiner(s)
MCFADDEN, SUSAN IRIS

Application Number

US15/149,599
Publication Number

US 20160253995A1
Time in Patent Office

617 Days
Field of Search

704233
US Class Current
CPC Class Codes

G10L 15/20   Speech recognition techniqu...

G10L 21/02   Speech enhancement, e.g. no...

G10L 21/0216   characterised by the method...

G10L 21/0224   Processing in the time domain

H04R 3/00   Circuits for transducers , ...

Environment adaptive speech recognition method and device

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

12 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Environment adaptive speech recognition method and device

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

12 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links