Robust speech recognition in the presence of echo and noise using multiple signals for discrimination

US 9,672,821 B2
Filed: 08/25/2015
Issued: 06/06/2017
Est. Priority Date: 06/05/2015
Status: Active Grant

First Claim

Patent Images

1. A speech recognition system (SRS) comprising:

a first input to receive a raw microphone signal having a user voice signal based on user speech during a period of time and an echo signal based on sound produced by a speaker during the period of time;

a second input to receive a plurality of types of echo information signals during the period of time, each type of echo information signal including information derived from the echo signal by an echo cancellation system; and

a trained speech recognition processor to recognize speech based on the raw microphone signal and the plurality of types of echo information signals, wherein the processor was trained at least by inputting a plurality of different samples of raw microphone signals and a plurality of different samples of each of a plurality of the types of echo information signals.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for speech recognition system having a speech processor that is trained to recognize speech by considering (1) a raw microphone signal that includes an echo signal and (2) different types of echo information signals from an echo cancellation system (and optionally different types of ambient noise suppression signals from a noise suppressor). The different types of echo information signals may include those used for echo cancelation and those having echo information. The speech recognition system may convert the raw microphone signal and different types of echo information signals (and optional noise suppression signals) into spectral features in the form of a vector, and a concatenator to combine the feature vectors into a total vector (for a period of time) that is used to train the speech processor, and during use of the speech processor to recognize speech.

183 Citations

21 Claims

1. A speech recognition system (SRS) comprising:
- a first input to receive a raw microphone signal having a user voice signal based on user speech during a period of time and an echo signal based on sound produced by a speaker during the period of time;
  
  a second input to receive a plurality of types of echo information signals during the period of time, each type of echo information signal including information derived from the echo signal by an echo cancellation system; and
  
  a trained speech recognition processor to recognize speech based on the raw microphone signal and the plurality of types of echo information signals, wherein the processor was trained at least by inputting a plurality of different samples of raw microphone signals and a plurality of different samples of each of a plurality of the types of echo information signals.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The system of claim 1, wherein the sound produced by the speaker is produced by the speaker in response to the speaker receiving an electronic reference signal;
    - andwherein the plurality of types of echo information signals include the electronic reference signal, a linear echo estimate signal, a residual echo estimate signal, and an echo canceller output signal.
  - 3. The system of claim 2, further comprising the echo cancellation system comprising:
    - a microphone input to receive the raw microphone signal;
      
      an electronic reference signal input to receive the electronic reference signal;
      
      an echo processor to use the raw microphone signal and the electronic reference signal to produce the linear echo estimate signal, the residual echo estimate signal, and the echo canceller output signal; and
      
      an output to output the linear echo estimate signal, the residual echo estimate signal, and the echo canceller output signal, to the second input.
  - 4. The system of claim 1, wherein the speech recognition processor includes a deep neural network that was trained using:
    - a plurality of different ones of electronic reference signals,a plurality of different ones of linear echo estimate signals,a plurality of different ones of residual echo estimate signals, anda plurality of different ones of echo canceller output signals.
  - 5. The system of claim 1, wherein the speech recognition processor includes:
    - spectral feature transformers to transform the raw microphone signal and different types of echo information signals into spectral features in the form of feature vectors; and
      
      a concatenator to combine the feature vectors into a total vector that represents the raw microphone signal and different types of echo information signals for a period of time.
  - 6. The system of claim 5, wherein the spectral feature transformers and the concatenator are one of (1) used to train the speech recognition processor based on recorded audio information, or (2) used to recognize user speech while using the system.
  - 7. The system of claim 1, wherein the plurality of types of echo information signals include different types of ambient noise suppression signals from a noise suppressor.

8. A method of speech recognition comprising:
- receiving at a first input, a raw microphone signal having a user voice signal based on user speech during a period of time and an echo signal based on sound produced by a speaker during the period of time;
  
  receiving at a second input, a plurality of types of echo information signals during the period of time, each type of echo information signal including information derived by an echo cancellation system from the echo signal; and
  
  recognizing speech at a trained speech recognition processor, based on the raw microphone signal and the plurality of types of echo information signals, wherein the processor was trained at least by inputting a plurality of different samples of raw microphone signals and a plurality of different samples of each of a plurality of the types of echo information signals.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The method of claim 8, further comprising:
    - the speaker receiving an electronic reference signal; and
      
      the speaker producing the sound produced by the speaker in response to the speaker receiving an electronic reference signal,wherein the plurality of types of echo information signals include the electronic reference signal, a linear echo estimate signal, a residual echo estimate signal, and an echo canceller output signal, andwherein the plurality of types of echo information signals are derived by the echo cancellation system from the echo signal, the raw microphone signal, and the reference signal.
  - 10. The method of claim 9, further comprising:
    - receiving the raw microphone signal at a microphone input of the echo cancelation system;
      
      receiving the electronic reference signal at an electronic reference signal input of the echo cancelation system;
      
      an echo processor of the echo cancellation system using the raw microphone signal and the electronic reference signal to produce the linear echo estimate signal, the residual echo estimate signal, and the echo canceller output signal; and
      
      an output outputting the linear echo estimate signal, the residual echo estimate signal, and the echo canceller output signal to the second input.
  - 11. The method of claim 9, wherein the speech recognition processor includes a deep neural network that was trained using:
    - a plurality of different ones of electronic reference signals,a plurality of different ones of linear echo estimate signals,a plurality of different ones of residual echo estimate signals, anda plurality of different ones of echo canceller output signals.
  - 12. The method of claim 8, wherein recognizing speech at the trained speech recognition processor includes:
    - spectral feature transformers transforming the raw microphone signal and different types of echo information signals into spectral features in the form of feature vectors; and
      
      a concatenator combining the feature vectors into a total vector that represents the raw microphone signal and different types of echo information signals for a period of time.
  - 13. The method of claim 12, wherein the spectral feature transformers and the concatenator are used to recognize user speech while using the system.
  - 14. The method of claim 8, further comprising:
    - a noise suppressor producing different types of ambient noise suppression signals from the raw microphone signal,wherein the plurality of types of echo information signals include the different types of ambient noise suppression signals.

15. A method of training a speech recognition system (SRS) comprising:
- receiving at a first input, a plurality of raw microphone signals having a user voice signal based on a sample of user speech during a period of time and an echo signal based on a sample of sound produced by a speaker during the period of time;
  
  receiving at a second input, a plurality of types of echo information signals during the period of time, each type of echo information signal including information derived by an echo cancellation system from the echo signal; and
  
  training a speech recognition processor to recognize speech based on the raw microphone signals and the plurality of types of echo information signals.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. The method of claim 15, further comprising:
    - the speaker receiving a plurality of electronic reference signals; and
      
      the speaker producing the sound in response to the speaker receiving the electronic reference signals,wherein the plurality of types of echo information signals include the plurality of electronic reference signals, a plurality of linear echo estimate signals, a plurality of residual echo estimate signals, and a plurality of echo canceller output signals, andwherein the plurality of types of echo information signals are derived by the echo cancellation system from the echo signal in the plurality of raw microphone signals, and the plurality of reference signals.
  - 17. The method of claim 16, further comprising:
    - receiving the raw microphone signals at a microphone input of the echo cancelation system;
      
      receiving the electronic reference signals at an electronic reference signal input of the echo cancelation system;
      
      an echo processor of the echo cancellation system using the raw microphone signals and the electronic reference signals to produce the linear echo estimate signals, the residual echo estimate signals, and the echo canceller output signals; and
      
      an output outputting the linear echo estimate signals, the residual echo estimate signals, and the echo canceller output signals to the second input.
  - 18. The method of claim 16, wherein the speech recognition processor includes a deep neural network (DNN), and further comprising training the DNN using:
    - the plurality of electronic reference signals,the plurality of linear echo estimate signals,the plurality of residual echo estimate signals, andthe plurality of echo canceller output signals.
  - 19. The method of claim 15, wherein training the speech recognition processor includes:
    - spectral feature transformers transforming the raw microphone signals and the plurality of types of echo information signals into spectral features in the form of feature vectors; and
      
      a concatenator combining the feature vectors into a total vector that represents the raw microphone signals and the plurality of types of echo information signals for a period of time.
  - 20. The method of claim 19, wherein the spectral feature transformers and the concatenator are used to train the speech recognition processor based on recorded audio information.
  - 21. The method of claim 15, further comprising:
    - a noise suppressor producing a plurality of different types of ambient noise suppression signals from the raw microphone signals,wherein the plurality of types of echo information signals include the plurality of different types of ambient noise suppression signals.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Krishnaswamy, Arvindh, Clark, Charles P., Malik, Sarmad
Primary Examiner(s)
COLUCCI, MICHAEL C

Application Number

US14/835,588
Publication Number

US 20160358602A1
Time in Patent Office

651 Days
Field of Search

37940601, 37940605, 37940613, 37940614, 381 61, 381 7111, 600446, 704200, 704202, 704226, 704233
US Class Current
CPC Class Codes

G10L 15/16   using artificial neural net...

G10L 15/20   Speech recognition techniqu...

G10L 2021/02082   the noise being echo, rever...

Robust speech recognition in the presence of echo and noise using multiple signals for discrimination

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

183 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Robust speech recognition in the presence of echo and noise using multiple signals for discrimination

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

183 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links