Mitigating effects of electronic audio sources in expression detection

US 9,734,845 B1
Filed: 06/26/2015
Issued: 08/15/2017
Est. Priority Date: 06/26/2015
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

a microphone array configured to produce microphone audio signals;

an audio beamformer configured to process the microphone audio signals to produce directional audio signals, wherein a first directional audio signal of the directional audio signals corresponds to a first direction with respect to the microphone array and wherein a second directional audio signal of the directional audio signals corresponds to a second direction with respect to the microphone array, wherein the first directional audio signal and the second directional audio signal emphasize sound from the first direction and the second direction, respectively;

a speech activity detector configured to analyze one or more frequency characteristics of the first directional audio signal and the second directional audio signal to determine a first level of speech presence and a second level of speech presence occurring in the first direction and the second direction, respectively, over time;

a source detector configured to analyze the first level of speech presence and the second level of speech presence occurring over a past time period to determine that an electronic source of sound is located in the first direction or the second direction; and

an expression detector configured to perform actions comprising;

identifying the first direction where a first occurring level of speech presence is a highest level of speech presence;

determining that the first direction corresponds to a direction in which the electronic source of sound is located;

identifying the second direction where a second occurring level of speech presence is a second highest level of speech presence;

analyzing the first directional audio signal corresponding to the first direction to produce a first score indicating a first likelihood that a trigger expression is represented in the first directional audio signal;

analyzing the second directional audio signal corresponding to the second direction to produce a second score indicating a second likelihood that the trigger expression is represented in the second directional audio signal;

comparing the first score to a first threshold;

comparing the second score to a second threshold, wherein the second threshold is less than the first threshold;

determining that (i) the first score is greater than the first threshold or (ii) the second score is greater than the second threshold;

concluding that the trigger expression has been uttered; and

performing speech recognition on subsequent speech, based at least in part on the trigger expression.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a speech-based system, a wake word or other trigger expression is used to preface user speech that is intended as a command. The system receives multiple directional audio signals, each of which emphasizes sound from a different direction. The signals are monitored and analyzed to detect the directions of interfering audio sources such as televisions or other types of electronic audio players. One of the directional signals having the strongest presence of speech is selected to be monitored for the trigger expression. If the directional signal corresponds to the direction of an interfering audio source, a more strict standard is used to detect the trigger expression. In addition, the directional audio signal having the second strongest presence of speech may also be monitored to detect the trigger expression.

67 Citations

View as Search Results

23 Claims

1. A system comprising:
- a microphone array configured to produce microphone audio signals;
  
  an audio beamformer configured to process the microphone audio signals to produce directional audio signals, wherein a first directional audio signal of the directional audio signals corresponds to a first direction with respect to the microphone array and wherein a second directional audio signal of the directional audio signals corresponds to a second direction with respect to the microphone array, wherein the first directional audio signal and the second directional audio signal emphasize sound from the first direction and the second direction, respectively;
  
  a speech activity detector configured to analyze one or more frequency characteristics of the first directional audio signal and the second directional audio signal to determine a first level of speech presence and a second level of speech presence occurring in the first direction and the second direction, respectively, over time;
  
  a source detector configured to analyze the first level of speech presence and the second level of speech presence occurring over a past time period to determine that an electronic source of sound is located in the first direction or the second direction; and
  
  an expression detector configured to perform actions comprising;
  
  identifying the first direction where a first occurring level of speech presence is a highest level of speech presence;
  
  determining that the first direction corresponds to a direction in which the electronic source of sound is located;
  
  identifying the second direction where a second occurring level of speech presence is a second highest level of speech presence;
  
  analyzing the first directional audio signal corresponding to the first direction to produce a first score indicating a first likelihood that a trigger expression is represented in the first directional audio signal;
  
  analyzing the second directional audio signal corresponding to the second direction to produce a second score indicating a second likelihood that the trigger expression is represented in the second directional audio signal;
  
  comparing the first score to a first threshold;
  
  comparing the second score to a second threshold, wherein the second threshold is less than the first threshold;
  
  determining that (i) the first score is greater than the first threshold or (ii) the second score is greater than the second threshold;
  
  concluding that the trigger expression has been uttered; and
  
  performing speech recognition on subsequent speech, based at least in part on the trigger expression.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system of claim 1, wherein the source detector is further configured to determine that the electronic source of sound is located in the first direction by performing actions comprising:
    - determining a first range of frequencies of the first directional audio signal corresponding to the first direction; and
      
      determining that the first range of frequencies is at least partially outside a second range of frequencies, wherein the second range of frequencies corresponds to frequencies of human speech.
  - 3. The system of claim 1, wherein the source detector is further configured to determine that the electronic source of sound is located in the first direction by determining that the first level of speech presence corresponding to the first direction has exceeded a level threshold for at least a threshold amount of time during the past time period.
  - 4. The system of claim 1, wherein the speech activity detector is configured to perform actions comprising:
    - determining one or more features of the first directional audio signal; and
      
      comparing the one or more features to one or more reference signals that are known to contain human speech.
  - 5. The system of claim 1, wherein analyzing the first directional audio signal to produce the first score comprises:
    - creating an acoustic model of an utterance represented by the first directional audio signal; and
      
      comparing the acoustic model to one or more reference acoustic models that correspond to the trigger expression.

6. A processor-implemented method, comprising:
- receiving, from one or more microphones, a plurality of audio signals, wherein the plurality of audio signals are processed by an audio beamformer to produce directional audio signals, wherein a first audio signal of the directional audio signals corresponds to a first direction with respect to the one or more microphones and wherein a second audio signal of the directional audio signals corresponds to a second direction with respect to the one or more microphones, wherein the first audio signal and the second audio signal emphasize sound from the first direction and the second direction, respectively;
  
  identifying the first direction as an identified direction in which a non-human sound source is located;
  
  analyzing the first audio signal to identify a representation of speech;
  
  determining that the first audio signal corresponds to the identified direction in which the non-human sound source is located;
  
  selecting a first standard to analyze the first audio signal based at least in part on the first audio signal corresponding to the identified direction of the non-human sound source;
  
  analyzing the first audio signal using the first standard to detect an utterance of a trigger expression;
  
  analyzing the second audio signal using a second standard to detect the utterance of the trigger expression, wherein the first standard includes (i) a first threshold that is greater than a second threshold associated with the second standard or (ii) a first detection algorithm that is different than a second detection algorithm associated with the second standard; and
  
  receiving, from the one or more microphones, a third audio signal including subsequent speech for performing subsequent speech recognition, based at least in part on the utterance of the trigger expression.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 23)
- - 7. The processor-implemented method of claim 6, further comprising:
    - producing a first score indicating a first likelihood that the trigger expression is represented by the first audio signal;
      
      comparing the first score to the first threshold;
      
      producing a second score indicating a second likelihood that the trigger expression is represented by the second audio signal; and
      
      comparing the second score to the second threshold.
  - 8. The processor-implemented method of claim 7, wherein producing the first score comprises:
    - creating an acoustic model of the utterance represented by the first audio signal; and
      
      comparing the acoustic model to one or more reference acoustic models that correspond to the trigger expression.
  - 9. The processor-implemented method of claim 6, wherein:
    - analyzing the first audio signal using the first standard comprises performing expression detection in accordance with the first detection algorithm; and
      
      analyzing the second audio signal using the second standard comprises performing expression detection in accordance with the second detection algorithm.
  - 10. The processor-implemented method of claim 9, wherein the first detection algorithm comprises analyzing the first audio signal to detect at least two utterances of the trigger expression within a particular time period to determine a valid utterance of the trigger expression.
  - 11. The processor-implemented method of claim 9, wherein the first detection algorithm comprises analyzing the first audio signal to detect the utterance of the trigger expression after a time period during which speech is not detected.
  - 12. The processor-implemented method of claim 6, wherein identifying the first direction as the identified direction further comprises determining that the first audio signal represents speech during at least a threshold amount of time during a past time period.
  - 13. The processor-implemented method of claim 6, wherein identifying the first direction as the identified direction further comprises determining that the first audio signal has audio frequencies outside of a range of spoken speech frequencies.
  - 14. The processor-implemented method of claim 6, wherein identifying the first direction as the identified direction further comprises receiving an indication from a user that the non-human sound source is located in the first direction.
  - 15. The processor-implemented method of claim 6, wherein identifying the first direction as the identified direction further comprises receiving an indication from a user that the non-human sound source is located in a direction corresponding to the position of the user.
  - 16. The processor-implemented method of claim 6, wherein analyzing the first audio signal to identify the representation of speech comprises:
    - determining one or more features of the first audio signal; and
      
      comparing the one or more features to one or more reference signals that are known to contain human speech.
  - 23. The processor-implemented method of claim 6, wherein identifying the first direction as the identified direction further comprises:
    - transmitting an instruction to an electronic device instructing the electronic device to produce an identifiable sound; and
      
      detecting the identifiable sound from the first direction.

17. A processor-implemented method, comprising:
- receiving, from one or more microphones, a plurality of audio signals, wherein the plurality of audio signals are processed by an audio beamformer to produce directional audio signals, wherein a first audio signal of the directional audio signals corresponds to a first area of an environment with respect to the one or more microphones and wherein a second audio signal of the directional audio signals corresponds to a second area of the environment with respect to the one or more microphones, wherein the first audio signal and the second audio signal emphasize sound from the first area of the environment and the second area of the environment, respectively;
  
  determining that the first audio signal represents sound generated by a non-human sound source;
  
  selecting a first standard to analyze the first audio signal based at least in part on the first audio signal representing the sound generated by the non-human sound source;
  
  analyzing the first audio signal using the first standard to detect an utterance of a trigger expression;
  
  analyzing the second audio signal using a second standard to detect the utterance of the trigger expression, wherein the first standard includes (i) a first threshold that is greater than a second threshold associated with the second standard or (ii) a first detection algorithm that is different than a second detection algorithm associated with the second standard; and
  
  receiving, from the one or more microphones, subsequent speech for performing subsequent speech recognition, based at least in part on the utterance of the trigger expression.
- View Dependent Claims (18, 19, 20, 21, 22)
- - 18. The processor-implemented method of claim 17, further comprising:
    - producing a first score indicating a first likelihood that the trigger expression is represented by the first audio signal;
      
      comparing the first score to the first threshold;
      
      producing a second score indicating a second likelihood that the trigger expression is represented by the second audio signal; and
      
      comparing the second score to the second threshold.
  - 19. The processor-implemented method of claim 17, further comprising:
    - determining that the trigger expression is represented in the first audio signal using the first detection algorithm; and
      
      determining that the trigger expression is represented in the second audio signal using the second detection algorithm,wherein the second detection algorithm is a default detection algorithm.
  - 20. The processor-implemented method of claim 17, wherein determining that the first audio signal represents sound generated by the non-human sound source comprises determining that the first audio signal represents speech during at least a threshold amount of time during a past time period.
  - 21. The processor-implemented method of claim 17, wherein determining that the first audio signal represents sound generated by the non-human source comprises determining that the first audio signal has audio frequencies outside of a range of human speech frequencies.
  - 22. The processor-implemented method of claim 17, further comprising:
    - accessing a database including a plurality of audio content;
      
      identifying audio content of the plurality of audio content corresponding to the sound produced by the non-human sound source; and
      
      canceling the audio content from the first audio signal.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Jayakumar, Praveen, Liu, Yue, Chhetri, Amit Singh, Gopalan, Ramya
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
Kim, Jonathan

Application Number

US14/752,400
Time in Patent Office

781 Days
Field of Search

None
US Class Current
CPC Class Codes

G10L 15/00   Speech recognition G10L17/0...

G10L 15/22   Procedures used during a sp...

G10L 2015/088   Word spotting

G10L 25/78   Detection of presence or ab...

H04R 1/406   microphones

H04R 2420/07   Applications of wireless lo...

H04R 2420/09   Applications of special con...

H04R 3/005   for combining the signals o...

Mitigating effects of electronic audio sources in expression detection

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

67 Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Mitigating effects of electronic audio sources in expression detection

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

67 Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links