Methods and devices for ignoring similar audio being received by a system

US 9,728,188 B1
Filed: 06/28/2016
Issued: 08/08/2017
Est. Priority Date: 06/28/2016
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

receiving, at a backend system, first audio data;

receiving a first timestamp indicating a first time that the first audio data was sent to the backend system by a first user device;

receiving, at the backend system, second audio data;

receiving a second timestamp indicating a second time that the second audio data was sent to the backend system by a second user device;

determining that an amount of time between the first time and the second time is less than a predetermined period of time, which indicates that the first audio data and the second audio data were sent at a substantially same time;

generating a first audio fingerprint of the first audio data by performing a first fast Fourier transform (“

FFT”

) on the first audio data, the first audio fingerprint comprising first data representing a first time-frequency profile of the first audio data;

generating a second audio fingerprint of the second audio data by performing a second FFT on the second audio data, the second audio fingerprint comprising second data representing a second time-frequency profile of the second audio data;

determining a bit error rate between the first audio fingerprint and the second audio fingerprint by determining a number of different bits between the first audio fingerprint and the second audio fingerprint, and then dividing the number by a total number of bits;

determining that the bit error rate is less than a predefined bit error rate threshold value indicating that the first audio data and the second audio data both represent a same sound; and

storing the first audio fingerprint as a flagged audio fingerprint in memory on the backend system such that receipt of additional audio data that has a matching audio fingerprint is ignored by the backend system.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for detecting similar audio being received by separate voice activated electronic devices, and ignoring those commands, is described herein. In some embodiments, a voice activated electronic device may be activated by a wakeword that is output by the additional electronic device, such as a television or radio, may capture audio of sound subsequently following the wakeword, and may send audio data representing the sound to a backend system. Upon receipt, the backend system may, in parallel to performing automated speech recognition processing to the audio data, generate a sound profile of the audio data, and may compare that sound profile to sound profiles of recently received audio data and/or flagged sound profiles. If the generated sound profile is determined to match another sound profiles, then the automated speech recognition processing may be stopped, and the voice activated electronic device may be instructed to return to a keyword spotting mode. If the matching sound profile is not already stored in a database of known sound profiles, it can be stored for future comparisons.

Citations

20 Claims

1. A method, comprising:
- receiving, at a backend system, first audio data;
  
  receiving a first timestamp indicating a first time that the first audio data was sent to the backend system by a first user device;
  
  receiving, at the backend system, second audio data;
  
  receiving a second timestamp indicating a second time that the second audio data was sent to the backend system by a second user device;
  
  determining that an amount of time between the first time and the second time is less than a predetermined period of time, which indicates that the first audio data and the second audio data were sent at a substantially same time;
  
  generating a first audio fingerprint of the first audio data by performing a first fast Fourier transform (“
  
  FFT”
  
  ) on the first audio data, the first audio fingerprint comprising first data representing a first time-frequency profile of the first audio data;
  
  generating a second audio fingerprint of the second audio data by performing a second FFT on the second audio data, the second audio fingerprint comprising second data representing a second time-frequency profile of the second audio data;
  
  determining a bit error rate between the first audio fingerprint and the second audio fingerprint by determining a number of different bits between the first audio fingerprint and the second audio fingerprint, and then dividing the number by a total number of bits;
  
  determining that the bit error rate is less than a predefined bit error rate threshold value indicating that the first audio data and the second audio data both represent a same sound; and
  
  storing the first audio fingerprint as a flagged audio fingerprint in memory on the backend system such that receipt of additional audio data that has a matching audio fingerprint is ignored by the backend system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, further comprising:
    - receiving, at the backend system, third audio data;
      
      generating a third audio fingerprint of the third audio data by performing a third FFT on the third audio data, the third audio fingerprint comprising third data representing a third time-frequency profile of the third audio data;
      
      determining an additional bit error rate between the third audio fingerprint and the flagged audio fingerprint;
      
      determining that the additional bit error rate is less than the predefined bit error rate threshold value indicating that the third audio data also represents the same sound; and
      
      causing the backend system to ignore the third audio data such that a response is not generated to respond to the third audio data.
  - 3. The method of claim 1, further comprising:
    - receiving, at the backend system, third audio data;
      
      generating a third audio fingerprint of the third audio data by performing a third FFT on the third audio data, the third audio fingerprint comprising third data representing a third time-frequency profile of the third audio data;
      
      determining a new bit error rate between the third audio fingerprint and the flagged audio fingerprint;
      
      determining that the new bit error rate is greater than the predefined bit error rate threshold value indicating that third audio data does not represent the same sound; and
      
      generating text data representing the third audio data by executing speech-to-text functionality on the third audio data.
  - 4. The method of claim 1, further comprising:
    - determining a first user identifier associated with the first user device;
      
      determining a second user identifier associated with the second user device;
      
      determining that the first user identifier is different than the second user identifier;
      
      generating a first instruction for the first user device that causes the first user device to return to a keyword spotting mode where the first user device will monitor sound signals received by a microphone for a subsequent utterance of a wakeword by continuously running the sound signals through a wakeword engine;
      
      generating a second instruction for the second user device that causes the second user device to return to the keyword spotting mode;
      
      sending the first instruction to the first user device; and
      
      sending the second instruction to the second user device.
  - 5. The method of claim 1, further comprising:
    - causing automated speech recognition processing to stop being performed to the first audio data; and
      
      causing the automated speech recognition processing to stop being performed to the second audio data.
  - 6. The method of claim 1, further comprising:
    - receiving, at the backend system, third audio data;
      
      receiving a third timestamp indicating a third time that the third audio data was sent to the backend system by a third user device;
      
      determining that an additional amount of time between the first time and the third time is greater than the predetermined period of time, which indicates that the first audio data and the third audio data were sent at a different time;
      
      generating a third audio fingerprint of the third audio data by performing a third FFT on the third audio data, the third audio fingerprint comprising third data representing a third time-frequency profile of the third audio data;
      
      determining a new bit error rate between the flagged audio fingerprint and the third audio fingerprint;
      
      determining that the new bit error rate is greater than the predefined bit error rate threshold value indicating that third audio data does not represent the same sound;
      
      receiving a first plurality of audio fingerprints corresponding to a second plurality of audio data that were received during the additional amount of time;
      
      determining a third plurality of bit error rates between the third audio fingerprint and each of the first plurality of audio fingerprints;
      
      determining that each of the third plurality of bit error rates are greater than the predefined bit error rate threshold value, indicating that each of the second plurality of audio data represent a different sound than the third audio data; and
      
      causing automated speech recognition processing to continue to be performed to the third audio data.
  - 7. The method of claim 6, further comprising:
    - determining a new amount of time between the third time and a fourth time, the fourth time corresponding to a fourth audio fingerprint of fourth audio data received prior to the first audio data, the second audio data, and the third audio data;
      
      determining that the new amount of time is greater than the amount of time;
      
      determining that the new amount of time is greater than the additional amount of time;
      
      determining that the fourth audio fingerprint correspond to an oldest audio fingerprint of the plurality of audio fingerprints;
      
      causing the fourth audio fingerprint to be deleted;
      
      determining an updated first plurality of audio fingerprints comprising the first plurality of audio fingerprints minus the fourth audio fingerprint; and
      
      generating a fourth plurality of audio fingerprints comprising the updated first plurality of audio fingerprints and the third audio fingerprint.
  - 8. The method of claim 1, further comprising:
    - receiving a third audio fingerprint of third audio data, wherein the first audio fingerprint is generated at a first speech processing component, and the third audio fingerprint is generated at a second speech processing component;
      
      causing the third audio fingerprint to be stored in the memory;
      
      determining an additional bit error rate between first audio fingerprint and the third audio fingerprint;
      
      determining that the additional bit error rate is less than the predefined bit error rate threshold value; and
      
      causing automated speech recognition processing to stop being performed to the third audio data.
  - 9. The method of claim 1, further comprising:
    - receiving, at the backend system, third audio data;
      
      generating a third audio fingerprint of the third audio data by performing a third FFT on the third audio data, the third audio fingerprint comprising third data representing a third time-frequency profile of the third audio data;
      
      determining an additional bit error rate between a first portion of the flagged audio fingerprint and a second portion of the third audio fingerprint;
      
      determining that the additional bit error rate is less than the predefined bit error rate threshold value; and
      
      causing automated speech recognition processing to stop being performed on the third audio data.
  - 10. The method of claim 1, further comprising:
    - receiving, at the backend system, third audio data;
      
      generating a third audio fingerprint of the third audio data by performing a third FFT on the third audio data, the third audio fingerprint comprising third data representing a third time-frequency profile of the third audio data;
      
      determining an additional bit error rate between the third audio fingerprint and the flagged audio fingerprint;
      
      determining that the additional bit error rate is less than the predefined bit error rate threshold value indicating that the third audio data also represents the same sound; and
      
      causing the third audio data to be deleted.

11. A backend system, comprising:
- memory;
  
  communications circuitry; and
  
  at least one processor operable to;
  
  receive first audio data;
  
  receive a first timestamp indicating a first time that the first audio data was sent to the backend system by a first user device;
  
  receive second audio data;
  
  receive a second time stamp indicating a second time that the second audio data was sent to the backend system by a second user device;
  
  determine that an amount of time between the first time and the second time is less than a predetermined period of time, which indicates that the first audio data and the second audio data were sent at a substantially same time;
  
  generate a first audio fingerprint of the first audio data by performing a first fast Fourier transform (“
  
  FFT”
  
  ) on the first audio data, the first audio fingerprint comprising first data representing a first time-frequency profile of the first audio data;
  
  generate a second audio fingerprint of the second audio data by performing a second FFT on the second audio data, the second audio fingerprint comprising second data representing a second time-frequency profile of the second audio data;
  
  determine a bit error rate between the first audio fingerprint and the second audio fingerprint by determining a number of different bits between the first audio fingerprint and the second audio fingerprint, and then dividing the number by a total number of bits;
  
  determine that the bit error rate is less than a predefined bit error rate threshold value indicating that the first audio data and the second audio data both represent a same sound; and
  
  store the first audio fingerprint as a flagged audio fingerprint in the memory such that receipt of additional audio data that has a matching audio fingerprint is ignored.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The backend system of claim 11, wherein the at least one processor is further operable to:
    - receive third audio data;
      
      generate a third audio fingerprint of the third audio data by performing a third FFT on the third audio data, the third audio fingerprint comprising third data representing a third time-frequency profile of the third audio data;
      
      determine an additional bit error rate between the third audio fingerprint and the flagged audio fingerprint;
      
      determine that the additional bit error rate is less than the predefined bit error rate threshold value indicating that the third audio data also represents the same sound; and
      
      cause the third audio data to be ignored such that a response is not generated to respond to the third audio data.
  - 13. The backend system of claim 11, wherein the at least one processor is further operable to:
    - receive third audio data;
      
      generate a third audio fingerprint of the third audio data by performing a third FFT on the third audio data, the third audio fingerprint comprising third data representing a third time-frequency profile of the third audio data;
      
      determine a new bit error rate between the third audio fingerprint and the flagged audio fingerprint;
      
      determine that the new bit error rate is greater than the predefined bit error rate threshold value indicating that third audio data does not represent the same sound; and
      
      generate text data representing the third audio data by executing speech-to-text functionality on the third audio data.
  - 14. The backend system of claim 11, wherein the at least one processor is further operable to:
    - determine a first user identifier associated with the first user device;
      
      determine a second user identifier associated with the second user device;
      
      determine that the first user identifier is different than the second user identifier;
      
      generate a first instruction for the first user device that causes the first user device to return to a keyword spotting mode where the first user device will monitor sound signals received by a microphone for a subsequent utterance of a wakeword by continuously running the sound signals through a wakeword engine;
      
      generate a second instruction for the second user device that causes the second user device to return to the keyword spotting mode;
      
      send the first instruction to the first user device; and
      
      send the second instruction to the second user device.
  - 15. The backend system of claim 11, wherein the at least one processor is further operable to:
    - cause automated speech recognition processing to stop being performed to the first audio data; and
      
      cause the automated speech recognition processing to stop being performed to the second audio data.
  - 16. The backend system of claim 11, wherein the at least one processor is further operable to:
    - receive third audio data;
      
      receive a third timestamp indicating a third time that the third audio data was sent to the backend system by a third user device;
      
      determine that an additional amount of time between the first time and the third time is greater than the predetermined period of time, which indicates that the first audio data and the third audio data were sent at a different time;
      
      generate a third audio fingerprint of the third audio data by performing a third FFT on the third audio data, the third audio fingerprint comprising third data representing a third time-frequency profile of the third audio data;
      
      determine a new bit error rate between the flagged audio fingerprint and the third audio fingerprint;
      
      determine that the new bit error rate is greater than the predefined bit error rate threshold value indicating that third audio data does not represent the same sound;
      
      receive a first plurality of audio fingerprints corresponding to a second plurality of audio data that were received during the additional amount of time;
      
      determine a third plurality of bit error rates between the third audio fingerprint and each of the first plurality of audio fingerprints;
      
      determine that each of the third plurality of bit error rates are greater than the predefined bit error rate threshold value, indicating that each of the second plurality of audio data represent a different sound than the third audio data; and
      
      cause automated speech recognition processing to continue to be performed to the third audio data.
  - 17. The backend system of claim 16, wherein the at least one processor is further operable to:
    - determine a new amount of time between the third time and a fourth time, the fourth time corresponding to a fourth audio fingerprint of fourth audio data received prior to the first audio data, the second audio data, and the third audio data;
      
      determine that the new amount of time is greater than the amount of time;
      
      determine that the new amount of time is greater than the additional amount of time;
      
      determine that the fourth audio fingerprint correspond to an oldest audio fingerprint of the plurality of audio fingerprints;
      
      cause the fourth audio fingerprint to be deleted;
      
      determine an updated first plurality of audio fingerprints comprising the first plurality of audio fingerprints minus the fourth audio fingerprint; and
      
      generate a fourth plurality of audio fingerprints comprising the updated first plurality of audio fingerprints and the third audio fingerprint.
  - 18. The backend system of claim 11, wherein the at least one processor is further operable to:
    - receive a third audio fingerprint of third audio data, wherein the first audio fingerprint is generated at a first speech processing component, and the third audio fingerprint is generated at a second speech processing component;
      
      cause the third audio fingerprint to be stored in the memory;
      
      determine an additional bit error rate between first audio fingerprint and the third audio fingerprint;
      
      determine that the additional bit error rate is less than the predefined bit error rate threshold value; and
      
      cause automated speech recognition processing to stop being performed to the third audio data.
  - 19. The backend system of claim 11, wherein the at least one processor is further operable to:
    - receive third audio data;
      
      generate a third audio fingerprint of the third audio data by performing a third FFT on the third audio data, the third audio fingerprint comprising third data representing a third time-frequency profile of the third audio data;
      
      determine an additional bit error rate between a first portion of the flagged audio fingerprint and a second portion of the third audio fingerprint;
      
      determine that the additional bit error rate is less than the predefined bit error rate threshold value; and
      
      cause automated speech recognition processing to stop being performed on the third audio data.
  - 20. The backend system of claim 11, wherein the at least one processor is further operable to:
    - receive third audio data;
      
      generate a third audio fingerprint of the third audio data by performing a third FFT on the third audio data, the third audio fingerprint comprising third data representing a third time-frequency profile of the third audio data;
      
      determine an additional bit error rate between the third audio fingerprint and the flagged audio fingerprint;
      
      determine that the additional bit error rate is less than the predefined bit error rate threshold value indicating that the third audio data also represents the same sound; and
      
      cause the third audio data to be deleted.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Rosen, Alexander David, Rodehorst, Michael James, Tucker, George Jay, Challenner, Aaron Lee Mathers
Primary Examiner(s)
Baker, Matthew

Application Number

US15/195,587
Time in Patent Office

406 Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/22   Procedures used during a sp...

G10L 19/08   Determination or coding of ...

G10L 2015/223   Execution procedure of a sp...

G10L 25/18   the extracted parameters be...

G10L 25/51   for comparison or discrimin...

Methods and devices for ignoring similar audio being received by a system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and devices for ignoring similar audio being received by a system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links