Post-speech recognition request surplus detection and prevention

US 10,453,460 B1
Filed: 03/30/2016
Issued: 10/22/2019
Est. Priority Date: 02/02/2016
Status: Active Grant

First Claim

Patent Images

1. A method for preventing a backend system from providing an error message, comprising:

receiving, from a requesting device, audio data representing a phrase;

determining a temporal window during which the audio data was received;

generating text data from the audio data by executing speech-to-text functionality;

identifying a category that has been generated for the phrase, the category signifying that the text data represents the phrase;

adding a count to the category to indicate that another instance of the category has been identified;

determining that a number of additional instances of the text data have been recognized from additional outputs from speech recognition functionality corresponding to additional audio data that also represent the phrase has been received by the backend system from additional requesting devices;

adding additional counts to the category for each of the number of additional instances such that a total number of counts for the category is representative of how many of the additional requesting devices sent similar audio data representing the phrase to the backend system;

determining that the additional audio data was also received within the temporal window;

determining a threshold count value indicative of the phrase originating from a non-human source;

determining that the total number of counts is greater than the threshold count value;

based at least in part on determining that the total number of counts is greater than the threshold count value, causing the speech recognition functionality to stop prior to providing the text data to natural language understanding functionality;

generating an instruction for the requesting device to return to a sleep state; and

sending the instruction to the requesting device.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for determining that artificial commands, in excess of a threshold value, are detected by multiple voice activated electronic devices is described herein. In some embodiments, numerous voice activated electronic devices may send audio data representing a phrase to a backend system at a substantially same time. Text data representing the phrase, and counts for instances of that text data, may be generated. If the number of counts exceeds a predefined threshold, the backend system may cause any remaining response generation functionality that particular command that is in excess of the predefined threshold to be stopped, and those devices returned to a sleep state. In some embodiments, a sound profile unique to the phrase that caused the excess of the predefined threshold may be generated such that future instances of the same phrase may be recognized prior to text data being generated, conserving the backend system'"'"'s resources.

Citations

22 Claims

1. A method for preventing a backend system from providing an error message, comprising:
- receiving, from a requesting device, audio data representing a phrase;
  
  determining a temporal window during which the audio data was received;
  
  generating text data from the audio data by executing speech-to-text functionality;
  
  identifying a category that has been generated for the phrase, the category signifying that the text data represents the phrase;
  
  adding a count to the category to indicate that another instance of the category has been identified;
  
  determining that a number of additional instances of the text data have been recognized from additional outputs from speech recognition functionality corresponding to additional audio data that also represent the phrase has been received by the backend system from additional requesting devices;
  
  adding additional counts to the category for each of the number of additional instances such that a total number of counts for the category is representative of how many of the additional requesting devices sent similar audio data representing the phrase to the backend system;
  
  determining that the additional audio data was also received within the temporal window;
  
  determining a threshold count value indicative of the phrase originating from a non-human source;
  
  determining that the total number of counts is greater than the threshold count value;
  
  based at least in part on determining that the total number of counts is greater than the threshold count value, causing the speech recognition functionality to stop prior to providing the text data to natural language understanding functionality;
  
  generating an instruction for the requesting device to return to a sleep state; and
  
  sending the instruction to the requesting device.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, further comprising:
    - storing the phrase in memory on the backend system;
      
      receiving, from the requesting device, supplemental audio data representing the phrase;
      
      generating supplemental text data from the supplemental audio data by executing the speech-to-text functionality;
      
      determining that the supplemental text data also represents the phrase;
      
      searching the memory for the phrase;
      
      determining that the phrase is stored in the memory;
      
      causing the text data to not be provided to the natural language understanding functionality;
      
      generating the instruction; and
      
      sending the instruction to the requesting device.
  - 3. The method of claim 1, further comprising:
    - receiving new audio data that also represents the phrase;
      
      generating new text data from the new audio data by executing the speech to-text functionality;
      
      determining a new time that the new audio data was received;
      
      determining that the new time is not included within the temporal window;
      
      determining that the new audio data originates from a different source than a source of the audio data; and
      
      sending the new text data to the natural language understanding functionality.
  - 4. The method of claim 1, further comprising:
    - determining that a television commercial is occurring during the temporal window;
      
      determining that the phrase originated from the television commercial based on the total number of counts being greater than the threshold count value;
      
      determining a future temporal window that a future occurrence of the television commercial is going to air;
      
      generating a directive that instructs the requesting device to ignore an utterance of a wake word that is determined to occur within the future temporal window; and
      
      sending the directive to the requesting device prior to a start of the future temporal window such that the future occurrence of the television commercial does not activate the requesting device.

5. A method for preventing a system error, comprising:
- receiving, at a backend system, first audio data from a plurality of different voice-activated electronic devices;
  
  determining, at the backend system and using the first audio data, that a first phrase was detected by more than a threshold number of the voice-activated electronic devices within a temporal window; and
  
  at the backend system and based at least in part on the first phrase being detected by more than the threshold number of the voice-activated electronic devices within the temporal window, causing speech recognition functionality to stop for at least a first portion of the first audio data received from at least one of the voice-activated electronic devices.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 22)
- - 6. The method of claim 5, further comprising:
    - identifying a category that has been generated for the first phrase; and
      
      incrementing a count for the category, the count corresponding to a number of the voice-activated electronic devices that detected the first phrase within the temporal window.
  - 7. The method of claim 5, further comprising:
    - based at least in part on the first phrase being detected by more than the threshold number of the voice-activated electronic devices within the temporal window, generating, for the at least one voice-activated electronic a requesting device that sent the first portion of the first audio data, an instruction to return to a standby mode; and
      
      sending the instruction to the at least one voice-activated electronic device.
  - 8. The method of claim 5, further comprising:
    - determining a time that second audio data corresponding to the first phrase was received from a voice-activated electronic device;
      
      generating text data corresponding to the second audio data;
      
      determining that the time is one of before or after the temporal window; and
      
      providing the text data to natural language understanding functionality.
  - 9. The method of claim 5, further comprising:
    - generating first text data corresponding to a first utterance detected by a first one of the voice-activated electronic devices;
      
      storing the first text data;
      
      generating second text data corresponding to a second utterance detected by a second one of the voice-activated electronic devices;
      
      determining that the second text data matches the first text data; and
      
      causing the speech recognition functionality to stop for the first portion of the first audio data based at least in part on the second text data matching the first text data.
  - 10. The method of claim 5, further comprising:
    - determining that a media event is occurring during the temporal window; and
      
      determining that the first phrase corresponds to the media event based at least in part on the first phrase being detected by more than the threshold number of the voice-activated electronic devices within the temporal window.
  - 11. The method of claim 10, further comprising:
    - determining at least one future time that the media event is to occur;
      
      generating an instruction for a first voice-activated electronic device to ignore detected audio data corresponding to the first phrase at the at least one future time; and
      
      sending the instruction to the first voice-activated electronic device prior to the at least one future time.
  - 12. The method of claim 5, further comprising:
    - determining an average number of different voice-activated electronic devices that detect the first phrase during a predefined time period; and
      
      setting the threshold number as the average number.
  - 22. The method of claim 5, wherein:
    - the method further comprises;
      
      generating first text data corresponding to a first utterance detected by a first one of the voice-activated electronic devices, anddetermining that the first text data matches stored text data corresponding to the first phrase; and
      
      the determining that the first phrase was detected by more than the threshold number of the voice-activated electronic devices within the temporal window is based at least in part on the first text data matching the stored text data.

13. A system, comprising:
- at least one processor; and
  
  at least one computer-readable medium encoded with instructions which, when executed by the at least one processor, cause the system to;
  
  receive first audio data from a plurality of different voice-activated electronic devices;
  
  determine, using the first audio data, that a first phrase was detected by more than a threshold number of the voice-activate electronic devices within a temporal window;
  
  based at least in part on the first phrase being detected by more than the threshold number of the voice-activated electronic devices during the temporal window, cause speech recognition functionality to stop for at least a first portion of the first audio data received from at least one of the voice-activated electronic devices.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
- - 14. The system of claim 13, wherein the at least one computer-readable medium is encoded with additional instructions which, when executed by the at least one processor, further cause the system to:
    - identify a category that has been generated for the first phrase; and
      
      increment a count for the category, the count corresponding to a number of the voice-activated electronic devices that detected the first phrase within the temporal window.
  - 15. The system of claim 13, wherein the at least one computer-readable medium is encoded with additional instructions which, when executed by the at least one processor, further cause the system to:
    - based at least in part on the first phrase being detected by more than the threshold number of the voice-activated electronic devices within the temporal window generate, for the at least one voice-activated electronic device that sent the first portion of the first audio data, an instruction to return to a standby mode; and
      
      send the instruction to the at least one voice-activated electronic device.
  - 16. The system of claim 13, wherein the at least one computer-readable medium is encoded with additional instructions which, when executed by the at least one processor, further cause the system to:
    - determine a time that second audio data corresponding to the first phrase was received from a voice-activated electronic device;
      
      generate text data corresponding to the second audio data;
      
      determine that the time is one of before or after the temporal window; and
      
      provide the text data to natural language understanding functionality.
  - 17. The system of claim 13, wherein the at least one computer-readable medium is encoded with additional instructions which, when executed by the at least one processor, further cause the system to:
    - generate first text data corresponding to a first utterance detected by a first one of the voice-activated electronic devices;
      
      store the first text data;
      
      generate second text data corresponding to a second utterance detected by a second one of the voice-activated electronic devices;
      
      determine that the second text data matches the first text data; and
      
      cause the speech recognition functionality to stop for the first portion of the first audio data based at least in part on the second text data matching the first text data.
  - 18. The system of claim 13, wherein the at least one computer-readable medium is encoded with additional instructions which, when executed by the at least one processor, further cause the system to:
    - determine that a media event is occurring during the temporal window; and
      
      determine that the first phrase corresponds to the media event based at least in part on the first phrase being detected by more than the threshold number of the voice-activated electronic devices within the temporal window.
  - 19. The system of claim 18, wherein the at least one computer-readable medium is encoded with additional instructions which, when executed by the at least one processor, further cause the system to:
    - determine at least one future time that the media event is to occur;
      
      generate an instruction for a first voice-activated electronic device to ignore detected audio data corresponding to the first phrase at the at least one future time; and
      
      send the instruction to the first voice-activated electronic device prior to the at least one future time.
  - 20. The system of claim 13, wherein the at least one computer-readable medium is encoded with additional instructions which, when executed by the at least one processor, further cause the system to:
    - determine an average number of different voice-activated electronic devices that detect the first phrase being generated during a predefined time period; and
      
      set the threshold number as the average number.
  - 21. The system of claim 13, wherein the at least one computer-readable medium is encoded with additional instructions which, when executed by the at least one processor, further cause the system to:
    - generate first text data corresponding to a first utterance detected by a first one of the voice-activated electronic devices;
      
      determine that the first text data matches stored text data corresponding to the first phrase; and
      
      determine, based at least in part on the first text data matching the stored text data, that the first phrase was detected by more than the threshold number of the voice-activated electronic devices within the temporal window.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Wightman, Colin Wills, Narayanan, Naresh, Rashid, Daniel Robert
Primary Examiner(s)
Ortiz-Sanchez, Michael

Application Number

US15/085,692
Time in Patent Office

1,301 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/316   Indexing structures

G06F 40/289   Phrasal analysis, e.g. fini...

G10L 15/10   using distance or distortio...

G10L 15/20   Speech recognition techniqu...

G10L 15/26   Speech to text systems G10L...

G10L 15/285   Memory allocation or algori...

G10L 17/04   Training, enrolment or mode...

G10L 2015/223   Execution procedure of a sp...

G10L 25/51   for comparison or discrimin...

Post-speech recognition request surplus detection and prevention

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Post-speech recognition request surplus detection and prevention

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links