Sound profile generation based on speech recognition results exceeding a threshold

US 10,074,364 B1
Filed: 03/30/2016
Issued: 09/11/2018
Est. Priority Date: 02/02/2016
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

receiving, at an electronic device, audio data representing a phrase;

generating text data representing the phrase by executing speech-to-text functionality;

identifying a category that has been generated for the phrase, the category signifying that the text data represents the phrase;

adding a count to the category to indicate that another instance of the category has been identified;

determining a total number of counts for the category;

determining, based on the total number of counts for the category, that multiple requesting devices have sent audio data representing the phrase to the electronic device during a same temporal window;

based at least in part on a determination that multiple requesting devices have sent audio data representing the phrase to the electronic device during the same temporal window, generating an audio fingerprint corresponding to the audio data;

storing the audio fingerprint on the electronic device;

receiving additional audio data also representing the phrase;

generating an additional audio fingerprint corresponding to the additional audio data;

determining that a bit error rate of the additional audio fingerprint as compared to the audio fingerprint;

determining that the bit error rate is less than a bit error rate threshold value indicating that the audio data and the additional audio data both represent the phrase; and

based at least in part on a determination that the bit error rate is less than the bit error rate threshold value, refraining from performing at least some automatic speech recognition processing for the additional audio data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for generating sound profiles of artificial commands detected by multiple voice activated electronic devices is described herein. In some embodiments, numerous voice activated electronic devices may send audio data representing a phrase to a backend system at a substantially same time. Text data representing the phrase, and counts for instances of that text data, may be generated. If the number of counts exceeds a predefined threshold, the backend system may cause any remaining response generation functionality that particular command that is in excess of the predefined threshold to be stopped, and those devices returned to a sleep state. In some embodiments, a sound profile unique to the phrase that caused the excess of the predefined threshold may be generated such that future instances of the same phrase may be recognized prior to text data being generated, conserving the backend system'"'"'s resources.

56 Citations

View as Search Results

22 Claims

1. A method, comprising:
- receiving, at an electronic device, audio data representing a phrase;
  
  generating text data representing the phrase by executing speech-to-text functionality;
  
  identifying a category that has been generated for the phrase, the category signifying that the text data represents the phrase;
  
  adding a count to the category to indicate that another instance of the category has been identified;
  
  determining a total number of counts for the category;
  
  determining, based on the total number of counts for the category, that multiple requesting devices have sent audio data representing the phrase to the electronic device during a same temporal window;
  
  based at least in part on a determination that multiple requesting devices have sent audio data representing the phrase to the electronic device during the same temporal window, generating an audio fingerprint corresponding to the audio data;
  
  storing the audio fingerprint on the electronic device;
  
  receiving additional audio data also representing the phrase;
  
  generating an additional audio fingerprint corresponding to the additional audio data;
  
  determining that a bit error rate of the additional audio fingerprint as compared to the audio fingerprint;
  
  determining that the bit error rate is less than a bit error rate threshold value indicating that the audio data and the additional audio data both represent the phrase; and
  
  based at least in part on a determination that the bit error rate is less than the bit error rate threshold value, refraining from performing at least some automatic speech recognition processing for the additional audio data.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, further comprising:
    - receiving new audio data representing a different phrase;
      
      generating a new audio fingerprint corresponding to the new additional audio data;
      
      determining a new bit error rate of the new audio fingerprint as compared to the audio fingerprint;
      
      determining that new bit error rate is greater than the bit error rate threshold value that indicates that the new audio fingerprint and the audio fingerprint represent different phrases; and
      
      enabling speech recognition processing to proceed for the new audio data such that text data representing the different phrase is generated.
  - 3. The method of claim 1, further comprising:
    - receiving new audio data representing the phrase;
      
      generating a candidate audio fingerprint corresponding to a beginning portion of the phrase;
      
      determining a new bit error rate of the candidate audio fingerprint as compared to an initial portion of the audio fingerprint;
      
      determining that the new bit error rate is greater than the error threshold value indicating that the new audio data and the audio data differ;
      
      generating a full audio fingerprint corresponding to the new audio data such that the full audio fingerprint represents an entirety of the phrase;
      
      determining a supplemental bit error rate of the full audio fingerprint as compared to the audio fingerprint;
      
      determining that the supplemental bit error rate is less than the error threshold value indicating that the new audio data and the audio data both represent the phrase; and
      
      based at least in part on a determination that the supplemental bit error rate is less than the bit error rate threshold value, refraining from performing at least some automatic speech recognition processing for the new audio data.
  - 4. The method of claim 1, further comprising:
    - receiving new audio data representing the phrase;
      
      generating a candidate audio fingerprint corresponding to a beginning portion of the phrase;
      
      determining a new bit error rate of the candidate audio fingerprint as compared to an initial portion of the audio fingerprint;
      
      determining that the new bit error rate is less than the bit error threshold value indicating that the new audio data represents the phrase; and
      
      based at least in part on a determination that the new bit error rate is less than the bit error rate threshold value, prior to a full audio fingerprint corresponding to an entirety of the phrase being generated, refraining from performing at least some automatic speech recognition processing for the new audio data.

5. A method, comprising:
- receiving a first instance of audio data representing a first sound;
  
  determining that, within a temporal window, a plurality of additional instances of audio data representing the first sound are also received;
  
  determining a number of the instances of audio data representing the first sound that are received within the temporal window;
  
  determining that the number of the instances is greater than a threshold value;
  
  based at least in part on a determination that the number of the instances is greater than the threshold value, generating a first sound profile of the first sound;
  
  storing the first sound profile;
  
  receiving second audio data representing a second sound;
  
  generating a second sound profile of the second sound;
  
  determining that a similarity value of the second sound profile and the first sound profile is greater than a similarity threshold value; and
  
  based at least in part on a determination that the similarity value is greater than the similarity threshold value, refraining from performing at least some automated speech recognition processing for the second audio data.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13)
- - 6. The method of claim 5, wherein the method is performed by at least one electronic device that is separate from one or more user devices that generate the first audio data and the second audio data.
  - 7. The method of claim 5, wherein determining the similarity value comprises:
    - determining a bit error rate of the second sound profile as compared to the first sound profile;
      
      determining that the bit error rate is less than a bit rate threshold signifying a bit difference between the first sound profile and the second sound profile; and
      
      determining, based at least in part on the bit error rate value being less than the bit rate threshold, that the first sound profile and the second sound profile are substantially similar to one another.
  - 8. The method of claim 5, further comprising:
    - receiving third audio data representing a third sound;
      
      generating a third sound profile of the third sound;
      
      determining that a second similarity value of the third sound profile and the first sound profile is less than the similarity threshold value; and
      
      enabling automated speech recognition processing to continue for the third audio data.
  - 9. The method of claim 5, further comprising:
    - generating, prior to determining that the number of instances of audio data representing the first sound is greater than the threshold value, a first instance of text data representing the first sound;
      
      generating an additional plurality of instances of text data corresponding to the plurality of additional instances of audio data;
      
      determining a total number of counts corresponding to the first instance of the text data and the additional plurality of instances of the text data; and
      
      determining that the total number of counts occurring within the temporal window is greater than the threshold value.
  - 10. The method of claim 5, further comprising:
    - determining that the first sound was produced by a media event;
      
      obtaining a total audio output of the media event; and
      
      generating a media event sound profile based on the total audio output.
  - 11. The method of claim 10, further comprising:
    - receiving third audio data representing a third sound;
      
      generating a third sound profile of the third audio data;
      
      determining that a second similarity value of the third sound profile as compared to a first portion of the media event sound profile is greater than a media event similarity threshold value; and
      
      based at least in part on a determination that the second similarity value is greater than the media event similarity threshold value, refraining from performing at least some automated speech recognition processing for the third audio data.
  - 12. The method of claim 5, wherein:
    - at least one first electronic device receives the first instance of audio data, determines that the plurality of additional instances of audio data representing the first sound are also received within the temporal window, determines the number of instances of audio data, determines that the number of instances is greater than the threshold value, and generates the first sound profile of the first sound;
      
      the method further comprises sending the first sound profile from the at least one first electronic device to at least one second electronic device for storage; and
      
      the at least one second electronic device receives the second audio data, generates the second sound profile, determines that the similarity value of the second sound profile and the first sound profile is greater than the similarity threshold value, and refrains from causing the at least some automated speech recognition processing to be performed for the second audio data.
  - 13. The method of claim 12, further comprising:
    - based at least in part on the determination that the similarity value is greater than the similarity threshold value, refraining from sending at least a portion of the second audio data from the at least one second electronic device to the at least one first electronic device for automatic speech recognition processing.

14. An electronic system, comprising:
- andat least one processor operable to;
  
  receive a first instance of audio data representing a first sound;
  
  determine that, within a temporal window, a plurality of additional instances of audio data representing the first sound are also received;
  
  determine a number of the instances of audio data representing the first sound that are received within the temporal window;
  
  determine that the number of instances is greater than a threshold value;
  
  based at least in part on a determination that the number of instances is greater than the threshold value, generate a first sound profile of the first sound;
  
  store the first sound profile;
  
  receive second audio data representing a second sound;
  
  generate a second sound profile of the second sound;
  
  determine that a similarity value of the second sound profile and the first sound profile is greater than a similarity threshold value; and
  
  based at least in part on a determination that the similarity value is greater than the similarity threshold value, refrain from performing at least some automated speech recognition processing for the second audio data.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22)
- - 15. The electronic system of claim 14, wherein the at least one processor is separate from one or more user devices that generate the first audio data and the second audio data.
  - 16. The electronic system of claim 14, wherein the at least one processor is further operable to:
    - determine a bit error rate of the second sound profile as compared to the first sound profile;
      
      determine that the bit error rate value is less than a bit rate threshold signifying a bit difference between the first sound profile and the second sound profile; and
      
      determine, based at least in part on the bit error rate value being less than the bit rate threshold, that the first sound profile and the second sound profile are substantially similar to one another.
  - 17. The electronic system of claim 14, wherein the at least one processor is further operable to:
    - receive third audio data representing a third sound;
      
      generate a third sound profile of the third sound;
      
      determine that a second similarity value of the second sound profile and the first sound profile is less than the similarity threshold value; and
      
      enable automatic speech recognition processing to continue for the third audio data.
  - 18. The electronic system of claim 14, wherein the at least one processor is further operable to:
    - generate, prior to determining that the number of instances of audio data representing the first sound is greater than the threshold value, a first instance of text data representing the first sound;
      
      generate an additional plurality of instances of text data corresponding to the plurality of additional instances of audio data;
      
      determine a total number of counts corresponding to the first instance of text data and the additional plurality of instances of text data; and
      
      determine that the total number of counts occurring within the temporal window is greater than the threshold value.
  - 19. The electronic system of claim 14, wherein the at least one processor is further operable to:
    - determine that the first sound was produced by a media event;
      
      obtain a total audio output of the media event; and
      
      generate a media event sound profile based on the total audio output.
  - 20. The electronic system of claim 19, wherein the at least one processor is further operable to:
    - receive third audio data representing a third sound;
      
      generate a third sound profile of the third audio data;
      
      determine that a second similarity value of the third sound profile as compared to a first portion of the media event sound profile, is greater than a media event similarity threshold value; and
      
      based at least in part on a determination that the second similarity value is greater than the media event similarity threshold value, refrain from performing at least some automated speech recognition processing for the third audio data.
  - 21. The electronic system of claim 14, wherein:
    - the at least one processor comprises at least one first processor associated with at least one first electronic device and at least one second processor associated with at least one second electronic device;
      
      the at least one first processor is operable to;
      
      receive the first instance of the audio data,determine that the plurality of additional instances of audio data representing the first sound are also received,determine the number of the instances of audio data,determine that the number of instances is greater than the threshold value,generate the first sound profile, andcause a communication to be sent to the second electronic device that causes the first sound profile to be stored; and
      
      the at least one second processor is operable to;
      
      generate the second sound profile of the second sound,determine that the similarity value of the second sound profile and the first sound profile is greater than the similarity threshold value, andbased at least in part on the determination that the similarity value is greater than the similarity threshold value, refrain from causing the at least some automated speech recognition processing to be performed for the second audio data.
  - 22. The electronic system of claim 21, wherein the at least one second processor is further operable to:
    - based at least in part on the determination that the similarity value is greater than the similarity threshold value, refrain from sending at least a portion of the second audio data from the at least one second electronic device to the at least one first electronic device for automatic speech recognition processing.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Wightman, Colin Wills, Narayanan, Naresh, Rosen, Alexander David, Rodehorst, Michael James, Rashid, Daniel Robert
Primary Examiner(s)
Sanchez, Michael Ortiz

Application Number

US15/085,772
Time in Patent Office

895 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/316   Indexing structures

G06F 40/289   Phrasal analysis, e.g. fini...

G10L 15/10   using distance or distortio...

G10L 15/20   Speech recognition techniqu...

G10L 15/26   Speech to text systems G10L...

G10L 15/285   Memory allocation or algori...

G10L 17/04   Training, enrolment or mode...

G10L 2015/223   Execution procedure of a sp...

G10L 25/51   for comparison or discrimin...

Sound profile generation based on speech recognition results exceeding a threshold

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

56 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Sound profile generation based on speech recognition results exceeding a threshold

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

56 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links