USER SPECIFIED KEYWORD SPOTTING USING LONG SHORT TERM MEMORY NEURAL NETWORK FEATURE EXTRACTOR

US 20160180838A1
Filed: 12/22/2014
Published: 06/23/2016
Est. Priority Date: 12/22/2014
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, by a device for each of multiple variable length enrollment audio signals each encoding a respective spoken utterance of an enrollment phrase, a respective plurality of enrollment feature vectors that represent features of the respective variable length enrollment audio signal, wherein when the device determines that another audio signal encodes another spoken utterance of the enrollment phrase, the device performs a particular action assigned to the enrollment phrase; and

for each of the multiple variable length enrollment audio signals;

processing each of the plurality of enrollment feature vectors for the respective variable length enrollment audio signal using a long short term memory (LSTM) neural network to generate a respective enrollment LSTM output vector for each enrollment feature vector; and

generating, for the respective variable length enrollment audio signal, a template fixed length representation for use in determining whether the other audio signal encodes another spoken utterance of the enrollment phrase by combining at most a quantity k of the enrollment LSTM output vectors for the enrollment audio signal, wherein a predetermined length of each of the template fixed length representations is the same.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for recognizing keywords using a long short term memory neural network. One of the methods includes receiving, by a device for each of multiple variable length enrollment audio signals, a respective plurality of enrollment feature vectors that represent features of the respective variable length enrollment audio signal, processing each of the plurality of enrollment feature vectors using a long short term memory (LSTM) neural network to generate a respective enrollment LSTM output vector for each enrollment feature vector, and generating, for the respective variable length enrollment audio signal, a template fixed length representation for use in determining whether another audio signal encodes another spoken utterance of the enrollment phrase by combining at most a quantity k of the enrollment LSTM output vectors for the enrollment audio signal.

59 Citations

View as Search Results

20 Claims

1. A method comprising:
- receiving, by a device for each of multiple variable length enrollment audio signals each encoding a respective spoken utterance of an enrollment phrase, a respective plurality of enrollment feature vectors that represent features of the respective variable length enrollment audio signal, wherein when the device determines that another audio signal encodes another spoken utterance of the enrollment phrase, the device performs a particular action assigned to the enrollment phrase; and
  
  for each of the multiple variable length enrollment audio signals;
  
  processing each of the plurality of enrollment feature vectors for the respective variable length enrollment audio signal using a long short term memory (LSTM) neural network to generate a respective enrollment LSTM output vector for each enrollment feature vector; and
  
  generating, for the respective variable length enrollment audio signal, a template fixed length representation for use in determining whether the other audio signal encodes another spoken utterance of the enrollment phrase by combining at most a quantity k of the enrollment LSTM output vectors for the enrollment audio signal, wherein a predetermined length of each of the template fixed length representations is the same.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, comprising, for each of the multiple variable length enrollment audio signals:
    - determining whether at least the quantity k of enrollment feature vectors were generated for the respective enrollment audio signal; and
      
      in response to determining that less than the quantity k of enrollment feature vectors were generated for the respective enrollment audio signal, adding leading zeros to a front of the respective template fixed length representation so that the respective template fixed length representation has the predetermined length.
  - 3. The method of claim 2, comprising determining an average number of enrollment frames in all of the enrollment audio signals wherein the quantity k comprises the average number of enrollment frames.
  - 4. The method of claim 2, wherein:
    - each of the enrollment output vectors has a predetermined size l that corresponds to a size of a last layer in the long short term memory neural network; and
      
      adding leading zeros to the front of the respective template fixed length representation comprises adding leading zeros to the front of the respective template fixed length representation so that the respective template fixed length representation has a total of l times k values.
  - 5. The method of claim 4, wherein the last layer in the long short term memory neural network comprises a hidden layer during training of the long short term memory neural network.
  - 6. The method of claim 1, comprising, for at least one of the multiple variable length enrollment audio signals:
    - determining that more than the quantity k of enrollment LSTM output vectors were generated for the respective enrollment audio signal; and
      
      in response, generating the template fixed length representation for the respective enrollment audio signal by combining the quantity k most recent enrollment LSMT output vectors.

7. A system comprising:
- a computer and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving, by the computer for each of multiple variable length enrollment audio signals each encoding a respective spoken utterance of an enrollment phrase, a respective plurality of enrollment feature vectors that represent features of the respective variable length enrollment audio signal, wherein when the computer determines that another audio signal encodes another spoken utterance of the enrollment phrase, the computer performs a particular action assigned to the enrollment phrase; and
  
  for each of the multiple variable length enrollment audio signals;
  
  processing each of the plurality of enrollment feature vectors for the respective variable length enrollment audio signal using a long short term memory (LSTM) neural network to generate a respective enrollment LSTM output vector for each enrollment feature vector; and
  
  generating, for the respective variable length enrollment audio signal, a template fixed length representation for use in determining whether the other audio signal encodes another spoken utterance of the enrollment phrase by combining at most a quantity k of the LSTM output vectors for the enrollment audio signal, wherein a predetermined length of each of the template fixed length representations is the same.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
- - 8. The system of claim 7, the operations comprising:
    - receiving, for an audio signal encoding a spoken utterance of a phrase, a respective plurality of feature vectors each comprising values that represent features of the audio signal;
      
      processing each of the feature vectors using the long short term memory neural network to generate a respective LSTM output vector for each of the feature vectors;
      
      generating a fixed length representation for the audio signal by combining at most the quantity k of the LSTM output vectors; and
      
      determining whether the phrase and the enrollment phrase are the same and the phrase was spoken by the same person using a comparison of the fixed length representation and the template fixed length representations.
  - 9. The system of claim 8, wherein determining whether the phrase and the enrollment phrase are the same using a comparison of the fixed length representation and all of the template fixed length representations comprises determining whether the phrase and the enrollment phrase are the same using a comparison of the fixed length representation and an average template fixed length representation created by averaging the values in each of the template fixed length representations to determine a corresponding value in the average template fixed length representation.
  - 10. The system of claim 8, wherein determining whether the phrase and the enrollment phrase are the same using a comparison of the fixed length representation and all of the template fixed length representations comprises determining a confidence score that represents a distance between the fixed length representation and the template fixed length representations.
  - 11. The system of claim 10, wherein determining the confidence score that represents the distance between the fixed length representation and the template fixed length representations comprises determining the distance between the fixed length representation and the template fixed length representations using a cosine distance function.
  - 12. The system of claim 10, the operations comprising determining that the confidence score satisfies a threshold value, wherein determining whether the phrase and the enrollment phrase are the same using a comparison of the representation and all of the template fixed length representations comprises determining that the phrase and the enrollment phrase are the same in response to determining that the confidence score satisfies the threshold value.
  - 13. The system of claim 12, the operations comprising:
    - receiving input indicating an action to perform in response to receipt of an audio signal encoding a spoken utterance of the enrollment phrase; and
      
      performing the action in response to determining that the phrase and the enrollment phrase are the same.
  - 14. The system of claim 13, wherein:
    - receiving input indicating the action to perform in response to receipt of an audio signal encoding a spoken utterance of the enrollment phrase comprises receiving input indicating that when a particular device is asleep and receives an audio signal encoding a spoken utterance of the enrollment phrase, the particular device should wake up; and
      
      performing the action in response to determining that the phrase and the enrollment phrase are the same comprises waking up by the particular device.

15. A computer-readable medium storing software comprising instructions executable by a computer which, upon such execution, cause the computer to perform operations comprising:
- receiving, by the computer for each of multiple variable length enrollment audio signals each encoding a respective spoken utterance of an enrollment phrase, a respective plurality of enrollment feature vectors that represent features of the respective variable length enrollment audio signal, wherein when the computer determines that another audio signal encodes another spoken utterance of the enrollment phrase, the computer performs a particular action assigned to the enrollment phrase; and
  
  for each of the multiple variable length enrollment audio signals;
  
  processing each of the plurality of enrollment feature vectors for the respective variable length enrollment audio signal using a long short term memory (LSTM) neural network to generate a respective enrollment LSTM output vector for each enrollment feature vector; and
  
  generating, for the respective variable length enrollment audio signal, a template fixed length representation for use in determining whether the other audio signal encodes another spoken utterance of the enrollment phrase by combining at most a quantity k of the LSTM output vectors for the enrollment audio signal, wherein a predetermined length of each of the template fixed length representations is the same.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer-readable medium of claim 15, comprising resetting, for each of the audio signals, values stored in cells of the long short term memory neural network prior to generating a first LSTM enrollment output vector for the respective audio signal.
  - 17. The computer-readable medium of claim 15, comprising receiving, from another computer, the long short term memory neural network.
  - 18. The computer-readable medium of claim 17, wherein receiving the long short term memory neural network comprises receiving a long short term memory neural network that does not include an output layer.
  - 19. The computer-readable medium of claim 15, comprising creating an average template fixed length representation by averaging values in each of the template fixed length representations to determine a corresponding value in the average template fixed length representation.
  - 20. The computer-readable medium of claim 15, comprising:
    - receiving, for an audio signal encoding a spoken utterance of a phrase, a respective plurality of feature vectors each comprising values that represent features of the audio signal;
      
      processing each of the feature vectors using the long short term memory neural network to generate a respective LSTM output vector for each of the feature vectors;
      
      generating a fixed length representation for the audio signal by combining at most the quantity k of the LSTM output vectors; and
      
      determining that the phrase and the enrollment phrase are not the same or were spoken by different people using a comparison of the representation and the template fixed length representations.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Parada San Martin, Maria Carolina, Sainath, Tara N., Chen, Guoguo

Granted Patent

US 9,508,340 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 1/3203   Power management, i.e. even...

G06N 3/02   Neural networks

G10L 15/02   Feature extraction for spee...

G10L 15/16   using artificial neural net...

G10L 15/28   Constructional details of s...

G10L 2015/0631   Creating reference template...

G10L 2015/088   Word spotting

G10L 25/51   for comparison or discrimin...

USER SPECIFIED KEYWORD SPOTTING USING LONG SHORT TERM MEMORY NEURAL NETWORK FEATURE EXTRACTOR

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

59 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

USER SPECIFIED KEYWORD SPOTTING USING LONG SHORT TERM MEMORY NEURAL NETWORK FEATURE EXTRACTOR

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

59 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links