Frame skipping with extrapolation and outputs on demand neural network for automatic speech recognition

US 9,520,128 B2
Filed: 09/23/2014
Issued: 12/13/2016
Est. Priority Date: 09/23/2014
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for providing automatic speech recognition comprising:

receiving, by a microphone, speech for evaluation;

converting the received speech to a speech recording;

extracting features from the speech recording;

evaluating, for a first time instance, a neural network based on the extracted features to determine a first distance value associated with the first time instance, wherein the first distance value corresponds to an output node from the neural network;

evaluating, for a second time instance, the neural network based on the extracted features to determine a second distance value associated with the second time instance, wherein the second distance value corresponds to the output node from the neural network;

approximating, for a third time instance, a third distance value based on at least one of an extrapolation or an interpolation of the first and second distance values, wherein the neural network is not evaluated for the third time instance;

converting the speech recording to a recognized word sequence based on a plurality of distance values comprising the first, the second, and the third distance values; and

storing the recognized word sequence in a system memory.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques related to implementing neural networks for speech recognition systems are discussed. Such techniques may include implementing frame skipping with approximated skip frames and/or distances on demand such that only those outputs needed by a speech decoder are provided via the neural network or approximation techniques.

Citations

21 Claims

1. A computer-implemented method for providing automatic speech recognition comprising:
- receiving, by a microphone, speech for evaluation;
  
  converting the received speech to a speech recording;
  
  extracting features from the speech recording;
  
  evaluating, for a first time instance, a neural network based on the extracted features to determine a first distance value associated with the first time instance, wherein the first distance value corresponds to an output node from the neural network;
  
  evaluating, for a second time instance, the neural network based on the extracted features to determine a second distance value associated with the second time instance, wherein the second distance value corresponds to the output node from the neural network;
  
  approximating, for a third time instance, a third distance value based on at least one of an extrapolation or an interpolation of the first and second distance values, wherein the neural network is not evaluated for the third time instance;
  
  converting the speech recording to a recognized word sequence based on a plurality of distance values comprising the first, the second, and the third distance values; and
  
  storing the recognized word sequence in a system memory.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, further comprising:
    - generating one or more output indices for the first time instance, wherein the first distance value is associated with a first output index of the output indices, wherein the neural network comprises an output layer having a plurality of output layer nodes, and wherein evaluating the neural network for the first time instance comprises evaluating only a subset of the plurality of output layer nodes associated with the output indices.
  - 3. The method of claim 2, wherein the neural network further comprises a final hidden layer having final hidden layer nodes, and wherein evaluating the neural network for the first time instance comprises evaluating all of the final hidden layer nodes.
  - 4. The method of claim 1, wherein approximating the third distance value comprises extrapolating the third distance value based on at least one of a linear function, a non-linear function, or a variance function.
  - 5. The method of claim 1, wherein approximating the third distance value comprises extrapolating the third distance value, wherein the first and second time instances are prior to the third time instance.
  - 6. The method of claim 5, wherein extrapolating the third distance value comprises extrapolating the third distance value via a linear function based on the first distance value and the second distance value, and wherein the linear function comprises the second distance value added to half of a difference between the first distance value and the second distance value.
  - 7. The method of claim 1, wherein the neural network comprises an output layer having a plurality of output layer nodes, and wherein evaluating the neural network for the first time instance comprises evaluating all of the plurality of output layer nodes.
  - 8. The method of claim 1, wherein the second time instance is associated with a neural network evaluation frame, the third time instance is associated with a skip frame, and wherein one, two, or three additional skip frames are between the evaluation frame and the skip frame.
  - 9. The method of claim 1, wherein the second time instance is associated with a neural network evaluation frame, the third time instance is associated with a skip frame, and the method further comprises:
    - determining a frame skipping rate based on at least one of available computing resources or a current real time factor; and
      
      providing an additional skip frame between the evaluation frame and the skip frame based on the frame skipping rate.
  - 10. The method of claim 1, wherein converting the speech recording to the recognized word sequence comprises decoding the plurality of distance values via a Viterbi beam searching decoder.

11. A system for providing automatic speech recognition comprising:
- a microphone to receive speech and convert the received speech to a digital signal;
  
  a system memory configured to store a speech recording corresponding to the digital signal; and
  
  a central processing unit coupled to the memory, the central processing unit to extract features from the speech recording, to implement, for a first time instance, a neural network based on the extracted features to determine a first distance value associated with the first time instance, wherein the first distance value corresponds to an output node from the neural network, to implement, for a second time instance, the neural network based on the extracted features to determine a second distance value associated with the second time instance, wherein the second distance value corresponds to the output node from the neural network, to approximate, for a third time instance, a third distance value based on at least one of an extrapolation or an interpolation of the first and second distance values, and to convert decode the speech recording to a recognized word sequence based on a plurality of distance values comprising the first, the second, and the third distance values, and to determine a recognized word sequence corresponding to the speech recording to store the recognized word sequence in the system memory.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The system of claim 11, wherein the central processing unit is to generate one or more output indices for the first time instance, wherein the first distance value is associated with a first output index of the output indices, wherein the neural network comprises an output layer having a plurality of output layer nodes, and wherein the neural network circuitry is configured to evaluate only a subset of the plurality of output layer nodes associated with the output indices for the first time instance.
  - 13. The system of claim 12, wherein the neural network further comprises a final hidden layer having final hidden layer nodes, and wherein central processing unit is to evaluate all of the final hidden layer nodes for the first time instance.
  - 14. The system of claim 11, wherein the central processing unit to approximate the third distance value comprises the central processing unit to extrapolate the third distance value via a linear function based on the first and second distance values.
  - 15. The system of claim 11, wherein the second time instance is associated with a neural network evaluation frame, the third time instance is associated with a skip frame, and wherein one, two, or three additional skip frames are between the evaluation frame and the skip frame.
  - 16. The system of claim 11, wherein the central processing unit is to determine a frame skipping rate based on at least one of available computing resources of the system or a current real time factor.

17. At least one non-transitory machine readable medium comprising a plurality of instructions that, in response to being executed on a computing device, cause the computing device to perform a method comprising:
- receiving, by a microphone, speech for evaluation;
  
  converting the received speech to a speech recording;
  
  extracting features from the speech recording;
  
  evaluating, for a first time instance, a neural network based on the extracted features to determine a first distance value associated with the first time instance, wherein the first distance value corresponds to an output node from the neural network;
  
  evaluating, for a second time instance, the neural network based on the extracted features to determine a second distance value associated with the second time instance, wherein the second distance value corresponds to the output node from the neural network;
  
  approximating, for a third time instance, a third distance value based on at least one of an extrapolation or an interpolation of the first and second distance values, wherein the neural network is not evaluated for the third time instance;
  
  converting the speech recording to a recognized word sequence based on a plurality of distance values comprising the first, the second, and the third distance values; and
  
  storing the recognized word sequence in a system memory.
- View Dependent Claims (18, 19, 20, 21)
- - 18. The machine readable medium of claim 17 further comprising instructions that, in response to being executed on the computing device, cause the computing device to perform speech recognition by:
    - generating one or more output indices for the first time instance, wherein the first distance value is associated with a first output index of the output indices, wherein the neural network comprises an output layer having a plurality of output layer nodes, and wherein evaluating the neural network for the first time instance comprises evaluating only a subset of the plurality of output layer nodes associated with the output indices.
  - 19. The machine readable medium of claim 17, wherein approximating the third distance value comprises extrapolating the third distance value via a linear function based on the first and second distance values.
  - 20. The machine readable medium of claim 17, wherein the second time instance is associated with a neural network evaluation frame, the third time instance is associated with a skip frame, and wherein one, two, or three additional skip frames are between the evaluation frame and the skip frame.
  - 21. The machine readable medium of claim 17, wherein the second time instance is associated with a neural network evaluation frame and the third time instance is associated with a skip frame, the machine readable medium further comprising instructions that, in response to being executed on the computing device, cause the computing device to perform speech recognition by:
    - determining a frame skipping rate based on at least one of available computing resources or a current real time factor; and
      
      providing an additional skip frame between the evaluation frame and the skip frame based on the frame skipping rate.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Intel Corporation
Original Assignee
Intel Corporation
Inventors
Stemmer, Georg, Bauer, Josef, Rozen, Piotr
Primary Examiner(s)
Chawan, Vijay B

Application Number

US14/493,434
Publication Number

US 20160086600A1
Time in Patent Office

812 Days
Field of Search

704/243, 704/251, 704/235, 704/240, 704/242, 704/245, 704/255, 704/257, 704/10, 704/231, 704/232, 704/233, 704/236, 704/238, 704/239, 704/246, 704/247, 704/249, 704/250, 704/254, 704/256, 704/256.8, 704/260, 704/265, 704/266
US Class Current

1/1
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/08   Speech classification or se...

G10L 15/12   using dynamic programming t...

G10L 15/16   using artificial neural net...

Frame skipping with extrapolation and outputs on demand neural network for automatic speech recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Frame skipping with extrapolation and outputs on demand neural network for automatic speech recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links