Speech classification of audio for wake on voice

US 10,714,122 B2
Filed: 06/06/2018
Issued: 07/14/2020
Est. Priority Date: 06/06/2018
Status: Active Grant

First Claim

Patent Images

1. A speech detection system comprising:

a memory to store received audio input; and

a processor coupled to the memory, the processor to;

generate, via acoustic scoring of an acoustic model based on the received audio input, a plurality of probability scores each for a corresponding audio unit;

update a speech pattern model based on at least some of the probability scores to generate a score for each state of the speech pattern model, wherein the speech pattern model comprises a first non-speech state comprising a plurality of self loops each associated with a non-speech probability score of the probability scores, a plurality of speech states following the first non-speech state, and a second non-speech state following the speech states, wherein the speech states comprise a first speech state following and connected to the first non-speech state by a plurality of first transitions each corresponding to a speech probability score of the probability scores and a second speech state following the first speech state and preceding the second non-speech state;

determine whether the received audio input comprises speech based on a comparison of a first score of the first non-speech state and a second score of the second speech state; and

provide a speech detection indicator when the received audio input comprises speech.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Speech or non-speech detection techniques are discussed and include updating a speech pattern model using probability scores from an acoustic model to generate a score for each state of the speech pattern model, such that the speech pattern model includes a first non-speech state having multiple self loops each associated with a non-speech probability score of the probability scores, a plurality of speech states following the first non-speech state, and a second non-speech state following the speech states, and detecting speech based on a comparison of a score of the first non-speech state and a score of the last speech state of the multiple speech states.

Citations

25 Claims

1. A speech detection system comprising:
- a memory to store received audio input; and
  
  a processor coupled to the memory, the processor to;
  
  generate, via acoustic scoring of an acoustic model based on the received audio input, a plurality of probability scores each for a corresponding audio unit;
  
  update a speech pattern model based on at least some of the probability scores to generate a score for each state of the speech pattern model, wherein the speech pattern model comprises a first non-speech state comprising a plurality of self loops each associated with a non-speech probability score of the probability scores, a plurality of speech states following the first non-speech state, and a second non-speech state following the speech states, wherein the speech states comprise a first speech state following and connected to the first non-speech state by a plurality of first transitions each corresponding to a speech probability score of the probability scores and a second speech state following the first speech state and preceding the second non-speech state;
  
  determine whether the received audio input comprises speech based on a comparison of a first score of the first non-speech state and a second score of the second speech state; and
  
  provide a speech detection indicator when the received audio input comprises speech.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The speech detection system of claim 1, wherein the comparison of the first score and the second score comprises a comparison of a difference between the second score and the first score to a threshold, and wherein the second non-speech state is connected to the second speech state by a plurality of second transitions each corresponding to a non-speech probability score of the probability scores.
  - 3. The speech detection system of claim 1, the processor further to:
    - detect an end of speech for a speech signal based on a third score of the second non-speech state exceeding the second score.
  - 4. The speech detection system of claim 3, wherein the processor to detect the end of speech comprises the processor to determine a score of the second non-speech state exceeds a score of the second speech state for a majority of a plurality of consecutive speech pattern model updates.
  - 5. The speech detection system of claim 3, the processor further to:
    - detect, based on a prior updating of the speech pattern model, a beginning of speech for the speech signal based on a fourth score of the first speech state exceeding a fifth score of the first non-speech state; and
      
      provide temporal indicators of the speech signal based on the beginning of speech and the end of speech.
  - 6. The speech detection system of claim 1, the processor further to:
    - train a second acoustic model, wherein the second acoustic model comprises a plurality of output nodes each corresponding to one of noise, silence, or sub-phonetic units each associated with one of a plurality of monophones;
      
      determine a usage rate for each of the sub-phonetic units during the training;
      
      determine a selected output node corresponding to a highest usage rate sub-phonetic unit for each of the plurality of monophones; and
      
      include, in the acoustic model, the selected output nodes corresponding to the highest usage rate sub-phonetic units and discard remaining output nodes corresponding to the sub-phonetic units.
  - 7. The speech detection system of claim 1, wherein the second non-speech state is a silence state connected to the second speech state by a plurality of transitions each corresponding to a silence score of the plurality of scores.
  - 8. The speech detection system of claim 1, wherein the speech pattern model comprises one or more third non-speech states immediately following the second speech state and immediately preceding the second non-speech state, wherein one of the third non-speech states is connected to the second non-speech state by a plurality of transitions each corresponding to the non-speech probability scores of the plurality of self loops.
  - 9. The speech detection system of claim 1, wherein speech states subsequent to the first speech state are connected to previous speech states by corresponding pluralities of second transitions corresponding to the speech probability scores, and wherein the second non-speech state is connected to the second speech state by a plurality of third transitions each corresponding to the non-speech probability scores of the plurality of self loops.
  - 10. The speech detection system of claim 9, wherein the processor to update the speech pattern model comprises the processor to:
    - provide a continual summing at the first non-speech state based on a previous score of the first non-speech state and a maximum probability score of the non-speech probability scores of the plurality of self loops; and
      
      provide a value at each of the speech states exclusive of the second speech state based on a sum of a previous score at an immediately preceding state and a maximum probability score of the speech probability scores.
  - 11. The speech detection system of claim 10, wherein the processor to update the speech pattern model further comprises the processor to:
    - provide a value of the second speech state based on a sum of a maximum of a previous score of an immediately preceding speech state and a previous score of the second speech state with a maximum probability score of the speech probability scores.
  - 12. The speech detection system of claim 1, wherein the acoustic model comprises a deep neural network and generating the plurality of probability scores comprises scoring a feature vector comprising a stack of a time series of coefficients each associated with a sampling time.

13. A computer-implemented method for speech detection comprising:
- generating, via acoustic scoring of an acoustic model based on received audio input, a plurality of probability scores each for a corresponding audio unit;
  
  updating a speech pattern model based on at least some of the probability scores to generate a score for each state of the speech pattern model, wherein the speech pattern model comprises a first non-speech state comprising a plurality of self loops each associated with a non-speech probability score of the probability scores, a plurality of speech states following the first non-speech state, and a second non-speech state following the speech states, wherein the speech states comprise a first speech state following and connected to the first non-speech state by a plurality of first transitions each corresponding to a speech probability score of the probability scores and a second speech state following the first speech state and preceding the second non-speech state;
  
  determining whether the received audio input comprises speech based on a comparison of a first score of the first non-speech state and a second score of the second speech state; and
  
  providing a speech detection indicator when the received audio input comprises speech.
- View Dependent Claims (14, 15, 16, 17, 18, 19)
- - 14. The method of claim 13, wherein the comparison of the first score and the second score comprises a comparison of a difference between the second score and the first score to a threshold, and wherein the second non-speech state is connected to the second speech state by a plurality of second transitions each corresponding to a non-speech probability score of the probability scores.
  - 15. The method of claim 13, further comprising:
    - detecting an end of speech for a speech signal based on a third score of the second non-speech state exceeding the second score, wherein detecting the end of speech comprises determining a score of the second non-speech state exceeds a score of the second speech state for a majority of a plurality of consecutive speech pattern model updates.
  - 16. The method of claim 15, further comprising:
    - detecting, based on a prior updating of the speech pattern model, a beginning of speech for the speech signal based on a fourth score of the first speech state exceeding a fifth score of the first non-speech state; and
      
      providing temporal indicators of the speech signal based on the beginning of speech and the end of speech.
  - 17. The method of claim 13, wherein speech states subsequent to the first speech state are connected to previous speech states by corresponding pluralities of second transitions corresponding to the speech probability scores, wherein the second non-speech state is connected to the second speech state by a plurality of third transitions each corresponding to the non-speech probability scores of the plurality of self loops.
  - 18. The method of claim 17, wherein updating the speech pattern model comprises:
    - providing a continual summing at the first non-speech state based on a previous score of the first non-speech state and a maximum probability score of the non-speech probability scores of the plurality of self loops; and
      
      providing a value at each of the speech states exclusive of the second speech state based on a sum of a previous score at an immediately preceding state and a maximum probability score of the speech probability scores.
  - 19. The method of claim 18, wherein updating the speech pattern model further comprises:
    - providing a value of the second speech state based on a sum of a maximum of a previous score of an immediately preceding speech state and a previous score of the second speech state with a maximum probability score of the speech probability scores.

20. At least one non-transitory machine readable medium comprising a plurality of instructions that, in response to being executed on a device, cause the device to speech detection by:
- generating, via acoustic scoring of an acoustic model based on received audio input, a plurality of probability scores each for a corresponding audio unit;
  
  updating a speech pattern model based on at least some of the probability scores to generate a score for each state of the speech pattern model, wherein the speech pattern model comprises a first non-speech state comprising a plurality of self loops each associated with a non-speech probability score of the probability scores, a plurality of speech states following the first non-speech state, and a second non-speech state following the speech states, wherein the speech states comprise a first speech state following and connected to the first non-speech state by a plurality of first transitions each corresponding to a speech probability score of the probability scores and a second speech state following the first speech state and preceding the second non-speech state;
  
  determining whether the received audio input comprises speech based on a comparison of a first score of the first non-speech state and a second score of the second speech state; and
  
  providing a speech detection indicator when the received audio input comprises speech.
- View Dependent Claims (21, 22, 23, 24, 25)
- - 21. The non-transitory machine readable medium of claim 20, wherein the comparison of the first score and the second score comprises a comparison of a difference between the second score and the first score to a threshold, and wherein the second non-speech state is connected to the second speech state by a plurality of second transitions each corresponding to a non-speech probability score of the probability scores.
  - 22. The non-transitory machine readable medium of claim 20, the machine readable medium further comprising instructions that, in response to being executed on the device, cause the device to perform speech detection by:
    - detecting an end of speech for a speech signal based on a third score of the second non-speech state exceeding the second score, wherein detecting the end of speech comprises determining a score of the second non-speech state exceeds a score of the second speech state for a majority of a plurality of consecutive speech pattern model updates.
  - 23. The non-transitory machine readable medium of claim 22, the machine readable medium further comprising instructions that, in response to being executed on the device, cause the device to perform speech detection by:
    - detecting, based on a prior updating of the speech pattern model, a beginning of speech for the speech signal based on a fourth score of the first speech state exceeding a fifth score of the first non-speech state; and
      
      providing temporal indicators of the speech signal based on the beginning of speech and the end of speech.
  - 24. The non-transitory machine readable medium of claim 20, wherein speech states subsequent to the first speech state are connected to previous speech states by corresponding pluralities of second transitions corresponding to the speech probability scores, wherein the second non-speech state is connected to the second speech state by a plurality of third transitions each corresponding to the non-speech probability scores of the plurality of self loops.
  - 25. The non-transitory machine readable medium of claim 24, wherein updating the speech pattern model comprises:
    - providing a continual summing at the first non-speech state based on a previous score of the first non-speech state and a maximum probability score of the non-speech probability scores of the plurality of self loops; and
      
      providing a value at each of the speech states exclusive of the second speech state based on a sum of a previous score at an immediately preceding state and a maximum probability score of the speech probability scores.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Intel Corporation
Original Assignee
Intel Corporation
Inventors
Muchlinski, Maciej, Bocklet, Tobias
Primary Examiner(s)
Vo, Huyen X

Application Number

US16/001,496
Publication Number

US 20190043529A1
Time in Patent Office

769 Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/063   Training

G10L 15/14   using statistical models, e...

G10L 15/142   Hidden Markov Models [HMMs]

G10L 15/16   using artificial neural net...

G10L 15/22   Procedures used during a sp...

G10L 2015/088   Word spotting

G10L 25/84   for discriminating voice fr...

G10L 25/87   Detection of discrete point...

Speech classification of audio for wake on voice

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Speech classification of audio for wake on voice

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links