Detection and use of acoustic signal quality indicators

US 8,521,537 B2
Filed: 04/03/2007
Issued: 08/27/2013
Est. Priority Date: 04/03/2006
Status: Active Grant

First Claim

Patent Images

1. A computer-driven method to regulate a speaker'"'"'s issuance of speech based commands, the method comprising operations of:

receiving an input signal representing audio content including speech of the speaker, the audio content occurring between a prescribed begin event and a prescribed end event;

processing at least a portion of the input signal to determine whether the processed portion of the input signal contains speech exhibiting one or more predetermined conditions capable of posing difficulty in successfully performing speech recognition upon the input signal;

where the processed portion of the input signal comprises one of;

all of the input signal, a part of the input signal, speech present in the input signal;

if the processing operation determines that the processed portion of the input signal contains speech exhibiting one or more of the predetermined conditions, then performing operations comprising;

issuing at least one human comprehensible speech quality alert corresponding to each exhibited predetermined condition;

the processing operation comprising;

determining whether the speaker was speaking too loudly relative to prescribed criteria;

the operation of determining whether the speaker was speaking too loudly relative to prescribed criteria comprising;

computing a non-silent-frame-count by computing a total number of frames of the input signal with acoustic energy above a first threshold;

computing a loud-frame-count by computing a total number of frames of the input signal with acoustic energy above a second threshold higher than the first threshold;

computing a ratio between the loud-frame-count and the non-silent-frame-count;

determining whether the ratio exceeds a third threshold, and if so, concluding that the speaker was speaking too loudly relative to the prescribed criteria.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-driven device assists a user in self-regulating speech control of the device. The device processes an input signal representing human speech to compute acoustic signal quality indicators indicating conditions likely to be problematic to speech recognition, and advises the user of those conditions.

Citations

36 Claims

1. A computer-driven method to regulate a speaker'"'"'s issuance of speech based commands, the method comprising operations of:
- receiving an input signal representing audio content including speech of the speaker, the audio content occurring between a prescribed begin event and a prescribed end event;
  
  processing at least a portion of the input signal to determine whether the processed portion of the input signal contains speech exhibiting one or more predetermined conditions capable of posing difficulty in successfully performing speech recognition upon the input signal;
  
  where the processed portion of the input signal comprises one of;
  
  all of the input signal, a part of the input signal, speech present in the input signal;
  
  if the processing operation determines that the processed portion of the input signal contains speech exhibiting one or more of the predetermined conditions, then performing operations comprising;
  
  issuing at least one human comprehensible speech quality alert corresponding to each exhibited predetermined condition;
  
  the processing operation comprising;
  
  determining whether the speaker was speaking too loudly relative to prescribed criteria;
  
  the operation of determining whether the speaker was speaking too loudly relative to prescribed criteria comprising;
  
  computing a non-silent-frame-count by computing a total number of frames of the input signal with acoustic energy above a first threshold;
  
  computing a loud-frame-count by computing a total number of frames of the input signal with acoustic energy above a second threshold higher than the first threshold;
  
  computing a ratio between the loud-frame-count and the non-silent-frame-count;
  
  determining whether the ratio exceeds a third threshold, and if so, concluding that the speaker was speaking too loudly relative to the prescribed criteria.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1,the operations further comprising receiving activation and deactivation of a push-to-talk mechanism;
    - the activation and the deactivation defining the begin event and end event, respectively;
      
      the processing operation comprising;
      
      determining whether the speaker began speaking before activation of the push-to-talk mechanism.
  - 3. The method of claim 2, the operation of determining whether the speaker began speaking before activation of the push-to-talk mechanism comprising:
    - applying a predetermined speech detection process to the input signal;
      
      determining whether the input signal contains detected speech occurring within a predetermined time period after the begin event, and if so, concluding that the speaker began speaking before activation of the push-to-talk mechanism.
  - 4. The method of claim 1,the operations further comprising receiving activation and deactivation of a push-to-talk mechanism;
    - the activation and the deactivation defining the begin event and end event, respectively;
      
      the processing operation comprising;
      
      determining whether the speaker was still speaking upon deactivation of the push-to-talk mechanism.
  - 5. The method of claim 4, the operation of determining whether the speaker was still speaking upon deactivation of the push-to-talk mechanism comprising:
    - applying a predetermined speech detection process to the input signal;
      
      determining whether the input signal contains detected speech occurring within a predetermined time period before the end event, and if so, concluding that the speaker was still speaking upon deactivation of the push-to-talk mechanism.
  - 6. The method of claim 3 or 5, where the operation of applying the predetermined speech detection process further comprises:
    - excluding breath noise from consideration as speech.
  - 7. The method of claim 3 or 5, where the operation of applying the predetermined speech detection process further comprises:
    - excluding energy of a prescribed frequency range from consideration as speech.
  - 8. The method of claim 1, the operation of determining whether the speaker was speaking too loudly relative to prescribed criteria comprising:
    - computing one of the following quantities;
      
      acoustic energy of the input signal, acoustic energy of speech of the input signal;
      
      determining whether the computed quantity exceeds a prescribed maximum, and if so, concluding that the speaker was speaking too loudly relative to prescribed criteria.
  - 9. The method of claim 1, the operation of processing at least a portion of the input signal comprising one or more of:
    - computing one or more acoustic signal quality indicators each indicating an affirmative output when the processed portion of the input signal contains speech exhibiting a corresponding one of the predetermined conditions;
      
      computing one or more acoustic signal quality indicators each functionally related to a degree of presence of a corresponding one of the predetermined conditions.
  - 10. The method of claim 1, the operations further comprising:
    - receiving or calculating an utterance confidence score representing a likelihood that results of speech recognition performed upon the input signal are correct;
      
      comparing the utterance confidence score with a prescribed threshold;
      
      if the utterance confidence score exceeds the prescribed threshold, omitting performance of one or both of the following;
      
      the operation of processing at least a portion of the input signal as to one or more of the predetermined conditions;
      
      the operation of issuing the at least one human comprehensible speech quality alert.
  - 11. The method of claim 10, where:
    - the comparing operation uses different prescribed thresholds corresponding to different predetermined conditions;
      
      the omitting operation is only performed for each of the predetermined conditions whose corresponding prescribed threshold was not exceeded by the utterance confidence score.
  - 12. The method of claim 1, where:
    - the operations further comprise receiving or calculating an utterance confidence score representing a likelihood that results of speech recognition performed upon the input signal are correct;
      
      the operation of processing at least the portion of the input signal comprises;
      
      computing one or more acoustic signal quality indicators each functionally related to a degree of presence of a corresponding one of the predetermined conditions;
      
      applying a predetermined function to each combination of;
      
      computed acoustic signal quality indicator and utterance confidence score;
      
      the operations further comprise;
      
      if the input signal contains speech exhibiting a given one of the predetermined conditions, suspending issuance of the human comprehensible speech quality alert corresponding to the given predetermined condition if the application of the predetermined function to the utterance confidence score and the acoustic signal quality indicator corresponding to the given predetermined condition yields a predetermined result.

13. A computer-driven method to regulate a speaker'"'"'s issuance of speech based commands, the method comprising operations of:
- receiving an input signal representing audio content including speech of the speaker, the audio content occurring between a prescribed begin event and a prescribed end event;
  
  processing at least a portion of the input signal to determine whether the processed portion of the input signal contains speech exhibiting one or more predetermined conditions capable of posing difficulty in successfully performing speech recognition upon the input signal;
  
  where the processed portion of the input signal comprises one of;
  
  all of the input signal, a part of the input signal, speech present in the input signal;
  
  if the processing operation determines that the processed portion of the input signal contains speech exhibiting one or more of the predetermined conditions, then performing operations comprising;
  
  issuing at least one human comprehensible speech quality alert corresponding to each exhibited predetermined condition;
  
  the processing operation comprising;
  
  determining whether the speaker was speaking too softly relative to prescribed criteria;
  
  the operation of determining whether the speaker was speaking too softly relative to prescribed criteria comprising;
  
  computing a non-silent-frame-count by computing a total number of frames of the input signal that have acoustic energy above a first threshold;
  
  computing a soft-frame-count by computing a total number of frames of the input signal that have acoustic energy less than a second threshold;
  
  computing a ratio between the soft-frame-count and the non-silent-frame-count;
  
  determining whether the ratio is greater than a third threshold, and if so, concluding that the speaker was speaking too softly relative to the prescribed criteria.
- View Dependent Claims (14, 15, 16)
- - 14. The method of claim 13, the operation of determining whether the speaker was speaking too softly relative to prescribed criteria comprising:
    - computing one of the following quantities;
      
      acoustic energy of the input signal, acoustic energy of speech of the input signal;
      
      determining whether the computed quantity is less than a predetermined minimum, and if so, concluding that the speaker was speaking too softly relative to prescribed criteria.
  - 15. The method of claim 8 or 14, where the computing operation comprises one of:
    - analyzing raw audio stream;
      
      analyzing mel frequency cepstral coefficients.
  - 16. The method of claim 8 or 14, the operations further comprisingutilizing the input signal to compute a C0 element of a mel-frequency cepstral coefficient vector;
    - the computing operation utilizing the computed C0 element.

17. A non-transitory computer medium storing one of (1) a program to perform operations for regulating a speaker'"'"'s issuance of speech-based commands, or (2) a program for installing a target program of machine-readable instructions on a computer, where the target program is executable to perform the operations for regulating a speaker'"'"'s issuance of speech-based commands, the operations comprising:
- receiving an input signal representing audio content including speech of the speaker, the audio content occurring between a prescribed begin event and a prescribed end event;
  
  processing at least a portion of the input signal to determine whether the processed portion of the input signal contains speech exhibiting one or more predetermined conditions capable of posing difficulty in successfully performing speech recognition upon the input signal;
  
  where the processed portion of the input signal comprises one of;
  
  all of the input signal, a part of the input signal, speech present in the input signal;
  
  if the processing operation determines that the processed portion of the input signal contains speech exhibiting one or more of the predetermined conditions, then performing operations comprising;
  
  issuing at least one human comprehensible speech quality alert corresponding to each exhibited predetermined condition;
  
  where the input signal originated from a microphone;
  
  the processing operation comprising;
  
  determining whether the microphone was held too near to the speaker; and
  
  determining whether the speaker was speaking too loudly relative to prescribed criteria;
  
  the operation of determining whether the speaker was speaking too loudly relative to prescribed criteria comprising;
  
  computing a non-silent-frame-count by computing a total number of frames of the input signal with acoustic energy above a first threshold;
  
  computing a loud-frame-count by computing a total number of frames of the input signal with acoustic energy above a second threshold higher than the first threshold;
  
  computing a ratio between the loud-frame-count and the non-silent-frame-count;
  
  determining whether the ratio exceeds a third threshold, and if so, concluding that the speaker was speaking too loudly relative to the prescribed criteria.

18. A non-transitory computer medium storing one of (1) a program to perform operations for regulating a speaker'"'"'s issuance of speech-based commands, or (2) a program for installing a target program of machine-readable instructions on a computer, where the target program is executable to perform the operations for regulating a speaker'"'"'s issuance of speech-based commands, the operations comprising:
- receiving an input signal representing audio content including speech of the speaker, the audio content occurring between a prescribed begin event and a prescribed end event;
  
  processing at least a portion of the input signal to determine whether the processed portion of the input signal contains speech exhibiting one or more predetermined conditions capable of posing difficulty in successfully performing speech recognition upon the input signal;
  
  where the processed portion of the input signal comprises one of;
  
  all of the input signal, a part of the input signal, speech present in the input signal;
  
  if the processing operation determines that the processed portion of the input signal contains speech exhibiting one or more of the predetermined conditions, then performing operations comprising;
  
  issuing at least one human comprehensible speech quality alert corresponding to each exhibited predetermined condition;
  
  where the input signal originated from a microphone;
  
  the processing operation comprising;
  
  determining whether the microphone was held too near to the speaker; and
  
  determining whether the speaker was speaking too softly relative to prescribed criteria,the operation of determining whether the speaker was speaking too softly relative to prescribed criteria comprising;
  
  computing a non-silent-frame-count by computing a total number of frames of the input signal that have acoustic energy above a first threshold;
  
  computing a soft-frame-count by computing a total number of frames of the input signal that have acoustic energy less than a second threshold;
  
  computing a ratio between the soft-frame-count and the non-silent-frame-count;
  
  determining whether the ratio is greater than a third threshold, and if so, concluding that the speaker was speaking too softly relative to the prescribed criteria.

19. A non-transitory computer medium having program instructions stored therein which, when executed by a processor, perform operations comprising:
- receiving an input signal representing audio content including speech of the speaker, the audio content occurring between a prescribed begin event and a prescribed end event;
  
  processing at least a portion of the input signal to determine whether the processed portion of the input signal contains speech exhibiting one or more predetermined conditions capable of posing difficulty in successfully performing speech recognition upon the input signal;
  
  where the processed portion of the input signal comprises one of;
  
  all of the input signal, a part of the input signal, speech present in the input signal;
  
  if the processing operation determines that the processed portion of the input signal contains speech exhibiting one or more of the predetermined conditions, then performing operations comprising;
  
  issuing at least one human comprehensible speech quality alert corresponding to each exhibited predetermined condition;
  
  the processing operation comprising;
  
  determining whether the speaker was speaking too loudly relative to prescribed criteria;
  
  the operation of determining whether the speaker was speaking too loudly relative to prescribed criteria comprising;
  
  computing a non-silent-frame-count by computing a total number of frames of the input signal with acoustic energy above a first threshold;
  
  computing a loud-frame-count by computing a total number of frames of the input signal with acoustic energy above a second threshold higher than the first threshold;
  
  computing a ratio between the loud-frame-count and the non-silent-frame-count;
  
  determining whether the ratio exceeds a third threshold, and if so, concluding that the speaker was speaking too loudly relative to the prescribed criteria.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
- - 20. The non-transitory computer medium of claim 19,the operations further comprising receiving activation and deactivation of a push-to-talk mechanism;
    - the activation and the deactivation defining the begin event and end event, respectively;
      
      the processing operation comprising;
      
      determining whether the speaker began speaking before activation of the push-to-talk mechanism.
  - 21. The non-transitory computer medium of claim 20, the operation of determining whether the speaker began speaking before activation of the push-to-talk mechanism comprising:
    - applying a predetermined speech detection process to the input signal;
      
      determining whether the input signal contains detected speech occurring within a predetermined time period after the begin event, and if so, concluding that the speaker began speaking before activation of the push-to-talk mechanism.
  - 22. The non-transitory computer medium of claim 19,the operations further comprising receiving activation and deactivation of a push-to-talk mechanism;
    - the activation and the deactivation defining the begin event and end event, respectively;
      
      the processing operation comprising;
      
      determining whether the speaker was still speaking upon deactivation of the push-to-talk mechanism.
  - 23. The non-transitory computer medium of claim 22, the operation of determining whether the speaker was still speaking upon deactivation of the push-to-talk mechanism comprising:
    - applying a predetermined speech detection process to the input signal;
      
      determining whether the input signal contains detected speech occurring within a predetermined time period before the end event, and if so, concluding that the speaker was still speaking upon deactivation of the push-to-talk mechanism.
  - 24. The non-transitory computer medium of claim 21 or 23, where the operation of applying the predetermined speech detection process further comprises:
    - excluding breath noise from consideration as speech.
  - 25. The non-transitory computer medium of claim 21 or 23, where the operation of applying the predetermined speech detection process further comprises:
    - excluding energy of a prescribed frequency range from consideration as speech.
  - 26. The non-transitory computer medium of claim 19, the processing operation comprising:
    - determining whether the speaker was speaking too loudly relative to prescribed criteria.
  - 27. The non-transitory computer medium of claim 26, the operation of determining whether the speaker was speaking too loudly relative to prescribed criteria comprising:
    - computing one of the following quantities;
      
      acoustic energy of the input signal, acoustic energy of speech of the input signal;
      
      determining whether the computed quantity exceeds a prescribed maximum, and if so, concluding that the speaker was speaking too loudly relative to prescribed criteria.
  - 28. The non-transitory computer medium of claim 19, the processing operation comprising:
    - determining whether the speaker was speaking too softly relative to prescribed criteria.
  - 29. The non-transitory computer medium of claim 28, the operation of determining whether the speaker was speaking too softly relative to prescribed criteria comprising:
    - computing one of the following quantities;
      
      acoustic energy of the input signal, acoustic energy of speech of the input signal;
      
      determining whether the computed quantity is less than a predetermined minimum, and if so, concluding that the speaker was speaking too softly relative to prescribed criteria.
  - 30. The non-transitory computer medium of claim 27 or 29, where the computing operation comprises one of:
    - analyzing raw audio stream;
      
      analyzing mel frequency cepstral coefficients.
  - 31. The non-transitory computer medium of claim 27 or 29,the operations further comprising utilizing the input signal to compute a C0 element of a mel-frequency cepstral coefficient vector;
    - the computing operation utilizing the computed C0 element.
  - 32. The non-transitory computer medium of claim 19, the operation of processing at least a portion of the input signal comprising one or more of:
    - computing one or more acoustic signal quality indicators each indicating an affirmative output when the processed portion of the input signal contains speech exhibiting a corresponding one of the predetermined conditions;
      
      computing one or more acoustic signal quality indicators each functionally related to a degree of presence of a corresponding one of the predetermined conditions.
  - 33. The non-transitory computer medium of claim 19, the operations further comprising:
    - receiving or calculating an utterance confidence score representing a likelihood that results of speech recognition performed upon the input signal are correct;
      
      comparing the utterance confidence score with a prescribed threshold;
      
      if the utterance confidence score exceeds the prescribed threshold, omitting performance of one or both of the following;
      
      the operation of processing at least a portion of the input signal as to one or more of the predetermined conditions;
      
      the operation of issuing the at least one human comprehensible speech quality alert.
  - 34. The non-transitory computer medium of claim 33, where:
    - the comparing operation uses different prescribed thresholds corresponding to different predetermined conditions;
      
      the omitting operation is only performed for each of the predetermined conditions whose corresponding prescribed threshold was not exceeded by the utterance confidence score.
  - 35. The non-transitory computer medium of claim 19, where:
    - the operations further comprise receiving or calculating an utterance confidence score representing a likelihood that results of speech recognition performed upon the input signal are correct;
      
      the operation of processing at least the portion of the input signal comprises;
      
      computing one or more acoustic signal quality indicators each functionally related to a degree of presence of a corresponding one of the predetermined conditions;
      
      applying a predetermined function to each combination of;
      
      computed acoustic signal quality indicator and utterance confidence score;
      
      the operations further comprise;
      
      if the input signal contains speech exhibiting a given one of the predetermined conditions, suspending issuance of the human comprehensible speech quality alert corresponding to the given predetermined condition if the application of the predetermined function to the utterance confidence score and the acoustic signal quality indicator corresponding to the given predetermined condition yields a predetermined result.

36. A non-transitory computer medium having program instructions stored therein which, when executed by a processor, perform operations comprising:
- receiving an input signal representing audio content including speech of the speaker, the audio content occurring between a prescribed begin event and a prescribed end event;
  
  processing at least a portion of the input signal to determine whether the processed portion of the input signal contains speech exhibiting one or more predetermined conditions capable of posing difficulty in successfully performing speech recognition upon the input signal;
  
  where the processed portion of the input signal comprises one of;
  
  all of the input signal, a part of the input signal, speech present in the input signal;
  
  if the processing operation determines that the processed portion of the input signal contains speech exhibiting one or more of the predetermined conditions, then performing operations comprising;
  
  issuing at least one human comprehensible speech quality alert corresponding to each exhibited predetermined condition;
  
  the processing operation comprising;
  
  determining whether the speaker was speaking too softly relative to prescribed criteria;
  
  the operation of determining whether the speaker was speaking too softly relative to prescribed criteria comprising;
  
  computing a non-silent-frame-count by computing a total number of frames of the input signal that have acoustic energy above a first threshold;
  
  computing a soft-frame-count by computing a total number of frames of the input signal that have acoustic energy less than a second threshold;
  
  computing a ratio between the soft-frame-count and the non-silent-frame-count;
  
  determining whether the ratio is greater than a third threshold, and if so, concluding that the speaker was speaking too softly relative to the prescribed criteria.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Promptu Systems Corporation
Original Assignee
Promptu Systems Corporation
Inventors
Chittar, Naren, Gulati, Vikas, Pratt, Matthew, Printz, Harry
Primary Examiner(s)
SAINT CYR, LEONARD

Application Number

US12/295,981
Publication Number

US 20090299741A1
Time in Patent Office

2,338 Days
Field of Search

None
US Class Current

704/275
CPC Class Codes

G10L 15/01   Assessment or evaluation of...

G10L 15/22   Procedures used during a sp...

G10L 21/06   Transformation of speech in...

Detection and use of acoustic signal quality indicators

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

36 Claims

Specification

Solutions

Use Cases

Quick Links

Detection and use of acoustic signal quality indicators

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

36 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links