ANALYZING AUDIO INPUT FOR EFFICIENT SPEECH AND MUSIC RECOGNITION

US 20150332667A1
Filed: 09/29/2014
Published: 11/19/2015
Est. Priority Date: 05/15/2014
Status: Active Grant

First Claim

Patent Images

1. A method for analyzing audio input, the method comprising:

at an electronic device;

receiving an audio input;

determining whether the audio input includes music;

determining whether the audio input includes speech;

in response to determining that the audio input includes music, generating an acoustic fingerprint representing a portion of the audio input that includes music; and

in response to determining that the audio input includes speech rather than music, identifying an end-point of a speech utterance of the audio input.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and processes for analyzing audio input for efficient speech and music recognition are provided. In one example process, an audio input can be received. A determination can be made as to whether the audio input includes music. In addition, a determination can be made as to whether the audio input includes speech. In response to determining that the audio input includes music, an acoustic fingerprint representing a portion of the audio input that includes music is generated. In response to determining that the audio input includes speech rather than music, an end-point of a speech utterance of the audio input is identified.

Citations

21 Claims

1. A method for analyzing audio input, the method comprising:
- at an electronic device;
  
  receiving an audio input;
  
  determining whether the audio input includes music;
  
  determining whether the audio input includes speech;
  
  in response to determining that the audio input includes music, generating an acoustic fingerprint representing a portion of the audio input that includes music; and
  
  in response to determining that the audio input includes speech rather than music, identifying an end-point of a speech utterance of the audio input.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The method according to claim 1, wherein the audio input comprises a sequence of audio segments, and wherein determining whether the audio input includes music further comprises:
    - extracting from an audio segment of the sequence of audio segments one or more features that characterize the audio segment; and
      
      determining whether the audio segment includes music using an audio classifier and based on the one or more features that characterize the audio segment.
  - 3. The method according to claim 2, wherein the one or more features that characterize the audio segment include at least one of root mean square amplitude, zero crossing rate, spectral centroid, spectral roll-off, spectral flux, and spectral flatness.
  - 4. The method according to claim 2, wherein the one or more features that characterize the audio segment include at least one of frequency cepstrum coefficients, linear predictive cepstral coefficients, bark scale frequency cepstral coefficients, mel-frequency discrete wavelet coefficients, or mel-frequency cepstral coefficients.
  - 5. The method according to claim 2, wherein the audio classifier is a neural network classifier.
  - 6. The method according to claim 2, wherein the audio classifier is a rule-based classifier.
  - 7. The method according to claim 6, wherein the rule-based classifier determines whether the audio segment includes music by comparing the one or more features to one or more predetermined thresholds.
  - 8. The method according to claim 2, wherein the audio input comprises a sequence of audio frames, and wherein determining whether the audio input includes speech further comprises:
    - extracting from an audio frame of the sequence of audio frames one or more features that characterize the audio frame; and
      
      determining whether the audio frame includes speech based on the one or more features that characterize the audio frame and one or more predetermined thresholds, wherein a duration of the audio frame is different from a duration of the audio segment.
  - 9. The method according to claim 2, wherein the audio input comprises a sequence of audio frames, and wherein determining whether the audio input includes speech further comprises:
    - extracting from an audio frame of the sequence of audio frames one or more features that characterize the audio frame; and
      
      determining whether the audio frame includes speech based on the one or more features that characterize the audio frame and one or more predetermined thresholds.
  - 10. The method according to claim 8, wherein the one or more features that characterize the audio frame include at least one of short-term energy level, zero crossing rate, spectral centroid, spectral roll-off, spectral flux, spectral flatness, and autocorrelation.
  - 11. The method according to claim 1, wherein determining whether the audio input includes music is performed independent of determining whether the audio input includes speech.
  - 12. The method according to claim 1, wherein determining whether the audio input includes music and determining whether the audio input includes speech are performed at least in part simultaneously.
  - 13. The method according to claim 1, wherein the acoustic fingerprint is generated from an uncompressed representation of a portion of the audio input.
  - 14. The method according to claim 1, further comprising:
    - in response to determining that the audio input includes speech, presenting a relevant dialog response to a speech utterance of the audio input.
  - 15. The method according to claim 1, further comprising:
    - in response to determining that the audio input includes music;
      
      obtaining an identity of the music in the audio input based on the acoustic fingerprint; and
      
      displaying the identity of the music.
  - 16. The method according to claim 1, further comprising:
    - in response to determining that the audio input includes music;
      
      processing the audio input for speech comprising;
      
      identifying a speech utterance of the audio input;
      
      determining an inferred user intent based on the speech utterance; and
      
      determining whether the inferred user intent includes identifying music in the audio input; and
      
      in response to determining that the inferred user intent does not include identifying music in the audio input, ceasing to generate the acoustic fingerprint.
  - 17. The method according to claim 1, wherein receiving the audio input begins in response to receiving a signal to begin receiving the audio input, and further comprising:
    - in response to determining that the audio input includes neither speech nor music for a predetermined duration, ceasing to receive the audio input.
  - 18. The method according to claim 1, further comprising:
    - in response to determining that the audio input includes music, ceasing to determine whether the audio input includes speech.
  - 19. The method according to claim 1, wherein receiving the audio input begins in response to receiving a signal to begin receiving the audio input, and further comprising:
    - in response to determining that the audio input includes speech rather than music, ceasing to receive the audio input a predetermined duration after the end-point is identified.

20. A non-transitory computer-readable storage medium comprising instructions for causing one or more processor to:
- receive audio input;
  
  determine whether the audio input includes music;
  
  determine whether the audio input includes speech;
  
  responsive to determining that the audio input includes music, generate an acoustic fingerprint representing a portion of the audio input that includes music; and
  
  responsive to determining that the audio input includes speech rather than music, identify an end-point of a speech utterance of the audio input.

21. An electronic device, comprising:
- one or more processors;
  
  memory;
  
  one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for;
  
  receiving audio input;
  
  determining whether the audio input includes music;
  
  determining whether the audio input includes speech;
  
  responsive to determining that the audio input includes music, generating an acoustic fingerprint representing a portion of the audio input that includes music; and
  
  responsive to determining that the audio input includes speech rather than music, identifying an end-point of a speech utterance of the audio input.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
MASON, Henry

Granted Patent

US 9,620,105 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 2015/025   Phonemes, fenemes or fenone...

G10L 25/03   characterised by the type o...

G10L 25/81   for discriminating voice fr...

ANALYZING AUDIO INPUT FOR EFFICIENT SPEECH AND MUSIC RECOGNITION

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

ANALYZING AUDIO INPUT FOR EFFICIENT SPEECH AND MUSIC RECOGNITION

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links