LEARNING SPEECH MODELS FOR MOBILE DEVICE USERS

US 20130006633A1
Filed: 01/05/2012
Published: 01/03/2013
Est. Priority Date: 07/01/2011
Status: Abandoned Application

First Claim

Patent Images

1. A method for training a user speech model, the method comprising:

accessing audio data captured while a mobile device is in an in-call state;

clustering the captured audio data into a plurality of clusters, each cluster of the plurality of clusters being associated with one or more audio segments from the accessed audio data;

identifying a predominate voice cluster; and

training the user speech model based, at least in part, on audio data associated with the predominate voice cluster.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques are provided to recognize a speaker'"'"'s voice. In one embodiment, received audio data may be separated into a plurality of signals. For each signal, the signal may be associated with value/s for one or more features (e.g., Mel-Frequency Cepstral coefficients). The received data may be clustered (e.g., by clustering features associated with the signals). A predominate voice cluster may be identified and associated with a user. A speech model (e.g., a Gaussian Mixture Model or Hidden Markov Model) may be trained based on data associated with the predominate cluster. A received audio signal may then be processed using the speech model to, e.g.: determine who was speaking; determine whether the user was speaking; determining whether anyone was speaking; and/or determine what words were said. A context of the device or the user may then be inferred based at least partly on the processed signal.

301 Citations

30 Claims

1. A method for training a user speech model, the method comprising:
- accessing audio data captured while a mobile device is in an in-call state;
  
  clustering the captured audio data into a plurality of clusters, each cluster of the plurality of clusters being associated with one or more audio segments from the accessed audio data;
  
  identifying a predominate voice cluster; and
  
  training the user speech model based, at least in part, on audio data associated with the predominate voice cluster.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1, further comprising:
    - determining that the mobile device is currently in the in-call state.
  - 3. The method of claim 2, wherein determining that a mobile device is currently in an in-call state comprises determining that the mobile device is currently executing a software application, wherein the software application collects user speech.
  - 4. The method of claim 1, further comprising:
    - receiving, at a remote server, the audio data from the mobile device.
  - 5. The method of claim 1, wherein identifying the predominate voice cluster comprises:
    - identifying one or more of the plurality of clusters as voice clusters, each of the identified voice cluster being primarily associated with audio segments estimated to include speech; and
      
      identifying a select voice cluster amongst the identified voice clusters that, relative to all other voice clusters, is associated with the greatest number of audio segments.
  - 6. The method of claim 1, wherein identifying the predominate voice cluster comprises:
    - identifying a cluster that, relative to all other clusters, is associated with the greatest number of audio segments.
  - 7. The method of claim 1, wherein the user speech model is trained only using the audio data captured while the mobile device was in the in-call state.
  - 8. The method of claim 1, wherein the user speech model is trained after the predominate voice cluster is identified.
  - 9. The method of claim 1, further comprising:
    - storing at least part of the accessed audio data, wherein it is not possible to reconstruct a message spoken during the in-call state by a speaker based on the stored accessed audio data.
  - 10. The method of claim 1, wherein the user speech model is trained to recognize words spoken by a user of the mobile device.
  - 11. The method of claim 1, further comprising:
    - analyzing a second set of audio data using the user speech model;
      
      recognizing, based on the analyzed second set of audio data, one or more particular words spoken by a user; and
      
      inferring a context at least partly based on the recognized one or more words.
  - 12. The method of claim 1, further comprising:
    - accessing second audio data captured while the mobile device is in a second and distinct in-call state;
      
      clustering the accessed second audio data;
      
      identifying a subsequent predominate voice cluster; and
      
      training the user speech model based, at least in part, on audio data associated with the subsequent predominate voice cluster.
  - 13. The method of claim 1, further comprising:
    - storing the accessed audio data;
      
      determining a plurality of cepstral coefficients associated with each of a plurality of portions of the accessed audio data;
      
      clustering the accessed audio data based on the determined plurality of cepstral coefficients, andtraining the user speech model based, at least in part, on the stored audio data, wherein the stored audio data comprises temporally varying data.
  - 14. The method of claim 1, wherein the user speech model comprises a Hidden Markov Model.
  - 15. The method of claim 1, wherein the user speech model comprises a Gaussian Mixture Model.
  - 16. The method of claim 1, further comprising:
    - accessing second audio data captured after a user was presented with text to read, the accessed second audio data including a second set of speech segments, wherein the second set of speech segments are based on the presented text; and
      
      training the user speech model based, at least in part, on the second set of speech segments.
  - 17. The method of claim 1, wherein the audio data comprises data collected across a plurality of calls.

18. An apparatus for training a user speech model, the apparatus comprising:
- a mobile device comprising;
  
  a microphone configured to, upon being in an active state, receive audio signals and convert the received audio signals into radio signals; and
  
  a transmitter configured to transmit the radio signals; and
  
  one or more processors configured to;
  
  determine that the microphone is in the active state;
  
  capture audio data while the microphone is in the active state;
  
  cluster the captured audio data into a plurality of clusters, each cluster of the plurality of clusters being associated with one or more audio segments from the captured audio data;
  
  identify a predominate voice cluster; and
  
  train the user speech model based, at least in part, on audio data associated with the predominate voice cluster.
- View Dependent Claims (19, 20, 21, 22)
- - 19. The apparatus of claim 18, wherein the mobile device comprises at least one of the one or more processors.
  - 20. The apparatus of claim 18, wherein the mobile device comprises all of the one or more processors.
  - 21. The apparatus of claim 18, wherein the mobile device is configured to execute at least one software application that activate the microphone.
  - 22. The apparatus of claim 18, wherein the audio data is captured only when the mobile device is engaged in a telephone call.

23. A computer-readable medium containing a program which executes the steps of:
- accessing audio data captured while a mobile device is in an in-call state;
  
  clustering the accessed audio data into a plurality of clusters, each cluster of the plurality of clusters being associated with one or more audio segments from the accessed audio data;
  
  identifying a predominate voice cluster; and
  
  training the user speech model based, at least in part, on audio data associated with the predominate voice cluster.
- View Dependent Claims (24, 25, 26)
- - 24. The computer-readable medium of claim 23, wherein the step of identifying the predominate voice cluster comprises identifying a cluster that, relative to all other clusters, is associated with the greatest number of audio segments.
  - 25. The computer-readable medium of claim 23, wherein the program further executes the step of:
    - storing at least part of the accessed audio data, wherein it is not possible to reconstruct a message spoken during the in-call state by a speaker based on the stored data.
  - 26. The computer-readable medium of claim 23, wherein the program further executes the steps of:
    - storing the accessed audio data;
      
      determining a plurality of cepstral coefficients associated with each of a plurality of portions of the accessed audio data;
      
      clustering the accessed audio data based on the determined cepstral coefficients, andtraining the user speech model based, at least in part, on the stored audio data, wherein the stored audio data comprises temporally varying data.

27. A system for training a user speech model, the system comprising:
- means for accessing audio data captured while a mobile device is in an in-call state;
  
  means for clustering the accessed audio data into a plurality of clusters, each cluster of the plurality of clusters being associated with one or more audio segments from the accessed audio data;
  
  means for identifying a predominate voice cluster; and
  
  means for training the user speech model based, at least in part, on audio data associated with the predominate voice cluster.
- View Dependent Claims (28, 29, 30)
- - 28. The system of claim 27, wherein the means for training the user speech model comprises means for training Hidden Markov Model.
  - 29. The system of claim 27, wherein the predominate voice cluster comprises a voice cluster associated with a highest number of audio frames.
  - 30. The system of claim 27, further comprising means for identifying at least one of the clusters associated with one or more speech signals.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Qualcomm, Inc.
Original Assignee
Qualcomm, Inc.
Inventors
Grokop, Leonard Henry, Narayanan, Vidya

Application Number

US13/344,026
Publication Number

US 20130006633A1
Time in Patent Office

Days
Field of Search
US Class Current

704/245
CPC Class Codes

G06N 7/01   Probabilistic graphical mod...

G10L 15/063   Training

G10L 2015/0631   Creating reference template...

LEARNING SPEECH MODELS FOR MOBILE DEVICE USERS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

301 Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

LEARNING SPEECH MODELS FOR MOBILE DEVICE USERS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

301 Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links