I-Vector Based Clustering Training Data in Speech Recognition

US 20150199960A1
Filed: 08/24/2012
Published: 07/16/2015
Est. Priority Date: 08/24/2012
Status: Abandoned Application

First Claim

Patent Images

1. A computer-implemented method for clustering training data in speech recognition, the method comprising:

extracting a plurality of i-vectors from speech data including a plurality of speech segments;

clustering the plurality of i-vectors into a plurality of clusters;

training an acoustic model using one of the plurality of clusters; and

recognizing one or more other speech segments using the trained acoustic model.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and systems for i-vector based clustering training data in speech recognition are described. An i-vector may be extracted from a speech segment of a speech training data to represent acoustic information. The extracted i-vectors from the speech training data may be clustered into multiple clusters using a hierarchical divisive clustering algorithm. Using a cluster of the multiple clusters, an acoustic model may be trained. This trained acoustic model may be used in speech recognition.

Citations

20 Claims

1. A computer-implemented method for clustering training data in speech recognition, the method comprising:
- extracting a plurality of i-vectors from speech data including a plurality of speech segments;
  
  clustering the plurality of i-vectors into a plurality of clusters;
  
  training an acoustic model using one of the plurality of clusters; and
  
  recognizing one or more other speech segments using the trained acoustic model.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The computer-implemented method as recited in claim 1, whereinthe extracting the plurality of i-vectors from the speech data comprises:
    - training a Gaussian mixture model (GMM) to represent the speech data;
      
      calculating a set of hyperparameters based on the speech data; and
      
      extracting the plurality of i-vectors based on the GMM and the set of hyperparameters.
  - 3. The computer-implemented method as recited in claim 2, whereinthe calculating the set of hyperparameters comprises:
    - initializing the set of hyperparameters;
      
      calculating statistics corresponding to the plurality of speech segments;
      
      calculating a posterior expectation associated with the speech data using;
      
      the one or more corresponding statistics, andthe set of hyperparameters; and
      
      updating the set of hyperparameters based on the posterior expectation to generate an updated set of hyperparameters, wherein the extracting the i-vector is further based on the updated set of hyperparameters.
  - 4. The computer-implemented method as recited in claim 2, further comprising:
    - calculating an additional set of hyperparameters using a residual term to model variabilities associated with the speech data that are not captured by the set of hyperparameters, and wherein the extracting the i-vector is further based on the additional set of hyperparameters.
  - 5. The computer-implemented method as recited in claim 1, wherein a similarity between two i-vectors of the plurality of i-vectors is measured using one of a Euclidean distance or a cosine measure.
  - 6. The computer-implemented method as recited in claim 1, wherein the acoustic model is cluster-dependent and trained based on a cluster-independent acoustic model that is trained using speech data.
  - 7. The computer-implemented method as recited in claim 6, wherein the recognizing the one or more speech segments using the trained acoustic model comprises recognizing the one or more speech segments using the cluster-dependent acoustic model and the cluster-independent acoustic model.
  - 8. The computer-implemented method as recited in claim 1, further comprising:
    - receiving other speech data;
      
      generating the one or more other speech segments based on the other speech data;
      
      extracting an i-vector from one segment of the one or more other speech segments;
      
      selecting a cluster corresponding to the i-vector; and
      
      determining an acoustic model that is trained by the cluster, and wherein the recognizing the one or more other speech segments using the trained acoustic model comprises recognizing the one segment using the acoustic model.

9. A method comprising:
- under control of one or more computing systems comprising one or more processors,receiving speech data including a plurality of speech segments;
  
  extracting an i-vector from a speech segment of the plurality of speech segments;
  
  selecting a cluster corresponding to the i-vector; and
  
  determining an acoustic model corresponding to the cluster; and
  
  recognizing the speech segment using the acoustic model.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The method as recited in claim 9, further comprising:
    - extracting a plurality of i-vectors from a plurality of training speech segments;
      
      clustering the plurality of i-vectors into multiple clusters that includes the cluster; and
      
      training acoustic models using the multiple clusters, the acoustic models including the acoustic model.
  - 11. The method as recited in claim 10, wherein the extracting the plurality of i-vectors from the plurality of training speech segments comprises:
    - training a GMM based on the plurality of training speech segments;
      
      calculating hyperparameters of the plurality of training speech segments;
      
      calculating additional hyperparameters to model variabilities of the plurality of training speech segments not captured by the hyperparameters; and
      
      extracting the plurality of i-vectors based on the GMM, the hyperparameters and the additional hyperparameters.
  - 12. The method as recited in claim 9, wherein the selecting the cluster corresponding to the i-vector comprises:
    - normalizing the i-vector using a cosine similarity measure; and
      
      selecting the cluster based on a similarity between the i-vector and a centroid of the cluster.
  - 13. The method as recited in claim 12, wherein the selecting the cluster comprises selecting multiple clusters based on similarities between the i-vector and centroids of the multiple clusters, and wherein the determining the acoustic model corresponding to the cluster comprises determining multiple acoustic models corresponding to the multiple clusters.
  - 14. The method as recited in claim 9, wherein the determining the acoustic model comprises determining a cluster-dependent acoustic model and a cluster-independent acoustic model, and wherein the cluster-dependent acoustic model is trained based on the cluster-independent acoustic model.

15. One or more computer-readable media storing instructions that are executable by one or more processors to perform acts comprising:
- receiving a plurality of training speech segments;
  
  extracting multiple i-vectors from the plurality of training speech segments based on a set of hyperparameters of the plurality of training speech segments, individual ones of the i-vectors of the multiple i-vectors corresponding to a training speech segment of the plurality of training speech segments;
  
  clustering the i-vectors into multiple clusters;
  
  training a cluster-dependent acoustic model using a cluster of the multiple clusters; and
  
  recognizing an unknown speech segment using the cluster-dependent acoustic model.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The one or more computer-readable media as recited in claim 15, wherein an i-vector extracted from the unknown speech segment is associated with a cluster corresponding to the cluster-dependent acoustic model.
  - 17. The one or more computer-readable media as recited in claim 15, wherein the extracting multiple i-vectors comprises extracting multiple i-vectors further based on an additional set of hyperparameters that model variabilities of the plurality of training speech segments not captured by the set of hyperparameters.
  - 18. The one or more computer-readable media as recited in claim 15, wherein the set of hyperparameters are determined based on Baum-Welch statistics that correspond to the plurality of training speech segments and a GMM that is trained to represent the plurality of training speech segments.
  - 19. The one or more computer-readable media as recited in claim 15, wherein the clustering the i-vectors into multiple clusters comprises clustering the i-vectors into multiple clusters using a Linde-Buzo-Gray (LBG) algorithm.
  - 20. The one or more computer-readable media as recited in claim 15, wherein a similarity between two i-vectors of the multiple i-vectors is measured using one of a Euclidean distance or a cosine measure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Huo, Qiang, Yan, Zhi-Jie, Zhang, Yu, Xu, Jian

Application Number

US13/640,804
Publication Number

US 20150199960A1
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G10L 15/063   Training

G10L 15/14   using statistical models, e...

G10L 2015/0631   Creating reference template...

I-Vector Based Clustering Training Data in Speech Recognition

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

I-Vector Based Clustering Training Data in Speech Recognition

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links