Speaker adaptation based on lateral tying for large-vocabulary continuous speech recognition

US 5,737,487 A
Filed: 02/13/1996
Issued: 04/07/1998
Est. Priority Date: 02/13/1996
Status: Expired due to Term

First Claim

Patent Images

1. A method of performing speaker adaptation in a speech recognition system which includes a set of reference models corresponding to speech data from a plurality of speakers, the speech data represented by a plurality of acoustic models and corresponding sub-events, wherein each sub-event includes one or more observations of speech data, the method comprising the steps of:

(a) computing a degree of lateral tying between each pair of sub-events, wherein the degree of tying indicates the degree to which a first observation in a first sub-event contributes to the remaining sub-events;

(b) assigning a new observation from adaptation data of a new speaker to one of the sub-events;

(c) populating each of the sub-events with a transformed version of the observation contained in the assigned sub-event based on the degree of lateral tying computed between each pair of sub-events;

(d) adapting the reference models that correspond to the populated sub-events to account for speech pattern idiosyncrasies of the new speaker, thereby reducing the error rate of the speech recognition system.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for performing speaker adaptation in a speech recognition system which includes a set of reference models corresponding to speech data from a plurality of speakers. The speech data is represented by a plurality of acoustic models and corresponding sub-events, and each sub-event includes one or more observations of speech data. A degree of lateral tying is computed between each pair of sub-events, wherein the degree of tying indicates the degree to which a first observation in a first sub-event contributes to the remaining sub-events. When adaptation data from a new speaker becomes available, a new observation from adaptation data is assigned to one of the sub-events. Each of the sub-events is then populated with the observations contained in the assigned sub-event based on the degree of lateral tying that was computed between each pair of sub-events. The reference models corresponding to the populated sub-events are then adapted to account for speech pattern idiosyncrasies of the new speaker, thereby reducing the error rate of the speech recognition system.

190 Citations

30 Claims

1. A method of performing speaker adaptation in a speech recognition system which includes a set of reference models corresponding to speech data from a plurality of speakers, the speech data represented by a plurality of acoustic models and corresponding sub-events, wherein each sub-event includes one or more observations of speech data, the method comprising the steps of:
- (a) computing a degree of lateral tying between each pair of sub-events, wherein the degree of tying indicates the degree to which a first observation in a first sub-event contributes to the remaining sub-events;
  
  (b) assigning a new observation from adaptation data of a new speaker to one of the sub-events;
  
  (c) populating each of the sub-events with a transformed version of the observation contained in the assigned sub-event based on the degree of lateral tying computed between each pair of sub-events;
  
  (d) adapting the reference models that correspond to the populated sub-events to account for speech pattern idiosyncrasies of the new speaker, thereby reducing the error rate of the speech recognition system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 20, 21, 22, 23, 24, 25, 26, 27)
- - 2. A method as in claim 1 wherein step (a) further includes the step of:
    - (a1) computing a transformation for each pair of sub-events to indicate if a high degree of lateral tying exist between the pair.
  - 3. A method as in claim 2 wherein step (a) further includes the step of:
    - (a2) for each of the sub-events, identifying a neighborhood of other sub-events that exhibit a high degree of tying with the respective sub-event.
  - 4. A method as in claim 3 wherein step (c) further includes the step of:
    - (c1) populating only those sub-events that are included in the neighborhood of the assigned sub-event.
  - 5. A method as in claim 4 wherein step (a1) further includes the steps of:
    - (a1i) representing observations by feature vectors;
      
      (a1ii) segregating each feature vector in the sub-events according to which speaker the feature vector originated from;
      
      (a1iii) for each speaker s, determining an anchor point of the feature vectors in each of the sub-events i, given by X_i (s);
      (a1iv) defining the degree of tying between sub-events E₁ and E₂ is based on a cross-covariance matrix between E₁ and E₂ by ##EQU6## where the summation is over all the speakers; and
      
      (a1v) computing a transformation matrix for every pair of sub-events using the quantity
      
      space="preserve" listing-type="equation">Γ
      
      .sub.1 = U.sub.12 V.sub.12.sup.T (
      
      6)to represent the least squares rotation that must be applied to X₁ (s) in sub-event E₁ in order to obtain an estimate of X₂ (s) in sub-event E₂.
  - 6. A method as in claim 5 wherein step (a2) further includes the step of:
    - (a2i) determining which sub-events E_j are closely correlated with E_i by a distance measure ##EQU7## which is computed for each E_j, and wherein λ
      
      _j has a value that ranges from one to zero, where a value of one represents a maximum degree of tying.
  - 7. A method as in claim 6 wherein step (a2) further includes the step of:
    - (a2ii) creating a neighborhood for sub-event E_i by excluding those sub-events E_j from the neighborhood whose value of λ
      
      _j deviates from the maximum value of λ
      
      by more than a pre-determined percentile.
  - 8. A method as in claim 6 wherein step (a2) further includes the step of:
    - (a2iii) creating a neighborhood for sub-event E_i by ranking each E_j from the highest value of λ
      
      _j to the lowest, and by including in the neighborhood only a predetermined number of sub-events E_j that have the highest ranked values of λ
      
      _j.
  - 9. A method as in claim 8 wherein step (c1) further includes the step of:
    - (c1i) calculating
      
      space="preserve" listing-type="equation">Z.sub.j =(S.sub.j.sup.1/2 λ
      
      .sub.j Γ
      
      .sub.ij S.sub.i.sup.-1/2)Z.sub.i +(M.sub.j -s.sub.j.sup.1/2 λ
      
      .sub.j Γ
      
      .sub.ij S.sub.i.sup.-1/2 M.sub.i)wherein Z_i denotes a first feature vector from the adaptation data assigned to sub-event E_i, which is used to obtain an estimate of an unobserved feature vector .sup. Z_j in sub-event E_j that is contained in neighborhood of sub-event E_i.
  - 20. A computer-readable medium as in claim 1 wherein instruction (a) further includes the instruction of:
    - (a1) computing a transformation for each pair of sub-events to indicate if a high degree of lateral tying exist between the pair.
  - 21. A computer-readable medium as in claim 2 wherein instruction (a) further includes the instruction of:
    - (a2) for each of the sub-events, identifying a neighborhood of other sub-events that exhibit a high degree of tying with the respective sub-event.
  - 22. A computer-readable medium as in claim 3 wherein instruction (c) further includes the instruction of:
    - (c1) populating only those sub-events that are included in the neighborhood of the assigned sub-event.
  - 23. A computer-readable medium as in claim 4 wherein instruction (a1) further includes the instructions of:
    - (a1i) representing observations by feature vectors;
      
      (a1ii) segregating each feature vector in the sub-events according to which speaker the feature vector originated from;
      
      (a1iii) for each speaker s, determining an anchor point of the feature vectors in each of the sub-events i, given by X_i (s);
      (a1iv) defining the degree of tying between sub-events E₁ and E₂ is based on a cross-covariance matrix between E₁ and E₂ by ##EQU10## where the summation is over all the speakers; and
      
      (a1v) computing a transformation matrix for every pair of sub-events using the quantity
      
      space="preserve" listing-type="equation">Γ
      
      .sub.1 = U.sub.12 V.sub.12.sup.T (
      
      6)to represent the least squares rotation that must be applied to X₁ (s) in sub-event E₁ in order to obtain an estimate of X₂ (s) in sub-event E₂.
  - 24. A computer-readable medium as in claim 5 wherein instruction (a2) further includes the instruction of:
    - (a2i) determining which sub-events E_j are closely correlated with E_i by a distance measure ##EQU11## which is computed for each E_j, and wherein λ
      
      _j has a value that ranges from one to zero, where a value of one represents a maximum degree of tying.
  - 25. A computer-readable medium as in claim 6 wherein instruction (a2) further includes the instruction of:
    - (a2ii) creating a neighborhood for sub-event E_i by excluding those sub-events E_j from the neighborhood whose value of λ
      
      _j deviates from the maximum value of λ
      
      by more than a pre-determined percentile.
  - 26. A computer-readable medium as in claim 6 wherein instruction (a2) further includes the instruction of:
    - (a2iii) creating a neighborhood for sub-event E_i by ranking each E_j from the highest value of λ
      
      _j to the lowest, and by including in the neighborhood only a predetermined number of sub-events E_j that have the highest ranked values of λ
      
      _j.
  - 27. A computer-readable medium as in claim 8 wherein instruction (c1) further includes the instruction of:
    - (c1i) calculating
      
      space="preserve" listing-type="equation">Z.sub.j =(S.sub.j.sup.1/2 λ
      
      .sub.j Γ
      
      .sub.ij S.sub.i.sup.-1/2)Z.sub.i +(M.sub.j -s.sub.j.sup.1/2 λ
      
      .sub.j Γ
      
      .sub.ij S.sub.i.sup.-1/2 M.sub.i)wherein Z_i denotes a first feature vector from the adaptation data assigned to sub-event E_i, which is used to obtain an estimate of an unobserved feature vector Z_j in sub-event E_j that is contained in neighborhood of sub-event E_i.

10. A speech recognition system that performs speaker adaptation, the system including a set of reference models corresponding to speech data from a plurality of speakers, the speech data represented by a plurality of acoustic models and corresponding sub-events, wherein each sub-event includes one or more observations of speech data, the speech recognition system comprising:
- (a) means for computing a degree of lateral tying between each pair of sub-events, wherein the degree of tying indicates the degree to which a first observation in a first sub-event contributes to the remaining sub-events;
  
  (b) means for assigning a new observation from adaptation data of a new speaker to one of the sub-events;
  
  (c) means for populating each of the sub-events with a transformed version of the observation contained in the assigned sub-event based on the degree of lateral tying computed between each pair of sub-events;
  
  (d) means for adapting the reference models that correspond to the populated sub-events to account for speech pattern idiosyncrasies of the new speaker, thereby reducing the error rate of the speech recognition system.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. A system as in claim 10 wherein the means for computing computes a transformation for each pair of sub-events to indicate if a high degree of lateral tying exist between the pair.
  - 12. A system as in claim 11 wherein the means for computing further includes means for identifying a neighborhood for each of the sub-events that represents other sub-events that exhibit a high degree of tying with the respective sub-event.
  - 13. A system as in claim 12 wherein the means for populating only populates those sub-events that are included in the neighborhood of the assigned sub-event.
  - 14. A system as in claim 13 wherein the means for computing further includes:
    - means for representing observations by feature vectors;
      
      means for segregating each feature vector in the sub-events according to which speaker the feature vector originated from;
      
      means for determining for each speaker s, an anchor point of the feature vectors in each of the sub-events i, given by X_i (s);
      means for defining the degree of tying between sub-events E₁ and E₂ is based on a cross-covariance matrix between E₁ and E₂ by ##EQU8## where the summation is over all the speakers; and
      
      (a1v) means for computing a transformation matrix for every pair of sub-events using the quantity
      
      space="preserve" listing-type="equation">Γ
      
      .sub.1 = U.sub.12 V.sub.12.sup.T (
      
      6)to represent the least squares rotation that must be applied to X₁ (s) in sub-event E₁ in order to obtain an estimate of X₂ (s) in sub-event E₂.
  - 15. A system as in claim 14 wherein means for computing further includes means for determining which sub-events E_j are closely correlated with E_i by a distance measure ##EQU9## which is computed for each E_j, and wherein λ
    - _j has a value that ranges from one to zero, where a value of one represents a maximum degree of tying.
  - 16. A system as in claim 15 wherein the means for computing creates a neighborhood for sub-event E_i by excluding those sub-events E_j from the neighborhood whose value of λ
    - _j deviates from the maximum value of λ
      
      by more than a pre-determined percentile.
  - 17. A system as in claim 16 wherein the means for computing creates a neighborhood for sub-event E_i by ranking each E_j from the highest value of λ
    - _j to the lowest, and by including in the neighborhood only a predetermined number of sub-events E_j that have the highest ranked values of λ
      
      _j.
  - 18. A system as in claim 17 wherein the means for populating includes means for calculating
    
    space="preserve" listing-type="equation">Z.sub.j =(S.sub.j.sup.1/2 λ
    
    .sub.j Γ
    
    .sub.ij S.sub.i.sup.-1/2)Z.sub.i +(M.sub.j -s.sub.j.sup.1/2 λ
    
    .sub.j Γ
    
    .sub.ij S.sub.i.sup.-1/2 M.sub.i)
    wherein Z_i denotes a first feature vector from the adaptation data assigned to sub-event E_i, which is used to obtain an estimate of an unobserved feature vector Z_j in sub-event E_j that is contained in neighborhood of sub-event E_i.

19. A computer-readable medium containing program instructions for performing speaker adaptation in a speech recognition system which includes a set of reference models corresponding to speech data from a plurality of speakers, the speech data represented by a plurality of acoustic models and corresponding sub-events, wherein each sub-event includes one or more observations of speech data, the program instructions for:
- (a) computing a degree of lateral tying between each pair of sub-events, wherein the degree of tying indicates the degree to which a first observation in a first sub-event contributes to the remaining sub-events;
  
  (b) assigning a new observation from adaptation data of a new speaker to one of the sub-events;
  
  (c) populating each of the sub-events with a transformed version of the observation contained in the assigned sub-event based on the degree of lateral tying computed between each pair of sub-events;
  
  (d) adapting the reference models that correspond to the populated sub-events to account for speech pattern idiosyncrasies of the new speaker, thereby reducing the error rate of the speech recognition system.

28. A method of performing speaker adaptation in a speech recognition system which includes a set of reference models corresponding to speech data, the speech data represented by a first and second reference model, the first reference model comprising a first sub-event and second reference model comprising a second sub-event, wherein the first sub-event is well populated with a plurality of feature vectors and the second sub-event is sparsely populated with feature vectors, the method comprising the steps of:
- (a) computing a transformation between the first and second sub-event to indicate a degree of lateral tying between the first and second sub-event;
  
  (b) assigning a new feature vector extracted from adaptation data to the first sub-event;
  
  (c) if the computed transformation indicates a degree of lateral tying that surpasses a desired threshold, applying the computed transformation to the feature vectors in the first sub-event to transform the feature vectors into the space of the second sub-event to thereby populate the second sub-event with feature vectors from the first sub-event; and
  
  (d) adapting the reference models that correspond to the populated sub-events to account for speech pattern idiosyncrasies of the new speaker, thereby reducing the error rate of the speech recognition system.
- View Dependent Claims (29, 30)
- - 29. A method as in claim 28 wherein step (c) further includes the steps of:
    - (c1) segregating each feature vector in the sub-events according to which speaker the feature vector originated from;
      
      (c2) for each speaker s, determining an anchor point of the feature vectors in each of the sub-events i, given by X_i (s);
      (c3) defining the degree of tying between sub-events E_i and E₂ is based on a cross-covariance matrix between E₁ and E₂ by ##EQU12## where the summation is over all the speakers; and
      
      (c4) computing a transformation matrix between sub-events E₁ and E₂ using the quantity
      
      space="preserve" listing-type="equation">Γ
      
      .sub.1 = U.sub.12 V.sub.12.sup.T (
      
      6)to represent the least squares rotation that must be applied to X₁ (s) in sub-event E₁ in order to obtain an estimate of X₂ (s) in sub-event E₂.
  - 30. A method as in claim 29 wherein step (b) further includes the steps of:
    - (b1) acquiring the speech from the new user in the form of input speech signals;
      
      (b2) transforming the input speech signals into a physical representation;
      
      (b3) extracting acoustic features from the physical representation and generating a series of feature vectors therefrom; and
      
      (B4) determining what allophone each feature vector represents and assigning each feature vector to the sub-event representing the corresponding allophone.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Computer Incorporated (Apple Inc.)
Inventors
Bellegarda, Jerome R., Butzberger, John W., Chow, Yen-Lu
Primary Examiner(s)
Hafiz, Tariq R.

Application Number

US08/600,859
Time in Patent Office

784 Days
Field of Search

395/2.1, 395/2.4, 395/2.45, 395/2.46, 395/2.47-2.49, 395/2.55-2.59
US Class Current

704/250
CPC Class Codes

G10L 15/065 Adaptation

Speaker adaptation based on lateral tying for large-vocabulary continuous speech recognition

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

190 Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Speaker adaptation based on lateral tying for large-vocabulary continuous speech recognition

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

190 Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links