High quality speech reconstruction for a dialog method and system

US 20070129946A1
Filed: 12/06/2005
Published: 06/07/2007
Est. Priority Date: 12/06/2005
Status: Abandoned Application

First Claim

Patent Images

1. A method for speech dialog, comprising:

receiving an input speech phrase that includes an instantiated variable;

extracting pitch and voicing characteristics for the instantiated variable;

performing voice recognition of the instantiated variable to determine a most likely set of recognition acoustic states;

converting the most likely set of recognition acoustic states to a most likely set of synthesis acoustic states; and

generating a synthesized value of the instantiated variable using the most likely set of synthesis acoustic states and the extracted pitch and voicing characteristics.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An electronic device (400) for speech dialog includes functions that receive (405, 205) a speech phrase that includes an instantiated variable (315), generate pitch and voicing characteristics (330) of the instantiated variable, and performs voice recognition (410, 220) of the instantiated variable to determine a most likely set of recognition acoustic states (335). A trained map (358) is established (115) that maps recognition feature vectors derived from training speech (105) to synthesis feature vectors derived from the same training speech (110). Recognition feature vectors that represent the most likely set of recognition acoustic states for the recognized instantiated variable are converted to a most likely set of synthesis acoustic states (420) in accordance with the map. The electronic device may generate (421, 440, 445) a synthesized value of the instantiated variable using the most likely set of synthesis acoustic states and the pitch and voicing characteristics extracted from the instantiated variable.

31 Citations

View as Search Results

20 Claims

1. A method for speech dialog, comprising:
- receiving an input speech phrase that includes an instantiated variable;
  
  extracting pitch and voicing characteristics for the instantiated variable;
  
  performing voice recognition of the instantiated variable to determine a most likely set of recognition acoustic states;
  
  converting the most likely set of recognition acoustic states to a most likely set of synthesis acoustic states; and
  
  generating a synthesized value of the instantiated variable using the most likely set of synthesis acoustic states and the extracted pitch and voicing characteristics.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method for speech dialog according to claim 1, wherein said performing voice recognition of the instantiated variable comprises:
    - extracting acoustic characteristics in the form of recognition feature vectors of the instantiated variable; and
      
      comparing the extracted acoustic characteristics to a mathematical model of stored lookup values to determine a most likely set of the extracted recognition feature vectors representing the most likely set of recognition acoustic states.
  - 3. The method for speech dialog according to claim 2, said converting further comprising:
    - deriving recognition feature vectors from training speech uttered by at least one speaker;
      
      deriving synthesis feature vectors from the training speech uttered by the at least one speaker, each of the derived synthesis vectors corresponding on a one-to-one basis to one of the derived recognition vectors;
      
      mapping a plurality of subsets of the derived recognition feature vectors to a most likely set of synthesis feature vectors, and for each of the most likely set of recognition feature vectors;
      
      determining the probability that the most likely recognition feature vector belongs in one or more of the subsets; and
      
      selecting a most likely synthesis feature vector based on the determined probability and the mapping.
  - 4. The method for speech dialog according to claim 3, wherein the extracted and derived recognition feature vectors comprise Mel-frequency cepstrum coefficients and the extracted and derived synthesis feature vectors comprise linear prediction coding compatible coefficients.
  - 5. The method for speech dialog according to claim 1, wherein said generating the synthesized value of the instantiated variable is performed when a metric of the most likely set of recognition acoustic states meets a criterion, and further comprising presenting an acoustically stored out-of-vocabulary response phrase when the metric of the most likely set of recognition acoustic states fails to meet the criterion.
  - 6. The method for speech dialog according to claim 1, wherein the speech phrase further includes a non-variable segment that is associated with the instantiated variable, further comprising:
    - performing voice recognition of the non-variable segment; and
      
      presenting an acoustically stored response phrase based on the recognized non-variable segment.
  - 7. The method for speech dialog according to claim 3, wherein said mapping further comprises creating the subsets of extracted recognition feature vectors through vector quantization;
    - establishing a vector quantization table comprising a plurality of entries, each of the entries comprising a centroid for a different one of the subsets; and
      
      determining the most likely synthesis feature vector for each of the entries, comprising;
      
      selecting an appropriate one of the derived recognition feature vectors from the subset corresponding to each entry; and
      
      associating the entry with the derived synthesis feature vector that corresponds to the appropriate recognition feature vector on a one-to-one basis.
  - 8. The method for speech dialog according to claim 7, wherein the selected appropriate one of the derived recognition feature vectors is the derived recognition feature vector of the subset that is closest to the centroid.
  - 9. The method for speech dialog according to claim 3, wherein said mapping further comprises:
    - modeling the derived recognition feature vectors such that each of the subsets is a statistical distribution characterized by a mean vector, a covariance matrix and a non-negative weight; and
      
      determining the most likely synthesis feature vector that corresponds to each of the most likely set of recognition feature vectors based on the probability that each of the set of most likely recognition feature vectors is in any one of the subsets.
  - 10. The method for speech dialog according to claim 9, wherein said determining the most likely synthesis feature vector further comprises:
    - computing a weight for each of the set of most likely recognition feature vectors, each weight corresponding to the probability the most likely synthesis feature vector is in that one of the subsets; and
      
      applying the computed weights for each of the subsets to the derived synthesis feature vectors each of which corresponds to the derived recognition feature vectors comprising each of the subsets to obtain the converted most likely synthesis feature vector.

11. An electronic device for speech dialog, comprising:
- means for receiving an input speech phrase that includes an instantiated variable;
  
  means for extracting pitch and voicing characteristics for the instantiated variable;
  
  means for performing voice recognition of the instantiated variable to determine a most likely set of recognition acoustic states;
  
  means for converting the most likely set of recognition acoustic states to a most likely set of synthesis acoustic states; and
  
  means for generating a synthesized value of the instantiated variable using the most likely set of synthesis acoustic states and the extracted pitch and voicing characteristics.
- View Dependent Claims (12, 13, 14, 15, 16, 17)
- - 12. The electronic device for speech dialog according to claim 11, wherein said means for performing voice recognition of the instantiated variable comprises:
    - means for extracting acoustic characteristics in the form of recognition feature vectors of the instantiated variable; and
      
      means for comparing the extracted acoustic characteristics to a mathematical model of stored lookup values to determine a most likely set of the extracted recognition feature vectors representing the most likely set of recognition acoustic states.
  - 13. The electronic device for speech dialog according to claim 12, said means for converting further comprising:
    - means for deriving recognition feature vectors from training speech uttered by at least one speaker;
      
      means for deriving synthesis feature vectors from the training speech uttered by the at least one speaker, each of the derived synthesis vectors corresponding on a one-to-one basis to one of the derived recognition vectors;
      
      means for mapping a plurality of subsets of the derived recognition feature vectors to a most likely set of synthesis feature vectors, and means for determining the probability that each of the most likely set of recognition feature vectors belongs in one or more of the subsets; and
      
      means for selecting a most likely synthesis feature vector for each of the most likely set of recognition feature vectors based on the determined probability and the mapping.
  - 14. The electronic device for speech dialog according to claim 13, wherein said means for mapping further comprises means for creating the subsets of extracted recognition feature vectors through vector quantization;
    - means for establishing a vector quantization table comprising a plurality of entries, each of the entries comprising a centroid for a different one of the subsets; and
      
      means for determining the most likely synthesis feature vector for each of the entries, comprising;
      
      means for selecting an appropriate one of the derived recognition feature vectors from the subset corresponding to each entry; and
      
      means for associating the entry with the derived synthesis feature vector that corresponds to the appropriate recognition feature vector on a one-to-one basis.
  - 15. The electronic device for speech dialog according to claim 14, wherein the selected appropriate one of the derived recognition feature vectors is the derived recognition feature vector of the subset that is closest to the centroid.
  - 16. The electronic device for speech dialog according to claim 13, wherein said means for mapping further comprises:
    - means for modeling the derived recognition feature vectors such that each of the subsets is a statistical distribution characterized by a mean vector, a covariance matrix and a non-negative weight; and
      
      means for determining the most likely synthesis feature vector that corresponds to each of the most likely set of recognition feature vectors based on the probability that each of the set of recognition feature vectors is in any one of the subsets.
  - 17. The electronic device for speech dialog according to claim 11, wherein said means for determining the most likely synthesis feature vector further comprises:
    - means for computing a weight for each of the set of most likely recognition feature vectors, each weight corresponding to the probability the most likely synthesis feature vector is in that one of the subsets; and
      
      means for applying the computed weights for each of the subsets to the derived synthesis feature vectors each of which corresponds to the derived recognition feature vectors comprising each of the subsets to obtain the converted most likely synthesis feature vector.

18. A media that includes a set of stored program instructions, comprising:
- a function for receiving an input speech phrase that includes an instantiated variable;
  
  a function for extracting pitch and voicing characteristics for the instantiated variable;
  
  a function for performing voice recognition of the instantiated variable to determine a most likely set of recognition acoustic states;
  
  a function for converting the most likely set of recognition acoustic states to a most likely set of synthesis acoustic states; and
  
  a function for generating a synthesized value of the instantiated variable using the most likely set of synthesis acoustic states and the extracted pitch and voicing characteristics.
- View Dependent Claims (19, 20)
- - 19. The media that includes a set of stored program instructions according to claim 18, wherein said function for performing voice recognition of the instantiated variable comprises:
    - a function for extracting acoustic characteristics in the form of recognition feature vectors of the instantiated variable; and
      
      a function for comparing the extracted acoustic characteristics to a mathematical model of stored lookup values to determine a most likely set of the extracted recognition feature vectors representing the most likely set of recognition acoustic states.
  - 20. The media that includes a set of stored program instructions according to claim 19, said function for mapping further comprising:
    - a function for modeling the derived recognition feature vectors such that each of the subsets is a statistical distribution characterized by a mean vector, a covariance matrix and a non-negative weight; and
      
      a function for determining the most likely synthesis feature vector that corresponds to each of the most likely set of recognition feature vectors based on the probability that each of the set of most likely recognition feature vectors is in any one of the subsets.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Motorola, Inc. (Motorola Solutions, Inc.)
Original Assignee
Motorola, Inc. (Motorola Solutions, Inc.)
Inventors
Ma, Changxue, Cheng, Yan, Ramabadran, Tenkasi

Application Number

US11/294,964
Publication Number

US 20070129946A1
Time in Patent Office

Days
Field of Search
US Class Current

704/256
CPC Class Codes

G10L 13/027   Concept to speech synthesis...

G10L 15/02   Feature extraction for spee...

G10L 15/142   Hidden Markov Models [HMMs]

High quality speech reconstruction for a dialog method and system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

31 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

High quality speech reconstruction for a dialog method and system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

31 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links