Communication Device Having Speaker Independent Speech Recognition

US 20070203701A1
Filed: 02/13/2007
Published: 08/30/2007
Est. Priority Date: 02/14/2006
Status: Abandoned Application

First Claim

Patent Images

1. A method for performing speech recognition in a communication device with a voice dialing function, comprising:

a) entering a speech recognition mode;

b) upon receipt of a voice input in the speech recognition mode, generating input feature vectors from voice input;

c) calculating a likelihood vector sequence from the input feature vectors indicating a likelihood in time of an utterance of phonetic units;

d) warping the likelihood vector sequence to phonetic word models;

e) calculating word model match likelihoods from the phonetic word models; and

f) determining a best matching one of the word model match as recognition result.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for performing speech recognition in a communication device with a voice dialing function is provided. Upon receipt of a voice input in a speech recognition mode, input feature vectors are generated from the voice input. Also, a likelihood vector sequence is calculated from the input feature vectors indicating the likelihood in time of an utterance of phonetic units. In a warping operation, the likelihood vector sequence is compared to phonetic word models and word model match likelihoods are calculated for that word models. After determination of a best-matching word model, the corresponding number to the name synthesized from the best-matching word model is dialed in a dialing operation.

Citations

36 Claims

1. A method for performing speech recognition in a communication device with a voice dialing function, comprising:
- a) entering a speech recognition mode;
  
  b) upon receipt of a voice input in the speech recognition mode, generating input feature vectors from voice input;
  
  c) calculating a likelihood vector sequence from the input feature vectors indicating a likelihood in time of an utterance of phonetic units;
  
  d) warping the likelihood vector sequence to phonetic word models;
  
  e) calculating word model match likelihoods from the phonetic word models; and
  
  f) determining a best matching one of the word model match as recognition result.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein the phonetic units serve as word sub-models for the phonetic word models, each of the phonetic word models includes a sequence of word model vectors, and a component of the word model vector indicates an expectation of finding a respective one of the phonetic units at a respective position of the phonetic word model.
  - 3. The method of claim 1, wherein each of the likelihood vectors is calculated from the respective input feature vector using an internal representation of a chosen language.
  - 4. The method of claim 3, wherein the internal language representation includes likelihood distributions calculated from representative ones of the feature vectors of the phonetic units indicating a statistic distribution of the representative feature vectors in feature space.
  - 5. The method of claim 4, wherein the calculation of the likelihood distributions is carried out in a registration mode, comprising:
    - recording of voice input samples spoken by different speakers in a noise free environment;
      
      selecting parts of the voice input samples corresponding to the phonetic units required in the chosen language; and
      
      generating of the representative feature vectors from the selected parts.
  - 6. The method of claim 4, further comprising:
    - determining a speaker characteristic adaptation vector for the present user and updating the likelihood distributions by reflecting the speaker characteristic adaptation vector into the representative feature vectors.
  - 7. The method of claim 4, further comprising:
    - measuring noise in the communication device environment;
      
      processing a noise feature vector from the measured noise; and
      
      updating the likelihood distributions by associating the noise feature vector into the representative feature vectors.
  - 8. The method of claim 7, wherein the noise feature vector, the speaker characteristic adaptation vector, and the representative feature vectors are spectral vectors, and updating the likelihood distributions comprises:
    - multiplying the speaker characteristic adaptation vector with each of the representative feature vectors to generate first modified representative feature vectors;
      
      adding to the first modified representative feature vectors to the noise feature vector to generate second modified representative feature vectors; and
      
      determining a statistical distribution of the second modified representative feature vectors in feature space as updated likelihood distributions.
  - 9. The method of claim 7, wherein the input feature vectors, the noise feature vector, the speaker characteristic adaptation vector, and the representative feature vectors are spectral vectors, the noise feature vector and the representative feature vectors have non-logarithmic components, and the input feature vectors and the speaker characteristic adaptation vector have logarithmic components, and updating the likelihood distribution comprises:
    - adding each of the representative feature vectors with the noise feature vector to generate first modified representative feature vectors;
      
      logarithmizing each component of the first modified representative feature vectors;
      
      adding to the first modified and logarithmized representative feature vectors the speaker characteristic adaptation vector to generate second modified representative feature vectors; and
      
      determining a statistical distribution of the second modified representative feature vectors in feature space as likelihood distribution.
  - 10. The method of claim 7, wherein determining of the speaker characteristic adaptation vector comprises calculation of a speaker characteristic adaptation vector for each the representative feature vectors, further comprising:
    - assigning a best matching phonetic unit to each of the input feature vectors;
      
      calculating a difference vector between each of the input feature vectors and the respective representative feature vector; and
      
      calculating a phoneme specific averaged difference vector as speaker characteristic adaptation vector for each of the respective representative feature vectors.
  - 11. The method of claim 10, wherein the speaker characteristic adaptation vector is averaged over the phoneme specific averaged difference vectors.
  - 12. The method of claim 1, further comprising:
    - synthesizing a name from the best matching word model and dialing a number corresponding to that name.
  - 13. The method of claim 1, wherein the phonetic word models are generated from names in a phone book as sequences of the word sub-models using a graphem-to-phonem translation.

14. An apparatus for performing speech recognition in a communication device with a voice dialing function, comprising:
- a first memory configured to store word models of names in a phone book;
  
  a vocoder configured to generate input feature vectors from a voice input in a speech recognition mode;
  
  a speech recognition component including (a) a likelihood vector calculation device configured to calculate a likelihood vector sequence from the input feature vectors indicating a likelihood in time of an utterance of phonetic units, (b) a warper configured to warp the likelihood vector sequence to the word models, (c) a calculation device configured to calculate word model match likelihoods from the word models, and (d) a determining device configured to determine a best matching word model as a recognition result; and
  
  a controller configured to initiate the speech recognition mode.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21)
- - 15. The apparatus of claim 14, wherein each of the likelihood vectors is calculated from the respective input feature vector using a likelihood distribution calculated from representative feature vectors of the phonetic units, and the apparatus further comprises:
    - a microphone configured to record the voice input and environmental noise as noise input;
      
      wherein the vocoder processes a noise feature vector from the noise input; and
      
      wherein the speech recognition component updates the likelihood distribution by reflecting the noise feature vector in the representative feature vectors.
  - 16. The apparatus of claim 14, wherein each of the likelihood vectors is calculated from the respective input feature vector using a likelihood distribution calculated from representative feature vectors of the phonetic units, and the apparatus further comprises:
    - a speaker characteristic adaptation device configured to determine a speaker characteristic adaptation vector for the present user and to update the likelihood distribution by reflecting the speaker characteristic adaptation vector in the representative feature vectors.
  - 17. The apparatus of claim 16, wherein the noise feature vector, the speaker characteristic adaptation vector, and the representative feature vectors are spectral vectors and the speaker characteristic adaptation device is configured to update the likelihood distribution by:
    - multiplying the speaker characteristic adaptation vector with each of the representative feature vectors to generate first modified representative feature vectors;
      
      adding to the first modified representative feature vectors the noise feature vector to generate second modified representative feature vectors; and
      
      determining a statistical distribution of the second modified representative feature vectors in feature space as likelihood distribution.
  - 18. The apparatus of claim 16, wherein the speaker characteristic adaptation device is configured to determine or update the speaker characteristic adaptation vector by:
    - assigning a best matching phonetic unit to each of the input feature vectors;
      
      calculating a difference vector between each of the input feature vectors and the respective representative feature vector;
      
      averaging over the difference vectors per phonetic unit and generating a phoneme specific averaged difference vector; and
      
      averaging over the phoneme specific averaged difference vectors.
  - 19. The apparatus of claim 14, further comprising:
    - a synthesizer configured to synthesize a name from the best matching word model; and
      
      wherein the controller dials a number in the phone book corresponding to the name synthesized from the best matching word model.
  - 20. The apparatus as claimed in claim 19, wherein:
    - the warper is configured to determine a list of best matching word models;
      
      the synthesizer is configured to synthesize a name for each of the best matching word models in the list;
      
      the apparatus further comprises, an output device configured to output the synthesized names; and
      
      a selecting device configured to select one of the output names by the user; and
      
      the controller dials the number in the phone book corresponding to the selected name.
  - 21. The apparatus as claimed in claim 20, wherein:
    - the output device comprises a loudspeaker of the communication device that outputs control commands from the controller;
      
      the microphone records the environmental noise while the loudspeaker is outputting; and
      
      the apparatus further comprises, an interference elimination device configured to remove the loudspeaker interference from the recorded noise to generate a noise input.

22. A computer program product comprising a computer useable medium having a computer program logic recorded thereon for controlling at least one processor, the computer program logic comprising:
- computer program code means for entering a speech recognition mode;
  
  computer program code means for generating input feature vectors from voice input upon receipt of a voice input in the speech recognition mode;
  
  computer program code means for calculating a likelihood vector sequence from the input feature vectors indicating a likelihood in time of an utterance of phonetic units;
  
  computer program code means for warping the likelihood vector sequence to phonetic word models;
  
  computer program code means for calculating word model match likelihoods from the phonetic word models; and
  
  computer program code means for determining a best matching one of the word model match as recognition result.

23. A memory device comprising computer program code, which when executed on a communication device enables the communication device to carry out a method comprising:
- a) entering a speech recognition mode;
  
  b) upon receipt of a voice input in the speech recognition mode, generating input feature vectors from voice input;
  
  c) calculating a likelihood vector sequence from the input feature vectors indicating a likelihood in time of an utterance of phonetic units;
  
  d) warping the likelihood vector sequence to phonetic word models;
  
  e) calculating word model match likelihoods from the phonetic word models; and
  
  f) determining a best matching one of the word model match as recognition result.

24. A computer-readable medium containing instructions for controlling at least one processor of a communications device, by a method comprising:
- a) entering a speech recognition mode;
  
  b) upon receipt of a voice input in the speech recognition mode, generating input feature vectors from voice input;
  
  c) calculating a likelihood vector sequence from the input feature vectors indicating a likelihood in time of an utterance of phonetic units;
  
  d) warping the likelihood vector sequence to phonetic word models;
  
  e) calculating word model match likelihoods from the phonetic word models; and
  
  f) determining a best matching one of the word model match as recognition result.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36)
- - 25. The computer-readable medium controlling the processor using the method of claim 24, wherein the phonetic units serve as word sub-models for the phonetic word models, each of the phonetic word models includes a sequence of word model vectors, and a component of the word model vector indicates an expectation of finding a respective one of the phonetic units at a respective position of the phonetic word model.
  - 26. The computer-readable medium controlling the processor using the method of claim 24, wherein each of the likelihood vectors is calculated from the respective input feature vector using an internal representation of a chosen language.
  - 27. The computer-readable medium controlling the processor using the method of claim 26, wherein the internal language representation includes likelihood distributions calculated from representative ones of the feature vectors of the phonetic units indicating a statistic distribution of the representative feature vectors in feature space.
  - 28. The computer-readable medium controlling the processor using the method of claim 27, wherein the calculation of the likelihood distributions is carried out in a registration mode, comprising:
    - recording of voice input samples spoken by different speakers in a noise free environment;
      
      selecting parts of the voice input samples corresponding to the phonetic units required in the chosen language; and
      
      generating of the representative feature vectors from the selected parts.
  - 29. The computer-readable medium controlling the processor using the method of claim 28, further comprising:
    - determining a speaker characteristic adaptation vector for the present user and updating the likelihood distributions by reflecting the speaker characteristic adaptation vector into the representative feature vectors.
  - 30. The computer-readable medium controlling the processor using the method of claim 28, further comprising:
    - measuring noise in the communication device environment;
      
      processing a noise feature vector from the measured noise; and
      
      updating the likelihood distributions by associating the noise feature vector into the representative feature vectors.
  - 31. The computer-readable medium controlling the processor using the method of claim 30, wherein the noise feature vector, the speaker characteristic adaptation vector, and the representative feature vectors are spectral vectors, and updating the likelihood distributions comprises:
    - multiplying the speaker characteristic adaptation vector with each of the representative feature vectors to generate first modified representative feature vectors;
      
      adding to the first modified representative feature vectors to the noise feature vector to generate second modified representative feature vectors; and
      
      determining a statistical distribution of the second modified representative feature vectors in feature space as updated likelihood distributions.
  - 33. The computer-readable medium controlling the processor using the method of claim 30, wherein determining of the speaker characteristic adaptation vector comprises calculation of a speaker characteristic adaptation vector for each the representative feature vectors, further comprising:
    - assigning a best matching phonetic unit to each of the input feature vectors;
      
      calculating a difference vector between each of the input feature vectors and the respective representative feature vector; and
      
      calculating a phoneme specific averaged difference vector as speaker characteristic adaptation vector for each of the respective representative feature vectors.
  - 34. The computer-readable medium controlling the processor using the method of claim 33, wherein the speaker characteristic adaptation vector is averaged over the phoneme specific averaged difference vectors.
  - 35. The computer-readable medium controlling the processor using the method of claim 24, further comprising:
    - synthesizing a name from the best matching word model and dialing a number corresponding to that name.
  - 36. The computer-readable medium controlling the processor using the method of claim 24, wherein the phonetic word models are generated from names in a phone book as sequences of the word sub-models using a graphem-to-phonem translation.

32. The computer-readable medium controlling the processor using the method of claim, wherein the input feature vectors, the noise feature vector, the speaker characteristic adaptation vector, and the representative feature vectors are spectral vectors, the noise feature vector and the representative feature vectors have non-logarithmic components, and the input feature vectors and the speaker characteristic adaptation vector have logarithmic components, and updating the likelihood distribution comprises:
- adding each of the representative feature vectors with the noise feature vector to generate first modified representative feature vectors;
  
  logarithmizing each component of the first modified representative feature vectors;
  
  adding to the first modified and logarithmized representative feature vectors the speaker characteristic adaptation vector to generate second modified representative feature vectors; and
  
  determining a statistical distribution of the second modified representative feature vectors in feature space as likelihood distribution.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Intellectual Ventures Fund 21 LLC (Intellectual Ventures LLC)
Original Assignee
Intellectual Ventures Fund 21 LLC (Intellectual Ventures LLC)
Inventors
RUWISCH, Dietmar

Application Number

US11/674,424
Publication Number

US 20070203701A1
Time in Patent Office

Days
Field of Search
US Class Current

704/254
CPC Class Codes

G10L 15/065   Adaptation

G10L 15/12   using dynamic programming t...

G10L 15/187   Phonemic context, e.g. pron...

G10L 2015/025   Phonemes, fenemes or fenone...

H04M 1/271   controlled by voice recogni...

Communication Device Having Speaker Independent Speech Recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

36 Claims

Specification

Solutions

Use Cases

Quick Links

Communication Device Having Speaker Independent Speech Recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

36 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links