Method of speech recognition

US 5,345,536 A
Filed: 12/17/1991
Issued: 09/06/1994
Est. Priority Date: 12/21/1990
Status: Expired due to Term

First Claim

Patent Images

1. A method of speech recognition, comprising the steps of:

generating "m" feature parameters every frame from reference speech which is spoken by at least one speaker and which represents recognition-object words, where "m" denotes a preset integer;

previously generating "n" types of standard patterns of a set of preset phonemes on the basis of speech data of a plurality of speakers, where "n" denotes a preset integer;

executing a matching between the feature parameters of the reference speech and each of the standard patterns, and generating a vector of "n" reference similarities between the feature parameters of the reference speech and each of the standard patterns every frame;

generating temporal sequences of the reference similarity vectors of respective frames, the reference similarity vector sequences corresponding to the recognition-object words respectively;

previously registering the reference similarity vector sequences as dictionary similarity vector sequences;

analyzing input speech to be recognized, and generating "m" feature parameters from the input speech;

executing a matching between the feature parameters of the input speech and the standard patterns, and generating a vector of "n" input-speech similarities between the feature parameters of the input speech and the standard patterns every frame;

generating a temporal sequence of the input-speech similarity vectors of respective frames;

collating the input-speech similarity vector sequence with the dictionary similarity-vector sequences; and

recognizing the input speech based on a result of the collating step.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A set of "m" feature parameters is generated every frame from reference speech which is spoken by at least one speaker and which represents recognition-object words, where "m" denotes a preset integer. A set of "n" types of standard patterns is previously generated on the basis of speech data of a plurality of speakers, where "n" denotes a preset integer. Matching between the feature parameters of the reference speech and each of the standard patterns is executed to generate a vector of "n" reference similarities between the feature parameters of the reference speech and each of the standard patterns every frame. The reference similarity vectors of respective frames are arranged into temporal sequences corresponding to the recognition-object words respectively. The reference similarity vector sequences are previously registered as dictionary similarity vector sequences. Input speech to be recognized is analyzed to generate "m" feature parameters from the input speech. Matching between the feature parameters of the input speech and the standard patterns is executed to generate a vector of "n" input-speech similarities between the feature parameters of the input speech and the standard patterns every frame. The input-speech similarity vectors of respective frames are arranged into a temporal sequence. The input-speech similarity vector sequence is collated with the dictionary similarity vector sequences to recognize the input speech.

275 Citations

35 Claims

1. A method of speech recognition, comprising the steps of:
- generating "m" feature parameters every frame from reference speech which is spoken by at least one speaker and which represents recognition-object words, where "m" denotes a preset integer;
  
  previously generating "n" types of standard patterns of a set of preset phonemes on the basis of speech data of a plurality of speakers, where "n" denotes a preset integer;
  
  executing a matching between the feature parameters of the reference speech and each of the standard patterns, and generating a vector of "n" reference similarities between the feature parameters of the reference speech and each of the standard patterns every frame;
  
  generating temporal sequences of the reference similarity vectors of respective frames, the reference similarity vector sequences corresponding to the recognition-object words respectively;
  
  previously registering the reference similarity vector sequences as dictionary similarity vector sequences;
  
  analyzing input speech to be recognized, and generating "m" feature parameters from the input speech;
  
  executing a matching between the feature parameters of the input speech and the standard patterns, and generating a vector of "n" input-speech similarities between the feature parameters of the input speech and the standard patterns every frame;
  
  generating a temporal sequence of the input-speech similarity vectors of respective frames;
  
  collating the input-speech similarity vector sequence with the dictionary similarity-vector sequences; and
  
  recognizing the input speech based on a result of the collating step.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 2. The method of claim 1, further comprising the steps of calculating time-domain variations in the reference similarities every frame, generating vectors of the time-domain variations in the reference similarities every frame, generating temporal sequences of the vectors of the time-domain variations in the reference similarities of respective frames, wherein the temporal sequences of the vectors of the time-domain variations in the reference similarities correspond to the recognition-object words respectively, calculating time-domain variations in the input-speech similarities, generating a vector of the time-domain variations in the input-speech similarities every frame, generating a temporal sequence of the vectors of the time-domain variations in the input-speech similarities of respective frames, and collating the temporal sequence of the vectors of the time-domain variations in the input-speech similarities with each of the temporal sequences of the vectors of the time-domain variations in the reference similarities to recognize the input speech.
  - 3. The method of claim 2, further comprising the steps of analyzing pieces of reference speech which are spoken by at least two speakers and which represent equal recognition-object words to obtain similarity vector sequences and time-domain similarity variation vector sequences, using the similarity vector sequences and the time-domain similarity variation vector sequences as multi standard patterns in determining the reference similarity vector sequences and the reference time-domain similarity variation vector sequences.
  - 4. The method of claim 3, wherein the pieces of the reference speech are spoken by a male speaker and a female speaker.
  - 5. The method of claim 2, wherein two or more speakers speak equal recognition-object words, which are analyzed to obtain temporal sequence patterns of similarity vectors and temporal sequence patterns of regression coefficient vectors, wherein time bases of the speakers are matched by a DP matching with respect to the temporal sequence patterns of the similarity vectors and the temporal sequence patterns of the regression coefficient vectors, wherein mean values of the similarities and mean values of time-domain variations in the similarities are calculated between temporally-matched frames, and wherein temporal sequence patterns of the mean values are registered with a dictionary.
  - 6. The method of claim 1, wherein the collating step uses a dynamic programming matching technique.
  - 7. The method of claim 1, further comprising the step of processing the reference similarities and the input-speech similarities through a function of emphasizing large members of the similarities.
  - 8. The method of claim 1, further comprising the steps of processing the reference similarities through a function of emphasizing large members of the reference similarities to convert the reference similarities into second reference similarities, processing the input-speech similarities through the same function to convert the input-speech similarities into second input-speech similarities, determining the dictionary similarity vector sequences on the basis of the second reference similarities, calculating time-domain variations in the second reference similarities every frame generating vectors of the time-domain variations in the second reference similarities every frame, generating temporal sequences of the vectors of the time-domain variations in the second reference similarities of respective frames, wherein the temporal sequences of the vectors of the time-domain variations in the second reference similarities correspond to the recognition-object words respectively, calculating time-domain variations in the second input-speech similarities, generating a vector of the time-domain variations in the second input-speech similarities every frame, generating a temporal sequence of the vectors of the time-domain variations in the second input-speech similarities of respective frames, and collating the temporal sequence of the vectors of the time-domain variations in the second input-speech similarities with each of the temporal sequences of the vectors of the time-domain variations in the second reference similarities to recognize the input speech.
  - 9. The method of claim 1, further comprising the steps of normalizing each of the reference similarity vectors, and normalizing each of the input-speech similarity vectors, and wherein the collating step comprises calculating a distance between the input-speech similarity vector sequence and each of the dictionary similarity vector sequences, and recognizing the input speech in response to the calculated distances.
  - 10. The method of claim 1, further comprising the steps of selecting "k" greater members from the reference similarities and setting remaining members of the reference similarities equal to a k-th greatest reference similarity to convert the reference similarities into second reference similarities, where "k" denotes a preset integer, determining the dictionary similarity vector sequences on the basis of the second reference similarities, selecting "k" greater members from the input similarities and setting remaining members of the input-speech similarities equal to a k-th greatest input similarity to convert the input similarities into second input-speech similarities, determining the input-speech similarity vector sequence on the basis of the second input-speech similarities, and wherein the collating step comprises calculating an Euclidean distance between the input-speech similarity vector sequence and each of the dictionary similarity vector sequences, and recognizing the input speech in response to the calculated Euclidean distances.
  - 11. The method of claim 1, further comprising the step of processing the reference similarities and the input-speech similarities through an exponential function of emphasizing large members of the similarities, and wherein the collating step comprises calculating a correlation distance between the input-speech similarity vector sequence and each of the dictionary similarity vector sequences, and recognizing the input speech in response to the calculated correlation distances.
  - 12. The method of claim 1, further comprising the steps of emphasizing great members of the reference similarities to convert the reference similarities into second reference similarities, determining the dictionary similarity vector sequences on the basis of the second reference similarities, normalizing each of the dictionary similarity vectors, emphasizing great members of the input-speech similarities to convert the input-speech similarities into second input-speech similarities, determining the input-speech similarity vector sequence on the basis of the second input-speech similarities, normalizing each of the input-speech vectors, and wherein the collating step comprises executing a DP matching technique using a weight, varying the weight in accordance with a magnitude of a mean similarity in a frame to set the weight small in an interval corresponding to a low mean similarity.
  - 13. The method of claim 12, further comprising the steps of calculating time-domain variations in the second reference similarities every frame, generating vectors of the time-domain variations in tile second reference similarities every frame, generating temporal sequences of the vectors of the time-domain variations in the second reference similarities of respective frames, wherein the temporal sequences of the vectors of the time-domain variations in the second reference similarities correspond to the recognition-object words respectively, calculating time-domain variations in the second input-speech similarities, generating a vector of the time-domain variations in the second input-speech similarities every frame, generating a temporal sequence of the vectors of the time-domain variations in the second input-speech similarities of respective frames, and collating the temporal sequence of the vectors of the time-domain variations in the second input-speech similarities with each of the temporal sequences of the vectors of the time-domain variations in the second reference similarities to recognize the input speech.
  - 14. The method of claim 1, wherein the collating step comprises executing a DP matching technique using a weight, and setting the weight small in a silent interval.
  - 15. The method of claim 1, further comprising the steps of emphasizing great members of the reference similarities to convert the reference similarities into second reference similarities, determining the dictionary similarity vector sequences on the basis of the second reference similarities, normalizing each of the dictionary similarity vectors, emphasizing great members of the input-speech similarities to convert the input-speech similarities into second input-speech similarities, determining the input-speech similarity vector sequence on the basis of the second input-speech similarities, normalizing each of the input-speech vectors, calculating time-domain variations in the second reference similarities every frame, generating vectors of the time-domain variations in the second reference similarities every frame, generating temporal sequences of the vectors of the time-domain variations in the second reference similarities of respective frames, wherein the temporal sequences of the vectors of the time-domain variations in the second reference similarities correspond to the recognition-object words respectively, calculating time-domain variations in the second input-speech similarities, generating a vector of the time-domain variations in the second input-speech similarities every frame, generating a temporal sequence of the vectors of the time-domain variations in the second input-speech similarities of respective frames, and wherein the collating means comprises executing a DP matching technique, calculating a distance Lk between the temporal sequence of the vectors of the time-domain variations in the second input-speech similarities and each of the temporal sequences of the vectors of the time-domain variations in the second reference similarities, calculating a distance Ls between the the input-speech similarity vector sequence and each of the dictionary similarity vector sequences, calculating a weighted addition L between the distances Lk and Ls by referring to an equation "L=pLs+(l-p)Lk" where "p" denotes a weight corresponding to a mixing ratio, varying the talking ratio "p" in accordance with a magnitude of a mean time-domain variation in the similarities in a frame to set the mixing ratio "p" great in an interval corresponding to a small mean time-domain variation in the similarities in a frame, and recognizing the input speech in response to the weighted addition L.
  - 16. The method of claim 1, further comprising the steps of emphasizing great members of the reference similarities to convert the reference similarities into second reference similarities, determining the dictionary similarity vector sequences on the basis of the second reference similarities, normalizing each of the dictionary similarity vectors, emphasizing great members of the input-speech similarities to convert the input-speech similarities into second input-speech similarities, determining the input-speech similarity vector sequence on the basis of the second input-speech similarities, normalizing each of the input-speech vectors, calculating time-domain variations in the second reference similarities every frame, generating vectors of the time-domain variations in the second reference similarities every frame, generating temporal sequences of the vectors of the time-domain variations in the second reference similarities of respective frames, wherein the temporal sequences of the vectors of the time-domain variations in the second reference similarities correspond to the recognition-object words respectively, calculating time-domain variations in the second input-speech similarities, generating a vector of the time-domain variations in the second input-speech similarities every frame, generating a temporal sequence of the vectors of the time-domain variations in the second input-speech similarities of respective frames, and wherein the collating means comprises executing a DP matching technique, calculating a distance Lk between the temporal sequence of the vectors of the time-domain variations in the second input-speech similarities and each of the temporal sequences of the vectors of the time-domain variations in the second reference similarities, calculating a distance Ls between the the input-speech similarity vector sequence and each of the dictionary similarity vector sequences, calculating a weighted addition L between the distances Lk and Ls by referring to an equation "L=pLs+(l-p)Lk" where "p" denotes a weight corresponding to a mixing ratio, setting the mixing ratio "p" to a first reference value in an interval corresponding to a constant portion of a vowel, setting the mixing ratio "p" to a second reference value in an interval different from the interval corresponding to the constant portion of the vowel, the second reference value being smaller than the first reference value, and recognizing the input speech in response to the weighted addition L.
  - 17. The method of claim 1, further comprising the steps of emphasizing great members of the reference similarities to convert the reference similarities into second reference similarities, determining the dictionary similarity vector sequences on the basis of the second reference similarities, normalizing each of the dictionary similarity vectors, emphasizing great members of the input-speech similarities to convert the input-speech similarities into second input-speech similarities, determining the input-speech similarity vector sequence on the basis of the second input-speech similarities, normalizing each of the input-speech vectors, calculating time-domain variations in the second reference similarities every frame, generating vectors of the time-domain variations in the second reference similarities every frame, generating temporal sequences of the vectors of the time-domain variations in the second reference similarities of respective frames, wherein the temporal sequences of the vectors of the time-domain variations in the second reference similarities correspond to the recognition-object words respectively, calculating time-domain variations in the second input-speech similarities, generating a vector of the time-domain variations in the second input-speech similarities every frame, generating a temporal sequence of the vectors of the time-domain variations in the second input-speech similarities of respective flames, and wherein the collating means comprises executing a DP matching technique, calculating a distance Lk between the temporal sequence of the vectors of the time-domain variations in the second input-speech similarities and each of the temporal sequences of the vectors of the time-domain variations in the second reference similarities, calculating a distance Ls between the the input-speech similarity vector sequence and each of the dictionary similarity vector sequences, calculating a weighted addition L between the distances Lk and Ls by referring to an equation "L=pLs+(l-p)Lk" where "p"60 denotes a weight corresponding to a mixing ratio, varying the mixing ratio "p" in accordance with a magnitude of a mean time-domain variation in the similarities in a frame to set the mixing ratio "p" great in an interval corresponding to a small mean time-domain variation in the similarities in a frame, executing a DP matching technique in calculating the distances Lk and Ls, the DP matching technique using a second weight, setting the second weight small in a silent interval, and recognizing the input speech in response to the weighted addition L.
  - 18. The method of claim 1, further comprising the steps of analyzing pieces of reference speech which are spoken by at least two speakers and which represent equal recognition-object words to obtain temporal sequences of similarity vectors, executing a DP matching on the temporal sequences of the similarity vectors to match time bases between the speakers, calculating mean values of respective similarities between temporally-matched frames, and determining the reference similarity vector sequences on the basis of the calculated mean values.
  - 19. The method of claim 18, wherein the pieces of the reference speech are spoken by a male speaker and a female speaker.
  - 20. The method of claim 1, further comprising the steps of calculating time-domain variations in the reference similarities every frame, generating vectors of the time-domain variations in the reference similarities every frame, generating temporal sequences of the vectors of the time-domain variations in the reference similarities of respective frames, wherein the temporal sequences of the vectors of the time-domain variations in the reference similarities correspond to the recognition-object words respectively, calculating time-domain variations in the input-speech similarities, generating a vector of the time-domain variations in the input-speech similarities every frame, generating a temporal sequence of the vectors of the time-domain variations in the input-speech similarities of respective frames, collating the temporal sequence of the vectors of the time-domain variations in the input-speech similarities with each of the temporal sequences of the vectors of the time-domain variations in the reference similarities to recognize the input speech, analyzing pieces of reference speech which are spoken by at least two speakers and which represent equal recognition-object words to obtain temporal sequences of similarity vectors and temporal sequences of time-domain similarity variations, executing a DP matching on the temporal sequences of the similarity vectors and the temporal sequences of the time-domain similarity variations to match time bases between the speakers, calculating mean values of respective similarities between temporally-matched frames, and determining the reference similarity vector sequences and the reference time-domain similarity variation sequences on the basis of the calculated mean values.
  - 21. The method of claim 20, wherein the pieces of the reference speech are spoken by a male speaker and a female speaker.
  - 22. The method of claim 1, further comprising the steps of analyzing pieces of reference speech which are spoken by at least two speakers and which represent equal recognition-object words to obtain similarity vectors, using the similarity vectors as multi standard patterns in determining the reference similarity vector sequences.
  - 23. The method of claim 22, wherein the pieces of the reference speech are spoken by a male speaker and a female speaker.

24. A method of speech recognition, comprising the steps of:
- previously setting a set of words in consideration of phonetic environments;
  
  making at least one speaker speak the word set, and obtaining "m" feature parameters therefrom every frame;
  
  previously generating "n" types of standard patterns of a set of preset phonemes from speech data generated by many speakers;
  
  executing a matching between the feature parameters and each of the standard patterns to obtain a vector of "n" similarities every frame;
  
  generating a temporal sequence pattern from the similarity vector;
  
  extracting speech fragments from the temporal sequence pattern, and registering the speech fragments as a speech fragment dictionary;
  
  generating a connection sequence of the speech fragments or a temporal sequence pattern of similarity vectors for each of speech-recognition object words, wherein the temporal sequence pattern of the similarity vectors is made by combining the speech fragments in the speech fragment dictionary;
  
  storing the connection sequence of the speech fragments or the temporal sequence pattern of the similarity vectors into a recognition-object dictionary for each of the recognition-object words;
  
  analyzing input speech to obtain "m" feature parameter every frame;
  
  executing a matching between the input-speech feature parameters and each of the standard patterns to obtain a temporal sequence of vectors of "n" similarities;
  
  performing one of a first and second collating steps,wherein said first collating step comprises collating the input-speech temporal similarity vector sequence with each of the temporal sequence patterns of the similarities which are registered in respective items of the recognition-object dictionary andsaid second collating step comprises collating the input-speech temporal similarity vector sequence with each of the temporal sequence patterns of the similarities which are generated according to the connection sequences of the speech fragments, andrecognizing the input speech based on a result of the collating step as performed.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33)
- - 25. The method of claim 27, wherein said calculating step comprises the step of using regression coefficients as information of similarity time-domain variations.
  - 26. The method of claim 27, wherein two or more speakers speak equal recognition-object words, which are analyzed to obtain temporal sequence patterns of similarity vectors and temporal sequence patterns of regression coefficient vectors, time bases of the speakers are matched by a DP matching with respect to the temporal sequence patterns of the similarity vectors and the temporal sequence patterns of the regression coefficient vectors, mean values of the similarities and mean values of time-domain variations in the similarities are calculated between temporally-matched frames, speech fragments are extracted from a temporal sequence pattern of the mean values, and the speech fragments are registered with the speech fragment dictionary.
  - 27. The method of claim 24, further comprising the steps of calculating "n" time-domain variations in the similarities with respect to each of the temporal sequences of the "n" types of the similarities every frame, and generating the temporal sequence patterns from a vector of the "n" time-domain variations in the similarities and a vector of the "n" similarities.
  - 28. The method of claim 24, wherein the speech fragments comprise a sequence of a consonant and a vowel and a combination of a vowel and a consonant.
  - 29. The method of claim 24, wherein each of the items of the recognition-object words is generated by connecting the temporal sequence patterns of the time-domain similarity variation vectors or the temporal sequence patterns of the similarity vectors extracted as speech fragment patterns, and a DP matching is done with respect to the input speech to recognize the input speech.
  - 30. The method of claim 24, wherein hidden Markov models are applied to the temporal sequence patterns of the time-domain similarity variation vectors or the temporal sequence patterns of the similarity vectors extracted as speech fragment patterns to recognize the input speech.
  - 31. The method of claim 24, wherein two or more speakers speak equal recognition-object words, which are analyzed to obtain temporal sequence patterns of similarity vectors, time bases of the speakers are matched by a DP matching with respect to the temporal sequence patterns of the similarity vectors, mean values of the similarities are calculated between temporally-matched frames, speech fragments are extracted from a temporal sequence pattern of the mean values, and the speech fragments are registered with the speech fragment dictionary.
  - 32. The method of claim 24, wherein said step of performing one of said first and second collating step comprises the step of using one of an Euclidean distance, a weighted Euclidean distance, and a correlation cosine as a distance measure for a step of calculating a distance between the similarity vectors.
  - 33. The method of claim 24, wherein at least one of said steps of executing a matching between feature parameters and standard patterns comprises the step of using one of a Bayesian decision distance, a Maharanobis'"'"'s distance, a Maharanobis'"'"'s distance in which covariance matrixes of the standard patterns are used in common, a neural-network distance, a hidden Markov model distance, and a learning vector quantization distance as a distance measure with respect to the standard patterns.

34. A method of speech recognition, comprising the steps of:
- extracting feature parameters from input speech representing one of preset words;
  
  calculating a set of input-speech similarities between the input-speech feature parameters and standard patterns of a set of preset phonemes;
  
  collating the set of the input-speech similarities with sets of predetermined reference similarities which correspond to the preset words respectively; and
  
  recognizing the input speech based on a result of said collating step.

35. A method of speech recognition, comprising the steps of:
- extracting feature parameters from input speech representing one of preset words;
  
  calculating a set of input-speech similarities between the input-speech feature parameters and standard patterns of a set of preset phonemes;
  
  calculating a set of time-domain variations in the input-speech similarities;
  
  collating the set of the input-speech similarities with sets of predetermined reference similarities which correspond to the preset words respectively; and
  
  collating the set of the time-domain variations in the input-speech similarities with sets of predetermined reference time-domain variations which correspond to the preset words respectively; and
  
  recognizing the input speech based on results of said two collating steps.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Matsushita Electric Industrial Company Limited (Panasonic Holdings Corporation)
Original Assignee
Matsushita Electric Industrial Company Limited (Panasonic Holdings Corporation)
Inventors
Hoshimi, Masakatsu, Miyata, Maki, Hiraoka, Shoji, Niyada, Katsuyuki
Primary Examiner(s)
MacDonald, Allen R.
Assistant Examiner(s)
KIM, RICHARD

Application Number

US07/808,692
Time in Patent Office

994 Days
Field of Search

381/43, 381/42, 381/41
US Class Current

704/243
CPC Class Codes

G10L 15/10 using distance or distortio...

Method of speech recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

275 Citations

35 Claims

Specification

Solutions

Use Cases

Quick Links

Method of speech recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

275 Citations

35 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links