Multiple template speech recognition system

US 4,181,821 A
Filed: 10/31/1978
Issued: 01/01/1980
Est. Priority Date: 10/31/1978
Status: Expired

First Claim

Patent Images

1. A circuit for recognizing an unknown utterance as one of a set of reference words comprising means responsive to each of a plurality of utterances of a reference word for generating a set of signals representative of the features of said utterance;

means responsive to the feature signal sets of each reference word for generating at least one temple signal, each template signal being representative of a group of said reference word feature signal sets;

means responsive to the unknown utterance for generating a set of signals representative of the features of said unknown utterance;

means jointly responsive to said unknown utterance feature signal set and each reference word template signal for forming a set of signals each representative of the similarity between said unknown utterance feature signal set and said reference word template signal;

characterized in that selection means (130) are responsive to the similarlity signals for each reference word to select a plurality of said reference word similarity signals;

averaging means (135) are adapted to form a signal corresponding to the average of said selected similarlity signals for each reference word; and

identifying apparatus (140,

145) is responsive to the average similarity signals for said reference words to identify said unknown utterance as the most similar reference word.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech analyzer for recognizing an unknown utterance as one of a set of reference words is adapted to generate a feature signal set for each utterance of every reference word. At least one template signal is produced for each reference word which template signal is representative of a group of feature signal sets. Responsive to a feature signal set formed from the unknown utterance and each reference word template signal, a signal representative of the similarity between the unknown utterance and the template signal is generated. A plurality of similarity signals for each reference word is selected and a signal corresponding to the average of said selected similarity signals is formed. The average similarity signals are compared to identify the unknown utterance as the most similar reference word. Features of the invention include: template formation by successive clustering involving partitioning feature signal sets into groups of predetermined similarity by centerpoint clustering, and recognition by comparing the average of selected similarity measures of a time-warped unknown feature signal set with the cluster-derived reference templates for each vocabulary word.

215 Citations

26 Claims

1. A circuit for recognizing an unknown utterance as one of a set of reference words comprising means responsive to each of a plurality of utterances of a reference word for generating a set of signals representative of the features of said utterance;
- means responsive to the feature signal sets of each reference word for generating at least one temple signal, each template signal being representative of a group of said reference word feature signal sets;
  
  means responsive to the unknown utterance for generating a set of signals representative of the features of said unknown utterance;
  
  means jointly responsive to said unknown utterance feature signal set and each reference word template signal for forming a set of signals each representative of the similarity between said unknown utterance feature signal set and said reference word template signal;
  
  characterized in that selection means (130) are responsive to the similarlity signals for each reference word to select a plurality of said reference word similarity signals;
  
  averaging means (135) are adapted to form a signal corresponding to the average of said selected similarlity signals for each reference word; and
  
  identifying apparatus (140,
  
  145) is responsive to the average similarity signals for said reference words to identify said unknown utterance as the most similar reference word.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. A circuit for recognizing an unknown utterance as one of a set of reference words according to claim 1 characterized in that the template signal generating means (112) further comprises means (222, 224, 225, 226, 228, 230) for successively partitioning said reference word feature signal sets into clusters of feature signal sets, the feature signal sets of each cluster having a predetermined degree of similarity;
    - and means (600, 230,
      
      216) for identifying a feature signal set in each cluster as the template signal representative of all feature signal sets in said cluster.
  - 3. A circuit for recognizing an unknown utterance as one of a set of reference words according to claim 2 wherein each feature signal set generating means comprises means for producing prediction parameter signals representative of the utterance;
    - said similarity signal generating means comprises means jointly responsive to the prediction parameter signals of the unknown utterance and the prediction parameter sigals of each reference word template signal for producing a signal representative of the distance between said unknown utterance prediction parameter signals and said reference word template prediction parameter signals;
      
      characterized in that said selection means (130) further comprises means (1825) for selecting a plurality of the smallest distance signals for each reference word, said averaging means (135) comprises means (1830 1833,
      
           1836) for forming a signal representative of the average of said selected distance signals for each reference word; and
      
      said identifying apparatus (140,
      
           145) comprises means (1839,
      
           1891) responsive to the average distance signals formed for all reference words for identifying the unknown utterance as the reference word having the least average distance signal.
  - 4. A circuit for recognizing an unknown utterance as one of a set of reference words according to claim 3 characterized in that said partitioning means (224, 225, 226, 228, 230) further comprises means (222, 224, 493) responsive to the feature signal sets of each reference word for generating and storing a set of signals corresponding to the distances between pairs of said reference word feature signal sets;
    - means (225, 226, 228, 505,
      
           587) responsive to said stored distance signals for determining the centermost of said reference word feature signal sets and for identifying a first group of said sets which are within a predetermined distance of said centermost set;
      
      means (225, 226, 230, 232, 234, 505,
      
           600) operative repetitively for forming successive groups of feature signal sets having a predetermined degree of similarity including means (226, 230,
      
           505) responsive to said stored distance signals for determining the centermost set of the immediately preceding group of feature signal sets;
      
      means (225, 232, 234,
      
           600) for identifying the feature signal sets of said preceding group which are within said predetermined distance of said preceding group centermost set as members of the next succesive group;
      
      means (225,
      
           600) responsive to all feature signal sets of a preceding group being within said predetermined distance of said preceding group centermost set for identifying said preceding group as a cluster of feature signal sets having a predetermined degree of similarity.
  - 5. A circuit for recognizing an utterance as one of a set of reference words according to claim 4 wherein said template signal identifying means is further characterized by means (216, 230, 640) for identifying the centermost feature signal set of said cluster as the template signal for said cluster of feature signal sets.
  - 6. A circuit for recognizing an unknown utterance as one of a set of reference words according to claim 5 wherein said means for producing a signal representative of the distance between said unknown utterance prediction parameter signals and said template prediction parameter signals is further characterized in that means (205) are responsive to the unknown utterance to determine the number of frames to the endpoint frame of said unknown utterance;
    - means (880,
      
      1806) are adapted to generate a first signal corresponding to the average frame distance between the unknown utterance prediction parameter signals and said template signal prediction parameter signals until said endpoint frame, to determine the unknown utterance intermediate frame at which the unknown utterance speech signal energy from said intermediate frame to said endpoint frame is a predetermined portion of the total unknown utterance speech signal energy, and to generate a second signal corresponding to the average frame distance between said unknown utterance prediction parameter signals and said template prediction parameter signals until said intermediate frame; and
      
      means (1817,
      
      1820) are responsive to said first and second signals to select the minimum of said first and second signals as said distance representative signal.
  - 7. A circuit for recognizing an unknown utterance as one of a set of reference words according to claim 6 wherein said distance signal generating and storing means comprises means for storing the distance signals for each reference word in a storage matrix of J rows and J columns;
    - and said centermost set determining means is characterized by means (702-1 through 702-J) responsive to the distance signals of each column of said matrix for selecting the maximum distance signal of said column;
      
      means (720) responsive to said selected maximum distance signals for determining the minimum of said selected maximum distance signals; and
      
      means (228, 230 for storing the row position of said determined minimum of the selected maximum distance signals to identify the centermost set.
  - 8. A circuit for recognizing an unknown utterance as one of a set of reference words according to claim 7 characterized in that said means (1825) for selecting the smallest distance signals for each reference word further comprises means for successively receiving reference word distance signals from said distance signal generating means (1803, 1806, 1817, 1820);
    - means (2002, 2012, 2022,
      
           2032) for storing a set of minimum value distance signals in ascending order;
      
      means (2004, 2014, 2024,
      
           2034) for comparing the distance signal from said receiving means with each of the distance signals stored in said distance signal storing means (2002, 2012, 2022,
      
           2032);
      
      means (e.g., 2020, 2026,
      
           2028) responsive to the operation of each comparing means (e.g.,
      
           2024) for replacing the distance signal in said distance signal storing means (e.g.,
      
           2022) with said received distance signal if said received distance signal is smaller than the distance signal in said distance signal storing means (e.g.,
      
           2022) and greater than the distance signal in the next lower order distance storing means (e.g.,
      
           2012), for retaining the distance signal in said distance signal storing means (e.g.,
      
           2022) if said received distance signal is greater than the distance signal in said distance signal storing means (e.g.,
      
           2022), and for transferring the distance signal in the next lower order distance signal storing means (e.g.,
      
           2012) to said distance signal storing means (e.g.,
      
           2022) if said received distance signal is smaller than the distance signals in both the next lower order distance signal storing means (e.g.,
      
           2012) and said distance signal storing means (e.g.,
      
           2022).
  - 9. A circuit for recognizing an unknown utterance as one of a set of reference words according to claim 3, wherein said distance representative signal producing means is further characterized in that means (205) are responsive to said unknown utterance to determine the number of frames in said utterance to the endpoint frame thereof;
    - means (880,
      
      1806) are operative to generate a first signal corresponding to the average frame distance between the unknown utterance prediction parameter signals and said template prediction parameter signals until said endpoint frame, to determine the unknown utterance intermediate frame at which the unknown utterance speech signal energy from said intermediate frame to said endpoint frame is a predetermined portion of the total speech signal energy of said unknown utterance, and to generate a second signal corresponding to the average frame distance between said unknown utterance prediction parameter signals and said template prediction parameter signals until said intermediate frame; and
      
      means (1817,
      
      1820) are adapted to select the smaller of said first and second signals as said distance representative signal.

10. A method of recognizing an unknown utterance as one of a set of reference words comprising the steps of generating a set of feature signals representative of each of a plurality of utterances of each reference word;
- generating at least one template signal for each reference word responsive to the feature signal sets of said reference word, each template signal being representative of a group of said reference word feature signal sets;
  
  responsive to the unknown utterance, generating a set of signals representative of the features of said unknown utterance;
  
  jointly responsive to said unknown utterance feature signal set and each reference word template signal, forming a set of signals each representative of the similarity between said unknown utterance feature signal set and said template signal;
  
  characterized in that a plurality of similarity signals for each reference word is selected;
  
  a signal corresponding to the average of each reference word selected similarity signals is formed; and
  
  responsive to the average signals for all reference words, the unknown utterance is identified as the most similar reference word.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. A method for recognizing an unknown utterance as one of a set of reference words according to claim 10 wherein said template signal generating step comprises successively partitioning said reference word feature signal sets into clusters of feature signal sets, the feature signal sets of each cluster having a predetermining degree of similarity;
    - and identifying a feature signal set in each cluster as a template signal representative of all feature signal sets in said cluster.
  - 12. A method for recognizing an unknown utterance as one of a set of reference words according to claim 11 wherein said feature signal set generating step comprises producing prediction parameter signals representative of the utterance;
    - and said similarity signal generating step comprises producing a signal representative of the distance between the unknown utterance prediction parameter signals and the reference word template prediction parameter signals jointly responsive to the prediction parameter signals of the unknown utterance and the prediction parameter signals of each reference word template characterized in that the smallest distance signals for each reference word are selected;
      
      a signal representative of the average of the selected distance signals for each reference word is formed; and
      
      responsive to the average distance signals formed for all reference words, the unknown utterance is identified as the reference word having the least average distance signal.
  - 13. A method for recognizing an unknown utterance as one of a set of reference words according to claim 12 wherein said partitioning step is further characterized in that a set of signals corresponding to the distances between pairs of said reference word feature signal sets is generated and stored;
    - responsive to the stored distance signals, the centermost set of the reference word feature signal sets is determined and a first group of said feature signal sets which are within a predetermined distance of said centermost set is identified;
      
      successive groups of unclustered feature signal sets having a predetermined degree of similarity are formed by determining the centermost set of the preceding group of feature signal sets from the stored distance signals and identifying the feature signal sets of said preceding group which are within the predetermined distance of said preceding group centermost set as members of the next successive group;
      
      responsive to all feature signal sets of a formed group being within the predetermined distance of the group centermost set, identifying the formed group as a cluster of feature signal sets having a predetermined degree of similarity.
  - 14. A method for recognizing an unknown utterance as one of a set of reference words according to claim 13 wherein said template signal identification step is characterized in that the centermost feature signal set of said cluster is stored as the template signal for said cluster of feature signal sets.
  - 15. A method for recognizing an unknown utterance as one of a set of reference words according to claim 14 wherein said step of producing a signal representative of the distance between the unknown utterance prediction parameter signals and the reference word template prediction parameter signals is characterized in that the number of frames to the endpoint frame of the unknown utterance is determined;
    - a first signal corresponding to the average frame distance between the unknown utterance prediction parameter signals and the template prediction parameter signals until said endpoint frame is generated;
      
      the unknown utterance intermediate frame at which the unknown utterance speech signal energy from said intermediate frame to said endpoint frame is a predetermined portion of the total unknown utterance speech signal energy is determined;
      
      a second signal corresponding to the average frame distance between the unknown utterance prediction parameter signals and said template prediction parameter signals until said intermediate frame is generated; and
      
      the minimum of said first and second signals is selected as said distance representative signal.
  - 16. A method for recognizing an unknown utterance as one of a set of reference words according to claim 15 wherein said distance signal generating and storing step includes storing the distance signals for each reference word in a storage matrix of J rows and J columns;
    - and said centermost set determining step is characterized in that the maximum distance signal in each column of said matrix is selected;
      
      responsive to said selected maximum distance signals, the minimum of said selected maximum distance signals is determined; and
      
      signal corresponding to the row position of said determined minimum of the selected maximum distance signals is stored to identify the centermost set.

17. A speech recognition circuit for identifying an unknown utterance as one of a set of reference words comprising means responsive to each of a plurality of utterances of a reference word for generating a first signal representative of the prediction parameters of said utterance;
- means responsive to the first signals of each reference word for generating at least one template signal for said reference word, each template signal being representative of a group of the reference word first signals;
  
  means responsive to the unknown utterance for generating a second signal representative of the prediction parameters of said unknown utterance;
  
  means jointly responsive to the template signals of each reference word and the second signal for forming a set of signals each representative of the distance between said second signal and said reference word template signal; and
  
  means responsive to said reference word distance signals for identifying said unknown utterance as the reference word having the minimum distance signals characterized in that said template signal generating means (112) further comprises means (222,
  
       224) responsive to the first signals of each reference word for generating and storing a set of signals each representative of the distance between a pair of said reference word first signals;
  
  means (222, 225, 226, 228,
  
       230) responsive to said stored distance signals for successively partitioning the first signals of each reference word into clusters, the first signals of each cluster having a predetermined degree of similarity; and
  
  means (216, 230,
  
       600) responsive to said distance signals for determining the centermost first signal of each cluster and for identifying said centermost first signal as the cluster template signal.
- View Dependent Claims (18, 19)
- - 18. A speech recognition circuit for identifying an unknown utterance as one of a set of reference words according to claim 17 characterized in that said partitioning means further comprises means (225, 226, 228, 505, 587) responsive to said stored distance signals for determining the centermost of said first signals and a first group of first signals which are within a predetermined distance of said centermost first signal;
    - means (225, 226, 230, 232, 505,
      
           600) responsive to said first group distance signals for successively forming groups of first signals having a predetermined degree of similarity including means (702-1 through 702-J, 720, 225, 232, 234,
      
           600) responsive to the stored distance signals of said reference word for determining the centermost first signal of the immediately preceding group and for identifying the first signals of the immediately preceding group which are within said predetermined distance of the centermost first signal of the immediately preceding group as memebers of the next group of first signals;
      
      means (225,
      
           600) responsive to all first signals of a formed group being within the predetermined distance of said formed group centermost first signal for identifying the first signals of said formed group as members of a cluster of first signals having a predetemined degree of similarity.
  - 19. A speech recognition circuit for identifying an unknown utterance as one of a set of reference words according to claim 18 further characterized by means (216, 230, 646) for storing the centermost first signal of the cluster as the template signal for said cluster of first signals.

20. A method for identifying an unknown utterance as one of a set of reference words comprising the steps of generating a first signal representative of the prediction parameters of each of a plurality of utterances of a reference word;
- generating at least one template signal for each reference word responsive to the reference word first signals, each template signal being representative of a group of the reference word first signals;
  
  generating a second signal representative of the prediction parameters of said unknown utterance;
  
  jointly responsive to the template signals of each reference word and the second signal, forming a set of signals each representative of the distance between said second signal and said reference word template signal;
  
  responsive to the distance signals of all reference words, identifying the unknown utterance as the reference word having the minimum distance signals characterized in that said template signal generation for each reference word includes generating and storing a set of signals each representative of the distance between a pair of reference word first signals responsive to the first signals of said reference word; and
  
  successively partitioning the first signals of said reference word into clusters responsive to the stored reference word distance signals, the first signals of each cluster having a predetermined degree of similarity;
  
  determining the centermost first signal of each cluster responsive to said stored reference word distance signals; and
  
  identifying said centermost first signal as the cluster template signal.
- View Dependent Claims (21, 22)
- - 21. A method for identifying an unknown utterance as one of a set of reference words according to claim 20 characterized in that the successive partitioning further includes determining the centermost of said first signals of said reference word and identifying the first group of said reference word first signals which are within a predetermined distance of said determined centermost first signal responsive to said reference word stored distance signals;
    - successively forming groups of first signals having a predetermined degree of similarity comprising determining the centermost first signal of the immediately preceding group from the stored distance signals and identifying the first signals of the immediately preceding group which are within a predetermined distance of the centermost first signal of the immediately preceding group as members of the next group of first signals; and
      
      responsive to all first signals of the immediately preceding group being within the predetermined distance of saidimmediately preceding group centermost first signal, identifying the immediately preceding group as a cluster of first signals having said predetermined degree of similarity.
  - 22. A method for identifying an unknown utterance as one of a set of reference words according to claim 21 further characterized in that the centermost first signal of said cluster is identified and stored as the template signal for said cluster.

23. A speech recognition circuit for identifying an unknown utterance as one of a set of reference words comprising:
- means responsive to each of a plurality of utterances of a reference word for generating a first signal representative of the prediction parameters of said utterance;
  
  means responsive to the first signals of each reference word for generating at least one template signal for each reference word, each template signal being representative of a group of reference word first signals;
  
  means responsive to the unknown utterance for generating a second signal representative of the prediction parameters of said unknown utterance;
  
  means jointly responsive to the template signals of each reference word and the second signal for forming a set of a signals each representative of the distance between the second signal and said reference word template signal; and
  
  means responsive to said reference word distance signals for identifying said unknown utterance as the reference word having the minimum distance signals;
  
  characterized in that said distance representative signal forming means (1803, 1806, 1810, 1815, 1817,
  
       1820) further comprises means (205) responsive to said unknown utterance for determining the number of frames to the endpoint frame of the unknown utterance;
  
  means (880,
  
       1806) for generating a third signal corresponding to the average frame distance between said second signal and said template prediction parameter signals until said endpoint frame of said unknown utterance, for determining the intermediate frame of the unknown utterance at which the speech signal energy of the unknown utterance from said intermediate frame to said endpoint frame is a predetermined portion of the total speech signal energy of the unknown utterance, and for generating a fourth signal corresponding to the average frame distance between said second signal and said template prediction parameter signals until said intermediate frame; and
  
  means (1817,
  
       1820) for selecting the minimum of said third and fourth signals as said distance representative signal.
- View Dependent Claims (24)
- - 24. A speech recognition circuit for identifying an unknown utterance as one of a set of reference words according to claim 23 characterized in that said means (130, 135, 140, 145) for identifying said unknown utterance as the reference word having the minimum distance signals further comprises means (130) responsive to said distance representative signals of each reference word for selecting a plurality of said reference word smallest distance representative signals;
    - means (135) for forming a signal corresponding to the average of said reference word selected distance representative signals; and
      
      means (140,
      
      145) responsive to the average selected distance representative signals for all reference words for identifying the unknown utterance as the reference word having the minimum average distance representative signal.

25. A method for identifying an unknown utterance as one of a set of reference words comprising the steps of generating a first signal representative of the prediction parameters of each of a plurality of utterances of each reference word;
- generating at least one template signal for each reference word responsive to the reference word first signals, each template signal being representative of a group of the reference word first signals;
  
  generating a second signal representative of the prediction parameters of said unknown utterance;
  
  jointly responsive to the template signals of each reference word and the second signal, forming a set of signals each representative of the distance between said second signal and said reference word template signal;
  
  responsive to the distance signals of all reference words, identifying the unknown utterance as the reference word having the minimum distance signals characterized in that the step of forming a set of signals each representative of the distance between said second signal and said reference template signal comprises the steps of determining the endpoint frame of the unknown utterance;
  
  generating a third signal corresponding to the average frame distance between said second signal and said template prediction parameter signals until said endpoint frame;
  
  determining the intermediate frame of said unknown utterance at which the unknown utterance speech signal energy from said intermediate frame to said endpoint frame is a predetermined portion of the total speech signal energy of said unknown utterance;
  
  generating a fourth signal corresponding to the average frame distance between the second signal and the template prediction parameter signals until said intermediate frame; and
  
  selecting the minimum of said third and fourth signals as said distance representative signal.
- View Dependent Claims (26)
- - 26. A method for identifying an unknown utterance as one of a set of reference words according to claim 25 further characterized in that said step of identifying the unknown utterance as the reference word having the minimum distance signals further comprises the steps of:
    - selecting a plurality of smallest distance signals for each reference word;
      
      forming a signal corresponding to the average of said reference word selected smallest distance signals; and
      
      identifying the unknown utterance as the reference word having the minimum average distance signal responsive to the average distance signals of all reference words.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Bell Telephone Laboratories, Inc. (Nokia Corporation)
Original Assignee
Bell Telephone Laboratories, Inc. (Nokia Corporation)
Inventors
Rabiner, Lawrence R., Pirz, Frank C.
Primary Examiner(s)
Morrison, Malcolm A.
Assistant Examiner(s)
Kemeny, E. S.

Application Number

US05/956,438
Time in Patent Office

427 Days
Field of Search

179/1 SD, 179/1 SB, 179/1 SC, 340/146.3 AQ, 340/146.3 WD, 340/146.3 CA
US Class Current

704/252
CPC Class Codes

G10L 15/063 Training

G10L 25/87 Detection of discrete point...

Multiple template speech recognition system

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

215 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Multiple template speech recognition system

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

215 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links