Continuous speech recognition system

US 4,349,700 A
Filed: 04/08/1980
Issued: 09/14/1982
Est. Priority Date: 04/08/1980
Status: Expired due to Term

First Claim

Patent Images

1. Apparatus for recognizing continuous speech comprising:

means (105) for storing signals representative of the acoustic features of a set of reference words;

means (103) responsive to an unknown utterance for producing a sequence of signals representative of the acoustic features of the utterance;

means (110,120,130,140,160) responsive to the reference word acoustic feature signals and the utterance acoustic feature signals for generating at least one reference word series as a candidate for said utterance;

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Recognition of continuous speech by comparison with prestored isolated words may be confused by the merging together of spoken adjacent words (coarticulation). Improved recognition is attained by generating overlap-words, e.g., words whose first phoneme is the end phoneme of the preceding word in a string of words. The reference candidate series of overlap-words is transformed under dynamic time warping so as to time-match the utterance series of overlap-words.

Citations

28 Claims

1. Apparatus for recognizing continuous speech comprising:
- means (105) for storing signals representative of the acoustic features of a set of reference words;
  
  means (103) responsive to an unknown utterance for producing a sequence of signals representative of the acoustic features of the utterance;
  
  means (110,120,130,140,160) responsive to the reference word acoustic feature signals and the utterance acoustic feature signals for generating at least one reference word series as a candidate for said utterance;
- View Dependent Claims (3, 4, 6, 8, 10, 12, 13, 14, 16, 17, 20, 21, 25, 28)
- - 3. Apparatus for recognizing continuous speech according to claim 1 furthercharacterized in thatsaid utterance identifying means (170) comprises means (335) responsive to the endpoints of all reference word series partial candidates of a word position interval being within a predetermined range of the utterance endpoint for generating a selection signal;
    - and means (311) responsive to the selection signal and said word position similarity signals for generating a signal for each reference word series candidate representative of the correspondence of the reference word series candidate to the utterance; and
      
      means (313) responsive to the reference word series candidate correspondence signals for selecting the reference word series candidate most closely corresponding to the utterance.
  - 4. Apparatus for recognizing continuous speech according to claim 1 further
  - 6. Apparatus for recognizing continuous speech according to claim 3 further
  - 8. Apparatus for recognizing continuous speech according to claim 4, furthercharacterized in thatsaid utterance segment determining means (307,309,311,320,345) comprises means (901,905,907,909) operative for each reference word series partial candidate of the preceding word position responsive to the series utterance portion endpoint of the preceding word position interval for generating a set of precessed utterance segment beginning points for each reference word of the current word position interval;
    - said word position interval similarity signal forming means (307,320) comprisesmeans (310,305,307,311) jointly responsive to said set of precessed utterance segment beginning points, the utterance feature signal sequence and the feature signals of said reference word for selecting the utterance segment for said reference word and for generating a signal representative of the distance between the feature signals of the reference word and the feature signals of the selected utterance segment for said reference word.
  - 10. Apparatus for recognizing continuous speech according to claims 1, 2, 3, 4, 5, or 6 furthercharacterized in thatsaid signals representative of the acoustic features of a set of reference words are representative of the acoustic features of reference words spoken in isolation.
  - 12. A method for recognizing continuous speech according to claim 8 furthercharacterized in thatsaid utterance identifying step comprises generating a selection signal responsive to the endpoints of all reference word series partial candidates of a word position interval being within a predetermined range of the utterance endpoint;
    - generating a signal for each reference word series candidate representative of the correspondence of the reference word series candidate feature signals to the utterance acoustic feature signals responsive to the selection signal;
      
      and selecting the reference word series candidate most closely corresponding to the utterance responsive to said reference word series candidate correspondence signals.
  - 13. A method for recognizing continuous speech according to claim 8 furthercharacterized in thatsaid utterance identifying step comprisesgenerating a selection signal responsive to the endpoints of all reference word series partial candidate of a word position interval being within a predetermined range of the utterance endpoint;
    - generating a sequence of feature signals corresponding to each reference word series candidate responsive to said selection signal;
      
      producing a signal representative of the correspondence of the utterance to each reference word series candidate responsive to the reference word series candidate feature signal sequence and said utterance acoustic feature signal sequence;
      
      and identifying the reference word series candidate having the closest correspondence to the utterance responsive to said correspondence signals.
  - 14. A method for recognizing continuous speech according to claim 10 furthercharacterized in thatsaid word position reference word combining step comprisesgenerating a signal representative of the similarity between each reference word series partial candidate and the utterance portion corresponding thereto responsive to said word position interval similarity signals;
    - and said reference word series candidate identifying step comprises combining the similarity and correspondence signals for each reference word series candidate;
      
      and determining the reference word series candidate most closely corresponding to the utterance responsive to the combined similarity and correspondence signals for the reference word series candidate.
  - 16. A method for recognizing continuous speech according to claim 12 furthercharacterized in thatthe word position reference word selecting step comprises identifying the reference word with the minimum distance signal below a first preassigned threshold and the reference words with distance signals within a preassigned distance of the minimum distance signal reference word responsive to the reference word distance signals for each reference word series partial candidate.
  - 17. A method for recognizing continuous speech according to claim 8, 9, 10, 11, 12, or 13 furthercharacterized in thatsaid signals representative of the acoustic features of a set of reference words are representative of the acoustic features of a set of reference words spoken in isolation.
  - 20. Apparatus for recognizing an utterance as a series of predetermined reference words according to claim 16 whereinsaid utterance identifying means further comprisesmeans operative in each word position for generating a signal representative of the similarity of each word position reference word series candidate to the utterance portion corresponding thereto;
    - and said reference word series candidate selecting means comprisesmeans jointly responsive to the detected word position reference word series partial candidate similarity signals and the correspondence signals for the detected word position reference word series partial candidates for selecting the reference word series candidate most closely corresponding to the utterance.
  - 21. Apparatus for recognizing an utterance as a series of predetermined reference words according to claim 17 wherein:
    - said utterance segment determining means comprisesmeans responsive to the endpoint of each preceding word position reference word series partial candidate utterance segment for generating a precessing sequence of beginning points for the utterance segment, and means jointly responsive to the set of reference word feature signals and the feature signals of the utterance from each of said beginning points for forming an utterance segment corresponding to the feature signals of each reference word.
  - 25. A method for recognizing an utterance as a series of predetermined reference words according to claim 21 wherein:
    - said utterance identifying step comprises detecting the final word position in which the utterance segments of all reference word series partial candidates are within a predetermined range of the utterance termination;
      
      responsive to the detection of the final word position, generating a set of feature signals corresponding to each reference word series partial candidate of the final word position;
      
      generating a signal for each final word position reference word series partial candidate representative of the correspondence of the final word position reference word series partial candidate to the utterance from the reference word series candidate feature signals and the utterance acoustic feature signals;
      
      and selecting the reference word series candidate most closely corresponding to the utterance responsive to the final word position reference word series partial candidate correspondence signals.
  - 28. A method for recognizing an utterance as a series of predetermined reference words according to claims 21, 22, 23 or 24 wherein the signals representative of the acoustic features of the set of predetermined reference words are representative of the acoustic features of the set of predetermined reference words spoken in isolation.

2. and means (170) responsive to said at least one reference word series candidate and said utterance acoustic feature signals for identifying said utterance as one of said reference word series candidates;
- characterized in thatsaid reference word series candidate generating means (110,120,130,140,160) comprises;
  
  means (401,340) for generating a signal for identifying successive word position intervals;
  
  means (307,309,311) operative in each successive word position interval for forming reference word series partial candidates including means (307,309,311,320,345) responsive to each reference word series partial candidate of the preceding word position for determining for each reference word a plurality of utterance segments beginning within a predetermined range of the utterance segment endpoint of said preceding word position reference word series partial candidate and corresponding to the reference word feature signals, said range overlapping the preceding word position partial candidate utterance segment endpoint;
  
  means (307,320) responsive to each reference word feature signals and the feature signals of the corresponding determined utterance segments for forming a signal representative of the similarity between said reference word and said utterance segments;
  
  means (309,325) responsive to said similarity signals for selecting reference words having a prescribed similarity to their corresponding utterance segments;
  
  and means (311) for combining said selected reference words with the reference word series partial candidates of the preceding word position to form at least one reference word series partial candidate for said word position interval.

5. characterized in thatsaid utterance identifying means (170) comprises means (335) responsive to the endpoints of all reference word series partial candidates of a word position interval being within a predetermined range of the utterance endpoint for generating a selection signal;
- means (365,370,387,389,460) responsive to said selection signal for generating a sequence of feature signals corresponding to each reference word series candidate;
  
  means (311) jointly responsive to each reference word series candidate feature signal sequence and the utterance feature signal sequence for producing a signal representative of the correspondence of the utterance to said reference word series candidateand means (313) responsive to said correspondence signals for identifying the reference word series candidate having the closest correspondence to said utterance.
- View Dependent Claims (9)
- - 9. Apparatus for recognizing continuous speech according to claim 5 furthercharacterized in thatsaid word position reference word selecting means (309) comprisesmeans (309,440) responsive to said reference word distance signals of the word position for identifying the reference word with the minimum distance signal below a first preassigned threshold and the reference words with distance signals within a preassigned distance of said minimum distance signal reference word.

7. characterized in thatthe word position reference word combining means (311) further comprises means (510) for generating a signal representative of the similarity between each reference word series partial candidate of the word position and the utterance portion corresponding thereto;
- and said reference word series candidate identifying means (170) comprises means (503,505,510) for combining the similarity and correspondence signals for each reference word series candidate;
  
  and means (311,313) responsive to said combined similarity and correspondence signals for determining the reference word series candidate most closely corresponding to the utterance.

11. A method for recognizing continuous speech comprisingstoring signals representative of the acoustic features of a set of reference words;
- producing a sequence of signals representative of the acoustic features of an unknown utterance;
  
  generating at least one reference word series as a candidate for said utterance responsive to the reference word and utterance feature signals;
  
  and identifying the utterance as one of said reference word series candidates;
  
  characterized in thatsaid reference word series candidate generating step comprises;
  
  generating a signal identifying successive word position intervals for said utterance;
  
  in each identified word position interval, forming reference word series partial candidates includingdetermining for each reference word series partial candidate of the preceding word position interval a plurality of utterance segments for each reference word beginning within a predetermined range of the utterance segment endpoint of the reference word series partial candidate of the preceding word position and corresponding to the reference word feature signals, said range overlapping the preceding word position partial candidate utterance segment;
  
  forming a signal representative of the similarity between each reference word and the corresponding determined utterance segments responsive to the reference word feature signals and said determined utterance segment feature signals;
  
  selecting reference words having a prescribed similarity to their corresponding utterance segments responsive to said similarity signals;
  
  and combining each selected reference word with the reference word series partial candidates of the preceding word position to form at least one reference word series partial candidate for said word position interval.
- View Dependent Claims (15, 19, 23, 27)
- - 15. A method for recognizing continuous speech according to claim 11 furthercharacterized in thatsaid utterance segment determining step comprisesfor each reference word series partial candidate of the preceding word position interval generating a set of utterance segment beginning points for each reference word of the current word position interval;
    - said word position interval similarity forming step comprises selecting the utterance segment for said reference word jointly responsive to the set of utterance beginning points, the utterance feature signals and said reference word feature signals; and
      
      generating a signal representative of the distance between the feature signals of the reference word and the feature signals of the selected utterance segment for said reference word.
  - 19. Apparatus for recognizing an utterance as a series of predetermined reference words according to claim 15 wherein said utterance identifying means comprisesmeans for detecting the word position in which all reference word series partial candidate utterance segment endpoints are within a predetermined range of the utterance endpoint;
    - means responsive to the operation of said detecting means for generating a signal representative of the feature signals of each reference word series partial candidate of the detected word position;
      
      means jointly responsive to the detected word position reference word series partial candidate feature signals and the utterance feature signals for generating a correspondence signal for each detected word position reference word series partial candidate; and
      
      means responsive to the reference word series candidate correspondence signals for selecting the reference word series candidate most closely corresponding to the utterance.
  - 23. Apparatus for recognizing an utterance as a series of predetermined reference words according to claims 15, 16, 17, 18 or 19 whereineach signal representative of the acoustic features of a reference word is representative of the acoustic features of a reference word spoken in isolation.
  - 27. A method for recognizing an utterance as a series of predetermined reference words according to claim 23 wherein said reference word corresponding utterance segment determining step comprises:
    - generating a set of utterance segment beginning points overlapping the utterance segment of the preceding word position reference word series partial candidate;
      
      and selecting the utterance segment having feature signals most similar to the feature signals of the reference word.

18. Apparatus for recognizing an utterance as a series of predetermined reference words comprising:
- means for storing a set of signals each representative of the acoustic features of a predetermined reference word;
  
  means responsive to the utterance for generating a sequence of signals representative of the acoustic features of the utterance;
  
  means jointly responsive to the utterance acoustic feature signals and reference word feature signals for producing a set of reference word series candidates for the utterance;
  
  and means responsive to the reference word series candidates for identifying the utterance;
  
  said reference word series candidate producing means includesmeans for generating a signal for identifying successive word positions for the utterance;
  
  means operative in each identified word position for generating reference word series partial candidates for said word position comprisingmeans operative for each reference word series partial candidate of the preceding word position responsive to the set of reference word feature signals and the utterance feature signals for determining an utterance segment best corresponding to the feature signals of each reference word and beginning within a predetermined range of the utterance portion endpoint of the preceding word position reference word series partial candidate, said range overlapping the preceding word position partial candidate utterance portion;
  
  means for selecting reference words having a prescribed similarity to their corresponding utterance segments andmeans for combining said selected reference words with the reference word series partial candidates of the preceding word position to form reference word series partial candidates for said word position.
- View Dependent Claims (22, 26)
- - 22. Apparatus for recognizing an utterance as a series of predetermined reference words according to claim 18 wherein:
    - said reference word selecting means comprisesmeans responsive to the feature signals of each reference word and the feature signals of each utterance segment formed for said reference word for generating a signal representative of the vector distance between the reference word and formed utterance segment feature signals; and
      
      means for selecting the reference word and utterance segment corresponding thereto having the minimum distance signal and reference words and utterance segments corresponding thereto having distance signals within a preassigned distance of said minimum reference word distance signals to combine with the preceding word position partial candidate series.
  - 26. A method for recognizing an utterance as a series of reference words according to claim 22 wherein:
    - said utterance identifying step further comprisesproducing a signal representative of the similarity between each reference word series partial candidate feature signals and the acoustic feature signals of the utterance portion corresponding thereto in each word position;
      
      and wherein said reference word series candidate selecting step comprises combining the final word position reference word series partial candidate similarity signal with the final word position reference word series partial candidate correspondence signal;
      
      and responsive to the combined similarity and correspondence signals of the final word position reference word series partial candidates, identifying the reference word series candidate most similar to the utterance.

24. A method for recognizing an utterance as a series of predetermined reference words comprising the steps of storing signals representative of the acoustic features of a set of predetermined reference words;
- generating a sequence of signals representative of the acoustic features of the utterance;
  
  producing at least one series of reference words as a candidate for the utterance responsive to the reference word feature signals and the utterance acoustic feature signals;
  
  and identifying the utterance as one of said reference word series candidates;
  
  wherein the reference word series candidate generating step comprisesidentifying successive word positions for the utterance;
  
  in each identified word position, generating reference word series partial candidate includingdetermining, for each reference word series partial candidate of the preceding word position and each reference word, an utterance segment in the current word position overlapping the utterance segment of the preceding word position reference word series partial candidate that best corresponds to the reference word acoustic feature signals; and
  
  combining reference words having a prescribed similarity to their corresponding utterance segments with the partial reference word series candidates of the preceding word position to form reference word series partial candidates for the current word position.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Bell Telephone Laboratories, Inc. (Nokia Corporation)
Original Assignee
Bell Telephone Laboratories, Inc. (Nokia Corporation)
Inventors
Rabiner, Lawrence R., Pirz, Frank C.
Primary Examiner(s)
Krass, Errol A.
Assistant Examiner(s)
KEMENY, EMANUEL

Application Number

US06/138,647
Time in Patent Office

889 Days
Field of Search

179/1 SD, 179/1 SB, 364/728, 340/146.3 WD, 340/146.3 AQ, 340/146.3 SG
US Class Current

704/241
CPC Class Codes

G10L 15/00 Speech recognition G10L17/0...

G10L 15/12 using dynamic programming t...

Continuous speech recognition system

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Continuous speech recognition system

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links