Multistage word recognizer based on reliably detected phoneme similarity regions

US 5,822,728 A
Filed: 09/08/1995
Issued: 10/13/1998
Est. Priority Date: 09/08/1995
Status: Expired due to Fees

First Claim

Patent Images

1. A word recognition processor for processing an input speech utterance in a speech recognition system, comprising:

a phoneme similarity module receptive of said input speech utterance for producing phone similarity data indicative of the correlation between said input speech utterance and predetermined phone model speech data;

a high similarity module coupled to said phoneme similarity module for identifying those regions of the phone similarity data that exceed a predetermined threshold;

a region count stage having a first word prototype database for storing similarity region count data for a plurality of predetermined words;

said region count stage coupled to said high similarity module and generating a first list of word candidates selected from said first word prototype database based on similarity regions;

a target congruence stage having a second word prototype database for storing word prototype data corresponding to a said plurality of predetermined words;

said target congruence stage being receptive of said first list of word candidates and being coupled to said high similarity module for generating a second list of at least one word candidate, selected from said first list based on similarity regions.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The multistage word recognizer uses a word reference representation based on reliably detected peaks of phoneme similarity values. The word reference representation captures the basic features of the words by targets that describe the location and shape of stable peaks of phoneme similarity values. The first stage of the word hypothesizer represents each reference word with statistical information on the number of high similarity regions over a predefined number of time intervals. The second stage represents each word by a prototype that consists of a series of phoneme targets and global statistics, namely the average word duration and average match rate. These represent the degree of fit of the word prototype to its training data. Word recognition scores generated in the two stages are converted to dimensionless normalized values and combined by averaging for use in selecting the most probable word candidates.

135 Citations

View as Search Results

34 Claims

1. A word recognition processor for processing an input speech utterance in a speech recognition system, comprising:
- a phoneme similarity module receptive of said input speech utterance for producing phone similarity data indicative of the correlation between said input speech utterance and predetermined phone model speech data;
  
  a high similarity module coupled to said phoneme similarity module for identifying those regions of the phone similarity data that exceed a predetermined threshold;
  
  a region count stage having a first word prototype database for storing similarity region count data for a plurality of predetermined words;
  
  said region count stage coupled to said high similarity module and generating a first list of word candidates selected from said first word prototype database based on similarity regions;
  
  a target congruence stage having a second word prototype database for storing word prototype data corresponding to a said plurality of predetermined words;
  
  said target congruence stage being receptive of said first list of word candidates and being coupled to said high similarity module for generating a second list of at least one word candidate, selected from said first list based on similarity regions.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The word recognition processor of claim 1 further comprising a fine match stage having word template database for storing word template data corresponding to said plurality of predetermined words;
    - said fine match stage being receptive of said second list of word candidates for selecting the recognized word.
  - 3. The word recognition processor of claim 1 wherein said phoneme similarity module includes a phone model database for storing phone model speech data corresponding to a plurality of phonemes that comprise said predetermined phone model speech data.
  - 4. The word recognition processor of claim 1 wherein said region count stage produces a first score corresponding to the degree of fit between the input utterance and each of the first list of word candidates.
  - 5. The word recognition processor of claim 1 wherein said target congruence stage produces a second score corresponding to the degree of fit between the input utterance and each of the second list of word candidates.
  - 6. The word recognition processor of claim 1 wherein said region count stage produces a first score corresponding to the degree of fit between the input utterance and each of the first list of word candidates;
    - wherein said target congruence stage produces a second score corresponding to the degree of fit between the input utterance and each of the second list of word candidates; and
      
      wherein said recognizer combines the first and second scores and selects at least the word with the best score as a final word candidate.
  - 7. The word recognition processor of claim 6 wherein said processor combines the first and second scores by averaging.
  - 8. The word recognition processor of claim 1 wherein said high similarity module produces a parameterized representation of high similarity regions of the phone similarity data.
  - 9. The word recognition processor of claim 8 wherein said parameterized representation includes a representation of the phone similarity peak location and peak height.
  - 10. The word recognition processor of claim 8 wherein said parameterized representation includes a representation of the phone similarity peak location, peak height and the left and right frame locations.
  - 11. The word recognition processor of claim 1 wherein said region count stage represents an instance of a given spoken word or phrase by the number of high similarity regions found corresponding to each of a plurality of phoneme identifiers.
  - 12. The word recognition processor of claim 11 further comprising means for breaking said phone similarity data into a plurality of time intervals and wherein said instance of a given spoken word or phrase is represented by the number of high similarity regions in each of said time intervals.
  - 13. The word recognition processor of claim 1 further comprising building a region count prototype corresponding to a plurality of training instances of a spoken word or phrase.
  - 14. The word recognition processor of claim 13 wherein said region count prototype consists of statistics based on the number of high phoneme similarity regions found for each phoneme identifier in each of a plurality of time intervals.
  - 15. The word recognition processor of claim 14 wherein said statistics comprise the mean and inverse variance of the number of said high phoneme similarity regions found for each phoneme identifier in each of said plurality of time intervals.

16. A method for processing an input speech utterance for word recognition, comprising:
- representing the input speech utterance as a phone similarity data indicative of the correlation between the input speech utterance and predetermined phone model speech data;
  
  selecting from said phone similarity data those regions of high similarity that exceed a predetermined threshold;
  
  testing the high similarity regions against a first predetermined word prototype database using a region count procedure that selects first list of word candidates minimizing the region count distortion with respect to the input speech utterance;
  
  testing the high similarity regions of words in said first list against a second predetermined word prototype database using a target congruence procedure that selects from the first list a second list of word candidates having high similarity regions substantially congruent with the input speech utterance.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
- - 17. The method of claim 16 further comprising performing a fine match upon said second list of word candidates to select a single recognized word from said second list.
  - 18. The method of claim 16 wherein said region count procedure produces a first score corresponding to the degree of fit between the input utterance and each of the first list of word candidates.
  - 19. The method of claim 16 wherein said target congruence procedure produces a second score corresponding to the degree of fit between the input utterance and each of the second list of word candidates.
  - 20. The method of claim 16 wherein said region count procedure produces a first score corresponding to the degree of fit between the input utterance and each of the first list of word candidates;
    - wherein said target congruence procedure produces a second score corresponding to the degree of fit between the input utterance and each of the second list of word candidates; and
      
      further comprising combining the first and second normalized scores and selecting the word with the best score as a final word candidate.
  - 21. The method of claim 20 wherein said combining step is performed by averaging said first and second normalized scores.
  - 22. The method of claim 16 further comprising representing said high similarity regions as a parameter representing the high similarity regions of the phone similarity data.
  - 23. The method of claim 22 wherein said parameters include a representation of the phone similarity peak location and peak height.
  - 24. The method of claim 22 wherein said parameters include a representation of the phone similarity peak location, peak height and the left and right frame locations.
  - 25. The method of claim 16 further comprising the step of:
    - representing an instance of a given spoken word or phrase by the number of high phoneme similarity regions found for each of a plurality of phoneme identifiers.
  - 26. The method of claim 25 further comprising breaking said phone similarity data into a plurality of time intervals and representing an instance of a given spoken word or phrase by the number of high similarity regions in each of said time intervals.
  - 27. The method of claim 25 further comprising the step of:
    - building a region count prototype corresponding to a spoken word or phrase that consists of statistics based on the number of high phoneme similarity regions found for each phoneme identifier in each of a plurality of time intervals in the phoneme similarity data.
  - 28. The method of claim 27 further comprising the step of:
    - calculating said statistics as the mean and inverse variance of the number of said high phoneme similarity regions found for each phoneme identifier, in each of a plurality of time intervals, of the training instances of the given spoken word or phrase.
  - 29. The method of claim 27 further comprising the step of computing the recognition score for a given instance of a spoken word or phrase with respect to a given region count prototype.
  - 30. The method of claim 29 wherein said recognition score is the Euclidean distance between:
    - (a) the number of high phoneme similarity regions found, for each phoneme identifier, in the phoneme similarity data and (b) the mean of the number of high phoneme similarity regions found for each phoneme identifier.
  - 31. The method of claim 29 wherein said recognition score is the Euclidean distance between:
    - (a) the number of high phoneme similarity regions found, for each phoneme identifier, in each of a plurality of time intervals within the phoneme similarity data and (b) the mean of the number of high phoneme similarity regions, in each of said plurality of time intervals of the training instances found in the given region count prototype.
  - 32. The method of claim 29 wherein said recognition score is a weighted Euclidean distance.
  - 33. The method of claim 32 wherein the weight is the inverse variance from the given region count prototype.
  - 34. The method of claim 16 further comprising:
    - comparing the input speech utterance with each prototype to provide a recognition score.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Matsushita Electric Industrial Company Limited (Panasonic Holdings Corporation)
Original Assignee
Matsushita Electric Industrial Company Limited (Panasonic Holdings Corporation)
Inventors
Applebaum, Ted H., Morin, Philippe R.
Primary Examiner(s)
Dorvil, Richemond

Application Number

US08/526,746
Time in Patent Office

1,131 Days
Field of Search

395/2.63, 395/2.79, 395/2.55, 395/2.41, 395/2.65, 395/2.58, 395/2.6, 395/2.61, 395/2.64, 704/254, 704/270, 704/246, 704/232, 704/256, 704/249, 704/251, 704/255, 704/239
US Class Current

704/254
CPC Class Codes

G10L 15/02 Feature extraction for spee...

G10L 15/08 Speech classification or se...

Multistage word recognizer based on reliably detected phoneme similarity regions

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

135 Citations

34 Claims

Specification

Solutions

Use Cases

Quick Links

Multistage word recognizer based on reliably detected phoneme similarity regions

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

135 Citations

34 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links