Speech recognition system utilizing vocabulary model preselection

US 5,386,492 A
Filed: 06/29/1992
Issued: 01/31/1995
Est. Priority Date: 06/29/1992
Status: Expired due to Term

First Claim

Patent Images

1. In a speech recognition system which compares an unknown speech segment represented by a fine sequence of frames with a vocabulary of models represented by respective fine sequences of states, said states being selected from a limited collection of predetermined states, thereby to determine the best matches;

a computer implemented method of preselecting candidate models for accurate comparison, said method comprising;

for each model to be considered, subsampling the corresponding fine sequence of states to obtain a respective coarse sequence comprising a predetermined number of states;

subsampling said fine sequence of frames to obtain a coarse sequence comprising a predetermined number of frames, said predetermined numbers together defining a matrix having frame positions along one axis and state positions along another axis, there being a preselected region within said matrix which is examined by said method;

for each state in said limited collection, determining for each state position in said matrix the input frame which provides the best match with that state, irrespective of the frame determined in connection with any adjacent state position and considering and examining only frames which lie within said preselected region, a measure of the match being stored in a table;

calculating, using said table, for each model to be considered a value representing the overall match of said coarse sequence of frames with the respective coarse sequence of states;

preselecting for accurate comparison those models with the better overall match values as so calculated.

View all claims

11 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Preliminary screening of vocabulary models is provided by successively applying two different high speed distance measuring functions which provide progressively increasing measurement accuracy. Both distance measuring functions utilize subsampled representations of the unknown speech segment and the vocabulary models. The initial screening function achieves very high speed by eliminating certain usual time warping constraints and by precalculating a table of distance values which can be used for all vocabulary models. The second screening function yields improved accuracy in spite of possible endpointing errors by comparing extra frames, preceding and following the presumed unknown word, with noise models appended to each vocabulary model.

40 Citations

View as Search Results

9 Claims

1. In a speech recognition system which compares an unknown speech segment represented by a fine sequence of frames with a vocabulary of models represented by respective fine sequences of states, said states being selected from a limited collection of predetermined states, thereby to determine the best matches;
- a computer implemented method of preselecting candidate models for accurate comparison, said method comprising;
  
  for each model to be considered, subsampling the corresponding fine sequence of states to obtain a respective coarse sequence comprising a predetermined number of states;
  
  subsampling said fine sequence of frames to obtain a coarse sequence comprising a predetermined number of frames, said predetermined numbers together defining a matrix having frame positions along one axis and state positions along another axis, there being a preselected region within said matrix which is examined by said method;
  
  for each state in said limited collection, determining for each state position in said matrix the input frame which provides the best match with that state, irrespective of the frame determined in connection with any adjacent state position and considering and examining only frames which lie within said preselected region, a measure of the match being stored in a table;
  
  calculating, using said table, for each model to be considered a value representing the overall match of said coarse sequence of frames with the respective coarse sequence of states;
  
  preselecting for accurate comparison those models with the better overall match values as so calculated.
- View Dependent Claims (2, 3)
- - 2. The method as set forth in claim 1 wherein said overall match value is obtained by accumulating the respective measures of match stored in said table.
  - 3. The method as set forth in claim 1 wherein, in determining the input frame which provides the best match for each possible state in each possible matrix position, the method examines not only the respective subsampled frame but also a preselected number of frames which precede and follow the respective subsampled frame in said fine sequence of frames.

4. In a speech recognition system which compares an unknown input speech segment with a vocabulary of models and in which input speech is encoded as a fine sequence of frames and means are provided for identifying the likely start and finish endpoints of words in said fine sequence, said models being represented by correspondingly fine sequences of states;
- a computer implemented method of preselecting candidate models for accurate comparison, said method comprising;
  
  for each model to be considered, subsampling the corresponding fine sequence of frames to obtain a respective coarse sequence comprising a predetermined number of states;
  
  subsampling said fine sequence of frames between said endpoints to obtain a coarse sequence comprising a predetermined number of frames, said predetermined numbers together defining a matrix having frame positions along one axis and state positions along another axis;
  
  comparing a preselected number of frames preceding said start endpoint with a preselected noise model thereby to precalculate cost values for entry into said matrix at different frame position locations;
  
  comparing a preselected number of frames following said finish endpoint with a preselected noise model thereby to precalculate cost values for exit from said matrix at different frame position locations;
  
  for each model to be considered, determining a best match path across said matrix including the cost of entry to and exit from the matrix at different frame position locations, and scoring the model on the basis of that best path;
  
  selecting, for accurate comparison with the input speech segment, those models with the best scores thusly obtained.
- View Dependent Claims (5)
- - 5. The method as set forth in claim 4 wherein, in determining the input frame which provides the best match for each possible state in each possible matrix position, the method examines not only the respective subsampled frame but also a preselected number of frames which precede and follow the respective subsampled frame in said fine sequence of frames.

6. In a speech recognition system which compares an unknown speech segment represented by a fine sequence of frames with a vocabulary of models represented by respective fine sequences of states thereby to determine the best matches;
- a computer implemented method of preselecting candidate models for accurate comparison, said method comprising;
  
  for each model to be considered, subsampling the corresponding fine sequence of states to obtain a respective coarse sequence comprising a predetermined number of states;
  
  subsampling said fine sequence of frames to obtain a coarse sequence comprising a predetermined number of frames, said predetermined numbers together defining a matrix having frame positions along one axis and state positions along another axis, there being a preselected region within said matrix which is examined by said method;
  
  determining for each state position in said matrix the input frame which provides the best match with that state, irrespective of the frame determined in connection with any adjacent state position and considering and examining only frames which lie within said preselected region, and providing a measure of the degree of match;
  
  combining the measures for the several state positions thereby to obtain a value representing the overall match of said coarse sequence of frames with the respective coarse sequence of states;
  
  preselecting for accurate comparison those models with the better overall match values as so calculated.
- View Dependent Claims (7)
- - 7. The method as set forth in claim 6 wherein, in determining the input frame which provides the best match for each possible state in each possible matrix position, the method examines not only the respective subsampled frame but also a preselected number of frames which precede and follow the respective subsampled frame in said fine sequence of frames.

8. In a speech recognition system which compares an unknown speech segment represented by a fine sequence of frames with a vocabulary of models represented by respective fine sequences of states, said vocabulary being partitioned into acoustically similar groups of model with one model of each group being representative of the group thereby to determine the best matches;
- a computer implemented method of preselecting candidate models for accurate comparison, said method comprising;
  
  for each model, subsampling the corresponding fine sequence of states to obtain a respective coarse sequence comprising a predetermined number of states;
  
  subsampling said fine sequence of frames to obtain a coarse sequence comprising a predetermined number of frames, said predetermined numbers together defining a matrix having frame positions along one axis and state positions along another axis, there being a preselected region within said matrix which is examined by said method;
  
  providing a first distance measuring function which determines for each state position in said matrix the input frame which provides the best match with that state, considering and examining only frames which lie within said preselected region, and provides a measure of the degree of match;
  
  combining the measures for the several state positions thereby to obtain a first value representing the overall match of said coarse sequence of frames with the respective coarse sequence of states;
  
  providing a second distance measuring function which determines a connected path across said matrix and calculates a second value representing the overall match of said coarse sequence of frames with the respective coarse sequence of states;
  
  applying said first distance measuring function to the group representative models;
  
  selecting the better scoring representative models and applying to the selected models said second distance measuring function thereby to identify a reduced number of better scoring groups;
  
  applying said first distance measuring function to the members of said better scoring groups;
  
  selecting the better scoring member models and applying to the selected member models said second distance measuring function thereby to preselect a reduced number of member models for accurate comparison with said unknown speech segment.

9. In a speech recognition system which compares an unknown speech segment represented by a fine sequence of frames with a vocabulary of models represented by respective fine sequences of states thereby to determine the best matches;
- a computer implemented method of selecting candidate models, said method comprising;
  
  for each model to be considered, subsampling the corresponding fine sequence of states to obtain a respective coarse sequence comprising a predetermined number of states;
  
  subsampling said fine sequence of frames to obtain a coarse sequence comprising a predetermined number of frames, said predetermined numbers together defining a matrix having frame positions along one axis and state positions along another axis, there being a preselected region within said matrix which is examined by said method;
  
  determining for each state position in said matrix the input frame which provides the best match with that state, irrespective of the frame determined in connection with any adjacent state position and considering and examining only frames which lie within said preselected region, and providing a measure of the degree of match;
  
  combining the measures for the several state positions thereby to obtain a value representing the overall match of said coarse sequence of frames with the respective coarse sequence of states;
  
  selecting (for accurate comparison) those models with the better overall match values as so calculated; and
  
  for only those models with the better overall match values, comparing the fine sequence of frames with the respective fine sequence of states thereby to identify at least one recognition candidate model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
Kurzweil Applied Intelligence, Inc. (Intel Corporation)
Inventors
Ganong, William F., Yegnanarayanan, Girija, Sejnoha, Vladimir, Wilson, Brian H.
Primary Examiner(s)
Knepper, David D.

Application Number

US07/905,345
Time in Patent Office

946 Days
Field of Search

395/2, 395/2.31, 395/2.5, 395/2.51, 395/2.52, 395/2.62, 395/2.63-2.65, 381/41-43
US Class Current

704/252
CPC Class Codes

G10L 15/08 Speech classification or se...

G10L 2015/085 Methods for reducing search...

Speech recognition system utilizing vocabulary model preselection

First Claim

11 Assignments

0 Petitions

Accused Products

Abstract

40 Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

Speech recognition system utilizing vocabulary model preselection

First Claim

11 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

40 Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links