Speech recognition using dynamic features

US 5,615,299 A
Filed: 06/20/1994
Issued: 03/25/1997
Est. Priority Date: 06/20/1994
Status: Expired due to Term

First Claim

Patent Images

1. A method for speech encoding, comprising the steps of:

producing a set of N distinct principal discriminant matrices, each principal discriminant matrix being associated with a different class, each class being an indication of the proximity of a speech segment to one or more neighboring speech segments,arranging a speech signal into a series of frames;

deriving a feature vector which represents said speech signal for each frame; and

generating a set of N different projected vectors for each frame, by multiplying each of said N distinct principal discriminant matrices by said feature vector.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition technique utilizes a set of N different principal discriminant matrices. Each principal discriminant matrix is associated with a distinct class. The class is an indication of the proximity of a speech segment to neighboring phones. A technique for speech encoding includes arranging speech signal into a series of frames. A feature vector is derived which represents the speech signal for a speech segment or series of speech segments for each frame. A set of N different projected vectors are generated for each frame, by multiplying the principal discriminant matrices by the vector. This speech encoding technique is capable of being used in speech recognition systems by utilizing models, in which each model transition is tagged with one of the N classes. The projected vector is utilized with the corresponding tag to compute the probability that at least one particular speech port is present in said frame.

36 Citations

View as Search Results

25 Claims

1. A method for speech encoding, comprising the steps of:
- producing a set of N distinct principal discriminant matrices, each principal discriminant matrix being associated with a different class, each class being an indication of the proximity of a speech segment to one or more neighboring speech segments,arranging a speech signal into a series of frames;
  
  deriving a feature vector which represents said speech signal for each frame; and
  
  generating a set of N different projected vectors for each frame, by multiplying each of said N distinct principal discriminant matrices by said feature vector.
- View Dependent Claims (2, 3, 4)
- - 2. The method as described in claim 1, further comprising the step of:
    - splicing a series of adjacent feature vectors together to derive a spliced vector.
  - 3. The method as described in claim 1, further comprising the step of:
    - tagging each frame with one of said classes.
  - 4. The method as described in claim 1, wherein said indication of the proximity of a speech segment to one or more neighboring speech segments includes an indication of different amounts of overlap with said neighboring speech segments.

5. A method for speech recognition, the method of speech recognition comprising the steps of:
- deriving N distinct transformations, each distinct transformation is respectively associated with one of N classes, each class providing an indication of the proximity of a speech segment to one or more neighboring speech segments,arranging a speech signal into a series of frames;
  
  deriving a vector, within each said frame, which represents said speech signal;
  
  generating a set of N different projected vectors for each frame, by multiplying said transformations by said vector;
  
  utilizing models for tagging each model transition with one of said N classes; and
  
  utilizing the projected vector with the corresponding tag to compute a probability that a particular speech segment is present in said frame.
- View Dependent Claims (6, 7, 8, 9, 10)
- - 6. The method as described in claim 5, wherein said models are based on fenones, and each fenone is always associated with one of said N tags.
  - 7. The method as described in claim 5, wherein said models are based on phones, and each phone is associated with one of said N tags.
  - 8. The method as described in claim 5 further comprising the step of:
    - splicing a series of adjacent vectors together to derive a spliced vector.
  - 9. The method as described in claim 5, wherein said indication of the proximity of a speech segment to one or more neighboring speech segments includes an indication of different amounts of overlap with said neighboring speech segments.
  - 10. The method as described in claim 5, wherein said transformations are principal discriminant matrices.

11. An apparatus for speech encoding comprising:
- means for producing a set of N distinct principal discriminant matrices, each principal discriminant matrix being associated with a different class, the class being an indication of the proximity of the speech segment to one or more neighboring speech segments;
  
  means for arranging a speech signal into a series of frames;
  
  means for deriving a feature vector which represents said speech signal for each frame; and
  
  means for generating a set of N different projected vectors for each frame, by multiplying each of said principal discriminant matrices by said vector.
- View Dependent Claims (12, 13, 14)
- - 12. The apparatus described in claim 11, further comprising:
    - means for splicing a series of adjacent feature vectors together to derive a spliced vector.
  - 13. The apparatus described in claim 11, further comprising:
    - means for tagging each frame with one of said classes.
  - 14. The apparatus described in claim 11, wherein said indication of the proximity of a speech segment to one or more neighboring speech segments includes an indication of different amounts of overlap with said neighboring speech segments.

15. A speech recognition system comprising:
- means for arranging speech segments into a series of frames;
  
  means for deriving a vector, within each of said frames, which represents said speech signal;
  
  means for deriving N distinct transformations, each distinct transformation is respectively associated with one of N classes, each class providing an indication of the proximity of a speech part to neighboring speech parts,means for generating a set of N different, projected vectors for each frame, by multiplying said N transformations by said vector;
  
  means for utilizing models for tagging each model transition with one of said N classes; and
  
  means for utilizing the projected vector with the corresponding tag to compute the probability that a particular speech part is present in said frame.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The system described in claim 15, wherein said models are based on fenones, and each fenone is always associated with one of said N tags.
  - 17. The system described in claim 15, wherein said models are based on phones, and each phone is associated with one of said N tags.
  - 18. The system described in claim 15, further comprising:
    - means for splicing a series of adjacent vectors together to derive a spliced vector.
  - 19. The system described in claim 15, wherein said indication of the proximity of a speech part to neighboring speech parts includes an indication for different amounts of overlap with neighboring speech parts.
  - 20. The apparatus described in claim 15, wherein said transformations are principal discriminant matrices.

21. A method for speech recognition which comprises the steps of:
- arranging a speech signal into a series of frames;
  
  varying the width of one or more windows to be utilized for a speech encoding system in accordance with a principal discriminant matrix, each window being defined as a number of successive frames which have a same speech segment associated therewith;
  
  deriving a feature vector which represents said speech signal for each frame; and
  
  generating a projected vector for each frame by multiplying said principal discriminant matrix by said feature vector, wherein said principal discriminant matrix represents the values of the projected vectors which are indicative of the speech signal.
- View Dependent Claims (22)
- - 22. The method as described in claim 21, wherein there are N principal discriminant matrices which are associated with N respective, distinct classes, the different classes being an indication of the proximity of the speech segment to neighboring speech segments.

23. An apparatus which comprises:
- means for arranging a speech signal into a series of frames;
  
  means for varying the width of one or more windows to be utilized for a speech encoding system, based upon a principal discriminant matrix, each window is defined as the number of successive frames which has the same speech segment associated with it,means for deriving a feature vector which represents said speech signal for a speech segment or series of speech segments for each frame; and
  
  means for generating a projected vector for each frame by multiplying said principal discriminant matrix by said feature vector, wherein said principal discriminant matrix equates the values of the projected vectors which are representative of the speech signal.
- View Dependent Claims (24)
- - 24. The apparatus as described in claim 23, wherein there are N principal discriminant matrices which are associated with N respective, distinct classes, the different classes being an indication of the proximity of the speech segment to neighboring speech segments.

25. A method for applying a value to each tag from a series of tags, to be utilized in a speech recognition application, comprising the steps of:
- determine whether a frame F belongs to a phone whose duration is M frames or less, if so, set the tag for each frame in the phone at a first value;
  
  otherwise, proceed with the next step;
  
  determine whether the window of frame F overlaps the preceding phone by N frames or more, if so, set the value of the of the tag at a second value, otherwise proceed with the next step;
  
  determine whether the window overlaps the following phone by N frames or more, if so, set frame tag at a third value, otherwise proceed with the next step;
  
  determine whether the window overlaps the preceding phone at all, if so, set the tag to a fourth value, otherwise proceed with the next step;
  
  determine whether the window overlaps the following phone at all, if so, set the tag to a fifth value, otherwise proceed to the next step; and
  
  set the tag to a sixth value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Bahl, Lahit R., Gopalakrishnan, Ponani, Picheny, Michael A., de Souza, Peter V.
Primary Examiner(s)
MacDonald, Allen R.
Assistant Examiner(s)
SMITS, TALIVALDIS IVARS

Application Number

US08/262,093
Time in Patent Office

1,009 Days
Field of Search

395/2.4, 395/2.49, 395/2.51, 395/2.6, 395/2.63, 395/2.64, 395/214, 381/41, 381/42, 381/43, 381/45
US Class Current

704/254
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 19/0018   Speech coding using phoneti...

G10L 2015/025   Phonemes, fenemes or fenone...

Speech recognition using dynamic features

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

36 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Speech recognition using dynamic features

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

36 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links