Knowledge-based speech recognition system and methods having frame length computed based upon estimated pitch period of vocalic intervals
First Claim
1. A knowledge-based speech recognition system for recognizing an input speech signal comprising vocalic and non-vocalic intervals, each of the vocalic intervals having a pitch period, the system comprising:
- means for capturing the input speech signal;
means for segmenting the input speech signal into a series of segments including vocalic intervals and non-vocalic intervals, the vocalic intervals having a frame length computed based on an estimation of the pitch period of the vocalic intervals of the input speech signal;
means for characterizing the series of segments based upon acoustic events detected within the input speech signal to obtain an acoustic feature vector;
means for storing a dictionary having a multiplicity of words, each one of the multiplicity of words described by a phonetic transcription and at least one acoustic event transcription; and
means for selecting a word choice by comparing the acoustic feature vector to the acoustic event transcriptions of the multiplicity of words.
6 Assignments
0 Petitions
Accused Products
Abstract
Knowledge based speech recognition apparatus and methods are provided for translating an input speech signal to text. The speech recognition apparatus captures an input speech signal, segments it based on the detection of pitch period, and generates a series of hypothesized acoustic feature vectors for the input speech signal that characterizes the signal in terms of primary acoustic events, detectable vowel sounds and other acoustic features. The apparatus and methods employ a largely speaker-independent dictionary based upon the application of phonological and phonetic/acoustic rules to generate acoustic event transcriptions against which the series of hypothesized acoustic feature vectors are compared to select word choices. Local and global syntactic analysis of the word choices is provided to enhance the recognition capability of the methods and apparatus.
792 Citations
52 Claims
-
1. A knowledge-based speech recognition system for recognizing an input speech signal comprising vocalic and non-vocalic intervals, each of the vocalic intervals having a pitch period, the system comprising:
-
means for capturing the input speech signal; means for segmenting the input speech signal into a series of segments including vocalic intervals and non-vocalic intervals, the vocalic intervals having a frame length computed based on an estimation of the pitch period of the vocalic intervals of the input speech signal; means for characterizing the series of segments based upon acoustic events detected within the input speech signal to obtain an acoustic feature vector; means for storing a dictionary having a multiplicity of words, each one of the multiplicity of words described by a phonetic transcription and at least one acoustic event transcription; and means for selecting a word choice by comparing the acoustic feature vector to the acoustic event transcriptions of the multiplicity of words. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A knowledge-based speech recognition system for recognizing an input speech signal, the input speech signal including a sequence of phonemes, the system comprising:
-
means for capturing the input speech signal; means for segmenting the input speech signal into a series of segments including vocalic intervals and non-vocalic intervals, the vocalic intervals having a frame length computed based on an estimation of the pitch period of the vocalic intervals of the input speech signal, the series of segments approximately corresponding to the sequence of phonemes; means for characterizing the series of segments based upon acoustic events detected within the input speech signal to obtain an acoustic feature vector; means for storing a dictionary having a multiplicity of words, each one of the multiplicity of words described by a phonetic transcription and an acoustic event transcription, the acoustic event transcription comprising a lattice of acoustic events corresponding to a plurality of pronunciations; and means for selecting a word choice by comparing the acoustic feature vector to selected lattices of acoustic event transcriptions. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
-
-
24. A knowledge-based speech recognition system for recognizing an input speech signal, the input speech signal including a sequence of phonemes, the system comprising:
-
means for capturing the input speech signal; means for segmenting the input speech signal into a series of segments including vocalic intervals and non-vocalic intervals, the vocalic intervals having a frame length computed based on an estimation of the pitch period of the vocalic intervals of the input speech signal, the series of segments approximately corresponding to the sequence of phonemes; means for characterizing the acoustic and spectral characteristics of the input speech signal; means for storing a dictionary having a multiplicity of words, each one of the multiplicity of words described by a series of acoustic events, spectral characteristics and a series of grammatical, semantic and syntactic attributes; and means for selecting a word choice from amongst the multiplicity of words by weighing, for one or more of the multiplicity of words, correspondence between the acoustic events and the acoustic characteristics, between the spectral characteristics of the words and the input speech signal, and the grammatical, semantic and syntactic attributes of the words. - View Dependent Claims (25, 26)
-
-
27. A method of applying knowledge-based rules in a processor-based system to recognize an input speech signal as a word in a natural language, the input speech signal comprising vocalic and non-vocalic intervals, each of the vocalic intervals having a pitch period, the method comprising steps of:
-
capturing and storing the input speech signal; segmenting the input speech signal into a series of segments including vocalic intervals and non-vocalic intervals, the vocalic intervals having a frame length computed based on an estimation of the pitch period of the vocalic intervals of the input speech signal; generating an acoustic feature vector by characterizing the series of segments based upon acoustic events detected within the input speech signal; retrieving from a store a dictionary having a multiplicity of words, each one of the multiplicity of words described by a phonetic transcription and at least one acoustic event transcription; and comparing the acoustic feature vector to the acoustic event transcriptions of the multiplicity of words to select a word choice. - View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35)
-
-
36. A method of applying knowledge-based rules in a processor-based system to recognize an input speech signal as a word in a natural language, the input speech signal including a sequence of phonemes, the method comprising steps of:
-
capturing and storing the input speech signal; segmenting the input speech signal into a series of segments including vocalic intervals and non-vocalic intervals, the vocalic intervals having a frame length computed based on an estimation of the pitch period of the vocalic intervals of the input speech signal, the series of segments approximately corresponding to the sequence of phonemes; generating an acoustic feature vector by characterizing the series of segments based upon acoustic events detected within the input speech signal; retrieving from a store a dictionary having a multiplicity of words, each one of the multiplicity of words described by a phonetic transcription and an acoustic event transcription, the acoustic event transcription comprising a lattice of acoustic events corresponding to a plurality of pronunciations; and comparing the acoustic feature vector to selected lattices of acoustic event transcriptions to select a word choice. - View Dependent Claims (37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49)
-
-
50. A method of applying knowledge-based rules in a processor-based system to recognize an input speech signal as a word in a natural language, the input speech signal including a sequence of phonemes, the method comprising steps of:
-
capturing and storing the input speech signal; segmenting the input speech signal into a series of segments including vocalic intervals and non-vocalic intervals, the vocalic intervals having a frame length computed based on an estimation of the pitch period of the vocalic intervals of the input speech signal, the series of segments approximately corresponding to the sequence of phonemes; computing acoustic and spectral characteristics representative of the input speech signal; retrieving from a store a dictionary having a multiplicity of words, each one of the multiplicity of words described by a series of acoustic events, spectral characteristics and a series of grammatical, semantic and syntactic attributes; and selecting a word choice from amongst the multiplicity of words by weighing, for one or more of the multiplicity of words, correspondence between the acoustic events and the acoustic characteristics, between the spectral characteristics of the words and the input speech signal, and the grammatical, semantic and syntactic attributes of the words. - View Dependent Claims (51, 52)
-
Specification