Labelling speech using context-dependent acoustic prototypes
First Claim
1. A computer based speech recognition system for labeling speech using context-dependent label prototype vectors, the system having an input comprising a sequence of phones from a training text, each of said sequence of phones having an associated phonetic context the system comprising:
- a user interface configured to receive spoken sounds corresponding to a spoken version of the training text, and further configured to generate an outpt signal representative of said spoken sounds;
a signal processor, coupled to said user interface, configured to convert said output signal into a series of feature vector signals; and
a context-dependent labeller, coupled to said signal processor, configured to assign a context-dependent label to each feature vector signal of said series of feature vector signals to result in tagged feature vectors, comprising;
aligning means, coupled to said signal processor, for aligning each of said feature vector signals with a corresponding phone to result in aligned feature vector signals,tagging means, coupled to said aligning means, for tagging each of said aligned feature vector signals with the phonetic context associated with said corresponding phone to result in tagged prototype vector signals, andfirst associating means, coupled to said tagging means, for associating a label with each of said tagged prototype vector signals based upon a context-dependent prototype vector signal, comprising;
phonetic context identifying means for determining, for each said label, whether a context-dependent prototype vector signal exists corresponding to the phonetic context of the tagged prototype vector signal,matching score generating means, coupled to said phonetic context identifying means, for generating a score for achieving said tagged feature vector signal given each of said context-dependent prototype vector signals having the same phonetic context as the tagged feature vector signal as determined in said phonetic context identifying means, andassociating means, coupled to said matching score generating means, for associating a label which is associated with a context-dependent prototype vector signal having the highest score as generated by said matching score generating means with said tagged feature vector signal.
0 Assignments
0 Petitions
Accused Products
Abstract
The present invention relates to labelling of speech in a context-dependent speech recognition system. When labelling speech using context-dependent prototypes the phone context of a frame of speech needs to be aligned with the appropriate acoustic parameter vector. Since aligning a large amount of data is difficult if based upon arc ranks, the present invention aligns the data using context-independent acoustic prototypes. The phonetic context of each phone of the data is known. Therefore after the alignment step the acoustic parameter vectors are tagged with a corresponding phonetic context. Context-dependent prototype vectors exists for each label. For all labels the context-dependent prototype vectors having the same phonetic context as the tagged acoustic parameter vector are determined. For each label the probability of achieving the tagged acoustic parameter vector is determined given each of the context-dependent label prototype vectors having the same phonetic context as the tagged acoustic parameter vector. The label with the highest probability is associated with the context-dependent acoustic parameter vector.
45 Citations
14 Claims
-
1. A computer based speech recognition system for labeling speech using context-dependent label prototype vectors, the system having an input comprising a sequence of phones from a training text, each of said sequence of phones having an associated phonetic context the system comprising:
-
a user interface configured to receive spoken sounds corresponding to a spoken version of the training text, and further configured to generate an outpt signal representative of said spoken sounds; a signal processor, coupled to said user interface, configured to convert said output signal into a series of feature vector signals; and a context-dependent labeller, coupled to said signal processor, configured to assign a context-dependent label to each feature vector signal of said series of feature vector signals to result in tagged feature vectors, comprising; aligning means, coupled to said signal processor, for aligning each of said feature vector signals with a corresponding phone to result in aligned feature vector signals, tagging means, coupled to said aligning means, for tagging each of said aligned feature vector signals with the phonetic context associated with said corresponding phone to result in tagged prototype vector signals, and first associating means, coupled to said tagging means, for associating a label with each of said tagged prototype vector signals based upon a context-dependent prototype vector signal, comprising; phonetic context identifying means for determining, for each said label, whether a context-dependent prototype vector signal exists corresponding to the phonetic context of the tagged prototype vector signal, matching score generating means, coupled to said phonetic context identifying means, for generating a score for achieving said tagged feature vector signal given each of said context-dependent prototype vector signals having the same phonetic context as the tagged feature vector signal as determined in said phonetic context identifying means, and associating means, coupled to said matching score generating means, for associating a label which is associated with a context-dependent prototype vector signal having the highest score as generated by said matching score generating means with said tagged feature vector signal. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for creating labels which are necessary for leafemic baseform construction, wherein a spoken version of a training text is converted into signals representing feature vectors in a signal processor, and wherein a sequence of phones are input from said training text each phone having a phonetic context associated with it, said phonetic context comprising one or more phones occurring immediately prior to or subsequent to said phone, context-independent prototype vectors and context-dependent prototype vectors having previously been stored in a memory module, comprising the steps of:
-
(1) matching each of said feature vector signals with a most similar signal representing a context-independent label prototype vector said most similar signal determined by comparing one or more parameters of said feature vector signal with one or more parameters of each of said signals representing said context-independent label prototype vectors, which is stored in the memory module, to label each of said feature vector signals with said most similar signal representing a context-independent label; (2) aligning each of said context-independent labelled feature vector signals with a corresponding phone from said training text; (3) tagging each of said aligned feature vector signals with the phonetic context associated with said corresponding phone; (4) identifying signals representing one or more context-dependent label prototype vectors having the same phonetic context of a given tagged feature vector signal; (5) determining the score for achieving a feature vector signal given each of the context-dependent label prototype vector signals identified in step (4); (6) identifying a context-dependent label prototype vector signal which maximizes the score of said feature vector signal as determined in step (5); (7) replacing the context-independent label associated with said feature vector signal with a label associated with said maximum score context-dependent label prototype vector signal as identified in step (6) to label said feature vector signal with a context-dependent label; and (8) repeating steps (4)-(7) for each tagged feature vector signal. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
Specification