Transcription of speech data with segments from acoustically dissimilar environments
First Claim
1. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
- inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes; and
transcribing each type identifier tagged segment using a specific system created for that type.
1 Assignment
0 Petitions
Accused Products
Abstract
A technique to improve the recognition accuracy when transcribing speech data that contains data from a wide range of environments. Input data in many situations contains data from a variety of sources in different environments. Such classes include: clean speech, speech corrupted by noise (e.g., music), non-speech (e.g., pure music with no speech), telephone speech, and the identity of a speaker. A technique is described whereby the different classes of data are first automatically identified, and then each class is transcribed by a system that is made specifically for it. The invention also describes a segmentation algorithm that is based on making up an acoustic model that characterizes the data in each class, and then using a dynamic programming algorithm (the viterbi algorithm) to automatically identify segments that belong to each class. The acoustic models are made in a certain feature space, and the invention also describes different feature spaces for use with different classes.
148 Citations
37 Claims
-
1. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
-
inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes; and transcribing each type identifier tagged segment using a specific system created for that type. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
-
inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes, wherein it is assumed that the input data is produced by a parallel combination of models, each model corresponding to one of the predetermined classes, the identifier tag assigned to a segment being the class identifier tag of the model that gives the segment the highest probability, subject to certain constraints, wherein one of the constraints is a minimum duration on the segment, and wherein a number of classes that the acoustic input can be classified into are identified that represent the most acoustically dissimilar classes possible; and transcribing each type identifier tagged segment using a specific system created for that type.
-
-
13. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
-
inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes using a binary tree hierarchy, wherein at each level of the tree, segments corresponding to one of the predetermined classifications are isolated, and wherein a number of classes that the acoustic input can be classified into are identified that represent the most acoustically dissimilar classes possible; and transcribing each type identifier tagged segment using a specific system created for that type.
-
-
14. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
-
inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes, wherein segmentation is carried out using a Hidden Markov Model to model each class and the viterbi algorithm to isolate and assign type identifier tags to the segments; and transcribing each type identifier tagged segment using a specific system created for that type.
-
-
15. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
-
inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes, wherein it is assumed that the input data is produced by a parallel combination of models, each model corresponding to one of the predetermined classes, the identifier tag assigned to a segment being the class identifier tag of the model that gives the segment the highest probability, subject to certain constraints, wherein a number of classes that the acoustic input can be classified into are identified that represent the most acoustically dissimilar classes possible and, wherein the process of creating the models comprises identifying a feature space for the individual predetermined classes; and transcribing each type identifier tagged segment using a specific system created for that type. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
-
-
27. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
-
inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes, wherein the classes include non-speech, telephone speech, noise-corrupted speech, and clean speech, and wherein clean speech segments are further segmented into smaller segments that can be assigned a speaker identifier tag; and transcribing each type identifier tagged segment using a specific system created for that type. - View Dependent Claims (28, 29, 30, 31, 32)
-
-
33. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
-
inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes, wherein the classes include non-speech, telephone speech, noise-corrupted speech, and clean speech; and transcribing each type identifier tagged segment using a specific speech recognition system created for that type, wherein setting up a system for transcribing telephone speech comprises transforming the training data from which the speech recognition system was made so that it matches the acoustic environment of telephone speech, wherein the transformation comprises band limiting the training data to telephone bandwidths.
-
-
34. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
-
inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes, wherein the classes include non-speech, telephone speech, noise-corrupted speech, and clean speech; and transcribing each type identifier tagged segment using a specific speech recognition system created for that type, wherein setting up a system for transcribing noise-corrupted speech comprises transforming the training data from which the speech recognition system was made so that it matches the acoustic environment of noise-corrupted speech, wherein the transformation comprises adding pure noise to the clean speech in the training data. - View Dependent Claims (35)
-
-
36. A system for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
-
means for inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes; and means for transcribing each type identifier tagged segment using a specific system created for that type.
-
-
37. Apparatus for transcribing a segment of data that includes speech in one or more environments and non-speech data, the apparatus comprising:
-
a segmenter which produces a series of segments from the data, each segment being given a type identifier tag selected from a predetermined set of classes; and a plurality of speech recognizers coupled to the segmenter which are specifically created for each type and which respectively transcribe segments having corresponding type identifier tags.
-
Specification