Transcription of speech data with segments from acoustically dissimilar environments

US 6,067,517 A
Filed: 02/02/1996
Issued: 05/23/2000
Est. Priority Date: 02/02/1996
Status: Expired due to Term

First Claim

Patent Images

1. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:

inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes; and

transcribing each type identifier tagged segment using a specific system created for that type.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A technique to improve the recognition accuracy when transcribing speech data that contains data from a wide range of environments. Input data in many situations contains data from a variety of sources in different environments. Such classes include: clean speech, speech corrupted by noise (e.g., music), non-speech (e.g., pure music with no speech), telephone speech, and the identity of a speaker. A technique is described whereby the different classes of data are first automatically identified, and then each class is transcribed by a system that is made specifically for it. The invention also describes a segmentation algorithm that is based on making up an acoustic model that characterizes the data in each class, and then using a dynamic programming algorithm (the viterbi algorithm) to automatically identify segments that belong to each class. The acoustic models are made in a certain feature space, and the invention also describes different feature spaces for use with different classes.

148 Citations

37 Claims

1. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
- inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes; and
  
  transcribing each type identifier tagged segment using a specific system created for that type.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the step of segmenting comprises:
    - identifying a number of classes that the acoustic input can be classified into that represent the most acoustically dissimilar classes possible.
  - 3. The method of claim 2, wherein the classes include non-speech, telephone speech, noise-corrupted speech, and clean speech.
  - 4. The method of claim 3, wherein the non-speech class includes music.
  - 5. The method of claim 3, wherein the noise-corrupted speech includes music.
  - 6. The method of claim 2, wherein the step of giving a type identifier tag comprises:
    - assuming that the input data is produced by a parallel combination of models, each model corresponding to one of the predetermined classes;
      
      the identifier tag assigned to a segment being the class identifier tag of the model that gives the segment the highest probability, subject to certain constraints.
  - 7. The method of claim 1, further comprising creating a system for transcribing data from each class.
  - 8. The method of claim 1, wherein the classes include the identity of a speaker.
  - 9. The method of claim 1, wherein one of the classes in the predetermined set of classes is a speaker identification class.
  - 10. The method of claim 9, wherein the speaker identification classes are not known a priori and are determined automatically based on updating classes corresponding to the speakers.
  - 11. The method of claim 9, wherein the speaker identification classes further comprise varying background environments, wherein speaker identification classes are determined in light of those varying environments.

12. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
- inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes, wherein it is assumed that the input data is produced by a parallel combination of models, each model corresponding to one of the predetermined classes, the identifier tag assigned to a segment being the class identifier tag of the model that gives the segment the highest probability, subject to certain constraints, wherein one of the constraints is a minimum duration on the segment, and wherein a number of classes that the acoustic input can be classified into are identified that represent the most acoustically dissimilar classes possible; and
  
  transcribing each type identifier tagged segment using a specific system created for that type.

13. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
- inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes using a binary tree hierarchy, wherein at each level of the tree, segments corresponding to one of the predetermined classifications are isolated, and wherein a number of classes that the acoustic input can be classified into are identified that represent the most acoustically dissimilar classes possible; and
  
  transcribing each type identifier tagged segment using a specific system created for that type.

14. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
- inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes, wherein segmentation is carried out using a Hidden Markov Model to model each class and the viterbi algorithm to isolate and assign type identifier tags to the segments; and
  
  transcribing each type identifier tagged segment using a specific system created for that type.

15. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
- inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes, wherein it is assumed that the input data is produced by a parallel combination of models, each model corresponding to one of the predetermined classes, the identifier tag assigned to a segment being the class identifier tag of the model that gives the segment the highest probability, subject to certain constraints, wherein a number of classes that the acoustic input can be classified into are identified that represent the most acoustically dissimilar classes possible and, wherein the process of creating the models comprises identifying a feature space for the individual predetermined classes; and
  
  transcribing each type identifier tagged segment using a specific system created for that type.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 16. The method of claim 15, wherein the feature space for the model for non-speech is created by:
    - taking a window of input speech every 10 milliseconds and computing a vector comprising the energy or log energy in logarithmically spaced frequency bands on that window, the feature being the variance across the dimensions of the vector.
  - 17. The method of claim 15, wherein the feature space for the model for non-speech is created by:
    - taking a window of input speech every 10 milliseconds and computing the log of the energy in logarithmically spaced frequency bands;
      
      computing the cepstra from this vector, the feature being the cepstra.
  - 18. The method of claim 15, wherein the feature space for the model for non-speech is created by:
    - taking a window of input speech every 10 milliseconds and computing the log of the energy in logarithmically spaced frequency bands;
      
      computing a linear discriminant to separate out non-speech and speech.
  - 19. The method of claim 15, wherein the feature space for the model for non-speech is created by:
    - taking a window of input speech every 10 milliseconds and computing the log of the energy in logarithmically spaced frequency bands;
      
      computing the variance across the dimensions of the vector, the cepstra of vector and a linear discriminant;
      
      wherein the feature is the variance across the dimensions of the vector, the cepstra of the vector or a linear discriminant.
  - 20. The method of claim 15, wherein the feature space for the model for non-speech is created by:
    - taking a window of input speech every 10 milliseconds and computing the pitch;
      
      wherein the feature is the mean and the variance of the pitch across a plurality of consecutive windows.
  - 21. The method of claim 15, wherein the feature space for the model for telephone speech is created by:
    - taking a window of input speech every 10 milliseconds;
      
      computing a ratio of the energies in the telephone frequency band (300-3700 Hz) to the total energy of the signal.
  - 22. The method of claim 15, wherein the feature space for the model for telephone speech is created by:
    - taking a window of input speech every 10 milliseconds and computing the log of the energy in logarithmically spaced frequency bands;
      
      computing the cepstra from this vector, the feature being the cepstra.
  - 23. The method of claim 15, wherein the feature space for the model for telephone speech is created by:
    - taking a window of input speech every 10 milliseconds and computing the log of the energy in logarithmically spaced frequency bands;
      
      computing a linear discriminant to separate telephone speech and non-telephone speech.
  - 24. The method of claim 15, wherein the feature space for the model for clean speech is created by:
    - taking a window of input speech every 10 milliseconds;
      
      computing the energy in the window, wherein the feature is related to the variation of energy across a plurality of consecutive windows.
  - 25. The method of claim 15, wherein the feature space for the model for clean speech is created by:
    - taking a window of input speech every 10 milliseconds and computing the log of the energy in logarithmically spaced frequency bands;
      
      computing the cepstra from this vector, the feature being the cepstra.
  - 26. The method of claim 15, wherein the feature space for the model for clean speech is created by:
    - taking a window of input speech every 10 milliseconds and computing the log of the energy in logarithmically spaced frequency bands;
      
      computing a linear discriminant to separate out clean speech and noisy speech.

27. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
- inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes, wherein the classes include non-speech, telephone speech, noise-corrupted speech, and clean speech, and wherein clean speech segments are further segmented into smaller segments that can be assigned a speaker identifier tag; and
  
  transcribing each type identifier tagged segment using a specific system created for that type.
- View Dependent Claims (28, 29, 30, 31, 32)
- - 28. The method of claim 27, further comprising providing a script to allow supervised speaker identification and thereby improve the speaker identifier segmentation.
  - 29. The method of claim 28, wherein the models for the training speakers are generated by combining sub-models that correspond to each phonetic or sub-phonetic class.
  - 30. The method of claim 28, wherein first the clear speech is viterbi aligned against the given script, using speaker independent models, to identify regions of silence and to tag every feature vector between two consecutive silence regions with the identifier tag of a phonetic or sub-phonetic class.
  - 31. The method of claim 30, wherein a speaker identifier tag is assigned to a speech segment between two consecutive silences, where the likelihood of each feature vector is computed given each speaker model for the sub-phonetic class that was assigned to that feature vector.
  - 32. The method of claim 27, wherein the procedure for segmenting is carried out using a parallel technique using a word transcription for the clean speech.

33. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
- inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes, wherein the classes include non-speech, telephone speech, noise-corrupted speech, and clean speech; and
  
  transcribing each type identifier tagged segment using a specific speech recognition system created for that type, wherein setting up a system for transcribing telephone speech comprises transforming the training data from which the speech recognition system was made so that it matches the acoustic environment of telephone speech, wherein the transformation comprises band limiting the training data to telephone bandwidths.

34. A method for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
- inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes, wherein the classes include non-speech, telephone speech, noise-corrupted speech, and clean speech; and
  
  transcribing each type identifier tagged segment using a specific speech recognition system created for that type, wherein setting up a system for transcribing noise-corrupted speech comprises transforming the training data from which the speech recognition system was made so that it matches the acoustic environment of noise-corrupted speech, wherein the transformation comprises adding pure noise to the clean speech in the training data.
- View Dependent Claims (35)
- - 35. The method of claim 34, wherein the noise includes music.

36. A system for transcribing a segment of data that includes speech in one or more environments and non-speech data, comprising:
- means for inputting the data to a segmenter and producing a series of segments, each segment being given a type identifier tag selected from a predetermined set of classes; and
  
  means for transcribing each type identifier tagged segment using a specific system created for that type.

37. Apparatus for transcribing a segment of data that includes speech in one or more environments and non-speech data, the apparatus comprising:
- a segmenter which produces a series of segments from the data, each segment being given a type identifier tag selected from a predetermined set of classes; and
  
  a plurality of speech recognizers coupled to the segmenter which are specifically created for each type and which respectively transcribe segments having corresponding type identifier tags.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Gopinath, Ramesh Ambat, Bahl, Lalit Rai, Polymenakos, Lazaros, Panmanabhan, Mukund, Gopalakrishnan, Ponani, Maes, Stephane Herman
Primary Examiner(s)
Dorvil, Richemond

Application Number

US08/595,722
Time in Patent Office

1,572 Days
Field of Search

395/2.44, 395/2.52, 395/2.53, 395/2.86, 395/2.4, 395/2.79, 395/2.6, 395/2.64, 704/235, 704/236, 704/243, 704/244, 704/245, 704/251, 704/252, 704/255, 704/256, 704/257, 704/277, 704/276, 704/270, 704/278, 704/200, 704/242, 704/241
US Class Current

704/256.4
CPC Class Codes

G10L 15/20 Speech recognition techniqu...

Transcription of speech data with segments from acoustically dissimilar environments

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

148 Citations

37 Claims

Specification

Solutions

Use Cases

Quick Links

Transcription of speech data with segments from acoustically dissimilar environments

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

148 Citations

37 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links