Aligning a transcript to audio data
First Claim
1. A method, comprising:
- receiving audio data and a textual transcript of the audio data to be aligned with the audio data;
generating, from the textual transcript, a language model that represents a set of particular substrings of the textual transcript, the language model comprising allowed states of the language model and one or more transitions that link the allowed states;
receiving, from a speech recognizer, recognized language elements from the received audio data and times at which the recognized language elements occur in the audio data;
comparing the recognized language elements from the audio data to substrings represented by the language model to identify times at which particular ones of the substrings occur in the audio data;
aligning a portion of the textual transcript with a portion of the audio data using the identified times; and
outputting the aligned portion of the textual transcript.
3 Assignments
0 Petitions
Accused Products
Abstract
The subject matter of this specification can be implemented in, among other things, a computer-implemented method including receiving audio data and a transcript of the audio data. The method further includes generating a language model including a factor automaton that includes automaton states and arcs, each of the automaton arcs corresponding to a language element from the transcript. The method further includes receiving language elements recognized from the received audio data and times at which each of the recognized language elements occur in the audio data. The method further includes comparing the recognized language elements to one or more of the language elements from the factor automaton to identify times at which the one or more of the language elements from the transcript occur in the audio data. The method further includes aligning a portion of the transcript with a portion of the audio data using the identified times.
26 Citations
24 Claims
-
1. A method, comprising:
-
receiving audio data and a textual transcript of the audio data to be aligned with the audio data; generating, from the textual transcript, a language model that represents a set of particular substrings of the textual transcript, the language model comprising allowed states of the language model and one or more transitions that link the allowed states; receiving, from a speech recognizer, recognized language elements from the received audio data and times at which the recognized language elements occur in the audio data; comparing the recognized language elements from the audio data to substrings represented by the language model to identify times at which particular ones of the substrings occur in the audio data; aligning a portion of the textual transcript with a portion of the audio data using the identified times; and outputting the aligned portion of the textual transcript. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer-implemented method comprising:
-
receiving audio data and a textual transcript of the audio data to be aligned with the audio data; generating, from the textual transcript, a language model that represents a set of particular substrings of the textual transcript, the language model comprising allowed states of the language model and one or more transitions that link the allowed states; generating a speech hypothesis of one or more language elements included in the audio using a speech recognizer, wherein the speech hypothesis includes time stamps associated with an occurrence of each of the one or more language elements; comparing the one or more language elements of the speech hypothesis to the substrings generated by the language model the language model to identify times at which particular ones of the substrings occur in the audio data; and associating at least a portion of the textual transcript with the time stamps based on the comparison. - View Dependent Claims (18, 19, 20, 21, 22, 23)
-
-
24. A computer-implemented system, comprising:
-
one or more computers having an interface that receives audio data and a textual transcript of the audio data to be aligned with the audio data; a model builder module that generates, from the textual transcript, a language model that represents a set of particular substrings of the textual transcript, the language model comprising allowed states of the language model and one or more transitions that link the allowed states; a storage device that stores the set of substrings of the textual transcript; a speech recognizer that recognizes language elements from the received audio data and determines times at which particular ones of the recognized language elements occur in the audio data; and means for comparing the recognized language elements to substrings to identify times at which particular ones of the substrings occur in the audio data and aligning a portion of the textual transcript with a portion of the audio data using the identified times, and outputting the aligned portion of the textual transcript.
-
Specification