Efficient method for producing off-line closed captions
First Claim
1. A method for producing time-aligned transcripts of an audio track, comprising the steps of:
- in response to an input audio stream, determining spoken parts of the audio stream;
transcribing the determined spoken parts of the audio stream by using an audio rate control routine, said transcribing producing transcription text;
adding time marks to the transcription text by detecting trigger events based on time of event keystrokes by an operator performing the transcribing; and
re-aligning precisely the transcription text on the input audio stream.
5 Assignments
0 Petitions
Accused Products
Abstract
Disclosed is a five-step process for producing closed captions for a television program, subtitles for a movie or other uses for time-aligned transcripts. An operator transcribes the audio track while listening to the recorded material. The system helps him/her to work efficiently and produce precisely aligned captions. The first step consists of identifying the portions of the input audio that contain spoken text. Only the spoken parts are further processed by the invention system. The other parts may be used to generate non-spoken captions. The second step controls the rate of speech depending on how fast the operator types. While the operator types, the third module records the time the words were typed in. This provides a rough time alignment for the transcribed text. Then the fourth module realigns precisely the transcribed text on the audio track. A final module segments the transcribed text into captions, based on acoustic clues and natural language constraints. Further, the speech rate-control component of the system may be used in other systems where transcripts are required to be generated from spoken audio.
-
Citations
15 Claims
-
1. A method for producing time-aligned transcripts of an audio track, comprising the steps of:
-
in response to an input audio stream, determining spoken parts of the audio stream;
transcribing the determined spoken parts of the audio stream by using an audio rate control routine, said transcribing producing transcription text;
adding time marks to the transcription text by detecting trigger events based on time of event keystrokes by an operator performing the transcribing; and
re-aligning precisely the transcription text on the input audio stream. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
detecting pauses acoustically;
determining from the detected pauses potential end of sentences; and
accounting for natural language constraints in the potential end of sentences to determine language legitimate end of sentences, said segmenting being based on the determined legitimate end of sentences.
-
-
5. A method as claimed in claim 1 wherein the audio rate control routine includes:
-
counting speech units in the spoken parts and producing a count of speech units for a given unit of time;
estimating a speech rate from the count of speech units; and
using the estimated speech rate, controlling playback of the spoken parts of the audio stream to match a target rate.
-
-
6. A method as claimed in claim 5 wherein the target rate is about equal to rate of operator operating a keyboard to effect the transcribing.
-
7. A method as claimed in claim 1 wherein the steps of determining spoken parts, adding time marks and realigning are performed in a digital processor;
- and
the audio rate control routine is performed by a digital processor.
- and
-
8. A method as claimed in claim 1 wherein the realigning produces time marks correlating character strings of the transcription text to corresponding parts of the input audio stream;
- and
further comprising the step of using the produced time marks for indexing respective position in time of the audio stream of various character strings in the transcription text, the indexing enabling a search on a desired character string to produce location in the audio stream where the corresponding audio part for the desired character string exists.
- and
-
9. Apparatus for producing time aligned transcripts of an audio track comprising:
-
an audio classifier, in response to an input audio stream, the audio classifier determining spoken parts of the audio stream;
audio rate controller coupled to receive the determined spoken parts of the audio stream from the audio classifier, the audio rate controller controlling rate of playback of the determined spoken parts of the audio stream to a transcriber transcribing the determined spoken parts and producing transcription text;
a time event tracker for adding time marks to the transcription text by detecting trigger events based on time of event keystrokes by the transcriber performing the transcribing;
a realigner responsive to output by the time event tracker, for precisely realigning the transcription text on the input audio stream; and
a segmenter coupled to receive from the realigner the realigned transcription text, the segmenter segmenting the realigned transcription text to form closed captions. - View Dependent Claims (10, 11, 12, 13, 14, 15)
counts speech units in the spoken parts and produces a count of speech units for a given unit of time;
estimates a speech rate from the counted speech unit; and
using the estimated speech rate controls playback of the spoken parts of the audio stream to match a target rate.
-
-
12. Apparatus as claimed in claim 11 wherein the target rate is about equal to rate of transcriber operating a keyboard to effect the transcribing.
-
13. Apparatus as claimed in claim 9 wherein the audio classifier, audio rate controller, time event tracker, realigner and segmenter are executed in a digital processor.
-
14. Apparatus as claimed in claim 9 wherein the realigner produces time marks correlating character strings of the transcription text to corresponding parts of the input audio stream;
- and
the apparatus further comprises an indexer, said indexer using the produced time marks to index respective position in time of the audio stream of various character strings in the transcription text, such that in response to a search on a desired character string, the indexer produces location in the audio stream where the corresponding audio part for the desired character string exists.
- and
-
15. Apparatus as claimed in claim 9 wherein the segmenter further:
-
detects pauses acoustically;
determines from the detected pauses potential ends of sentences; and
accounts for natural language constraints in the potential ends of sentences to determine legitimate end of sentences, said segmenter segmenting the realigned transcription text according to the determined legitimate end of sentences, to form closed captions.
-
Specification