Automatic indexing and aligning of audio and text using speech recognition
First Claim
1. An apparatus for indexing an audio recording comprising:
- an acoustic recorder for storing an ordered series of acoustic information signal units representing sounds generated from spoken words, said acoustic recorder having a plurality of recording locations, each recording location storing at least one acoustic information signal unit;
a timer connected to said acoustic recorder for time stamping said acoustic information signal units;
a speech recognizer connected to said acoustic recorder for generating an ordered series of recognized words having a high conditional probability of occurrence given the occurrence of the sounds represented by the acoustic information signal units from said acoustic recorder, each recognized word corresponding to at least one acoustic information signal unit and comprising a series of one or more characters, each recognized word having a context of at least one preceding or following recognized word;
a time alignment device connected to said speech recognizer and receiving time stamps of said acoustic information signal units for aligning said acoustic information signal units according to respective time stamps of said acoustic information signal units;
a text storage device for storing a transcript of text of the spoken words corresponding to ordered series of acoustic information signal units stored on said acoustic recorder;
mapping means connected to said text storage device for determining a size of an acoustic information signal unit to be passed to said speech recognizer from said acoustic recorder, said mapping means generating an ordered series of index words, said ordered series of index words comprising a representation of at least some of the spoken words represented by the acoustic information signal units, each index word having a context of at least one preceding or following index word and comprising a series of one or more characters;
a segmenter controlled by said mapping means for controlling playback of acoustic information signal units to said speech recognizer; and
alignment means connected to said acoustic recorder and to said mapping means for comparing the ordered series of recognized words with the ordered series of index words to pair recognized words and index words which are the same word and which have matching contexts, a recognized word being the same as an index word when both words comprise the same series of characters, a context of a target recognized word comprises the number of other recognized words preceding and following the target recognized word in the ordered series of recognized words, a context of a larger index word comprises the number of other index words preceding and following the target index word in the ordered series of index words, and the context of a recognized word matches the context of an index word if the context of the target recognized word is within a selected threshold value of the context of the target index word, said alignment means tagging each paired index word with the recording location of the acoustic information signal unit corresponding to the recognized word paired with the index word.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of automatically aligning a written transcript with speech in video and audio clips. The disclosed technique involves as a basic component an automatic speech recognizer. The automatic speech recognizer decodes speech (recorded on a tape) and produces a file with a decoded text. This decoded text is then matched with the original written transcript via identification of similar words or clusters of words. The results of this matching is an alignment of the speech with the original transcript. The method can be used (a) to create indexing of video clips, (b) for "teleprompting" (i.e. showing the next portion of text when someone is reading from a television screen), or (c) to enhance editing of a text that was dictated to a stenographer or recorded on a tape for its subsequent textual reproduction by a typist.
-
Citations
8 Claims
-
1. An apparatus for indexing an audio recording comprising:
-
an acoustic recorder for storing an ordered series of acoustic information signal units representing sounds generated from spoken words, said acoustic recorder having a plurality of recording locations, each recording location storing at least one acoustic information signal unit; a timer connected to said acoustic recorder for time stamping said acoustic information signal units; a speech recognizer connected to said acoustic recorder for generating an ordered series of recognized words having a high conditional probability of occurrence given the occurrence of the sounds represented by the acoustic information signal units from said acoustic recorder, each recognized word corresponding to at least one acoustic information signal unit and comprising a series of one or more characters, each recognized word having a context of at least one preceding or following recognized word; a time alignment device connected to said speech recognizer and receiving time stamps of said acoustic information signal units for aligning said acoustic information signal units according to respective time stamps of said acoustic information signal units; a text storage device for storing a transcript of text of the spoken words corresponding to ordered series of acoustic information signal units stored on said acoustic recorder; mapping means connected to said text storage device for determining a size of an acoustic information signal unit to be passed to said speech recognizer from said acoustic recorder, said mapping means generating an ordered series of index words, said ordered series of index words comprising a representation of at least some of the spoken words represented by the acoustic information signal units, each index word having a context of at least one preceding or following index word and comprising a series of one or more characters; a segmenter controlled by said mapping means for controlling playback of acoustic information signal units to said speech recognizer; and alignment means connected to said acoustic recorder and to said mapping means for comparing the ordered series of recognized words with the ordered series of index words to pair recognized words and index words which are the same word and which have matching contexts, a recognized word being the same as an index word when both words comprise the same series of characters, a context of a target recognized word comprises the number of other recognized words preceding and following the target recognized word in the ordered series of recognized words, a context of a larger index word comprises the number of other index words preceding and following the target index word in the ordered series of index words, and the context of a recognized word matches the context of an index word if the context of the target recognized word is within a selected threshold value of the context of the target index word, said alignment means tagging each paired index word with the recording location of the acoustic information signal unit corresponding to the recognized word paired with the index word. - View Dependent Claims (2, 3, 4)
-
-
5. A method of indexing an audio recording comprising:
-
storing in an acoustic recorder an ordered series of acoustic information signal units representing sounds generated from spoken words, said acoustic recorder having a plurality of recording locations, each recording location storing at least one acoustic information signal unit; time stamping said acoustic information signal units; generating by a speech recognizer an ordered series of recognized words having a high conditional probability of occurrence given the occurrence of the sounds represented by the acoustic information signal units from said acoustic recorder, each recognized word corresponding to at least one acoustic information signal unit and comprising a series of one or more characters, each recognized word having a context of at least one preceding or following recognized word; time aligning said acoustic information signal units according to respective time stamps of said acoustic information signal units; storing a transcript of text of the spoken words corresponding to ordered series of acoustic information signal units on said acoustic recorder; mapping the transcript of text to determine a size of an acoustic information signal unit to be passed to said speech recognizer from said acoustic recorder; generating an ordered series of index words, said ordered series of index words comprising a representation of at least some of the spoken words represented by the acoustic information signal units, each index word having a context of at least one preceding or following index word and comprising a series of one or more characters; segmenting said ordered series of index words to control playback of acoustic information signal units by said acoustic recorder; comparing the ordered series of recognized words with the ordered series of index words to pair recognized words and index words which are the same word and which have matching contexts, a recognized word being the same as an index word when both words comprise the same series of characters, a context of a target recognized word comprises the number of other recognized words preceding and following the target recognized word in the ordered series of recognized words, a context of a target index word comprises the number of other index words preceding and following the target index word in the ordered series of index words, and the context of a recognized word matches the context of an index word if the context of the target recognized word is within a selected threshold value of the context of the target index word; and tagging each paired index word with the recording location of the acoustic information signal unit corresponding to the recognized word paired with the index word. - View Dependent Claims (6, 7, 8)
-
Specification