Methods and apparatus for automatically synchronizing electronic audio files with electronic text files
First Claim
1. A method of processing audio data and text data comprising:
- operating a speech recognizer device to perform a speech recognition operation on the audio data to produce a set of recognized text;
globally aligning the recognized text with words included in the text data;
identifying a first location in the recognized text where silence was recognized and where at least one correctly recognized word adjoins the recognized silence; and
inserting into the text data, at the location aligned with said first identified location, a pointer to the audio data corresponding to the recognized silence.
2 Assignments
0 Petitions
Accused Products
Abstract
Automated methods and apparatus for synchronizing audio and text data, e.g., in the form of electronic files, representing audio and text expressions of the same work or information are described. A statistical language model is generated from the text data. A speech recognition operation is then performed on the audio data using the generated language model and a speaker independent acoustic model. Silence is modeled as a word which can be recognized. The speech recognition operation produces a time indexed set of recognized words some of which may be silence. The recognized words are globally aligned with the words in the text data. Recognized periods of silence, which correspond to expected periods of silence, and are adjoined by one or more correctly recognized words are identified as points where the text and audio files should be synchronized, e.g., by the insertion of bi-directional pointers. In one embodiment, for a text location to be identified for synchronization purposes, both words which bracket, e.g., precede and follow, the recognized silence must be correctly identified. Pointers, corresponding to identified locations of silence to be used for synchronization purposes are inserted into the text and/or audio files at the identified locations. Audio time stamps obtained from the speech recognition operation may be used as the bi-directional pointers. Synchronized text and audio data may be output in a variety of file formats.
379 Citations
36 Claims
-
1. A method of processing audio data and text data comprising:
-
operating a speech recognizer device to perform a speech recognition operation on the audio data to produce a set of recognized text;
globally aligning the recognized text with words included in the text data;
identifying a first location in the recognized text where silence was recognized and where at least one correctly recognized word adjoins the recognized silence; and
inserting into the text data, at the location aligned with said first identified location, a pointer to the audio data corresponding to the recognized silence. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
where the first identified location is a location in the recognized text where silence is expected to occur based on information included in the aligned text data, the method further comprising: identifying an additional location in the recognized text where silence was recognized and additional silence was expected to occur based on information included in the aligned text data and where at least one correctly recognized word adjoins the additional recognized silence; and
inserting into the text data, at the location aligned with said additional identified location, a pointer to the audio data corresponding to the additional recognized silence.
-
-
3. The method of claim 1, wherein identifying a first location includes the act of determining if two correctly identified words adjoin the recognized silence.
-
4. The method of claim 3, wherein determining if two correctly identified words adjoin the recognized silence includes the act of:
comparing a first word in the recognized text adjoining and preceding the recognized silence to a corresponding word in the aligned text data to determine if the first word was correctly recognized.
-
5. The method of claim 4, wherein determining if two correctly identified words adjoin the recognized silence further includes the act of:
comparing a second word in the recognized text adjoining and following the recognized silence to a corresponding word in the aligned text data to determine if the second word was correctly recognized.
-
6. The method of claim 3, wherein determining if two correctly identified words adjoin the recognized silence includes the act of:
comparing two consecutive words in the recognized text adjoining and preceding the recognized silence to two corresponding consecutive words in the aligned text data to determine if the two consecutive words in the recognized text were correctly recognized.
-
7. The method of claim 3, wherein determining if two correctly identified words adjoin the recognized silence includes the act of:
comparing two consecutive words in the recognized text adjoining and following the recognized silence to two corresponding consecutive words in the aligned text data to determine if the two consecutive words in the recognized text were correctly recognized.
-
8. The method of claim 3, wherein operating a speech recognizer apparatus to perform a speech recognition operation includes the act of:
operating the speech recognizer to generate time indexes into the audio data corresponding to the locations of audio data recognized as silence.
-
9. The method of claim 8, wherein inserting into the text data a pointer to the audio data includes the act of:
inserting as the pointer, a time index into the audio data corresponding to the recognized period of silence.
-
10. The method of claim 9, further comprising:
inserting into the audio data the same time index inserted into the text data.
-
11. The method of claim 9, further comprising the step of:
storing the text data including the inserted time stamp in a data storage device.
-
12. The method of claim 11, further comprising:
operating a computer device to read the time stamp in the stored text file and to access the audio data using the time stamp as an index into the audio file.
-
13. The method of claim 1, wherein said audio data and text data are audio and text versions of the same literary work.
-
14. The method of claim 13, wherein the pointer includes an audio time stamp.
-
15. The method of claim 13, wherein the pointer includes an audio file identifier and a value used to index the identified audio file.
-
16. The method of claim 8, further comprising:
-
generating a statistical language model from the text data; and
wherein operating a speech recognizer includes the act of;
using the statistical language model and a speaker independent acoustic model to recognize words and silence in the text data.
-
-
17. The method of claim 1, further comprising:
generating a statistical language model from the text data.
-
18. The method of claim 17, wherein operating a speech recognizer includes the act of:
using the statistical language model and a speaker independent acoustic model to recognize words and silence in the text data.
-
19. The method of claim 1, wherein inserting a pointer into the text data includes the act of:
operating a computer device to add the pointer to the text data.
-
20. The method of claim 19, further comprising:
storing the text data including the inserted pointer in a data storage device.
-
21. A computer readable medium, comprising:
-
computer executable instructions for controlling a computer device to process audio data and text data said processing including;
performing a speech recognition operation on the audio data to produce a set of recognized text;
globally aligning the recognized text with words included in the text data;
identifying a first location in the recognized text where silence was recognized and where at least one correctly recognized word adjoins the recognized silence; and
inserting into the text data, at the location aligned with said first identified location, a pointer to the audio data corresponding to the recognized silence.
-
-
22. A method of processing audio data and text data comprising:
-
operating a speech recognizer device to perform a speech recognition operation on the audio data to produce a set of recognized text;
globally aligning the recognized text with words included in the text data;
identifying a location in the recognized text where silence was recognized and where at least one correctly recognized word adjoins the recognized silence; and
segmenting the audio and text data into multiple audio and data files including corresponding informational content, as a function of the location in the recognized text corresponding to the identified recognized silence, and the location of the identified recognized silence in the audio data. - View Dependent Claims (23, 24)
generating a statistical language model from the text data.
-
-
24. The method of claim 23, wherein performing a speech recognition operation further includes the act of:
using the statistical language model and a speaker independent acoustic model to recognize words in the text data.
-
25. A method of synchronizing audio data and text data comprising:
-
operating a speech recognizer device to perform a speech recognition operation on the audio data to produce a set of recognized text;
aligning the recognized text with words included in the text data;
identifying a location in the recognized text where silence was recognized and where the silence is preceded and followed by at least one correctly recognized word; and
inserting into the text data, at a location in the text data corresponding to the identified location in the recognized text, a pointer to the audio data corresponding to the recognized silence. - View Dependent Claims (26, 27, 28, 29)
generating a statistical language model from the text data; and
supplying the statistical language model to the speech recognizer.
-
-
27. The method of claim 26, wherein operating a speech recognizer to perform a speech recognition operation includes the act of:
operating the speech recognizer to use the statistical language model and a speaker independent acoustic model to recognize words in the audio data.
-
28. The method of claim 27,
wherein the step of operating a speech recognizer to perform a speech recognition operation further includes the act of: -
operating the speech recognizer to generate audio time stamps identifying the locations within the audio data of the recognized words and silence; and
wherein inserting a pointer into the text data includes the act of;
inserting one of the generated audio time stamps as the pointer.
-
-
29. The method of claim 27, wherein the statistical language model is an N-gram language model where N is an integer greater than one.
-
30. A computer readable medium, comprising:
-
computer executable instructions for controlling a computer device to process audio data and text data said processing including;
aligning the recognized text with words included in the text data;
performing a speech recognition operation on the audio data to produce a set of recognized text;
aligning the recognized text with words included in the text data;
identifying a location in the recognized text where silence was recognized and where the silence is preceded and followed by at least one correctly recognized word; and
inserting into the text data, at a location in the text data corresponding to the identified location in the recognized text, a pointer to the audio data corresponding to the recognized silence.
-
-
31. A device for processing electronic text data and electronic audio data, comprising:
-
a speech recognizer for performing a speech recognition operation on the audio data to produce a set of recognized text;
means for globally aligning the recognized text with words included in the text data;
means for identifying a first location in the recognized text where silence was recognized and where at least one correctly recognized word adjoins the recognized silence; and
means for inserting into the text data, at the location aligned with said first identified location, a pointer to the audio data corresponding to the recognized silence. - View Dependent Claims (32, 33, 34, 35, 36)
a statistical language model generation module coupled to the speech recognizer for generating from the text data a statistical language model used by the speech recognizer when performing a speech recognition operation on the audio data.
-
-
33. The system of claim 32, further comprising:
a speaker independent acoustic model used by the speech recognizer when performing a speech recognition operation on the audio data.
-
34. The system of claim 31, wherein the statistical language model generation module generates N-gram models, where N is an integer greater than two.
-
35. The system of claim 31, wherein the means for identifying a first location in the recognized text includes computer instructions for identifying recognized periods of silence bracketed by correctly recognized words.
-
36. The device of claim 31, wherein the audio data and text data correspond to the same literary work, the device further comprising:
-
a display;
an audio output system; and
means for simultaneously presenting the audio data via the audio output system and the text data via the display in a synchronized manner using the inserted pointer.
-
Specification