Methods and apparatus for automatically synchronizing electronic audio files with electronic text files

US 6,260,011 B1
Filed: 03/20/2000
Issued: 07/10/2001
Est. Priority Date: 03/20/2000
Status: Expired due to Term

First Claim

Patent Images

1. A method of processing audio data and text data comprising:

operating a speech recognizer device to perform a speech recognition operation on the audio data to produce a set of recognized text;

globally aligning the recognized text with words included in the text data;

identifying a first location in the recognized text where silence was recognized and where at least one correctly recognized word adjoins the recognized silence; and

inserting into the text data, at the location aligned with said first identified location, a pointer to the audio data corresponding to the recognized silence.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Automated methods and apparatus for synchronizing audio and text data, e.g., in the form of electronic files, representing audio and text expressions of the same work or information are described. A statistical language model is generated from the text data. A speech recognition operation is then performed on the audio data using the generated language model and a speaker independent acoustic model. Silence is modeled as a word which can be recognized. The speech recognition operation produces a time indexed set of recognized words some of which may be silence. The recognized words are globally aligned with the words in the text data. Recognized periods of silence, which correspond to expected periods of silence, and are adjoined by one or more correctly recognized words are identified as points where the text and audio files should be synchronized, e.g., by the insertion of bi-directional pointers. In one embodiment, for a text location to be identified for synchronization purposes, both words which bracket, e.g., precede and follow, the recognized silence must be correctly identified. Pointers, corresponding to identified locations of silence to be used for synchronization purposes are inserted into the text and/or audio files at the identified locations. Audio time stamps obtained from the speech recognition operation may be used as the bi-directional pointers. Synchronized text and audio data may be output in a variety of file formats.

379 Citations

36 Claims

1. A method of processing audio data and text data comprising:
- operating a speech recognizer device to perform a speech recognition operation on the audio data to produce a set of recognized text;
  
  globally aligning the recognized text with words included in the text data;
  
  identifying a first location in the recognized text where silence was recognized and where at least one correctly recognized word adjoins the recognized silence; and
  
  inserting into the text data, at the location aligned with said first identified location, a pointer to the audio data corresponding to the recognized silence.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 2. The method of claim 1,
3. The method of claim 1, wherein identifying a first location includes the act of determining if two correctly identified words adjoin the recognized silence.
4. The method of claim 3, wherein determining if two correctly identified words adjoin the recognized silence includes the act of:
- comparing a first word in the recognized text adjoining and preceding the recognized silence to a corresponding word in the aligned text data to determine if the first word was correctly recognized.
5. The method of claim 4, wherein determining if two correctly identified words adjoin the recognized silence further includes the act of:
- comparing a second word in the recognized text adjoining and following the recognized silence to a corresponding word in the aligned text data to determine if the second word was correctly recognized.
6. The method of claim 3, wherein determining if two correctly identified words adjoin the recognized silence includes the act of:
- comparing two consecutive words in the recognized text adjoining and preceding the recognized silence to two corresponding consecutive words in the aligned text data to determine if the two consecutive words in the recognized text were correctly recognized.
7. The method of claim 3, wherein determining if two correctly identified words adjoin the recognized silence includes the act of:
- comparing two consecutive words in the recognized text adjoining and following the recognized silence to two corresponding consecutive words in the aligned text data to determine if the two consecutive words in the recognized text were correctly recognized.
8. The method of claim 3, wherein operating a speech recognizer apparatus to perform a speech recognition operation includes the act of:
- operating the speech recognizer to generate time indexes into the audio data corresponding to the locations of audio data recognized as silence.
9. The method of claim 8, wherein inserting into the text data a pointer to the audio data includes the act of:
- inserting as the pointer, a time index into the audio data corresponding to the recognized period of silence.
10. The method of claim 9, further comprising:
- inserting into the audio data the same time index inserted into the text data.
11. The method of claim 9, further comprising the step of:
- storing the text data including the inserted time stamp in a data storage device.
12. The method of claim 11, further comprising:
- operating a computer device to read the time stamp in the stored text file and to access the audio data using the time stamp as an index into the audio file.
13. The method of claim 1, wherein said audio data and text data are audio and text versions of the same literary work.
14. The method of claim 13, wherein the pointer includes an audio time stamp.
15. The method of claim 13, wherein the pointer includes an audio file identifier and a value used to index the identified audio file.
16. The method of claim 8, further comprising:
- generating a statistical language model from the text data; and
  
  wherein operating a speech recognizer includes the act of;
  
  using the statistical language model and a speaker independent acoustic model to recognize words and silence in the text data.
17. The method of claim 1, further comprising:
- generating a statistical language model from the text data.
18. The method of claim 17, wherein operating a speech recognizer includes the act of:
- using the statistical language model and a speaker independent acoustic model to recognize words and silence in the text data.
19. The method of claim 1, wherein inserting a pointer into the text data includes the act of:
- operating a computer device to add the pointer to the text data.
20. The method of claim 19, further comprising:
- storing the text data including the inserted pointer in a data storage device.

21. A computer readable medium, comprising:
- computer executable instructions for controlling a computer device to process audio data and text data said processing including;
  
  performing a speech recognition operation on the audio data to produce a set of recognized text;
  
  globally aligning the recognized text with words included in the text data;
  
  identifying a first location in the recognized text where silence was recognized and where at least one correctly recognized word adjoins the recognized silence; and
  
  inserting into the text data, at the location aligned with said first identified location, a pointer to the audio data corresponding to the recognized silence.

22. A method of processing audio data and text data comprising:
- operating a speech recognizer device to perform a speech recognition operation on the audio data to produce a set of recognized text;
  
  globally aligning the recognized text with words included in the text data;
  
  identifying a location in the recognized text where silence was recognized and where at least one correctly recognized word adjoins the recognized silence; and
  
  segmenting the audio and text data into multiple audio and data files including corresponding informational content, as a function of the location in the recognized text corresponding to the identified recognized silence, and the location of the identified recognized silence in the audio data.
- View Dependent Claims (23, 24)
- - 23. The method of claim 22, wherein performing a speech recognition operation includes the act of:
24. The method of claim 23, wherein performing a speech recognition operation further includes the act of:
- using the statistical language model and a speaker independent acoustic model to recognize words in the text data.

25. A method of synchronizing audio data and text data comprising:
- operating a speech recognizer device to perform a speech recognition operation on the audio data to produce a set of recognized text;
  
  aligning the recognized text with words included in the text data;
  
  identifying a location in the recognized text where silence was recognized and where the silence is preceded and followed by at least one correctly recognized word; and
  
  inserting into the text data, at a location in the text data corresponding to the identified location in the recognized text, a pointer to the audio data corresponding to the recognized silence.
- View Dependent Claims (26, 27, 28, 29)
- - 26. The method of claim 25, further comprising:
27. The method of claim 26, wherein operating a speech recognizer to perform a speech recognition operation includes the act of:
- operating the speech recognizer to use the statistical language model and a speaker independent acoustic model to recognize words in the audio data.
28. The method of claim 27,wherein the step of operating a speech recognizer to perform a speech recognition operation further includes the act of:
- operating the speech recognizer to generate audio time stamps identifying the locations within the audio data of the recognized words and silence; and
  
  wherein inserting a pointer into the text data includes the act of;
  
  inserting one of the generated audio time stamps as the pointer.
29. The method of claim 27, wherein the statistical language model is an N-gram language model where N is an integer greater than one.

30. A computer readable medium, comprising:
- computer executable instructions for controlling a computer device to process audio data and text data said processing including;
  
  aligning the recognized text with words included in the text data;
  
  performing a speech recognition operation on the audio data to produce a set of recognized text;
  
  aligning the recognized text with words included in the text data;
  
  identifying a location in the recognized text where silence was recognized and where the silence is preceded and followed by at least one correctly recognized word; and
  
  inserting into the text data, at a location in the text data corresponding to the identified location in the recognized text, a pointer to the audio data corresponding to the recognized silence.

31. A device for processing electronic text data and electronic audio data, comprising:
- a speech recognizer for performing a speech recognition operation on the audio data to produce a set of recognized text;
  
  means for globally aligning the recognized text with words included in the text data;
  
  means for identifying a first location in the recognized text where silence was recognized and where at least one correctly recognized word adjoins the recognized silence; and
  
  means for inserting into the text data, at the location aligned with said first identified location, a pointer to the audio data corresponding to the recognized silence.
- View Dependent Claims (32, 33, 34, 35, 36)
- - 32. The system of claim 31, further comprising:
33. The system of claim 32, further comprising:
- a speaker independent acoustic model used by the speech recognizer when performing a speech recognition operation on the audio data.
34. The system of claim 31, wherein the statistical language model generation module generates N-gram models, where N is an integer greater than two.
35. The system of claim 31, wherein the means for identifying a first location in the recognized text includes computer instructions for identifying recognized periods of silence bracketed by correctly recognized words.
36. The device of claim 31, wherein the audio data and text data correspond to the same literary work, the device further comprising:
- a display;
  
  an audio output system; and
  
  means for simultaneously presenting the audio data via the audio output system and the text data via the display in a synchronized manner using the inserted pointer.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Rounthwaite, Robert L., Hwang, Mei-Yuh, Heckerman, David E., Manferdelli, John L., Alleva, Fileno A., Yaacovi, Yoram, Rosen, Daniel
Primary Examiner(s)
Dorvil, Richemond
Assistant Examiner(s)
Nolan, Daniel A.

Application Number

US09/531,054
Time in Patent Office

477 Days
Field of Search

704/260, 704/252, 704/270.1, 704/270, 704/243, 704/231, 704/236, 704/278, 704/238
US Class Current

704/235
CPC Class Codes

G06F 16/40   of multimedia data, e.g. sl...

G09B 5/062   Combinations of audio and p...

G10L 15/08   Speech classification or se...

G10L 15/26   Speech to text systems G10L...

H04N 21/43074   of additional data with con...

H04N 21/435   Processing of additional da...

H04N 21/4394   involving operations for an...

H04N 21/466   Learning process for intell...

H04N 21/8106   involving special audio dat...

H04N 21/8133   specifically related to the...

Methods and apparatus for automatically synchronizing electronic audio files with electronic text files

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

379 Citations

36 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for automatically synchronizing electronic audio files with electronic text files

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

379 Citations

36 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links