Aligning a transcript to audio data

US 8,719,024 B1
Filed: 03/05/2012
Issued: 05/06/2014
Est. Priority Date: 09/25/2008
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

receiving audio data and a textual transcript of the audio data to be aligned with the audio data;

generating, from the textual transcript, a language model that represents a set of particular substrings of the textual transcript, the language model comprising allowed states of the language model and one or more transitions that link the allowed states;

receiving, from a speech recognizer, recognized language elements from the received audio data and times at which the recognized language elements occur in the audio data;

comparing the recognized language elements from the audio data to substrings represented by the language model to identify times at which particular ones of the substrings occur in the audio data;

aligning a portion of the textual transcript with a portion of the audio data using the identified times; and

outputting the aligned portion of the textual transcript.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The subject matter of this specification can be implemented in, among other things, a computer-implemented method including receiving audio data and a transcript of the audio data. The method further includes generating a language model including a factor automaton that includes automaton states and arcs, each of the automaton arcs corresponding to a language element from the transcript. The method further includes receiving language elements recognized from the received audio data and times at which each of the recognized language elements occur in the audio data. The method further includes comparing the recognized language elements to one or more of the language elements from the factor automaton to identify times at which the one or more of the language elements from the transcript occur in the audio data. The method further includes aligning a portion of the transcript with a portion of the audio data using the identified times.

26 Citations

View as Search Results

24 Claims

1. A method, comprising:
- receiving audio data and a textual transcript of the audio data to be aligned with the audio data;
  
  generating, from the textual transcript, a language model that represents a set of particular substrings of the textual transcript, the language model comprising allowed states of the language model and one or more transitions that link the allowed states;
  
  receiving, from a speech recognizer, recognized language elements from the received audio data and times at which the recognized language elements occur in the audio data;
  
  comparing the recognized language elements from the audio data to substrings represented by the language model to identify times at which particular ones of the substrings occur in the audio data;
  
  aligning a portion of the textual transcript with a portion of the audio data using the identified times; and
  
  outputting the aligned portion of the textual transcript.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1, wherein the set of substrings in the textual transcript and the recognized language elements comprise words, syllables, phonemes, or phones.
  - 3. The method of claim 1, wherein comparing comprises transitioning to a subsequent substring when a recognized language element is successfully matched with a substring.
  - 4. The method of claim 1, wherein comparing the recognized language elements to the substrings further comprises searching the substrings for a substring that substantially matches a recognized language element.
  - 5. The method of claim 4, wherein searching further comprises ending the comparison or continuing to a next substring adjacent to a previous substring, wherein the previous substring comprises a language element from the textual transcript that was last successfully matched to a recognized language element.
  - 6. The method of claim 1, further comprising estimating times at which the substrings likely occur in the audio data, and weighting the comparison of the recognized language elements and the substrings based on a difference between the estimated times and the times associated with the recognized language elements from the audio data.
  - 7. The method of claim 1, further comprising determining a level of confidence for the alignment of the substrings with the recognized language elements from the audio.
  - 8. The method of claim 7, further comprising selecting audio data and substrings associated with the level of confidence when the level does not meet a threshold level of confidence.
  - 9. The method of claim 8, further comprising iteratively performing the method of claim 8 on the selected audio data and transcript portions until the threshold level is met for the selected audio data and the substrings or until a stop condition occurs.
  - 10. The method of claim 9, wherein the stop condition comprises a number of iterations, an indication that the confidence level is within a satisfactory range from the threshold level, an elapsed period of time, or an amount of processing power used.
  - 11. The method of claim 1, wherein steps of claim 1 are distributed across multiple systems.
  - 12. The method of claim 1, further comprising outputting the received audio with the output aligned portion of the textual transcript.
  - 13. The method of claim 12, further comprising storing the output audio and the aligned portion of the textual transcript in a training corpus configured for use in training speech recognition systems.
  - 14. The method of claim 1, wherein the received audio data is based on television broadcasts, online videos, audio books, or music.
  - 15. The method of claim 1, wherein the received textual transcript is based on closed captions, lyrics, or books.
  - 16. The method of claim 1, wherein the transitions include at least one of self loops and epsilon transitions.

17. A computer-implemented method comprising:
- receiving audio data and a textual transcript of the audio data to be aligned with the audio data;
  
  generating, from the textual transcript, a language model that represents a set of particular substrings of the textual transcript, the language model comprising allowed states of the language model and one or more transitions that link the allowed states;
  
  generating a speech hypothesis of one or more language elements included in the audio using a speech recognizer, wherein the speech hypothesis includes time stamps associated with an occurrence of each of the one or more language elements;
  
  comparing the one or more language elements of the speech hypothesis to the substrings generated by the language model the language model to identify times at which particular ones of the substrings occur in the audio data; and
  
  associating at least a portion of the textual transcript with the time stamps based on the comparison.
- View Dependent Claims (18, 19, 20, 21, 22, 23)
- - 18. The method of claim 17, further comprising outputting the received audio with the associated portion of the textual transcript.
  - 19. The method of claim 17, wherein the received audio data is based on television broadcasts, online videos, audio books, or music.
  - 20. The method of claim 17, wherein the received textual transcript is based on closed captions, lyrics, or books.
  - 21. The method of claim 17, wherein the substrings from the set of substrings in the textual transcript and the recognized language elements comprise words, syllables, phonemes, or phones.
  - 22. The method of claim 17, wherein comparing comprises transitioning to a subsequent substring when a language element from the speech hypothesis is successfully matched with a substring.
  - 23. The method of claim 17, wherein comparing further comprises searching the substrings for a substring that substantially matches a language element from the speech hypothesis.

24. A computer-implemented system, comprising:
- one or more computers havingan interface that receives audio data and a textual transcript of the audio data to be aligned with the audio data;
  
  a model builder module that generates, from the textual transcript, a language model that represents a set of particular substrings of the textual transcript, the language model comprising allowed states of the language model and one or more transitions that link the allowed states;
  
  a storage device that stores the set of substrings of the textual transcript;
  
  a speech recognizer that recognizes language elements from the received audio data and determines times at which particular ones of the recognized language elements occur in the audio data; and
  
  means for comparing the recognized language elements to substrings to identify times at which particular ones of the substrings occur in the audio data and aligning a portion of the textual transcript with a portion of the audio data using the identified times, and outputting the aligned portion of the textual transcript.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Moreno, Pedro J., Alberti, Christopher
Primary Examiner(s)
AZAD, ABUL K

Application Number

US13/412,320
Time in Patent Office

792 Days
Field of Search

704/235, 704/260, 704/270, 704/9, 704255-257
US Class Current

704/257
CPC Class Codes

G10L 15/04 Segmentation; Word boundary...

Aligning a transcript to audio data

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

26 Citations

24 Claims

Specification

Use Cases

Quick Links

Others

Aligning a transcript to audio data

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

26 Citations

24 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others