Timeline Alignment for Closed-Caption Text Using Speech Recognition Transcripts

US 20110134321A1
Filed: 09/13/2010
Published: 06/09/2011
Est. Priority Date: 09/11/2009
Status: Active Grant

First Claim

Patent Images

1. A method of synchronizing text with audio in a multimedia file, wherein the multimedia file includes previously synchronized video and audio, wherein the multimedia file has a start time and a stop time that defines a timeline for the multimedia file, wherein the frames of the video and the corresponding audio are each associated with respective points in time along the timeline, comprising the steps of:

receiving the multimedia file and parsing the audio therefrom, but maintaining the timeline synchronization between the video and the audio;

receiving closed-captioned data associated with the multimedia file, wherein the closed-captioned data contains closed-captioned text, wherein each word of the closed-captioned text is associated with a corresponding word spoken in the audio, wherein each word of the closed-captioned text has a high degree of accuracy with the corresponding word spoken in the audio but a low correlation with the respective point in time along the timeline at which the corresponding word was spoken in the audio;

using automated speech recognition (ASR) software, generating ASR text of the parsed audio, wherein each word of the ASR text is associated approximately with the corresponding words spoken in the audio, wherein each word of the ASR text has a lower degree of accuracy with the corresponding words spoken in the audio than the respective words of the closed-captioned text but a high correlation with the respective point in time along the timeline at which the corresponding word was spoken in the audio;

thereafter, using N-gram analysis, comparing each word of the closed-captioned text with a plurality of words of the ASR text until a match is found;

for each matched word from the closed-captioned text, associating therewith the respective point in time along the timeline of the matched word from the ASR text corresponding therewith, whereby each closed-captioned word is associated with a respective point on the timeline corresponding to the same point in time on the timeline in which the word is actually spoken in the audio and occurs within the video.

View all claims

13 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Method, systems, and computer program products for synchronizing text with audio in a multimedia file, wherein the multimedia file is defined by a timeline having a start point and end point and respective points in time therebetween, wherein an N-gram analysis is used to compare each word of a closed-captioned text associated with the multimedia file with words generated by an automated speech recognition (ASR) analysis of the audio of the multimedia file to create an accurate, time-based metadata file in which each closed-captioned word is associated with a respective point on the timeline corresponding to the same point in time on the timeline in which the word is actually spoken in the audio and occurs within the video.

77 Citations

View as Search Results

20 Claims

1. A method of synchronizing text with audio in a multimedia file, wherein the multimedia file includes previously synchronized video and audio, wherein the multimedia file has a start time and a stop time that defines a timeline for the multimedia file, wherein the frames of the video and the corresponding audio are each associated with respective points in time along the timeline, comprising the steps of:
- receiving the multimedia file and parsing the audio therefrom, but maintaining the timeline synchronization between the video and the audio;
  
  receiving closed-captioned data associated with the multimedia file, wherein the closed-captioned data contains closed-captioned text, wherein each word of the closed-captioned text is associated with a corresponding word spoken in the audio, wherein each word of the closed-captioned text has a high degree of accuracy with the corresponding word spoken in the audio but a low correlation with the respective point in time along the timeline at which the corresponding word was spoken in the audio;
  
  using automated speech recognition (ASR) software, generating ASR text of the parsed audio, wherein each word of the ASR text is associated approximately with the corresponding words spoken in the audio, wherein each word of the ASR text has a lower degree of accuracy with the corresponding words spoken in the audio than the respective words of the closed-captioned text but a high correlation with the respective point in time along the timeline at which the corresponding word was spoken in the audio;
  
  thereafter, using N-gram analysis, comparing each word of the closed-captioned text with a plurality of words of the ASR text until a match is found;
  
  for each matched word from the closed-captioned text, associating therewith the respective point in time along the timeline of the matched word from the ASR text corresponding therewith, whereby each closed-captioned word is associated with a respective point on the timeline corresponding to the same point in time on the timeline in which the word is actually spoken in the audio and occurs within the video.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1 wherein the closed-captioned text and the ASR text represent only a portion of the audio of the multimedia file.
  - 3. The method of claim 1 wherein the closed-captioned text and the ASR text represent all of the audio of the multimedia file.
  - 4. The method of claim 1 wherein the step of comparing each word of the closed-captioned text with a plurality of words of the ASR text until a match is found further comprises the step of moving on to the next respective word of the closed-captioned text for comparison purposes if the prior word of the closed-captioned text is not matched with any of the plurality of words of the ASR text.
  - 5. The method of claim 1 wherein, for any unmatched word in the closed captioned text, identifying the closest matched words in the closed captioned text on either side of the unmatched word along the timeline and then comparing the unmatched word with words of the ASR text between the two points on the timeline and selecting the most likely match or matches thereto.
  - 6. The method of claim 1 wherein the step of comparing comprises matching strings of characters between the words of the closed-captioned text with the words of the ASR text to attempt to find exact or phonetically similar matches.
  - 7. The method of claim 1 wherein the step of comparing comprises matching strings of characters between the words of the closed-captioned text with the words of the ASR text to attempt to find approximate matches based on the proximity of the respective points on the timeline of the respective words.
  - 8. The method of claim 1 wherein N represents the number of words to be analyzed.
  - 9. The method of claim 1 further comprising the step of creating a time-based metadata file that contains all of the correct words associated with the audio of the multimedia file and wherein each of the correct words is associated with the respective point in time along the timeline of the matched word from the ASR text corresponding therewith.
  - 10. The method of claim 9 further comprising associating the time-based metadata file with the corresponding multimedia file.

11. A computer program product, comprising:
- a computer readable medium; and
  
  computer program instructions stored on the computer readable medium that, when processed by a computer, instruct the computer to perform a process of synchronizing text with audio in a multimedia file, wherein the multimedia file includes previously synchronized video and audio, wherein the multimedia file has a start time and a stop time that defines a timeline for the multimedia file, wherein the frames of the video and the corresponding audio are each associated with respective points in time along the timeline, the process comprising;
  
  receiving the multimedia file and parsing the audio therefrom, but maintaining the timeline synchronization between the video and the audio;
  
  receiving closed-captioned data associated with the multimedia file, wherein the closed-captioned data contains closed-captioned text, wherein each word of the closed-captioned text is associated with a corresponding word spoken in the audio, wherein each word of the closed-captioned text has a high degree of accuracy with the corresponding word spoken in the audio but a low correlation with the respective point in time along the timeline at which the corresponding word was spoken in the audio;
  
  using automated speech recognition (ASR) software, generating ASR text of the parsed audio, wherein each word of the ASR text is associated approximately with the corresponding words spoken in the audio, wherein each word of the ASR text has a lower degree of accuracy with the corresponding words spoken in the audio than the respective words of the closed-captioned text but a high correlation with the respective point in time along the timeline at which the corresponding word was spoken in the audio;
  
  thereafter, using N-gram analysis, comparing each word of the closed-captioned text with a plurality of words of the ASR text until a match is found; and
  
  for each matched word from the closed-captioned text, associating therewith the respective point in time along the timeline of the matched word from the ASR text corresponding therewith, whereby each closed-captioned word is associated with a respective point on the timeline corresponding to the same point in time on the timeline in which the word is actually spoken in the audio and occurs within the video.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The computer program product of claim 11 wherein the closed-captioned text and the ASR text represent only a portion of the audio of the multimedia file.
  - 13. The computer program product of claim 11 wherein the closed-captioned text and the ASR text represent all of the audio of the multimedia file.
  - 14. The computer program product of claim 11 wherein, within the process, the step of comparing each word of the closed-captioned text with a plurality of words of the ASR text until a match is found further comprises the step of moving on to the next respective word of the closed-captioned text for comparison purposes if the prior word of the closed-captioned text is not matched with any of the plurality of words of the ASR text.
  - 15. The computer program product of claim 11 wherein, for any unmatched word in the closed captioned text, the process further comprises identifying the closest matched words in the closed captioned text on either side of the unmatched word along the timeline and then comparing the unmatched word with words of the ASR text between the two points on the timeline and selecting the most likely match or matches thereto.
  - 16. The computer program product of claim 11 wherein, within the process, the step of comparing comprises matching strings of characters between the words of the closed-captioned text with the words of the ASR text to attempt to find exact or phonetically similar matches.
  - 17. The computer program product of claim 11 wherein, within the process, the step of comparing comprises matching strings of characters between the words of the closed-captioned text with the words of the ASR text to attempt to find approximate matches based on the proximity of the respective points on the timeline of the respective words.
  - 18. The computer program product of claim 11 wherein N represents the number of words to be analyzed by the process.
  - 19. The computer program product of claim 11 wherein the process further comprises creating a time-based metadata file that contains all of the correct words associated with the audio of the multimedia file and wherein each of the correct words is associated with the respective point in time along the timeline of the matched word from the ASR text corresponding therewith.
  - 20. The computer program product of claim 19 wherein the process further comprises associating the time-based metadata file with the corresponding multimedia file.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
ROVI Product Corporation (Xperi Inc.)
Original Assignee
Digitalsmiths Corporation (Adeia Inc.)
Inventors
Berry, Matthew G., Yang, Changwen

Granted Patent

US 8,281,231 B2
Time in Patent Office

Days
Field of Search
US Class Current

348/464
CPC Class Codes

G11B 27/10   Indexing; Addressing; Timin...

G11B 27/28   by using information signal...

G11B 27/322   used signal is digitally coded

H04N 21/234336   by media transcoding, e.g. ...

H04N 21/43074   of additional data with con...

H04N 21/4884   for displaying subtitles

H04N 7/0885   for the transmission of sub...

Timeline Alignment for Closed-Caption Text Using Speech Recognition Transcripts

First Claim

13 Assignments

0 Petitions

Accused Products

Abstract

77 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Timeline Alignment for Closed-Caption Text Using Speech Recognition Transcripts

First Claim

13 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

77 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others