Automated synchronization of video image sequences to new soundtracks

US 5,880,788 A
Filed: 03/25/1996
Issued: 03/09/1999
Est. Priority Date: 03/25/1996
Status: Expired due to Term

First Claim

Patent Images

1. A method for modifying a video recording having an accompanying audio track to produce a new video presentation with a different audio track, comprising the steps of:

analyzing said accompanying audio track by means of automatic speech recognition techniques to identify video frames in the video recording that are associated with individual speech characteristics in said accompanying audio track, and storing video image information from each of said frames in a database;

analyzing video image information from said frames to identify predetermined features associated with the video image, and annotating the video image information stored in said database with data relating to said features;

analyzing a sound utterance to identify individual speech characteristics in said sound utterance;

selecting video image information stored in said database according to the identified speech characteristics in said sound utterance, and assembling the selected items of image information to form a sequence; and

smoothly fitting the selected items of information in said sequence to one another in accordance with the annotated data to produce a video presentation that is synchronized to said sound utterance.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The synchronization of an existing video to a new soundtrack is carried out through the phonetic analysis of the original soundtrack and the new soundtrack. Individual speech sounds, such as phones, are identified in the soundtrack for the original video recording, and the images corresponding thereto are stored. The new soundtrack is similarly analyzed to identify individual speech sounds, which are used to select the stored images and create a new video sequence. The sequence of images are then smoothly fitted to one another, to provide a video stream that is synchronized to the new soundtrack. This approach permits a given video sequence to be synchronized to any arbitrary utterance. Furthermore, the matching of the video images to the new speech sounds can be carried out in a highly automated manner, thereby reducing required manual effort.

237 Citations

39 Claims

1. A method for modifying a video recording having an accompanying audio track to produce a new video presentation with a different audio track, comprising the steps of:
- analyzing said accompanying audio track by means of automatic speech recognition techniques to identify video frames in the video recording that are associated with individual speech characteristics in said accompanying audio track, and storing video image information from each of said frames in a database;
  
  analyzing video image information from said frames to identify predetermined features associated with the video image, and annotating the video image information stored in said database with data relating to said features;
  
  analyzing a sound utterance to identify individual speech characteristics in said sound utterance;
  
  selecting video image information stored in said database according to the identified speech characteristics in said sound utterance, and assembling the selected items of image information to form a sequence; and
  
  smoothly fitting the selected items of information in said sequence to one another in accordance with the annotated data to produce a video presentation that is synchronized to said sound utterance.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1 wherein said individual speech characteristics in said audio track and in said sound utterance include phones.
  - 3. The method of claim 2 wherein said individual speech characteristics comprise diphones.
  - 4. The method of claim 2 wherein said individual speech characteristics comprise triphones.
  - 5. The method of claim 1 wherein said annotated data comprises control points in the video images, and said step of smoothly fitting the items of information in said sequence comprises the process of morphing between pairs of adjacent items of information in the sequence, using said control points.
  - 6. The method of claim 5 wherein said control points identify the location of a speaker'"'"'s lips in the video images.
  - 7. The method of claim 1 wherein each item of video image information stored in said database is a subimage comprising a portion of an entire image in a video frame.
  - 8. The method of claim 7 further including the step of incorporating the subimages into full video frames to produce said video presentation.
  - 9. The method of claim 1 wherein the video recording includes an image of a person'"'"'s head, and wherein each item of video image information stored in said database comprises a subimage of an area encompassing the mouth of the person in the image.
  - 10. The method of claim 9 further including the step of incorporating the subimages of a person'"'"'s mouth into a video frame that includes an image of a person'"'"'s head.
  - 11. The method of claim 1 wherein the step of analyzing the video recording comprises the step of analyzing said predetermined features to identify individual speech characteristics associated with said features.
  - 12. The method of claim 11 wherein said predetermined features comprise control points which define the shape of a speaker'"'"'s lips.
  - 13. The method of claim 12 wherein said analysis comprises detection of the relative motion of said control points.
  - 14. The method of claim 12 wherein said analysis comprises detection of the spatial distribution of said control points.

15. A method for synchronizing a video sequence having an accompanying audio track with a different audio track, comprising the steps of:
- analyzing the audio track accompanying said video sequence by means of automatic speech recognition techniques to identify individual speech characteristics in said accompanying audio track;
  
  analyzing a sound utterance in said different audio track by means of automatic speech recognition techniques to identify individual speech characteristics in said sound utterance; and
  
  temporally modifying said video sequence so that identified individual speech characteristics in said video sequence are temporally aligned with corresponding individual speech characteristics in said sound utterance.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
- - 16. The method of claim 15 wherein said modifying step comprises the process of reordering frames of said video sequence to align them with individual speech characteristics in said sound utterance.
  - 17. The method of claim 15 wherein said modifying step comprises the process of altering the timing of frames of said video sequence to align them with individual speech characteristics in said sound utterance.
  - 18. The method of claim 15 wherein said individual speech characteristics in said audio track and in said sound utterance include phones.
  - 19. The method of claim 15 wherein said sound utterance is similar to said audio track, and said modifying step includes the step of temporally warping said video sequence to align corresponding individual speech characteristics.
  - 20. The method of claim 19 wherein said video sequence is temporally warped by removing one or more video frames from said sequence, and wherein the frames to be removed are selected in accordance with individual speech characteristics associated with the respective frames of the sequence.
  - 21. The method of claim 15 wherein said modifying step comprises the steps of:
    - storing video image data for individual speech components that are identified in said soundtrack; and
      
      retrieving stored video image data in a sequence corresponding to the identified individual speech components in said utterance, to produce a new video presentation.
  - 22. The method of claim 21 further including the step of smoothly fitting the retrieved video image data in said sequence corresponding to successive individual speech components in said utterance.
  - 23. The method of claim 22 wherein said smooth fitting step comprises the process of morphing between successive sets of retrieved video image data.
  - 24. The method of claim 22 further including the steps of analyzing images in said video sequence to define control information therein, storing said control information with the stored video image data, and smoothly fitting the video image data in accordance with the stored control information.
  - 25. The method of claim 24 wherein said control information comprises points in the video images which relate to features in the images.

26. A system for modifying a recorded video image stream to synchronize it to a soundtrack which is generated separately from the recorded video image stream, comprising:
- means for automatically analyzing the recorded video image stream to identify sequences of images that are associated with individual speech characteristics;
  
  a memory storing a database containing said identified sequences of images;
  
  means for automatically analyzing said soundtrack to identify individual speech characteristics contained therein; and
  
  means for selecting sequences of images contained in said database that correspond to individual speech characteristics that are identified in said soundtrack and assembling the selected sequences of images into a video image stream that is synchronized with said soundtrack.
- View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34)
- - 27. The system of claim 26 wherein each of said automatic analyzing means comprises a speech recognition system.
  - 28. The system of claim 27 wherein said speech recognition system is a Hidden Markov Model system.
  - 29. The system of claim 27 wherein said speech recognition system is a neural network.
  - 30. The system of claim 27 wherein said speech recognition system comprises a Hidden Markov Model system and a neural network.
  - 31. The system of claim 26 wherein said individual speech characteristics include speech phones.
  - 32. The system of claim 26 further including means for smoothly fitting said selected images to one another to produce said synchronized video image stream.
  - 33. The system of claim 32 wherein said fitting means includes an image morphing system.
  - 34. The system of claim 26 wherein said means for automatically analyzing the recorded video image stream includes means for defining control points in said images which relate to predetermined features, and means for analyzing said control points to recognize speech characteristics associated with said features.

35. A system for modifying a recorded video image stream to synchronize it to a soundtrack which is generated separately from the recorded video image stream, comprising:
- means for analyzing the recorded video image stream to identify images that are associated with individual speech characteristics;
  
  a memory storing a first database containing sub images, each of which comprises a predetermined portion of one of said identified images;
  
  means for analyzing said identified images to define control features within the subimage portions of said images;
  
  means for annotating said stored subimages with data relating to said defined control features;
  
  a memory storing a second database containing full-frame images from said video image sequence, together with said defined control features;
  
  means for analyzing said soundtrack to identify individual speech characteristics contained therein;
  
  means for selecting subimages contained in said first database that correspond to individual speech characteristics that are identified in said sound track; and
  
  means for incorporating the selected subimages into full-frame images stored in said second database, in accordance with the defined control features, to form a video stream that is synchronized with said soundtrack.
- View Dependent Claims (36, 37)
- - 36. The system of claim 35 wherein said incorporating means aligns the control features in said subimages with corresponding control features in the full frame images, and cross-fades the subimages into the full-frame images.
  - 37. The system of claim 35 wherein said incorporating means comprises a morphing system which morphs the subimages into the full-frame images in accordance with said control features.

38. A method for synchronizing a video sequence having an accompanying audio track with a different audio track, comprising the steps of:
- analyzing the audio track accompanying said video sequence to identify individual speech characteristics in said audio track;
  
  analyzing a sound utterance in said different audio track by means of automatic speech recognition techniques to identify individual speech characteristics in said sound utterance; and
  
  reordering frames of said video sequence so that identified individual speech characteristics in said video sequence are temporally aligned with corresponding individual speech characteristics in said sound utterance.

39. A method for modifying a video recording that is associated with a first audio track to produce a video presentation corresponding to a second audio track, comprising the steps of:
- analyzing said video recording to identify sequences of video frames that are associated with individual features in said first audio track, and storing said sequences of frames in a database in accordance with said identified features;
  
  analyzing said second audio track to identify individual features therein;
  
  selecting sequences of frames stored in said database according to the identified features in said second audio track, and assembling the selected sequences of frames to form a video stream that is synchronized to said second audio track.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Vulcan Patents LLC
Original Assignee
Interval Research Corporation
Inventors
Bregler, Christoph
Primary Examiner(s)
Lee, Michael

Application Number

US08/620,949
Time in Patent Office

1,079 Days
Field of Search

348/515, 348/512, 348/518, 348/571, 348/96, 348/97, 348/14, 348/576
US Class Current

348/515
CPC Class Codes

G03B 31/02   in which sound track is on ...

G10L 15/24   Speech recognition using no...

G10L 2021/105   Synthesis of the lips movem...

G11B 27/032   on tapes G11B27/036, G11B27...

G11B 27/034   on discs G11B27/036, G11B27...

G11B 27/10   Indexing; Addressing; Timin...

H04N 5/14   Picture signal circuitry fo...

Automated synchronization of video image sequences to new soundtracks

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

237 Citations

39 Claims

Specification

Use Cases

Quick Links

Others

Automated synchronization of video image sequences to new soundtracks

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

237 Citations

39 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others