Methods and apparatuses for segmenting an audio-visual recording using image similarity searching and audio speaker recognition

US 6,404,925 B1
Filed: 03/11/1999
Issued: 06/11/2002
Est. Priority Date: 03/11/1999
Status: Expired due to Term

First Claim

Patent Images

1. A method of segmenting an audio-video recording, comprising the steps of:

identifying one or more video frame intervals having similarity to a predetermined video image class;

extracting one or more audio intervals corresponding to the one or more video frame intervals;

applying an acoustic clustering method on the one or more audio intervals to produce one or more audio clusters; and

wherein the step of identifying one or more video frame intervals comprises decimating a video portion of the audio-visual recording in time and space to produce decimated frames; and

for each decimated frame, transforming the decimated frame to produce a transform matrix;

extracting a feature vector from the transform matrix; and

determining similarity of the frame using the feature vector and a video image class statistical model.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods for segmenting audio-video recording of meetings containing slide presentations by one or more speakers are described. These segments serve as indexes into the recorded meeting. If an agenda is provided for the meeting, these segments can be labeled using information from the agenda. The system automatically detects intervals of video that correspond to presentation slides. Under the assumption that only one person is speaking during an interval when slides are displayed in the video, possible speaker intervals are extracted from the audio soundtrack by finding these regions. Since the same speaker may talk across multiple slide intervals, the acoustic data from these intervals is clustered to yield an estimate of the number of distinct speakers and their order. Clustering the audio data from these intervals yields an estimate of the number of different speakers and their order. Merged clustered audio intervals corresponding to a single speaker are then used as training data for a speaker segmentation system. Using speaker identification techniques, the full video is then segmented into individual presentations based on the extent of each presenter'"'"'s speech. The speaker identification system optionally includes the construction of a hidden Markov model trained on the audio data from each slide interval. A Viterbi assignment then segments the audio according to speaker.

Citations

19 Claims

1. A method of segmenting an audio-video recording, comprising the steps of:
- identifying one or more video frame intervals having similarity to a predetermined video image class;
  
  extracting one or more audio intervals corresponding to the one or more video frame intervals;
  
  applying an acoustic clustering method on the one or more audio intervals to produce one or more audio clusters; and
  
  wherein the step of identifying one or more video frame intervals comprises decimating a video portion of the audio-visual recording in time and space to produce decimated frames; and
  
  for each decimated frame, transforming the decimated frame to produce a transform matrix;
  
  extracting a feature vector from the transform matrix; and
  
  determining similarity of the frame using the feature vector and a video image class statistical model.
- View Dependent Claims (2, 3)
- - 2. A method as in claim 1, wherein the step of measuring similarity of the frame includes the steps of:
3. A method as in claim 2, wherein the step of comparing a magnitude of the difference vector to a threshold comprises the step of:
- comparing the magnitude of the difference vector to a predetermined multiple of a standard deviation associated with the video image class statistical model.

4. A method of segmenting an audio-video recording, comprising the steps of:
- identifying one or more video frame intervals having similarity to a predetermined video image class;
  
  extracting one or more audio intervals corresponding to the one or more video frame intervals;
  
  applying an acoustic clustering method on the one or more audio intervals to produce one or more audio clusters; and
  
  wherein the step of identifying one or more video frame intervals having similarity to a predetermined video class includes the step of finding video frame intervals corresponding to slide intervals longer than a predetermined time duration.
- View Dependent Claims (5, 6, 7, 8, 9, 10, 11)
- - 5. A method as in claim 4, wherein the step of applying an acoustic clustering method comprises the steps of;
6. A method as in claim 5, wherein the mean vector is a mel-frequency cepstral coefficient mean vector.
7. A method as in claim 6, wherein the mean vector is a filterbank or liner predictive coding coefficient mean vector.
8. A method as in claim 4, further comprising the steps of:
- merging the audio intervals within same audio clusters to produced merged audio intervals; and
  
  training source-specific speaker models on the merged audio intervals.
9. A method in claim 8, further comprising the step of:
- segmenting the audio-visual recording by speaker using the source-specific speaker models to identify each speaker.
10. A method as in claim 8, further comprising the step of:
- creating a speaker transition model using a speaker sequence indicated by the merged audio intervals and the source-specific speaker models; and
  
  segmenting the audio-visual recording using the speaker transition model.
11. A method as in claim 10, wherein the speaker transition model includes a sequence of speaker units, each speaker unit including a source-specific speaker model and a filler model.

12. A computer readable storage medium, comprising:
- computer readable program code embodied on said computer readable storage medium, said computer readable program code for programming a computer to perform a method of segmenting an audio-video recording, comprising the steps of;
  
  identifying one or more video frame intervals having similarity to a predetermined video image class;
  
  extracting one or more audio intervals corresponding to the one or more video frame intervals;
  
  applying an acoustic clustering method on the one or more audio intervals to produce one or more audio clusters; and
  
  wherein the step of identifying one or more video frame intervals comprises;
  
  decimating a video portion of the audio-visual recording in time and space to produce decimated frames; and
  
  for each decimated frame, transforming the decimated frame to produce a transform matrix;
  
  extracting a feature vector from the transform matrix; and
  
  determining similarity of the frame using the feature vector and a video image class statistical model.
- View Dependent Claims (13)
- - 13. A computer readable storage medium, comprising:

14. A computer readable storage medium, comprising:
- computer readable program code embodied on said computer readable storage medium, said computer readable program code for programming a computer to perform a method of segmenting an audio-video recording, comprising the steps of;
  
  identifying one or more video frame intervals having similarity to a predetermined video image class;
  
  extracting one or more audio intervals corresponding to the one or more video frame intervals;
  
  applying an acoustic clustering method on the one or more audio intervals to produce one or more audio clusters; and
  
  wherein the step of identifying one or more video frame intervals having similarity to a predetermined video image class includes the step of finding video frame intervals corresponding to slide intervals longer than a predetermined time duration.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. A computer readable storage medium, comprising:
16. A computer readable storage medium, comprising:
- computer readable program code embodied on said computer readable storage medium, said computer readable program code for programming a computer to perform a method as in claim 14, further comprising the steps of;
  
  merging the audio intervals within same audio clusters to produced merged audio intervals; and
  
  training source-specific speaker models on the merged audio intervals.
17. A computer readable storage medium, comprising:
- computer readable program code embodied on said computer readable storage medium, said computer readable program code for programming a computer to perform a method as in claim 16, further comprising the step of;
  
  segmenting the audio-visual recording by speaker using the source-specific speaker models to identify each speaker.
18. A computer readable storage medium, comprising:
- computer readable program code embodied on said computer readable storage medium, said computer readable program code for programming a computer to perform a method as in claim 16, further comprising the step of;
  
  creating a speaker transition model using a speaker sequence indicated by the merged audio intervals and the source-specific speaker models; and
  
  segmenting the audio-visual recording using the speaker transition model.
19. A computer readable storage medium, comprising:
- computer readable program code embodied on said computer readable storage medium, said computer readable program code for programming a computer to perform a method as in claim 18, wherein the speaker transition model includes a sequence of speaker units, each speaker unit including a source-specific speaker model and a filler model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fuji Xerox Company Limited (Fujifilm Holdings Corporation), Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Fuji Xerox Company Limited (Fujifilm Holdings Corporation), Xerox Corporation (Xerox Holdings Corp.)
Inventors
Wilcox, Lynn, Foote, Jonathan T.
Primary Examiner(s)
Boudreau, Leo
Assistant Examiner(s)
Mariam, Daniel G.

Application Number

US09/266,561
Time in Patent Office

1,188 Days
Field of Search

382/181, 382/224, 382/225, 382/226, 382/227, 382/229, 382/159, 382/171, 382/173, 382/190, 382/197, 382/209, 382/218, 382/219, 382/232, 382/238, 382/243, 382/305, 348/480, 348/484, 704/239, 704/243, 707/1, 707/3
US Class Current

382/224
CPC Class Codes

G06F 16/685   using automatically derived...

G06F 16/7834   using audio features

G06V 20/48   Matching video sequences

G10L 17/00   Speaker identification or v...

G11B 27/28   by using information signal...

Y10S 707/99931   Database or file accessing

Y10S 707/99933   Query processing, i.e. sear...

Methods and apparatuses for segmenting an audio-visual recording using image similarity searching and audio speaker recognition

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatuses for segmenting an audio-visual recording using image similarity searching and audio speaker recognition

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links