Method and apparatus for extracting indexing information from digital video data

US 5,828,809 A
Filed: 10/01/1996
Issued: 10/27/1998
Est. Priority Date: 10/01/1996
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented speech and video analysis system for creating an index to indicate locations of a first event occurring within audio-video data, said audio-video data containing audio data synchronized with video data to represent a plurality of events, said first event having at least one audio-feature and at least one video-feature indicative of said first event, comprising the steps of:

(a) providing a model speech database for storing speech models representative of said audio-feature;

(b) providing a model video database for storing video models representative of said video-feature;

(c) performing wordspotting to determine candidates by comparing said audio data with said stored speech models, said candidates indicating positions of said audio-feature within said audio data;

(d) establishing predetermined ranges around each of said candidates;

(e) segmenting into shots those portions of said video data which are located within said ranges;

(f) analyzing said segmented video data to determine video-locations based on a comparison between said segmented video data and said stored video models, said video-locations indicating positions of said video-feature within said segmented video data; and

(g) generating an index to indicate locations of said first event based on said video-locations.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus to automatically index the locations of specified events on a video tape. The events, for example, include touchdowns, fumbles and other football-related events. An index to the locations where these events occur are created by using both speech detection and video analysis algorithms. A speech detection algorithm locates specific words in the audio portion data of the video tape. Locations where the specific words are found are passed to the video analysis algorithm. A range around each of the locations is established. Each range is segmented into shots using a histogram technique. The video analysis algorithm analyzes each segmented range for certain video features using line extraction techniques to identify the event. The final product of the video analysis is a set of pointers (or indexes) to the locations of the events in the video tape.

Citations

50 Claims

1. A computer-implemented speech and video analysis system for creating an index to indicate locations of a first event occurring within audio-video data, said audio-video data containing audio data synchronized with video data to represent a plurality of events, said first event having at least one audio-feature and at least one video-feature indicative of said first event, comprising the steps of:
- (a) providing a model speech database for storing speech models representative of said audio-feature;
  
  (b) providing a model video database for storing video models representative of said video-feature;
  
  (c) performing wordspotting to determine candidates by comparing said audio data with said stored speech models, said candidates indicating positions of said audio-feature within said audio data;
  
  (d) establishing predetermined ranges around each of said candidates;
  
  (e) segmenting into shots those portions of said video data which are located within said ranges;
  
  (f) analyzing said segmented video data to determine video-locations based on a comparison between said segmented video data and said stored video models, said video-locations indicating positions of said video-feature within said segmented video data; and
  
  (g) generating an index to indicate locations of said first event based on said video-locations.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
- - 2. The method according to claim 1 wherein said predetermined ranges have a starting position of one minute before each of said candidates and an ending position of two minutes after each of said candidates.
  - 3. The method according to claim 1 further comprising the step of deriving said audio-video data from a video tape.
  - 4. The method according to claim 1 further comprising the step of digitizing said audio data.
  - 5. The method according to claim 1 further comprising the step of digitizing said video data.
  - 6. The method according to claim 1 wherein said audio-feature is a predefined utterance.
  - 7. The method according to claim 6 further comprising the steps of:
    - determining energy of said predefined utterance; and
      
      storing said determined energy in said speech models.
  - 8. The method according to claim 7 further comprising the step of determining said candidates based on Euclidean distance between said energy of said audio data and said energy speech models.
  - 9. The method according to claim 6 further comprising the steps of:
    - determining Hidden Markov Models of said predefined utterance;
      
      storing said determined Hidden Markov Models in said speech models.
  - 10. The method according to claim 9 further comprising the step of determining said candidates based on a Hidden Markov Model comparison between said audio data and said Hidden Markov Model speech models.
  - 11. The method according to claim 6 further comprising the steps of:
    - determining a phonetic model of said predefined utterance; and
      
      storing said determined phonetic model in said speech models.
  - 12. The method according to claim 11 further comprising the step of determining said candidates based on a dynamic time warping analysis performed between said audio data and said speech models.
  - 13. The method according to claim 1 wherein each of said shots being a contiguous set of video data depicting a discrete activity within an event.
  - 14. The method according to claim 13 further comprising the step of segmenting said video data based upon a histogram difference X² comparison between said segmented video data and said stored video models.
  - 15. The method according to claim 13 further comprising the step of storing line representations of said video-feature within said stored video models.
  - 16. The method according to claim 15 further comprising the step of performing line extraction upon said segmented video data.
  - 17. The method according to claim 14 further comprising the step of storing color characteristics of said video-feature within said stored video models.
  - 18. The method according to claim 17 further comprising the step of determining video-locations based on comparing color data of said video data with said color characteristics of said stored video models.
  - 19. The method according to claim 13 further comprising the step of storing texture characteristics of said video feature within said stored video models.
  - 20. The method according to claim 19 further comprising the step of determining video-locations based on comparing texture data of said video data with said texture characteristics of said stored video models.
  - 21. The method according to claim 1 further comprising the step of storing a predefined transition of shots within said video models, each of said shots being a contiguous set of video data depicting a discrete activity within an event.
  - 22. The method according to claim 21 wherein said discrete activity includes two football teams lining up in a football formation.
  - 23. The method according to claim 21 wherein said discrete activity includes a football team attempting a field goal.
  - 24. The method according to claim 21 wherein said predefined transition of shots includes a lining up shot, an action shot, an aftermath shot, and an extra point shot.
  - 25. The method according to claim 21 further comprising the step of comparing said shots from said video data to said stored predefined transition of shots to identify said first event.

26. An apparatus for creating an index to indicate locations of a first event occurring within audio-video data, said audio-video data containing audio data synchronized with video data to represent a plurality of events, said first event having at least one audio-feature and at least one video-feature indicative of said first event, comprising:
- a model speech database for storing speech models representative of said audio-feature;
  
  a model video database for storing video models representative of said video-feature;
  
  a wordspotter coupled to said model speech database for determining candidates based on comparison between said audio data with said stored speech models, said candidates indicating positions of said audio-feature within said audio data;
  
  range establishing means coupled to said wordspotter for establishing predetermined ranges around each of said candidates;
  
  a segmenting device coupled to said range establishing means for segmenting into shots those portions of said video data which are located within said ranges;
  
  an video analyzer coupled to said segmenting device and to said model video database for determining video-locations based on a comparison between said video data and said stored video models, said video-locations indicating positions of said video-feature within said video data; and
  
  an indexer coupled to said video analyzer for creating indicating the locations of said first event within said audio-video data based on said determined video-locations.
- View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50)
- - 27. The apparatus according to claim 26 wherein said predetermined ranges have a starting position of one minute before each of said candidates and an ending position of two minutes after each of said candidates.
  - 28. The apparatus according to claim 26 wherein said audio-video data is derived from a video tape.
  - 29. The apparatus according to claim 26 wherein said audio data is digital audio data.
  - 30. The apparatus according to claim 26 wherein said video data is digital video data.
  - 31. The apparatus according to claim 26 wherein said audio-feature is a predefined utterance.
  - 32. The apparatus according to claim 31 wherein said speech models are based upon energy of said predefined utterance.
  - 33. The apparatus according to claim 32 wherein said wordspotter selects said audio-locations based on Euclidean distance between said energy of said audio data and said speech models.
  - 34. The apparatus according to claim 31 wherein said speech models are based upon Hidden Markov speech models of said predefined utterance.
  - 35. The apparatus according to claim 34 wherein said wordspotter selects said audio-locations based on a Hidden Markov Model comparison between said audio data and said Hidden Markov speech models.
  - 36. The apparatus according to claim 31 wherein said speech models are based upon a phonetic model of said predefined utterance.
  - 37. The apparatus according to claim 36 wherein said wordspotter selects said audio-locations based on a dynamic time warping analysis performed on said audio data and said speech models.
  - 38. The method according to claim 26 wherein each of said shots being a contiguous set of video data depicting a discrete activity within an event.
  - 39. The apparatus according to claim 38 wherein said segmenter device segments said portions of said video data based upon a histogram difference X² comparison.
  - 40. The apparatus according to claim 38 wherein said video models are based upon line representations of said video-feature.
  - 41. The apparatus according to claim 40 wherein said video analyzer includes a line extraction device for representing said video data as a set of lines.
  - 42. The apparatus according to claim 38 wherein said video models include color characteristics of said video-feature.
  - 43. The apparatus according to claim 42 wherein said video analyzer includes a color analysis device for comparing color data of said video data with said color characteristics of said video models.
  - 44. The apparatus according to claim 38 wherein said video models include texture characteristics of said video-feature.
  - 45. The apparatus according to claim 44 wherein said video analyzer includes a texture analysis device for comparing texture data of said video data with said texture characteristics of said video models.
  - 46. The apparatus according to claim 26 wherein said video models are based upon a predefined transition of shots, each of said shots being a contiguous set of video data depicting a discrete activity within an event.
  - 47. The apparatus according to claim 46 wherein said discrete activity includes two football teams lining up in a football formation.
  - 48. The apparatus according to claim 46 wherein said discrete activity includes a football team attempting a field goal.
  - 49. The apparatus according to claim 46 wherein said predefined transition of shots includes a lining up shot, an action shot, an aftermath shot, and an extra point shot.
  - 50. The apparatus according to claim 46 wherein said video analyzer compares shots from said video data to said predefined transition of shots to identify said first event.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Matsushita Electric Industrial Company Limited (Panasonic Holdings Corporation)
Original Assignee
Matsushita Electric Industrial Company Limited (Panasonic Holdings Corporation)
Inventors
Chang, Yuh-Lin, Zeng, Wenjun
Primary Examiner(s)
Chevalier, Robert

Application Number

US08/723,594
Time in Patent Office

756 Days
Field of Search

386/69, 386/96, 386/95, 386/104, 386/52, 386/46, 386/39, 386/68
US Class Current

386/241
CPC Class Codes

G06F 16/71   Indexing; Data structures t...

G06F 16/7834   using audio features

G06F 16/7844   using original textual cont...

G06F 16/785   using colour or luminescence

G06V 20/40   in video content extracting...

G10L 15/26   Speech to text systems G10L...

G11B 2220/90   Tape-like record carriers

G11B 27/28   by using information signal...

H04H 60/37   for identifying segments of...

H04H 60/48   for recognising items expre...

H04H 60/58   of audio determination or d...

H04H 60/59   of video recognising charac...

H04N 17/004   for digital television systems

H04N 5/147   Scene change detection

Method and apparatus for extracting indexing information from digital video data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

50 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for extracting indexing information from digital video data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

50 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links