Systems and methods for semantically classifying and normalizing shots in video

US 9,852,344 B2
Filed: 08/01/2016
Issued: 12/26/2017
Est. Priority Date: 02/15/2008
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

within each frame of a sequence of video frames, for each spatial segment of a plurality of spatial segments within the frame, determining likelihoods of the spatial segment corresponding to specific types of contents;

based on the likelihoods, generating arrangement data for each frame in the sequence, the arrangement data representing a spatial arrangement of the specific types of contents within the frame;

identifying groups of consecutive video frames, within the sequence, that have similar arrangement data;

based on the identified groups of consecutive video frames, identifying start times and end times for scenes within a video, the video comprising the video frames.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present disclosure relates to systems and methods for classifying videos based on video content. For a given video file including a plurality of frames, a subset of frames is extracted for processing. Frames that are too dark, blurry, or otherwise poor classification candidates are discarded from the subset. Generally, material classification scores that describe type of material content likely included in each frame are calculated for the remaining frames in the subset. The material classification scores are used to generate material arrangement vectors that represent the spatial arrangement of material content in each frame. The material arrangement vectors are subsequently classified to generate a scene classification score vector for each frame. The scene classification results are averaged (or otherwise processed) across all frames in the subset to associate the video file with one or more predefined scene categories related to overall types of scene content of the video file.

34 Citations

View as Search Results

21 Claims

1. A method comprising:
- within each frame of a sequence of video frames, for each spatial segment of a plurality of spatial segments within the frame, determining likelihoods of the spatial segment corresponding to specific types of contents;
  
  based on the likelihoods, generating arrangement data for each frame in the sequence, the arrangement data representing a spatial arrangement of the specific types of contents within the frame;
  
  identifying groups of consecutive video frames, within the sequence, that have similar arrangement data;
  
  based on the identified groups of consecutive video frames, identifying start times and end times for scenes within a video, the video comprising the video frames.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the video frames are a subset of video frames sampled from the video, the sequence corresponding to the order in which the video frames appear in the video.
  - 3. The method of claim 1, further comprising:
    - classifying the video frames based on comparing the respective arrangement data of each video frame to scene classification data for pre-defined classes of scenes;
      
      wherein identifying groups of consecutive video frames, within the sequence, that have similar arrangement data, comprises identifying groups of consecutive video frames, within the sequence, that are similarly classified.
  - 4. The method of claim 1, wherein determining the likelihoods for a given spatial segment comprises extracting features from the spatial segment and comparing the extracted features to content type classifiers, the extracted features including one or more of:
    - color, edge, line, texture, and shape.
  - 5. The method of claim 4, wherein the pre-defined classes of scenes include two or more of:
    - coast, beach, desert, forest, grassland, highway, indoor, lake, river, mountainous, open water, sky, snow, or urban.
  - 6. The method of claim 1, wherein the types of contents include two or more of:
    - buildings, grass, persons, roads, sidewalks, rock, sand, gravel, soil, sky, clouds, snow, ice, trees, plants, vehicles, or water.
  - 7. The method of claim 1, further comprising identifying the plurality of spatial segments by dividing each frame into cells formed by multiple grids of different grid sizes, wherein first spatial segments formed by a first grid overlap with second spatial segments formed by another grid, the arrangement data comprising data that represents spatial arrangements of the specific types of contents within the frame at different levels of granularity corresponding to the different grid sizes.

8. One or more non-transitory media storing instructions that, when executed by one or more computing devices, cause performance of:
- within each frame of a sequence of video frames, for each spatial segment of a plurality of spatial segments within the frame, determining likelihoods of the spatial segment corresponding to specific types of contents;
  
  based on the likelihoods, generating arrangement data for each frame in the sequence, the arrangement data representing a spatial arrangement of the specific types of contents within the frame;
  
  identifying groups of consecutive video frames, within the sequence, that have similar arrangement data;
  
  based on the identified groups of consecutive video frames, identifying start times and end times for scenes within a video, the video comprising the video frames.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The one or more non-transitory media of claim 8, wherein the video frames are a subset of video frames sampled from the video, the sequence corresponding to the order in which the video frames appear in the video.
  - 10. The one or more non-transitory media of claim 8, wherein the instructions, when executed by the one or more computing devices, further cause performance of:
    - classifying the video frames based on comparing the respective arrangement data of each video frame to scene classification data for pre-defined classes of scenes;
      
      wherein identifying groups of consecutive video frames, within the sequence, that have similar arrangement data, comprises identifying groups of consecutive video frames, within the sequence, that are similarly classified.
  - 11. The one or more non-transitory media of claim 8, wherein determining the likelihoods for a given spatial segment comprises extracting features from the spatial segment and comparing the extracted features to content type classifiers, the extracted features including one or more of:
    - color, edge, line, texture, and shape.
  - 12. The one or more non-transitory media of claim 11, wherein the pre-defined classes of scenes include two or more of:
    - coast, beach, desert, forest, grassland, highway, indoor, lake, river, mountainous, open water, sky, snow, or urban.
  - 13. The one or more non-transitory media of claim 8, wherein the types of contents include two or more of:
    - buildings, grass, persons, roads, sidewalks, rock, sand, gravel, soil, sky, clouds, snow, ice, trees, plants, vehicles, or water.
  - 14. The one or more non-transitory media of claim 8, wherein the instructions, when executed by the one or more computing devices, further cause performance of identifying the specific spatial segments by dividing each frame into cells formed by multiple grids of different grid sizes, wherein first spatial segments formed by a first grid overlap with second spatial segments formed by another grid, the arrangement data comprising data that represents spatial arrangements of the specific types of contents within the frame at different levels of granularity corresponding to the different grid sizes.

15. A system comprising:
- a module, implemented at least partially by computing hardware, configured to, within each frame of a sequence of video frames, for each spatial segment of a plurality of spatial segments within the frame, determining likelihoods of the spatial segment corresponding to specific types of contents;
  
  a module, implemented at least partially by computing hardware, configured to, based on the likelihoods, generating arrangement data for each frame in the sequence, the arrangement data representing a spatial arrangement of the specific types of contents within the frame;
  
  a module, implemented at least partially by computing hardware, configured to identify groups of consecutive video frames, within the sequence, that have similar arrangement data;
  
  a module, implemented at least partially by computing hardware, configured to, based on the identified groups of consecutive video frames, identify start times and end times for scenes within a video, the video comprising the video frames.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. The system of claim 15, wherein the video frames are a subset of video frames sampled from the video, the sequence corresponding to the order in which the video frames appear in the video.
  - 17. The system of claim 15, further comprising:
    - a module, implemented at least partially by computing hardware, configured to classify the video frames based on comparing the respective arrangement data of each video frame to scene classification data for pre-defined classes of scenes;
      
      wherein identifying groups of consecutive video frames, within the sequence, that have similar arrangement data, comprises identifying groups of consecutive video frames, within the sequence, that are similarly classified.
  - 18. The system of claim 15, wherein determining the likelihoods for a given spatial segment comprises extracting features from the spatial segment and comparing the extracted features to content type classifiers, the extracted features including one or more of:
    - color, edge, line, texture, and shape.
  - 19. The system of claim 18, wherein the pre-defined classes of scenes include two or more of:
    - coast, beach, desert, forest, grassland, highway, indoor, lake, river, mountainous, open water, sky, snow, or urban.
  - 20. The system of claim 15, wherein the types of contents include two or more of:
    - buildings, grass, persons, roads, sidewalks, rock, sand, gravel, soil, sky, clouds, snow, ice, trees, plants, vehicles, or water.
  - 21. The system of claim 15, further comprising a module, implemented at least partially by computing hardware, configured to identify the specific spatial segments by dividing each frame into cells formed by multiple grids of different grid sizes, wherein first spatial segments formed by a first grid overlap with second spatial segments formed by another grid, the arrangement data comprising data that represents spatial arrangements of the specific types of contents within the frame at different levels of granularity corresponding to the different grid sizes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
TiVo Solutions Inc. (Adeia Inc.)
Original Assignee
TiVo Solutions Inc. (Adeia Inc.)
Inventors
Dunlop, Heather, Berry, Matthew
Primary Examiner(s)
PARK, SOO JIN

Application Number

US15/225,665
Publication Number

US 20160342842A1
Time in Patent Office

512 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 18/22   Matching criteria, e.g. pro...

G06T 2207/10016   Video; Image sequence

G06T 2207/20021   Dividing image into blocks,...

G06T 7/174   involving the use of two or...

G06V 20/41   Higher-level, semantic clus...

G06V 20/47   Detecting features for summ...

Systems and methods for semantically classifying and normalizing shots in video

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

34 Citations

21 Claims

Specification

Use Cases

Quick Links

Others

Systems and methods for semantically classifying and normalizing shots in video

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

34 Citations

21 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others