Methods and systems for representation and matching of video content

US 20100104184A1
Filed: 01/06/2009
Published: 04/29/2010
Est. Priority Date: 07/16/2007
Status: Active Grant

First Claim

Patent Images

1. A method of determining spatio-temporal correspondence between different sets of video data, the method comprising:

inputting the sets of video data;

representing the video data as ordered sequences of visual nucleotides;

determining temporally corresponding subsets of video data by aligning the sequences of visual nucleotides;

computing a spatial correspondence between the temporally corresponding subsets of video data (spatio-temporal correspondence); and

outputting the spatio-temporal correspondence between subsets of the video data.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The described methods and systems provide for the representation and matching of video content, including spatio-temporal matching of different video sequences. A particular method of determining temporal correspondence between different sets of video data inputs the sets of video data and represents the video data as ordered sequences of visual nucleotides. Temporally corresponding subsets of video data are determined by aligning the sequences of visual nucleotides.

186 Citations

61 Claims

1. A method of determining spatio-temporal correspondence between different sets of video data, the method comprising:
- inputting the sets of video data;
  
  representing the video data as ordered sequences of visual nucleotides;
  
  determining temporally corresponding subsets of video data by aligning the sequences of visual nucleotides;
  
  computing a spatial correspondence between the temporally corresponding subsets of video data (spatio-temporal correspondence); and
  
  outputting the spatio-temporal correspondence between subsets of the video data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51)
- - 2. The method of claim 1, wherein the video data is a collection of video sequences comprising query video data and corpus video data, or subsets of a single video sequence or modified subsets of a video sequence from the corpus video data.
  - 3. The method of claim 2, wherein the spatio-temporal correspondence is established between a subset of a video sequence from the query video data and a subset of a video sequence from the corpus video data.
  - 4. The method of claim 2, wherein the query video contains modified subsets of the corpus video data, and the modification is a combination of one or more modifications selected from the group consisting of:
    - frame rate change, spatial resolution change, non-uniform spatial scaling, histogram modification, cropping, overlay of new video content, and temporal insertion of new video content.
  - 5. The method of claim 1, wherein the video data are segmented into temporal intervals and one visual nucleotide is computed for each interval.
  - 6. The method of claim 5, wherein the temporal intervals comprise a plurality of time-consecutive video image frames.
  - 7. The method of claim 5, wherein the temporal intervals span time intervals between 1/30 second and 1 second.
  - 8. The method of claim 5, wherein the video data are segmented into temporal intervals of constant duration or temporal intervals of variable duration.
  - 9. The method of claim 5, wherein the temporal interval start and end times are computed according to shot transitions in the video data.
  - 10. The method of claim 5, wherein the temporal intervals are either non-overlapping or overlapping.
  - 11. The method of claim 1, wherein the visual nucleotide is computed by:
    - representing a temporal interval of the video data as a collection of visual atoms; and
      
      constructing the nucleotide as a grouping function of at least one of the visual atoms.
  - 12. The method of claim 11, wherein the grouping function used to construct the nucleotide is a histogram of the appearance frequency of visual atoms in the temporal interval, or the grouping function is a weighted function histogram of the appearance frequency of the visual atoms in the temporal interval.
  - 13. The method of claim 12, wherein the grouping function is a weighted function histogram, and the weighed function assigned to a particular visual atom in the nucleotide comprises a combination of the temporal location of the visual atom in the temporal interval, the spatial location of the visual atom in the temporal interval, and the significance of the visual atom.
  - 14. The method of claim 12, wherein the grouping function is a weighted function histogram, and the weighed function assigned to a particular visual atom in the nucleotide is one of:
    - constant over the interval;
      
      Gaussian with the maximum weight being inside the interval;
      
      set to a large value for the visual content belonging to the same shot as the center of the interval, and to a small value for the visual content belonging to different shots;
      
      set to a large value for visual atoms located closer to the center of the frame, and to a small value for visual atoms located closer to the boundaries of the frame.
  - 15. The method of claim 11, wherein representing a temporal interval of the video data as a collection of visual atoms is performed by:
    - detecting a collection of invariant feature points in the temporal interval;
      
      computing a collection of descriptors of the local spatio-temporal region of the video data around each invariant feature point;
      
      removing a subset of invariant feature points and their descriptors; and
      
      constructing a collection of visual atoms as a function of the remaining invariant feature point locations and descriptors.
  - 16. The method of claim 15, wherein the invariant feature points in the temporal interval is computed using detectors comprising detectors selected from the group consisting of Harris-Laplace corner detectors, affine-invariant Harris-Laplace corner detectors, Harris-Laplace corner detectors, spatio-temporal corner detectors or a MSER algorithm.
  - 17. The method of claim 15, wherein the MSER algorithm is used, and the MSER algorithm is applied individually to a subset of frames in the video data or is applied to a spatio-temporal subset of the video data.
  - 18. The method of claim 15, wherein the descriptors of the invariant feature points are SIFT descriptors, spatio-temporal SIFT descriptors, or SURF descriptors.
  - 19. The method of claim 15, wherein computing a collection of descriptors is performed by:
    - tracking of corresponding invariant feature points in the temporal interval of the video data;
      
      computing a single descriptor as a function of the descriptors of the invariant feature points belonging to a track; and
      
      assigning the descriptor to all features belonging to the track.
  - 20. The method of claim 19, wherein a function of the descriptors of the invariant feature points belonging to a track is the average of the invariant feature points descriptors, or the median of the invariant feature points descriptors.
  - 21. The method of claim 15, wherein removing a subset of invariant feature points is performed by:
    - tracking corresponding invariant feature points in the temporal interval of the video data;
      
      assigning a track quality metric for each track; and
      
      removing the invariant feature points belonging to tracks having track quality metric values below a predefined track quality threshold.
  - 22. The method of claim 21, wherein the track quality metric assigned for a track is a consistency function of a combination of descriptor values of the invariant feature points belonging to the track and locations of the invariant feature points belonging to the track.
  - 23. The method of claim 22, wherein the consistency function is proportional to the variance of the descriptor values, or to the total variation of the invariant feature point locations.
  - 24. The method of claim 15, wherein constructing a collection of visual atoms is performed by constructing a single visual atom for each of the remaining invariant feature points as a function of the invariant feature point descriptor.
  - 25. The method of claim 15, wherein the function of the invariant feature point descriptor is performed by:
    - receiving an invariant feature point descriptor as the input;
      
      finding a representative descriptor from an ordered collection of representative descriptors matching the best the invariant feature point descriptor received as the input; and
      
      outputting the index of the found representative descriptor.
  - 26. The method of claim 25, wherein finding a representative descriptor is performed using a vector quantization algorithm or using an approximate nearest neighbor algorithm.
  - 27. The method of claim 25, wherein the ordered collection of representative descriptors may be fixed and computed offline from training data, or may be adaptive and updated online from the input video data.
  - 28. The method of claim 15, wherein constructing the collection of visual atoms also comprises removing a subset of the visual atoms, where removing a subset of the visual atoms is performed by:
    - assigning an atom quality metric for each visual atom in the collection; and
      
      removing the visual atoms having atom quality metric values below a predefined atom quality threshold.
  - 29. The method of claim 28, wherein the atom quality threshold value is either fixed, adapted to maintain a minimum number of visual atoms in the collection, or adapted to limit the maximum number of visual atoms in the collection.
  - 30. The method of claim 28, wherein assigning the atom quality metric is performed by:
    - receiving a visual atom as the input;
      
      computing a vector of similarities of the visual atom to visual atoms in a collection of representative visual atoms; and
      
      outputting the atom quality metric as a function of the vector of similarities.
  - 31. The method of claim 30, wherein the function of the vector of similarities is either:
    - proportional to the largest value in the vector of similarities;
      
      proportional to the ratio between the largest value in the vector of similarities and the second-largest value in the vector of similarities;
      
      a function of the largest value in the vector of similarities and the ratio between the largest value in the vector of similarities and the second-largest value in the vector of similarities.
  - 32. The method of claim 1, wherein the aligning sequences of visual nucleotides are performed by:
    - receiving two sequences of visual nucleotides s={s₁, . . . , s_M} and q={q₁, . . . , q_M} as the input;
      
      receiving a score function σ
      
      (s_i, q_j) and a gap penalty function γ
      
      (i, j, n) as the parameters;
      
      finding the partial correspondence C={(i₁, j₁), . . . , (i_K, j_K)} and the collection of gaps G={(l₁, m₁, n₁), . . . , (l_L, m_L, n_L)} maximizing the F(C,G) function;
  - 33. The method of claim 32, wherein maximizing the F(C,G) function is performed by using the Smith-Waterman algorithm, the Needleman-Wunsch algorithm, the BLAST algorithm or by a hierarchical algorithm.
  - 34. The method of claim 32, wherein the score function is inversely proportional to a distance function d(s_i, q_j), and the distance function comprises a combination of distance functions selected from the group consisting of the Euclidean distance, the L1 distance, the Mahalanobis distance, the Kullback-Leibler divergence distance, and the Earth Mover'"'"'s distance.
  - 35. The method of claim 32, wherein the score function is a combination of one or more functions of the form
  - 36. The method of claim 32, wherein the score function is proportional to the conditional probability P(q_j|s_i) of the nucleotide q_jbeing a mutation of the nucleotide s_iand the mutation probability may be estimated empirically from training data.
  - 37. The method of claim 36, wherein the score function is proportional to the ratio of probabilities
  - 38. The method of claim 35, wherein the diagonal elements of the matrix A are proportional to
  - 39. The method of claim 38, wherein E_iis determined from training video data or from the input video data. And the diagonal elements of the matrix A are proportional to
  - 40. The method of claim 32, wherein the gap penalty is a parametric function of the form γ
    - (i, j, n;
      
      θ
      
      ), where i and j are the starting position of the gap in the two sequences, n is the gap length, and θ
      
      are parameters.
  - 41. The method of claim 40, wherein the θ
    - parameters may be estimated empirically from the training data, and the training data comprise examples of video sequences with inserted and deleted content.
  - 42. The method of claim 32, wherein the gap penalty is a function of the form γ
    - (n)=a+bn, where n is the gap length and a and b are parameters.
  - 43. The method of claim 32, wherein the gap penalty is a convex function or inversely proportional to the probability of finding a gap of length n starting at positions i and j in the two sequences.
  - 44. The method of claim 1, wherein computing spatial correspondence is performed by:
    - inputting temporally corresponding subsets of video data;
      
      providing feature points in subsets of video data;
      
      finding correspondence between feature points; and
      
      finding correspondence between spatial coordinates.
  - 45. The method of claim 44, wherein the temporally corresponding subsets of video data include at least one pair of temporally corresponding frames.
  - 46. The method of claim 44, wherein finding correspondence between feature points is performed by:
    - inputting two sets of feature points;
      
      providing descriptors of feature points, wherein the feature points and the descriptors are the same feature points and descriptors that were used for video nucleotides computation; and
      
      matching descriptors.
  - 47. The method of claim 44, wherein finding correspondence between feature points is performed using a RANSAC algorithm.
  - 48. The method of claim 44, wherein finding correspondence between feature points is performed by finding parameters of a model describing the transformation between two sets of feature points;
    - wherein finding parameters of a model is performed by solving the following optimization problem
  - 49. The method of claim 44, wherein the correspondence between spatial coordinates is a map between the spatial system of coordinates (x, y) in one subset of video data and spatial system of coordinates (x′
    - , y′
      
      ) in another subset of video data.
  - 50. The method of claim 1, wherein computing spatio-temporal correspondence is performed by:
    - inputting temporally corresponding subsets of video data;
      
      providing feature points in subsets of video data;
      
      finding correspondence between feature points; and
      
      finding correspondence between spatial coordinates.
  - 51. The method of claim 1, wherein computing spatio-temporal correspondence is performed by:
    - inputting temporally corresponding subsets of video data;
      
      providing feature points in subsets of video data;
      
      finding correspondence between feature points;
      
      finding correspondence between spatial coordinates; and
      
      finding correspondence between time coordinates.

52. A method of determining spatio-temporal correspondence between different sets of video data, the method comprising:
- inputting the sets of video data;
  
  representing the video data as ordered sequences of visual nucleotides, wherein the video data are segmented into temporal intervals;
  
  computing at least one visual nucleotide for each temporal interval, wherein each of the visual nucleotides is a grouping function collection of a plurality of visual atoms from a different temporal interval of the video data, and wherein each of the visual atoms describe the visual content of a local spatio-temporal region of the video data;
  
  constructing the visual atoms by;
  
  detecting a collection of invariant feature points in the temporal interval;
  
  computing a collection of descriptors of the local spatio-temporal region of the video data around each invariant feature point;
  
  removing a subset of invariant feature points and their descriptors;
  
  constructing a collection of visual atoms as a function of the remaining invariant feature point locations and descriptors;
  
  determining temporally corresponding subsets of video data by aligning sequences of visual nucleotides;
  
  computing spatial correspondence between temporally corresponding subsets of video data (spatio-temporal correspondence); and
  
  outputting the spatio-temporal correspondence between subsets of the video data.
- View Dependent Claims (53, 54, 55, 56, 57, 58, 59)
- - 53. The method of claim 52, wherein the grouping function used to construct the nucleotide is a histogram of the appearance frequency of visual atoms in the temporal interval, or the grouping function is a weighted function histogram of the appearance frequency of the visual atoms in the temporal interval.
  - 54. The method of claim 52, wherein the invariant feature points in the temporal interval is computed using detectors comprising detectors selected from the group consisting of Harris-Laplace corner detectors, affine-invariant Harris-Laplace corner detectors, Harris-Laplace corner detectors, spatio-temporal corner detectors or a MSER algorithm.
  - 55. The method of claim 52, wherein the descriptors of the invariant feature points are SIFT descriptors, spatio-temporal SIFT descriptors, or SURF descriptors, and in which computing a collection of descriptors is performed by:
    - tracking of corresponding invariant feature points in the temporal interval of the video data;
      
      computing a single descriptor as a function of the descriptors of the invariant feature points belonging to a track; and
      
      assigning the descriptor to all features belonging to the track.
  - 56. The method of claim 52, wherein removing a subset of invariant feature points is performed by:
    - tracking of corresponding invariant feature points in the temporal interval of the video data;
      
      assigning a track quality metric for each track, wherein the track quality metric assigned for a track is a consistency function of a combination of the descriptor values of the invariant feature points belonging to the track, and/or the locations of the invariant feature points belonging to the track; and
      
      removing the invariant feature points belonging to tracks whose track quality metric value is below a predefined track quality threshold.
  - 57. The method of claim 52, wherein constructing a collection of visual atoms is performed by constructing a single visual atom for each of the remaining invariant feature points as a function of the invariant feature point descriptor, wherein the function of the invariant feature point descriptor is performed by the steps of:
    - receiving an invariant feature point descriptor as the input;
      
      finding a representative descriptor from an ordered collection of representative descriptors matching the best the invariant feature point descriptor received as the input; and
      
      outputting the index of the found representative descriptor, wherein the ordered collection of representative descriptors may be fixed and computed offline from training data, or may be adaptive and updated online from the input video data.
  - 58. The method of claim 52, wherein the aligning sequences of visual nucleotides are performed by:
    - receiving two sequences of visual nucleotides s={s₁, . . . , s_M} and q={q₁, . . . , q_M} as the input;
      
      receiving a score function σ
      
      (s_i, q_j) and a gap penalty function γ
      
      (i, j, n) as the parameters, and in which in which the score function is inversely proportional to a distance function d(s_i, q_j);
      
      finding the partial correspondence C={(i₁, j₁), . . . , (i_K, j_K)} the collection of gaps G={(l₁, m₁, n₁), . . . , (l_L, m_L, n_L)} maximizing the F(C,G) function;
  - 59. The method of claim 52, wherein computing spatial correspondence is performed by:
    - inputting temporally corresponding subsets of video data;
      
      providing feature points in subsets of video data;
      
      finding correspondence between feature points, by the steps of inputting two sets of feature points, providing descriptors of feature points, and matching the descriptors; and
      
      finding correspondence between spatial coordinates.

60. A method of determining spatio-temporal correspondence between different sets of video data, the method comprising:
- creating a sequence of visual nucleotides by the steps of;
  
  analyzing a series of time successive video images from the video data for features;
  
  pruning the features to remove features that are only present on one video image;
  
  time averaging the remaining video features and discarding outlier features from the average;
  
  using a nearest neighbor fit to assign the remaining features to a standardized array of different features;
  
  counting the number of each type of assigned feature in the series of time successive video images, thus creating coefficients for the standardized array of different features, where each visual nucleotide consists of this array of coefficients, and the sequence of visual nucleotides consists of sequential time successive visual nucleotides;
  
  determining temporally corresponding subsets of video data by aligning sequences of the visual nucleotides;
  
  computing spatial correspondence between temporally corresponding subsets of video data (spatio-temporal correspondence); and
  
  outputting the spatio-temporal correspondence between subsets of the video data.

61. An apparatus comprising:
- a source of video data;
  
  a video segmenter coupled to the source of video data and configured to segment video data into temporal intervals;
  
  a video processor coupled to the source of video data and configured to detect feature locations within the video data, generate feature descriptors associated with the feature locations, and prune the detected feature locations to generate a subset of feature locations; and
  
  a video aggregator coupled to the video segmenter and the video processor, the video aggregator configured to generate a video DNA associated with the video data, wherein the video DNA includes video data ordered as sequences of visual nucleotides.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Novafora, Inc.
Original Assignee
Novafora, Inc.
Inventors
Bronstein, Michael, Bronstein, Alexander, Rakib, Shlomo Selim

Granted Patent

US 8,358,840 B2
Time in Patent Office

Days
Field of Search
US Class Current

382/170
CPC Class Codes

G06F 16/783   using metadata automaticall...

G06V 20/48   Matching video sequences

H04N 21/4334   Recording operations record...

H04N 21/44008   involving operations for an...

H04N 21/4402   involving reformatting oper...

H04N 21/4532   involving end-user characte...

H04N 21/845   Structuring of content, e.g...

H04N 21/8455   involving pointers to the c...

H04N 21/8456   by decomposing the content ...

Methods and systems for representation and matching of video content

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

186 Citations

61 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems for representation and matching of video content

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

186 Citations

61 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links