MULTI-VIEW OBJECT DETECTION USING APPEARANCE MODEL TRANSFER FROM SIMILAR SCENES

US 20130272573A1
Filed: 06/07/2013
Published: 10/17/2013
Est. Priority Date: 07/15/2011
Status: Active Grant

First Claim

Patent Images

1. A method for learning a plurality of view-specific object detectors as a function of scene geometry and object motion patterns, the method comprising:

determining via a processing unit motion directions for each of a plurality of object images that are extracted from a source training video dataset input and that each have size and motion dimension values that meet an expected criterion of an object of interest, wherein the object images are collected from each of a plurality of different camera scene viewpoints;

categorizing via the processing unit the plurality of object images into a plurality of clusters as a function of similarities of their determined motion directions, wherein the object images in each of the clusters are also acquired from one of the different camera scene viewpoints;

estimating via the processing unit zenith angles for poses of the object images in each of the clusters relative to a position of a horizon in the camera scene viewpoint from which the clustered object images are acquired, and azimuth angles of the poses as a function of a relation of the determined motion directions of the clustered object images to the camera scene viewpoint from which the clustered object images are acquired; and

building via the processing unit a plurality of detectors for recognizing objects input video, one for each of the clusters of the object images, and associating each of the built detectors with the estimated zenith angles and azimuth angles of the poses of the cluster for which the detectors are built.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

View-specific object detectors are learned as a function of scene geometry and object motion patterns. Motion directions are determined for object images extracted from a training dataset and collected from different camera scene viewpoints. The object images are categorized into clusters as a function of similarities of their determined motion directions, the object images in each cluster are acquired from the same camera scene viewpoint. Zenith angles are estimated for object image poses in the clusters relative to a position of a horizon in the cluster camera scene viewpoint, and azimuth angles of the poses as a function of a relation of the determined motion directions of the clustered images to the cluster camera scene viewpoint. Detectors are thus built for recognizing objects in input video, one for each of the clusters, and associated with the estimated zenith angles and azimuth angles of the poses of the respective clusters.

Citations

25 Claims

1. A method for learning a plurality of view-specific object detectors as a function of scene geometry and object motion patterns, the method comprising:
- determining via a processing unit motion directions for each of a plurality of object images that are extracted from a source training video dataset input and that each have size and motion dimension values that meet an expected criterion of an object of interest, wherein the object images are collected from each of a plurality of different camera scene viewpoints;
  
  categorizing via the processing unit the plurality of object images into a plurality of clusters as a function of similarities of their determined motion directions, wherein the object images in each of the clusters are also acquired from one of the different camera scene viewpoints;
  
  estimating via the processing unit zenith angles for poses of the object images in each of the clusters relative to a position of a horizon in the camera scene viewpoint from which the clustered object images are acquired, and azimuth angles of the poses as a function of a relation of the determined motion directions of the clustered object images to the camera scene viewpoint from which the clustered object images are acquired; and
  
  building via the processing unit a plurality of detectors for recognizing objects input video, one for each of the clusters of the object images, and associating each of the built detectors with the estimated zenith angles and azimuth angles of the poses of the cluster for which the detectors are built.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 14)
- - 2. The method of claim 1, further comprising:
    - estimating the position of the horizon in a target camera viewpoint scene, wherein the target camera viewpoint scene is different from and not included in the source domain camera scene viewpoints;
      
      determining a motion direction for the target scene object image;
      
      estimating a zenith angle for a pose of the target scene objects relative to the estimated target camera viewpoint scene horizon and an azimuth angle of the target scene object pose as a function of the determined target scene object image motion direction to the target camera scene viewpoint; and
      
      selecting one or more of the built detectors that have an associated cluster zenith angle and an associated cluster azimuth angle that best match the target scene object image pose zenith angle and target scene object image pose azimuth angle, wherein the built detector is selected for recognizing objects in video data of the target domain acquired from the target camera viewpoint that have the size and motion dimension values that meet the expected criterion of the object of interest.
  - 3. The method of claim 2, further comprising:
    - representing variations of the poses of the objects in each of the clusters with respect to the camera viewpoint from which the clustered objects are acquired, by a range of the zenith angles determined for the cluster objects from the minimum determined zenith angle to the maximum determined zenith angle; and
      
      representing variations of the determined directions of motion of each of the objects in each of the clusters with respect to the camera viewpoint from which the clustered objects are acquired, as a range of the azimuth angles determined for the cluster objects from the minimum determined azimuth angle to the maximum determined azimuth angle.
  - 4. The method of claim 3, wherein determining the motion directions for the source training video dataset object images and for the target scene object image further comprises:
    - estimating a direction of motion of objects appearing in each scene for each respective camera viewpoint through an optical flow process; and
      
      representing each space-time point in the estimated optical flow directions of motion of the objects appearing for each respective camera viewpoint by a four-dimensional vector, the vector comprising a location of the each space-time point in an image plane, a magnitude and a direction of its optical flow; and
      
      wherein the clusters are optical flow map clusters, and categorizing the plurality of object images into the plurality of optical flow map clusters as a function of the similarities of their determined motion directions further comprises;
      
      discarding the space-time points that have an optical flow magnitude that is above or below certain respective fixed thresholds as noise;
      
      after the discarding the noise points, randomly sub-sampling and clustering a remainder of the space-time points into the optical flow map clusters by using a self-tuning variant of spectral clustering that automatically selects a scale of analysis and the total number of the clusters; and
      
      representing different values of the directions of motion of the objects appearing in the scene viewpoint of each optical flow map cluster by the dominant direction of motion of the points within each optical flow map cluster and by the location of the cluster in the image plane.
  - 5. The method of claim 4, further comprising:
    - estimating the position of the horizon in each of the clustered training camera views and the target camera view by utilizing structures in images of the camera scene viewpoint that have an inherent geometric relationship to an image horizon inferred from the real-world, three-dimensional geometry of the structures in the camera scene viewpoint images.
  - 6. The method of claim 5, wherein the step of estimating the position of the horizon in at least one of the clustered camera views and the target camera view by utilizing the structures in images of the camera scene viewpoint that have an inherent geometric relationship to the image horizon inferred from the real-world, three-dimensional geometry of the structures in the camera scene viewpoint images comprises:
    - identifying a plurality of structures in the images of the camera scene viewpoint through geometric parsing via the processing unit that are generally parallel to each other in the real-world, three-dimensional geometry of the structures in the camera scene viewpoint images;
      
      using the plurality of structures to define multiple sets of parallel lines in the camera scene viewpoint images that are each aligned with structures such as buildings, roads etc. wherein each set of parallel lines intersects at a vanishing point; and
      
      estimating the horizon line as a line passing through the vanishing points of at least two sets of parallel lines.
  - 7. The method of claim 6, wherein the step of building via the processing unit the plurality of detectors for recognizing objects in input video comprises:
    - building Deformable Parts Model (DPM)-based object detectors that treat positions of parts of the objects as latent variables; and
      
      employing a latent Support Vector Machine (SVM) to infer the positions of the parts of the objects from image data in the input video.
  - 14. The method of claim 6, wherein the object detector modeler builds the plurality of detectors for recognizing objects in input video comprises:
    - building Deformable Parts Model (DPM)-based object detectors that treat positions of parts of the objects as latent variables; and
      
      employing a latent Support Vector Machine (SVM) to infer the positions of the parts of the objects from image data in the input video.

8. A method of providing a service for learning a plurality of view-specific object detectors as a function of scene geometry and object motion patterns, the method comprising providing:
- a motion direction determiner that determines motion directions for each of a plurality of object images that are extracted from a source training video dataset input and that each have size and motion dimension values that meet an expected criterion of an object of interest, wherein the object images are collected from each of a plurality of different camera scene viewpoints;
  
  an object classifier that categorizes the plurality of object images into a plurality of clusters as a function of similarities of their determined motion directions, wherein the object images in each of the clusters are also acquired from one of the different camera scene viewpoints;
  
  a pose parameterizer that estimates zenith angles for poses of the object images in each of the clusters relative to a position of a horizon in the camera scene viewpoint from which the clustered object images are acquired, and azimuth angles of the poses as a function of a relation of the determined motion directions of the clustered object images to the camera scene viewpoint from which the clustered object images are acquired; and
  
  an object detector modeler that builds a plurality of detectors for recognizing objects, one for each of the clusters of the object images, and associates each of the built detectors with the estimated zenith angles and azimuth angles of the poses of the cluster for which the detectors are built.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The method of claim 8, wherein the motion direction determiner further determines a motion direction for the target scene object image, wherein the target camera viewpoint scene is different from and not included in the source domain camera scene viewpoints;
    - wherein the pose parameterizer further estimates a position of a horizon in a target camera viewpoint scene of an object image acquired from a target camera, a zenith angle for the pose of the target scene object relative to the estimated target camera viewpoint scene horizon and an azimuth angle of the target scene object pose as a function of a relation of the determined target scene object image motion direction to the target camera scene viewpoint; and
      
      wherein the method further comprises;
      
      providing a detector selector that selects one of the built detectors that has an associated cluster zenith angle and an associated cluster azimuth angle that best matches the target scene object image pose zenith angle and target scene object image pose azimuth angle; and
      
      a detector applicator that applies the selected previously learned detector to video data of the target domain acquired from the target camera viewpoint to recognize objects in the target domain video data.
  - 10. The method of claim 9, wherein the object classifier further:
    - represents variations of the poses of the objects in each of the clusters with respect to the camera viewpoint from which the clustered objects are acquired, by a range of the zenith angles determined for the cluster objects from the minimum determined zenith angle to the maximum determined zenith angle; and
      
      represents variations of the determined directions of motion of each of the objects in each of the clusters with respect to the camera viewpoint from which the clustered objects are acquired, as a range of the azimuth angles determined for the cluster objects from the minimum determined azimuth angle to the maximum determined azimuth angle.
  - 11. The method of claim 10, wherein the motion direction determiner determines the motion directions for the source training video dataset object images and for the target scene object image by:
    - estimating a direction of motion of objects appearing in each scene for each respective camera viewpoint through an optical flow process; and
      
      representing each space-time point in the estimated optical flow directions of motion of the objects appearing for each respective camera viewpoint by a four-dimensional vector, the vector comprising a location of each space-time point in an image plane, a magnitude and a direction of its optical flow; and
      
      wherein the clusters are optical flow map clusters, and the object classifier categorizes the plurality of object images into the plurality of optical flow map clusters as a function of the similarities of their determined motion directions by;
      
      discarding the space-time points that have an optical flow magnitude that is above or below certain respective fixed thresholds as noise;
      
      after the discarding the noise points, randomly sub-sampling and clustering a remainder of the space-time points into the optical flow map clusters by using a self-tuning variant of spectral clustering that automatically selects a scale of analysis and the total number of the clusters; and
      
      representing different values of the directions of motion of the objects appearing in the scene viewpoint of each optical flow map cluster by a dominant direction of motion of the points within the each optical flow map cluster and by a location in the image plane.
  - 12. The method of claim 11, wherein the pose parameterizer further:
    - estimates the position of the horizon in each of the clustered training camera views and the target camera view by utilizing structures in images of the camera scene viewpoint that have an inherent geometric relationship to an image horizon inferred from the real-world, three-dimensional geometry of the structures in the camera scene viewpoint images.
  - 13. The method of claim 12, wherein the pose parameterizer estimates the position of the horizon in at least one of the clustered camera views and the target camera view by utilizing the structures in images of the camera scene viewpoint that have an inherent geometric relationship to the image horizon inferred from the real-world, three-dimensional geometry of the structures in the camera scene viewpoint images by:
    - identifying a plurality of structures in the images of the camera scene viewpoint through geometric parsing that are generally parallel to each other in the real-world, three-dimensional geometry of the structures in the camera scene viewpoint images;
      
      using the plurality of structures to define multiple sets of parallel lines in the camera scene viewpoint images that are each aligned with structures such as buildings, roads etc. wherein each set of parallel lines intersects at a vanishing point; and
      
      estimating the horizon line as a line passing through the vanishing points of at least two sets of parallel lines.

15. A system, comprising:
- a processing unit, a computer readable memory and a computer-readable storage medium;
  
  wherein the processing unit, when executing program instructions stored on the computer-readable storage medium via the computer readable memory;
  
  determines motion directions for each of a plurality of object images that are extracted from a source training video dataset input and that each have size and motion dimension values that meet an expected criterion of an object of interest, wherein the object images are collected from each of a plurality of different camera scene viewpoints;
  
  categorizes the plurality of object images into a plurality of clusters as a function of similarities of their determined motion directions, wherein the object images in each of the clusters are also acquired from one of the different camera scene viewpoints;
  
  estimates zenith angles for poses of the object images in each of the clusters relative to the position of the horizon in the camera scene viewpoint from which the clustered object images are acquired, and azimuth angles of the poses as a function of the determined motion directions of the clustered object images; and
  
  builds a plurality of detectors for recognizing objects input video, one for each of the clusters of the object images, and associates each of the built detectors with the estimated zenith angles and azimuth angles of the poses of the cluster for which the detectors are built.
- View Dependent Claims (16, 17, 18, 19)
- - 16. The system of claim 15, wherein the processing unit, when executing the program instructions stored on the computer-readable storage medium via the computer readable memory, further:
    - determines a motion direction for the target scene object image, wherein the target camera viewpoint scene is different from and not included in the source domain camera scene viewpoints;
      
      estimates a position of the horizon in a target camera viewpoint scene of an object image acquired from a target camera, a zenith angle for the pose of the target scene object relative to the estimated target camera viewpoint scene horizon and an azimuth angle of the target scene object pose as a function of the determined target scene object image motion direction;
      
      selects one of the built detectors that has an associated cluster zenith angle and an associated cluster azimuth angle that best matches the target scene object image pose zenith angle and target scene object image pose azimuth angle; and
      
      applies the selected previously learned detector to video data of the target domain acquired from the target camera viewpoint to recognize objects in the target domain video data.
  - 17. The system of claim 16, wherein the processing unit, when executing the program instructions stored on the computer-readable storage medium via the computer readable memory, further:
    - represents variations of the poses of the objects in each of the clusters with respect to the camera viewpoint from which the clustered objects are acquired, by a range of the zenith angles determined for the cluster objects from the minimum determined zenith angle to the maximum determined zenith angle; and
      
      represents variations of the determined directions of motion of each of the objects in each of the clusters with respect to the camera viewpoint from which the clustered objects are acquired, as a range of the azimuth angles determined for the cluster objects from the minimum determined azimuth angle to the maximum determined azimuth angle.
  - 18. The system of claim 17, wherein the clusters are optical flow map clusters, and wherein the processing unit, when executing the program instructions stored on the computer-readable storage medium via the computer readable memory, further:
    - determines the motion directions for the source training video dataset object images and for the target scene object image by;
      
      estimating a direction of motion of objects appearing in each scene for each respective camera viewpoint through an optical flow process; and
      
      representing each space-time point in the estimated optical flow directions of motion of the objects appearing for each respective camera viewpoint by a four-dimensional vector, the vector comprising a location of the each space-time point in an image plane, a magnitude and a direction of its optical flow; and
      
      categorizes the plurality of object images into the plurality of optical flow map clusters as a function of the similarities of their determined motion directions by;
      
      discarding the space-time points that have an optical flow magnitude that is above or below certain respective fixed thresholds as noise;
      
      after the discarding the noise points, randomly sub-sampling and clustering a remainder of the space-time points into the optical flow map clusters by using a self-tuning variant of spectral clustering that automatically selects the scale of analysis and the total number of the clusters; and
      
      representing different values of the directions of motion of the objects appearing in the scene viewpoint of each optical flow map cluster by the dominant direction of motion of the points within the each optical flow map cluster and by a location in the image plane.
  - 19. The system of claim 18, wherein the processing unit, when executing the program instructions stored on the computer-readable storage medium via the computer readable memory, further estimates the position of the horizon in each of the clustered camera views and the target camera view by utilizing the structures in images of the camera scene viewpoint that have an inherent geometric relationship to the image horizon inferred from the real-world, three-dimensional geometry of the structures in the camera scene viewpoint images by:
    - identifying a plurality of structures in the images of the camera scene viewpoint through geometric parsing that are generally parallel to each other in the real-world, three-dimensional geometry of the structures in the camera scene viewpoint images;
      
      using the plurality of structures to define multiple sets of parallel lines in the camera scene viewpoint images that are each aligned with structures such as buildings, roads etc. wherein each set of parallel lines intersects at a vanishing point; and
      
      estimating the horizon line as a line passing through the vanishing points of at least two sets of parallel lines.

20. An article of manufacture, comprising:
- a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising instructions that, when executed by a computer processor, cause the computer processor to;
  
  determine motion directions for each of a plurality of object images that are extracted from a source training video dataset input and that each have size and motion dimension values that meet an expected criterion of an object of interest, wherein the object images are collected from each of a plurality of different camera scene viewpoints;
  
  categorize the plurality of object images into a plurality of clusters as a function of similarities of their determined motion directions, wherein the object images in each of the clusters are also acquired from one of the different camera scene viewpoints;
  
  estimate zenith angles for poses of the object images in each of the clusters relative to the position of a horizon in the camera scene viewpoint from which the clustered object images are acquired, and azimuth angles of the poses as a function of the determined motion directions of the clustered object images to the camera scene viewpoint from which the clustered object images are acquired; and
  
  build a plurality of detectors for recognizing objects input video, one for each of the clusters of the object images, and associates each of the built detectors with the estimated zenith angles and azimuth angles of the poses of the cluster for which the detectors are built.
- View Dependent Claims (21, 22, 23, 24, 25)
- - 21. The article of manufacture of claim 20, wherein the computer readable program code instructions, when executed by the computer processor, further cause the computer processor to:
    - determine a motion direction for the target scene object image, wherein the target camera viewpoint scene is different from and not included in the source domain camera scene viewpoints;
      
      estimate a position of a horizon in a target camera viewpoint scene of an object image acquired from a target camera, a zenith angle for the pose of the target scene object relative to the estimated target camera viewpoint scene horizon and an azimuth angle of the target scene object pose as a function of the determined target scene object image motion directions;
      
      select one of the built detectors that has an associated cluster zenith angle and an associated cluster azimuth angle that best matches the target scene object image pose zenith angle and target scene object image pose azimuth angle; and
      
      apply the selected previously learned detector to video data of the target domain acquired from the target camera viewpoint to recognize objects in the target domain video data.
  - 22. The article of manufacture of claim 21, wherein the computer readable program code instructions, when executed by the computer processor, further cause the computer processor to:
    - represent variations of the poses of the objects in each of the clusters with respect to the camera viewpoint from which the clustered objects are acquired, by a range of the zenith angles determined for the cluster objects from the minimum determined zenith angle to the maximum determined zenith angle; and
      
      represent variations of the determined directions of motion of the objects in each of the clusters with respect to the camera viewpoint from which the clustered objects are acquired, as a range of the azimuth angles determined for the cluster objects from the minimum determined azimuth angle to the maximum determined azimuth angle.
  - 23. The article of manufacture of claim 22, wherein the clusters are optical flow map clusters, and wherein the computer readable program code instructions, when executed by the computer processor, further cause the computer processor to:
    - determine the motion directions for the source training video dataset object images and for the target scene object image by;
      
      estimating a direction of motion of objects appearing in each scene through an optical flow process; and
      
      representing each space-time point in the estimated optical flow directions of motion of the objects appearing for each respective camera viewpoint by a four-dimensional vector, the vector comprising a location of the each space-time point in an image plane, a magnitude and a direction of its optical flow; and
      
      categorize the plurality of object images into the plurality of optical flow map clusters as a function of the similarities of their determined motion directions by;
      
      discarding the space-time points that have an optical flow magnitude that is above or below certain respective fixed thresholds as noise;
      
      after the discarding the noise points, randomly sub-sampling and clustering a remainder of the space-time points into the optical flow map clusters by using a self-tuning variant of spectral clustering that automatically selects a scale of analysis and a total number of the clusters; and
      
      representing different values of the directions of motion of the objects appearing in the scene viewpoint of each optical flow map cluster by the dominant direction of motion of the points within the each optical flow map cluster and by the location of the cluster in the image plane.
  - 24. The article of manufacture of claim 23, wherein the computer readable program code instructions, when executed by the computer processor, further cause the computer processor to estimate the position of the horizon in each of the clustered camera views and the target camera view by utilizing the structures in images of the camera scene viewpoint that have an inherent geometric relationship to the image horizon inferred from the real-world, three-dimensional geometry of the structures in the camera scene viewpoint images by:
    - identifying a plurality of structures in the images of the camera scene viewpoint through geometric parsing that are generally parallel to each other in the real-world, three-dimensional geometry of the structures in the camera scene viewpoint images;
      
      using the plurality of structures to define multiple sets of parallel lines in the camera scene viewpoint images that are each aligned with structures such as buildings, roads etc. wherein each set of parallel lines intersects at a vanishing point; and
      
      estimating the horizon line as a line passing through the vanishing points of at least two sets of parallel lines.
  - 25. The article of manufacture of claim 24, wherein the computer readable program code instructions, when executed by the computer processor, further cause the computer processor to build the plurality of detectors for recognizing objects in input video by:
    - building Deformable Parts Model (DPM)-based object detectors that treat positions of parts of the objects as latent variables; and
      
      employing a latent Support Vector Machine (SVM) to infer the positions of the parts of the objects from image data in the input video.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kyndryl Incorporated (Kyndryl Holdings, Inc.)
Original Assignee
International Business Machines Corporation
Inventors
Feris, Rogerio S., Pankanti, Sharathchandra U., Siddiquie, Behjat

Granted Patent

US 8,983,133 B2
Time in Patent Office

Days
Field of Search
US Class Current

382/103
CPC Class Codes

G06F 18/23   Clustering techniques

G06T 2207/10016   Video; Image sequence

G06T 2207/20081   Training; Learning

G06T 7/246   using feature-based methods...

G06T 7/73   using feature-based methods

G06V 20/41   Higher-level, semantic clus...

G06V 20/52   Surveillance or monitoring ...

G06V 20/54   of traffic, e.g. cars on th...

MULTI-VIEW OBJECT DETECTION USING APPEARANCE MODEL TRANSFER FROM SIMILAR SCENES

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

MULTI-VIEW OBJECT DETECTION USING APPEARANCE MODEL TRANSFER FROM SIMILAR SCENES

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links