END-TO-END VISUAL RECOGNITION SYSTEM AND METHODS

US 20130215264A1
Filed: 01/07/2013
Published: 08/22/2013
Est. Priority Date: 07/08/2010
Status: Active Grant

First Claim

Patent Images

1. A visual recognition apparatus for identifying objects captured in a video stream having a captured time period, the apparatus comprising:

an image sensor configured for capturing a video stream;

a computer processor; and

programming for processing said video stream to perform visual recognition by performing steps comprising;

capturing the video stream from said image sensor;

associating each frame in an image with a corresponding frame in temporally adjacent images, or in images taken from nearby vantage points; and

temporally aggregating statistics computed at one or more collections of temporally corresponding frames, into a descriptor.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

We describe an end-to-end visual recognition system, where “end-to-end” refers to the ability of the system of performing all aspects of the system, from the construction of “maps” of scenes, or “models” of objects from training data, to the determination of the class, identity, location and other inferred parameters from test data. Our visual recognition system is capable of operating on a mobile hand-held device, such as a mobile phone, tablet or other portable device equipped with sensing and computing power. Our system employs a video based feature descriptor, and we characterize its invariance and discriminative properties. Feature selection and tracking are performed in real-time, and used to train a template-based classifier during a capture phase prompted by the user. During normal operation, the system scores objects in the field of view based on their ranking.

64 Citations

View as Search Results

37 Claims

1. A visual recognition apparatus for identifying objects captured in a video stream having a captured time period, the apparatus comprising:
- an image sensor configured for capturing a video stream;
  
  a computer processor; and
  
  programming for processing said video stream to perform visual recognition by performing steps comprising;
  
  capturing the video stream from said image sensor;
  
  associating each frame in an image with a corresponding frame in temporally adjacent images, or in images taken from nearby vantage points; and
  
  temporally aggregating statistics computed at one or more collections of temporally corresponding frames, into a descriptor.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The apparatus recited in claim 1, wherein said temporal aggregating of statistics is performed by computing a mean, or median, or mode, or sample histogram of a contrast-invariant function of the image in said frames.
  - 3. The apparatus recited in claim 1, wherein said programming performs steps comprising:
    - spatially aggregating such statistics into a representation that is insensitive to nuisance factor and distinctive;
      
      exploiting such a representation within a classification scheme to enable the detection, localization, recognition and categorization of objects and scenes in video; and
      
      displaying the result of the classification scheme by overlaying information on the live video stream, optionally localized and overlaid on the object of interest.
  - 4. The apparatus recited in claim 1, wherein said programming performs steps comprising:
    - selecting a plurality of features corresponding to translational, similarity, affine or more general reference frames from the video stream for objects in a field of view of the video stream; and
      
      performing such a selection at a plurality of scales, and using topological consistency across scale as a criterion for propagating said general reference frames across different scales.
  - 5. The apparatus recited in claim 4, wherein said plurality of features comprises a plurality of feature points.
  - 6. The apparatus recited in claim 4, wherein said programming ranks features according to their structural stability margin.
  - 7. The apparatus recited in claim 6, wherein said structural stability margin comprises a maximum norm of the nuisance that does not cause a singularity in the detection mechanism.
  - 8. The apparatus recited in claim 1, wherein said programming includes a canonization mechanism which does not rely on a co-variant detector.
  - 9. The apparatus recited in claim 1, wherein said programming canonizes rotation in response to a gravity sensor signal.
  - 10. The apparatus recited in claim 4, wherein said programming performs steps comprising:
    - computing a co-variant region that is proximate to a feature point of said feature;
      
      computing a contrast invariant feature; and
      
      performing a temporal aggregation operation of a number of statistics computed on each image associated with the plurality of video frames over a time period.
  - 11. The apparatus recited in claim 10, wherein the temporal aggregation operation comprises aggregating the contrast invariant feature at each video frame during the time period at the corresponding scale of a feature point of the feature.

12. A visual recognition method for identifying objects captured in a video stream having a captured time period, the method comprising:
- capturing the video stream on an electronic device;
  
  enabling the user to select a target object or scene for training;
  
  capturing the video stream from said image sensor;
  
  associating each frame in an image with a corresponding frame in temporally adjacent images, or in images taken from nearby vantage points; and
  
  temporally aggregating statistics computed at one or more collections of temporally corresponding frames, into a descriptor.
- View Dependent Claims (13, 14, 15)
- - 13. The method recited in claim 12, wherein said aggregation is performed by computing a mean, or median, or mode, or sample histogram of a contrast-invariant function of the image in said frames.
  - 14. The method recited in claim 12, further comprising:
    - spatially aggregating such statistics into a representation that is insensitive to nuisance factor and distinctive;
      
      exploiting such a representation within a classification scheme to enable the detection, localization, recognition and categorization of objects and scenes in video; and
      
      displaying the result of the classification scheme by overlaying information on the live video stream, optionally localized and overlaid on the object of interest.
  - 15. The method recited in claim 12, further comprising:
    - selecting a plurality of features corresponding to translational, similarity, affine or more general reference frames from the video stream for objects in a field of view of the video stream; and
      
      performing such a selection at a plurality of scales, and using topological consistency across scale as a criterion for propagating said general reference frames across different scales.

16. A visual recognition apparatus for identifying objects captured in a video stream having a captured time period, the apparatus comprising:
- an image sensor configured for capturing a video stream;
  
  a computer processor; and
  
  programming for processing said video stream to perform visual recognition by performing steps comprising;
  
  capturing the video stream from said image sensor;
  
  associating each frame in an image with a corresponding frame in temporally adjacent images, or in images taken from nearby vantage points;
  
  temporally aggregating statistics computed at one or more collections of temporally corresponding frames, into a descriptor;
  
  spatially aggregating such statistics into a representation that is insensitive to nuisance factor and distinctive;
  
  exploiting such a representation within a classification scheme to enable the detection, localization, recognition and categorization of objects and scenes in video; and
  
  displaying the result of the classification scheme by overlaying information on the live video stream, optionally localized and overlaid on the object of interest.
- View Dependent Claims (18, 19, 23, 24, 25, 26)
- - 18. The apparatus recited in claim 16, wherein said programming performs steps comprising:
    - selecting a plurality of features corresponding to translational, similarity, affine or more general reference frames from the video stream for objects in a field of view of the video stream; and
      
      performing such a selection at a plurality of scales, and using topological consistency across scale as a criterion for propagating said general reference frames across different scales.
  - 19. The apparatus recited in claim 16 or 17, wherein said temporal aggregating of statistics is performed by computing a mean, or median, or mode, or sample histogram of a contrast-invariant function of the image in said frames.
  - 23. The apparatus recited in claim 16 or 17, wherein said programming includes a canonization mechanism which does not rely on a co-variant detector.
  - 24. The apparatus recited in claim 16 or 17, wherein said programming canonizes rotation in response to a gravity sensor signal.
  - 25. The apparatus recited in claim 16 or 17, wherein said programming performs steps comprising:
    - computing a co-variant region that is proximate to a feature point of said feature;
      
      computing a contrast invariant feature; and
      
      performing a temporal aggregation operation of a number of statistics computed on each image associated with the plurality of video frames over a time period.
  - 26. The apparatus recited in claim 25, wherein the temporal aggregation operation comprises aggregating the contrast invariant feature at each video frame during the time period at the corresponding scale of a feature point of the feature.

17. A visual recognition apparatus for identifying objects captured in a video stream having a captured time period, the apparatus comprising:
- an image sensor configured for capturing a video stream;
  
  a computer processor; and
  
  programming for processing said video stream to perform visual recognition by performing steps comprising;
  
  capturing the video stream from said image sensor;
  
  optionally selecting a plurality of features corresponding to translational, similarity, affine or more general reference frames from the video stream for objects in a field of view of the video stream;
  
  performing such a selection at a plurality of scales, and using topological consistency across scale as a criterion for propagating said general reference frames across different scales;
  
  associating each frame in an image with a corresponding frame in temporally adjacent images, or in images taken from nearby vantage points;
  
  temporally aggregating statistics computed at one or more collections of temporally corresponding frames, into a descriptor;
  
  spatially aggregating such statistics into a representation that is insensitive to nuisance factor and distinctive;
  
  exploiting such a representation within a classification scheme to enable the detection, localization, recognition and categorization of objects and scenes in video; and
  
  displaying the result of the classification scheme by overlaying information on the live video stream, optionally localized and overlaid on the object of interest.
- View Dependent Claims (20, 21, 22)
- - 20. The apparatus recited in claim 17 or 18, wherein said plurality of features comprises a plurality of feature points.
  - 21. The apparatus recited in claim 17 or 18, wherein said programming ranks features according to their structural stability margin.
  - 22. The apparatus recited in claim 21, wherein said structural stability margin comprises a maximum norm of the nuisance that does not cause a singularity in the detection mechanism.

27. A visual recognition method for identifying objects captured in a video stream having a captured time period, the method comprising:
- capturing the video stream on an electronic device;
  
  enabling the user to select a target object or scene for training;
  
  capturing the video stream from said image sensor;
  
  associating each frame in an image with a corresponding frame in temporally adjacent images, or in images taken from nearby vantage points;
  
  temporally aggregating statistics computed at one or more collections of temporally corresponding frames, into a descriptor;
  
  spatially aggregating such statistics into a representation that is insensitive to nuisance factor and distinctive;
  
  exploiting such a representation within a classification scheme to enable the detection, localization, recognition and categorization of objects and scenes in video; and
  
  displaying the result of the classification scheme by overlaying information on the live video stream, optionally localized and overlaid on the object of interest.
- View Dependent Claims (29, 30, 34, 35, 36, 37)
- - 29. The method recited in claim 27, wherein said programming performs steps comprising:
    - selecting a plurality of features corresponding to translational, similarity, affine or more general reference frames from the video stream for objects in a field of view of the video stream; and
      
      performing such a selection at a plurality of scales, and using topological consistency across scale as a criterion for propagating said general reference frames across different scales.
  - 30. The method recited in claim 27 or 28, wherein said temporal aggregating of statistics is performed by computing a mean, or median, or mode, or sample histogram of a contrast-invariant function of the image in said frames.
  - 34. The method recited in claim 27 or 28, further comprising employing a canonization mechanism which does not rely on a co-variant detector.
  - 35. The method recited in claim 27 or 28, further comprising canonizing rotation in response to a gravity sensor signal.
  - 36. The method recited in claim 27 or 28, further comprising:
    - computing a co-variant region that is proximate to a feature point of said feature;
      
      computing a contrast invariant feature; and
      
      performing a temporal aggregation operation of a number of statistics computed on each image associated with the plurality of video frames over a time period.
  - 37. The method recited in claim 36, wherein the temporal aggregation operation comprises aggregating the contrast invariant feature at each video frame during the time period at the corresponding scale of a feature point of the feature.

28. A visual recognition method for identifying objects captured in a video stream having a captured time period, the method comprising:
- capturing the video stream on an electronic device;
  
  enabling the user to select a target object or scene for training;
  
  capturing the video stream from said image sensor;
  
  optionally selecting a plurality of features corresponding to translational, similarity, affine or more general reference frames from the video stream for objects in a field of view of the video stream;
  
  performing such a selection at a plurality of scales, and using topological consistency across scale as a criterion for propagating said general reference frames across different scales;
  
  associating each frame in an image with a corresponding frame in temporally adjacent images, or in images taken from nearby vantage points;
  
  temporally aggregating statistics computed at one or more collections of temporally corresponding frames, into a descriptor;
  
  spatially aggregating such statistics into a representation that is insensitive to nuisance factor and distinctive;
  
  exploiting such a representation within a classification scheme to enable the detection, localization, recognition and categorization of objects and scenes in video; and
  
  displaying the result of the classification scheme by overlaying information on the live video stream, optionally localized and overlaid on the object of interest.
- View Dependent Claims (31, 32, 33)
- - 31. The method recited in claim 28 or 29, wherein said plurality of features comprises a plurality of feature points.
  - 32. The method recited in claim 28 or 29, further comprising ranking features according to their structural stability margin.
  - 33. The method recited in claim 32, wherein said structural stability margin comprises a maximum norm of the nuisance that does not cause a singularity in the detection mechanism.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Regents of the University of California (University of California)
Original Assignee
Regents of the University of California (University of California)
Inventors
Soatto, Stefano, Lee, Taehee

Granted Patent

US 8,717,437 B2
Time in Patent Office

Days
Field of Search
US Class Current

348/143
CPC Class Codes

G06F 18/24765   Rule-based classification

G06F 18/28   Determining representative ...

G06T 7/207   for motion estimation over ...

G06T 7/246   using feature-based methods...

G06V 10/462   Salient features, e.g. scal...

G06V 10/772   Determining representative ...

G06V 20/20   in augmented reality scenes

G06V 20/46   Extracting features or char...

END-TO-END VISUAL RECOGNITION SYSTEM AND METHODS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

64 Citations

37 Claims

Specification

Solutions

Use Cases

Quick Links

END-TO-END VISUAL RECOGNITION SYSTEM AND METHODS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

64 Citations

37 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links