HAND MOTION IDENTIFICATION METHOD AND APPARATUS

US 20160335487A1
Filed: 04/21/2015
Published: 11/17/2016
Est. Priority Date: 04/22/2014
Status: Active Grant

First Claim

Patent Images

1. A hand motion identification method, comprising:

a computing device having one or more processors and a memory storing programs executed by the one or more processors;

obtaining a to-be-identified video;

performing area localization and tracking of a hand for the to-be-identified video;

extracting a red-green-blue (RGB) video and a depth information video of the located and tracked hand;

detecting the RGB video and the depth information video of the hand to obtain a feature point;

representing, by using a 3D Mesh motion scale-invariant feature transform (MoSIFT) feature descriptor, the feature point; and

comparing the 3D Mesh MoSIFT feature descriptor of the feature point with 3D Mesh MoSIFT feature descriptors in positive samples obtained through beforehand training, to obtain a hand motion category in the to-be-identified video.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A hand motion identification method includes: obtaining a to-be-identified video; performing area-locating and tracking of a hand for the to-be-identified video, and extracting a red-green-blue (RGB) video and a depth information video of the hand; detecting the RGB video and the depth information video of the hand, to obtain a feature point; representing the feature point by using a 3D Mesh motion scale-invariant feature transform (MoSIFT) feature descriptor; and comparing the 3D Mesh MoSIFT feature descriptor of the feature point with a 3D Mesh MoSIFT feature descriptor in a positive sample obtained through beforehand training, to obtain a hand motion category in the to-be-identified video.

28 Citations

View as Search Results

18 Claims

1. A hand motion identification method, comprising:
- a computing device having one or more processors and a memory storing programs executed by the one or more processors;
  
  obtaining a to-be-identified video;
  
  performing area localization and tracking of a hand for the to-be-identified video;
  
  extracting a red-green-blue (RGB) video and a depth information video of the located and tracked hand;
  
  detecting the RGB video and the depth information video of the hand to obtain a feature point;
  
  representing, by using a 3D Mesh motion scale-invariant feature transform (MoSIFT) feature descriptor, the feature point; and
  
  comparing the 3D Mesh MoSIFT feature descriptor of the feature point with 3D Mesh MoSIFT feature descriptors in positive samples obtained through beforehand training, to obtain a hand motion category in the to-be-identified video.

2. The method according to claim 1, wherein the step of performing area localization and tracking of a hand for the to-be-identified video, and extracting an RGB video and a depth information video of the hand comprises:
- locating a hand area by using an adaptive window; and
  
  tracking the hand area of a current frame by using a minimized energy function in combination with hand state prediction of a previous frame to extract the RGB video and the depth information video of the hand.

3. The method according to claim 2, wherein the minimized energy function is a sum of a data term, a smoothness term, a distance term, a space constraint, a motion constraint, and a Chamfer distance term, wherein the data term is used to estimate likelihood values of pixels associated with the hand;
- the smoothness term is used to estimate smoothness of two adjacent pixels;
  
  the distance term is used to constrain a new state estimation to be within a predicted space domain;
  
  the space constraint is used to distinguish a color-similar area of the hand;
  
  the motion constraint is used to separate the hand from another portion other than the hand; and
  
  the Chamfer distance term is used to distinguish an overlapping area of the hand.

4. The method according to claim 1, wherein the step of detecting the RGB video and the depth information video of the hand, to obtain a feature point comprises:
- converting the RGB video and the depth information video of the hand into grayscale and depth data, and converting the grayscale and depth data into 3D grid data;
  
  calculating a local density of depth information of vertices within a preset neighborhood in the 3D grid data; and
  
  selecting a vertex corresponding to a maximum value of the local density of the depth information within the preset neighborhood, to be used as a feature point of the preset neighborhood.

5. The method according to claim 1, wherein the step of representing the feature point by using a 3D Mesh MoSIFT feature descriptor comprises:
- representing the feature point by using a 3D gradient space descriptor and a 3D motion space descriptor, wherein the 3D gradient space descriptor comprises image gradient descriptors in a horizontal direction and a vertical direction, and the 3D motion space descriptor comprises a rate descriptor.

6. The method according to claim 5, wherein a step of calculating the image gradient descriptors in the horizontal direction and the vertical direction comprises:
- rotating coordinate axes to a direction of the feature point, projecting the feature point to an xy plane, an xz plane, and a yz plane of 3D space coordinates, separately taking m×
  
  m windows by using points formed by projecting the feature point to the xy plane, the xz plane, and the yz plane as centers, calculating, on each r×
  
  r block, gradient histograms in 8 directions, evaluating an accumulated value of each gradient direction, to form one seed point, and making up the feature point by using

7. The method according to claim 5, wherein components of the rate descriptor on 3D space x-, y-, and z-coordinate axes comprise:
- a component of the rate descriptor on the x-axis being a difference between coordinate values of the x-axis to which the feature point is projected on two adjacent frames of videos;
  
  a component of the rate descriptor on the y-axis being a difference between coordinate values of the y-axis to which the feature point is projected on two adjacent frames of videos; and
  
  a component of the rate descriptor on the z-axis being a difference between coordinate values of the z-axis to which the feature point is projected on depth information of two adjacent frames of videos.

8. The method according to claim 1, wherein the step of comparing the 3D Mesh MoSIFT feature descriptor of the feature point with 3D Mesh MoSIFT feature descriptors in positive samples obtained through beforehand training, to obtain a hand motion category in the to-be-identified video comprises:
- dimensionally reducing the 3D Mesh MoSIFT feature descriptor of the feature point to a dimension that is the same as that of a 3D Mesh MoSIFT feature descriptor in a positive sample obtained through the beforehand training;
  
  evaluating a Euclidean distance between the 3D Mesh MoSIFT feature descriptor of the feature point after the dimension reduction and the 3D Mesh MoSIFT feature descriptor in the positive sample; and
  
  selecting a category corresponding to the 3D Mesh MoSIFT feature descriptor in one of the positive samples with a minimum Euclidean distance to the 3D Mesh MoSIFT feature descriptor of the feature point, to be used as the hand motion category in the to-be-identified video.

9. The method according to claim 1, wherein the method further comprises:
- performing beforehand training, to obtain the positive samples each comprising a 3D Mesh MoSIFT feature descriptor and a corresponding category.

10. A hand motion identification apparatus, comprising:
- multiple instruction modules that can be executed by a processor, the multiple instruction modules comprising;
  
  a to-be-identified video obtaining module, configured to obtain a to-be-identified video;
  
  a to-be-identified video pair extraction module, configured to perform area localization and tracking of a hand for the to-be-identified video, and extract a red-green-blue (RGB) video and a depth information video of the hand;
  
  a to-be-identified feature point detection module, configured to detect the RGB video and the depth information video of the hand, to obtain a feature point;
  
  a to-be-identified feature point representation module, configured to represent the feature point by using a 3D Mesh motion scale-invariant feature transform (MoSIFT) feature descriptor; and
  
  a category identification module, configured to compare the 3D Mesh MoSIFT feature descriptor of the feature point with a 3D Mesh MoSIFT feature descriptor in a positive sample obtained through beforehand training, to obtain a hand motion category in the to-be-identified video.

11. The apparatus according to claim 10, wherein the to-be-identified video pair extraction module comprises:
- a to-be-identified video locating submodule, configured to locate a hand area by using an adaptive window; and
  
  a to-be-identified video extraction submodule, configured to track the hand area of a current frame by using a minimized energy function in combination with hand state prediction of a previous frame to extract the RGB video and the depth information video of the hand.

12. The apparatus according to claim 11, wherein the minimized energy function is a sum of a data term, a smoothness term, a distance term, a space constraint, a motion constraint, and a Chamfer distance term, wherein the data term is used to estimate likelihood values of pixels associated with the hand;
- the smoothness term is used to estimate smoothness of two adjacent pixels;
  
  the distance term is used to constrain a new state estimation to be within a predicted space domain;
  
  the space constraint is used to distinguish a color-similar area of the hand;
  
  the motion constraint is used to separate the hand from another portion other than the hand; and
  
  the Chamfer distance term is used to distinguish an overlapping area of the hand.

13. The apparatus according to claim 10, wherein the to-be-identified feature point detection module comprises:
- a to-be-identified data conversion submodule, configured to convert the RGB video and the depth information video of the hand into grayscale data, and convert the grayscale data into 3D grid data;
  
  a to-be-identified density obtaining submodule, configured to calculate a local density of depth information of vertices within a preset neighborhood in the 3D grid data; and
  
  a to-be-identified feature point selection submodule, configured to select a vertex corresponding to a maximum value of the local density of the depth information within the preset neighborhood, to be used as a feature point of the preset neighborhood.

14. The apparatus according to claim 10, wherein the to-be-identified feature point representation module is further configured to represent the feature point by using a 3D gradient space descriptor and a 3D motion space descriptor, wherein the 3D gradient space descriptor comprises image gradient descriptors in a horizontal direction and a vertical direction, and the 3D motion space descriptor comprises a rate descriptor.

15. The apparatus according to claim 14, wherein the to-be-identified feature point representation module is further configured to rotate coordinate axes to a direction of the feature point, project the feature point to an xy plane, an xz plane, and a yz plane of 3D space coordinates, separately take mxm windows by using points formed by projecting the feature point to the xy plane, the xz plane, and the yz plane as centers, calculate, on each r×
- r block, gradient histograms in 8 directions, evaluate an accumulated value of each gradient direction, to form one seed point, and make up the feature point by using

16. The apparatus according to claim 14, wherein components of the rate descriptor on 3D space x-, y-, and z-coordinate axes comprise:
- a component of the rate descriptor on the x-axis being a difference between coordinate values of the x-axis to which the feature point is projected on two adjacent frames of videos;
  
  a component of the rate descriptor on the y-axis being a difference between coordinate values of the y-axis to which the feature point is projected on two adjacent frames of videos; and
  
  a component of the rate descriptor on the z-axis being a difference between coordinate values of the z-axis to which the feature point is projected on depth information of two adjacent frames of videos.

17. The apparatus according to claim 10, wherein the category identification module comprises:
- a dimension reduction submodule, configured to dimensionally reduce the 3D Mesh MoSIFT feature descriptor of the feature point to a dimension that is the same as that of a 3D Mesh MoSIFT feature descriptor in a positive sample obtained through the beforehand training;
  
  a distance obtaining submodule, configured to evaluate a Euclidean distance between the 3D Mesh MoSIFT feature descriptor of the feature point after the dimension reduction and the 3D Mesh MoSIFT feature descriptor in the positive sample; and
  
  a category determining submodule, configured to select a category corresponding to the 3D Mesh MoSIFT feature descriptor in one of the positive samples with a minimum Euclidean distance to the 3D Mesh MoSIFT feature descriptor of the feature point, to be used as the hand motion category in the to-be-identified video.

18. The apparatus according to claim 10, wherein the apparatus further comprises:
- a construction module, configured to perform beforehand training, to obtain the positive samples each comprising a 3D Mesh MoSIFT feature descriptor and a corresponding category.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Beijing University of Posts and Telecommunications, Tencent Technology Company Limited (Tencent Holdings Limited)
Original Assignee
Tencent Technology Company Limited (Tencent Holdings Limited)
Inventors
MING, YUE, JIANG, JIE, LIU, TINGTING, WANG, JUHONG

Granted Patent

US 10,248,854 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06V 10/462   Salient features, e.g. scal...

G06V 20/46   Extracting features or char...

G06V 20/64   Three-dimensional objects

G06V 40/20   Movements or behaviour, e.g...

G06V 40/28   Recognition of hand or arm ...

HAND MOTION IDENTIFICATION METHOD AND APPARATUS

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

28 Citations

18 Claims

Specification

Use Cases

Quick Links

Others

HAND MOTION IDENTIFICATION METHOD AND APPARATUS

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

28 Citations

18 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others