HAND MOTION IDENTIFICATION METHOD AND APPARATUS
First Claim
1. A hand motion identification method, comprising:
- a computing device having one or more processors and a memory storing programs executed by the one or more processors;
obtaining a to-be-identified video;
performing area localization and tracking of a hand for the to-be-identified video;
extracting a red-green-blue (RGB) video and a depth information video of the located and tracked hand;
detecting the RGB video and the depth information video of the hand to obtain a feature point;
representing, by using a 3D Mesh motion scale-invariant feature transform (MoSIFT) feature descriptor, the feature point; and
comparing the 3D Mesh MoSIFT feature descriptor of the feature point with 3D Mesh MoSIFT feature descriptors in positive samples obtained through beforehand training, to obtain a hand motion category in the to-be-identified video.
2 Assignments
0 Petitions
Accused Products
Abstract
A hand motion identification method includes: obtaining a to-be-identified video; performing area-locating and tracking of a hand for the to-be-identified video, and extracting a red-green-blue (RGB) video and a depth information video of the hand; detecting the RGB video and the depth information video of the hand, to obtain a feature point; representing the feature point by using a 3D Mesh motion scale-invariant feature transform (MoSIFT) feature descriptor; and comparing the 3D Mesh MoSIFT feature descriptor of the feature point with a 3D Mesh MoSIFT feature descriptor in a positive sample obtained through beforehand training, to obtain a hand motion category in the to-be-identified video.
28 Citations
18 Claims
-
1. A hand motion identification method, comprising:
-
a computing device having one or more processors and a memory storing programs executed by the one or more processors; obtaining a to-be-identified video; performing area localization and tracking of a hand for the to-be-identified video; extracting a red-green-blue (RGB) video and a depth information video of the located and tracked hand; detecting the RGB video and the depth information video of the hand to obtain a feature point; representing, by using a 3D Mesh motion scale-invariant feature transform (MoSIFT) feature descriptor, the feature point; and comparing the 3D Mesh MoSIFT feature descriptor of the feature point with 3D Mesh MoSIFT feature descriptors in positive samples obtained through beforehand training, to obtain a hand motion category in the to-be-identified video.
-
-
2. The method according to claim 1, wherein the step of performing area localization and tracking of a hand for the to-be-identified video, and extracting an RGB video and a depth information video of the hand comprises:
-
locating a hand area by using an adaptive window; and tracking the hand area of a current frame by using a minimized energy function in combination with hand state prediction of a previous frame to extract the RGB video and the depth information video of the hand.
-
-
3. The method according to claim 2, wherein the minimized energy function is a sum of a data term, a smoothness term, a distance term, a space constraint, a motion constraint, and a Chamfer distance term, wherein the data term is used to estimate likelihood values of pixels associated with the hand;
- the smoothness term is used to estimate smoothness of two adjacent pixels;
the distance term is used to constrain a new state estimation to be within a predicted space domain;
the space constraint is used to distinguish a color-similar area of the hand;
the motion constraint is used to separate the hand from another portion other than the hand; and
the Chamfer distance term is used to distinguish an overlapping area of the hand.
- the smoothness term is used to estimate smoothness of two adjacent pixels;
-
4. The method according to claim 1, wherein the step of detecting the RGB video and the depth information video of the hand, to obtain a feature point comprises:
-
converting the RGB video and the depth information video of the hand into grayscale and depth data, and converting the grayscale and depth data into 3D grid data; calculating a local density of depth information of vertices within a preset neighborhood in the 3D grid data; and selecting a vertex corresponding to a maximum value of the local density of the depth information within the preset neighborhood, to be used as a feature point of the preset neighborhood.
-
-
5. The method according to claim 1, wherein the step of representing the feature point by using a 3D Mesh MoSIFT feature descriptor comprises:
representing the feature point by using a 3D gradient space descriptor and a 3D motion space descriptor, wherein the 3D gradient space descriptor comprises image gradient descriptors in a horizontal direction and a vertical direction, and the 3D motion space descriptor comprises a rate descriptor.
-
6. The method according to claim 5, wherein a step of calculating the image gradient descriptors in the horizontal direction and the vertical direction comprises:
rotating coordinate axes to a direction of the feature point, projecting the feature point to an xy plane, an xz plane, and a yz plane of 3D space coordinates, separately taking m×
m windows by using points formed by projecting the feature point to the xy plane, the xz plane, and the yz plane as centers, calculating, on each r×
r block, gradient histograms in 8 directions, evaluating an accumulated value of each gradient direction, to form one seed point, and making up the feature point by using
-
7. The method according to claim 5, wherein components of the rate descriptor on 3D space x-, y-, and z-coordinate axes comprise:
-
a component of the rate descriptor on the x-axis being a difference between coordinate values of the x-axis to which the feature point is projected on two adjacent frames of videos; a component of the rate descriptor on the y-axis being a difference between coordinate values of the y-axis to which the feature point is projected on two adjacent frames of videos; and a component of the rate descriptor on the z-axis being a difference between coordinate values of the z-axis to which the feature point is projected on depth information of two adjacent frames of videos.
-
-
8. The method according to claim 1, wherein the step of comparing the 3D Mesh MoSIFT feature descriptor of the feature point with 3D Mesh MoSIFT feature descriptors in positive samples obtained through beforehand training, to obtain a hand motion category in the to-be-identified video comprises:
-
dimensionally reducing the 3D Mesh MoSIFT feature descriptor of the feature point to a dimension that is the same as that of a 3D Mesh MoSIFT feature descriptor in a positive sample obtained through the beforehand training; evaluating a Euclidean distance between the 3D Mesh MoSIFT feature descriptor of the feature point after the dimension reduction and the 3D Mesh MoSIFT feature descriptor in the positive sample; and selecting a category corresponding to the 3D Mesh MoSIFT feature descriptor in one of the positive samples with a minimum Euclidean distance to the 3D Mesh MoSIFT feature descriptor of the feature point, to be used as the hand motion category in the to-be-identified video.
-
-
9. The method according to claim 1, wherein the method further comprises:
performing beforehand training, to obtain the positive samples each comprising a 3D Mesh MoSIFT feature descriptor and a corresponding category.
-
10. A hand motion identification apparatus, comprising:
-
multiple instruction modules that can be executed by a processor, the multiple instruction modules comprising; a to-be-identified video obtaining module, configured to obtain a to-be-identified video; a to-be-identified video pair extraction module, configured to perform area localization and tracking of a hand for the to-be-identified video, and extract a red-green-blue (RGB) video and a depth information video of the hand; a to-be-identified feature point detection module, configured to detect the RGB video and the depth information video of the hand, to obtain a feature point; a to-be-identified feature point representation module, configured to represent the feature point by using a 3D Mesh motion scale-invariant feature transform (MoSIFT) feature descriptor; and a category identification module, configured to compare the 3D Mesh MoSIFT feature descriptor of the feature point with a 3D Mesh MoSIFT feature descriptor in a positive sample obtained through beforehand training, to obtain a hand motion category in the to-be-identified video.
-
-
11. The apparatus according to claim 10, wherein the to-be-identified video pair extraction module comprises:
-
a to-be-identified video locating submodule, configured to locate a hand area by using an adaptive window; and a to-be-identified video extraction submodule, configured to track the hand area of a current frame by using a minimized energy function in combination with hand state prediction of a previous frame to extract the RGB video and the depth information video of the hand.
-
-
12. The apparatus according to claim 11, wherein the minimized energy function is a sum of a data term, a smoothness term, a distance term, a space constraint, a motion constraint, and a Chamfer distance term, wherein the data term is used to estimate likelihood values of pixels associated with the hand;
- the smoothness term is used to estimate smoothness of two adjacent pixels;
the distance term is used to constrain a new state estimation to be within a predicted space domain;
the space constraint is used to distinguish a color-similar area of the hand;
the motion constraint is used to separate the hand from another portion other than the hand; and
the Chamfer distance term is used to distinguish an overlapping area of the hand.
- the smoothness term is used to estimate smoothness of two adjacent pixels;
-
13. The apparatus according to claim 10, wherein the to-be-identified feature point detection module comprises:
-
a to-be-identified data conversion submodule, configured to convert the RGB video and the depth information video of the hand into grayscale data, and convert the grayscale data into 3D grid data; a to-be-identified density obtaining submodule, configured to calculate a local density of depth information of vertices within a preset neighborhood in the 3D grid data; and a to-be-identified feature point selection submodule, configured to select a vertex corresponding to a maximum value of the local density of the depth information within the preset neighborhood, to be used as a feature point of the preset neighborhood.
-
-
14. The apparatus according to claim 10, wherein the to-be-identified feature point representation module is further configured to represent the feature point by using a 3D gradient space descriptor and a 3D motion space descriptor, wherein the 3D gradient space descriptor comprises image gradient descriptors in a horizontal direction and a vertical direction, and the 3D motion space descriptor comprises a rate descriptor.
-
15. The apparatus according to claim 14, wherein the to-be-identified feature point representation module is further configured to rotate coordinate axes to a direction of the feature point, project the feature point to an xy plane, an xz plane, and a yz plane of 3D space coordinates, separately take mxm windows by using points formed by projecting the feature point to the xy plane, the xz plane, and the yz plane as centers, calculate, on each r×
- r block, gradient histograms in 8 directions, evaluate an accumulated value of each gradient direction, to form one seed point, and make up the feature point by using
-
16. The apparatus according to claim 14, wherein components of the rate descriptor on 3D space x-, y-, and z-coordinate axes comprise:
-
a component of the rate descriptor on the x-axis being a difference between coordinate values of the x-axis to which the feature point is projected on two adjacent frames of videos; a component of the rate descriptor on the y-axis being a difference between coordinate values of the y-axis to which the feature point is projected on two adjacent frames of videos; and a component of the rate descriptor on the z-axis being a difference between coordinate values of the z-axis to which the feature point is projected on depth information of two adjacent frames of videos.
-
-
17. The apparatus according to claim 10, wherein the category identification module comprises:
-
a dimension reduction submodule, configured to dimensionally reduce the 3D Mesh MoSIFT feature descriptor of the feature point to a dimension that is the same as that of a 3D Mesh MoSIFT feature descriptor in a positive sample obtained through the beforehand training; a distance obtaining submodule, configured to evaluate a Euclidean distance between the 3D Mesh MoSIFT feature descriptor of the feature point after the dimension reduction and the 3D Mesh MoSIFT feature descriptor in the positive sample; and a category determining submodule, configured to select a category corresponding to the 3D Mesh MoSIFT feature descriptor in one of the positive samples with a minimum Euclidean distance to the 3D Mesh MoSIFT feature descriptor of the feature point, to be used as the hand motion category in the to-be-identified video.
-
-
18. The apparatus according to claim 10, wherein the apparatus further comprises:
a construction module, configured to perform beforehand training, to obtain the positive samples each comprising a 3D Mesh MoSIFT feature descriptor and a corresponding category.
Specification