System, method, and apparatus for multiple face tracking
DCFirst Claim
Patent Images
1. A method of tracking a plurality of face candidate regions in a digital video sequence comprising:
- receiving a plurality of frames of video data;
a first operating mode, including constructing a score map based at least in part on one of said frames of video data, producing a mask for each frame of said plurality of frames based on said score map, said mask indicating a plurality of face candidate regions, and filtering said mask to remove an indication of at least one among the plurality of face candidate regions; and
a standby mode, wherein the first operating mode is practiced to track face candidate regions for each frame, and the standby mode performs face-tracking operations only on a portion of the frames of video data, said standby mode being operational when a specified number of consecutive frames have been processed without any face candidate regions being located.
7 Assignments
Litigations
0 Petitions
Accused Products
Abstract
A system, method, and apparatus are disclosed that support automatic tracking of multiple faces in a sequence of digital images. Temporal filtering may be applied to reduce both missed detections and false alarms. Multiple modes may also be implemented to reduce processor load.
-
Citations
33 Claims
-
1. A method of tracking a plurality of face candidate regions in a digital video sequence comprising:
-
receiving a plurality of frames of video data;
a first operating mode, including constructing a score map based at least in part on one of said frames of video data, producing a mask for each frame of said plurality of frames based on said score map, said mask indicating a plurality of face candidate regions, and filtering said mask to remove an indication of at least one among the plurality of face candidate regions; and
a standby mode, wherein the first operating mode is practiced to track face candidate regions for each frame, and the standby mode performs face-tracking operations only on a portion of the frames of video data, said standby mode being operational when a specified number of consecutive frames have been processed without any face candidate regions being located. - View Dependent Claims (2, 3, 4, 5, 7)
wherein said constructing a score map is based at least in part on a set of predetermined relations, each among said set of predetermined relations associating one among a range of possible pixel values with a flesh-tone probability value. -
3. The method of claim 1, wherein said mask comprises a plurality of entries, and
wherein each among said plurality of entries corresponds to a plurality of pixels. -
4. The method of claim 1, wherein said score map comprises a plurality of entries, and
wherein said producing a mask from said score map comprises locating the entry having the highest flesh-tone probability value. -
5. The method of claim 4, wherein said producing a mask from said score map further comprises growing a region, said region comprising entries connected to the located entry and having a flesh-tone probability value higher than a predetermined threshold.
-
7. The method of claim 1, wherein said filtering said mask to remove an indication of at least one among the plurality of face candidate regions comprises testing at least one among a size, a position, an aspect ratio, an intra-region flatness, a shape, and a contrast between interior and boundary areas of at least one among the plurality of face candidate regions.
-
-
6. A method comprising:
-
receiving a frame of video data;
constructing a score map based at least in part on said frame of data;
producing a mask from said score map, said mask indicating a plurality of face candidate regions;
filtering said mask to remove an indication of at least one among the plurality of face candidate regions, said filtering including filtering said mask, wherein said mask comprises a plurality of entries, and wherein said temporal filtering said mask comprises obtaining an average of each entry and a corresponding entry of at least one temporally distinct mask, wherein each among said at least one temporally distinct mask is produced from a corresponding one among a plurality of temporally distinct score maps, each among said temporally distinct score maps being constructed based at least in part on a corresponding one among a plurality of temporally distinct frames of video data, and wherein said plurality of temporally distinct frames of video data includes at least one among a set of previous frames of video data and a set of subsequent frames of video data.
-
-
8. An apparatus for tracking a plurality of face candidate regions in a digital video sequence comprising:
-
a mapper configured and arranged to receive a plurality of frames of data and produce a score map based at least in part on a frame of the data;
a segmenter configured and arranged to receive the score map and produce a mask for each frame, the mask indicating a plurality of face candidate regions; and
a filter configured and arranged to receive the mask and remove an indication of at least one among the plurality of face candidate regions, wherein the apparatus has a normal operational mode in which the apparatus tracks face candidate regions for each frame, and a standby mode in which the apparatus performs face-tracking operations only on a portion of the plurality of frames, said standby mode becoming operational when a specified number of consecutive frames have been processed without any face candidate regions being located. - View Dependent Claims (9, 10, 11, 12, 14)
wherein said mapper produces the score map according to at least a set of predetermined relations, each among said set of predetermined relations associating one among a range of possible pixel values with a flesh-tone probability value. -
10. The apparatus of claim 8, wherein the mask comprises a plurality of entries, and
wherein each among the plurality of entries corresponds to a plurality of pixels. -
11. The apparatus of claim 8, wherein the score map comprises a plurality of entries, and
wherein said segmenter produces the mask from the score map at least in part by locating the entry having the highest flesh-tone probability value. -
12. The apparatus of claim 11, wherein said segmenter further produces the mask from the score map by growing a region, the region comprising entries connected to the located entry and having a flesh-tone probability value higher than a predetermined threshold.
-
14. The apparatus of claim 8, wherein said filter tests at least one among a size, a position, an aspect ratio, an intra-region flatness, a shape, and a contrast between interior and boundary areas of at least one among the plurality of face candidate regions.
-
-
13. An apparatus comprising:
-
a mapper configured and arranged to receive a frame of data and produce a score map based at least in part on the frame of data;
a segmenter configured and arranged to receive the score map and produce a mask, the mask indicating a plurality of face candidate regions; and
a filter configured and arranged to receive the mask and remove an indication of at least one among the plurality of face candidate regions, said filter being a temporal filter configured and arranged to receive the mask and produce a temporally filtered mask, wherein the temporally filtered mask comprises a plurality of entries, and wherein said temporal filter obtains an average of each entry and a corresponding entry of at least one temporally distinct mask, wherein each among said at least one temporally distinct mask is produced from a corresponding one among a plurality of temporally distinct score maps, each among said temporally distinct score maps being constructed based at least in part on a corresponding one among a plurality of temporally distinct frames of video data, and wherein said plurality of temporally distinct frames of video data includes at least one among a set of previous frames of video data and a set of subsequent frames of video data.
-
-
15. A system comprising:
-
an apparatus including a mapper configured and arranged to receive a frame of data and produce a score map based at least in part on the frame of data;
a segmenter configured and arranged to receive the score map and produce a mask, the mask indicating a plurality of face candidate regions; and
a filter configured and arranged to receive the mask and remove an indication of at least one among the plurality of face candidate regions, and a video encoder configured and arranged to receive the frame of data and the mask and produce an encoded stream representing the frame of data, said encoded stream comprising a plurality of bits, wherein, in accordance at least with said mask, the video encoder is further configured and arranged to allocate a disproportionate number of said plurality of bits to at least one part of the encoded stream that represents an area of the frame of data that corresponds to one among the plurality of face candidate regions. - View Dependent Claims (16, 17, 18, 19, 20)
wherein said mapper produces the score map according to at least a set of predetermined relations, each among said set of predetermined relations associating one among a range of possible pixel values with a flesh-tone probability value. -
17. The system of claim 15, wherein the score map comprises a plurality of entries, and
wherein said segmenter produces the mask from the score map at least in part by locating the entry having the highest flesh-tone probability value. -
18. The system of claim 17, wherein said segmenter further produces the mask from the score map by growing a region, the region comprising entries connected to the located entry and having a flesh-tone probability value higher than a predetermined threshold.
-
19. The system of claim 15, said filter further comprising a temporal filter,
wherein the mask comprises a plurality of entries, and wherein said temporal filter obtains an average of each entry and a corresponding entry of at least one temporally distinct mask, wherein each among said at least one temporally distinct mask is produced from a corresponding one among a plurality of temporally distinct score maps, each among said temporally distinct score maps being constructed based at least in part on a corresponding one among a plurality of temporally distinct frames of video data, and wherein said plurality of temporally distinct frames of video data includes at least one among a set of previous frames of video data and a set of subsequent frames of video data. -
20. The system of claim 15, wherein said filter tests at least one among a size, a position, an aspect ratio, an intra-region flatness, a shape, and a contrast between interior and boundary areas of at least one among the plurality of face candidate regions.
-
-
21. A system comprising:
-
a camera configured and arranged to output frames of data over successive periods of time;
an apparatus for performing face-tracking operations including a mapper configured and arranged to receive a plurality of the frames of data and produce a plurality of score maps, each among said plurality of score maps based at least in part on a corresponding one among the plurality of frames of data;
a segmenter configured and arranged to receive the plurality of score maps and produce a corresponding plurality of masks, each among said plurality of masks indicating a corresponding plurality of face candidate regions;
a filter configured and arranged to receive the plurality of masks and remove an indication of at least one among the corresponding plurality of face candidate regions indicated by at least one among the plurality of masks, wherein the apparatus has a normal operational mode in which the apparatus tracks face candidate regions for each frame, and a standby mode in which the apparatus performs face-tracking operations only on a portion of the plurality of frames, said standby mode becoming operational when a specified number of consecutive frames have been processed without any face candidate regions being located; and
a camera control unit configured and arranged to receive the plurality of masks and control at least one movement of the camera, wherein said at least one movement of the camera is responsive to at least one comparison of (A) a face candidate region indicated in a mask corresponding to one frame of data to (B) a face candidate region indicated in a mask corresponding to another frame of data. - View Dependent Claims (22, 23, 24, 26)
wherein said mapper produces the plurality of score maps according to at least a set of predetermined relations, each among said set of predetermined relations associating one among a range of possible pixel values with a flesh-tone probability value. -
23. The system of claim 21, wherein each among the plurality of score maps comprises a plurality of entries, and
wherein said segmenter produces each among the plurality of masks from a corresponding score map at least in part by locating the entry having the highest flesh-tone probability value. -
24. The system of claim 23, wherein said segmenter further produces each among the plurality of masks from a corresponding score map by growing a region, the region comprising entries connected to the located entry and having a flesh-tone probability value higher than a predetermined threshold.
-
26. The system of claim 21, wherein said filter tests at least one among a size, a position, an aspect ratio, an intra-region flatness, a shape, and a contrast between interior and boundary areas of at least one among the plurality of face candidate regions.
-
-
25. A system comprising:
-
a camera configured and arranged to output frames of data over successive periods of time;
an apparatus including;
a mapper configured and arranged to receive a plurality of the frames of data and produce a plurality of score maps, each among said plurality of score maps based at least in part on a corresponding one among the plurality of frames of data;
a segmenter configured and arranged to receive the plurality of score maps and produce a corresponding plurality of masks, each among said plurality of masks indicating a corresponding plurality of face candidate regions; and
a filter configured and arranged to receive the plurality of masks and remove an indication of at least one among the corresponding plurality of face candidate regions indicated by at least one among the plurality of masks, a temporal filter configured and arranged to receive the plurality of masks and produce a corresponding plurality of temporally filtered masks, wherein each among the plurality of masks comprises a plurality of entries, wherein each among the plurality of temporally filtered masks comprises a plurality of temporally filtered entries, wherein each among the plurality of entries in one among the plurality of masks corresponds to one among the plurality of entries in each of the others among the plurality of masks, and wherein said temporal filter obtains an average of each among the plurality of entries in one among the plurality of masks and the corresponding one among the plurality of entries in at least one of the others among the plurality of masks; and
a camera control unit configured and arranged to receive the plurality of masks and control at least one movement of the camera, wherein said at least one movement of the camera is responsive to at least one comparison of (A) a face candidate region indicated in a mask corresponding to one frame of data to (B) a face candidate region indicated in a mask corresponding to another frame of data.
-
-
27. An apparatus comprising a data storage medium, said data storage medium having machine-readable code stored thereon, the machine-readable code including instructions executable by an array of logic elements, the instructions defining a method including:
-
receiving a plurality of frames of video data;
a first operating mode, including constructing a score map based at least in part on a frame of video data, producing a mask for each frame from said score map, said mask indicating a plurality of face candidate regions, and filtering said mask to remove an indication of at least one among the plurality of face candidate regions; and
a standby mode, wherein the first operating mode is practiced to track face candidate regions for each frame, and the standby mode performs face-tracking operations only on a portion of the frames of video data, said standby mode being operational when a specified number of consecutive frames have been processed without any face candidate regions being located. - View Dependent Claims (28, 29, 30, 31, 33)
wherein said constructing a score map is based at least in part on a set of predetermined relations, each among said set of predetermined relations associating one among a range of possible pixel values with a flesh-tone probability value. -
29. The apparatus of claim 27, wherein said mask comprises a plurality of entries, and
wherein each among said plurality of entries corresponds to a plurality of pixels. -
30. The apparatus of claim 27, wherein said score map comprises a plurality of entries, and
wherein said producing a mask from said score map comprises locating the entry having the highest flesh-tone probability value. -
31. The apparatus of claim 30, wherein said producing a mask from said score map further comprises growing a region, said region comprising entries connected to the located entry and having a flesh-tone probability value higher than a predetermined threshold.
-
33. The apparatus of claim 27, wherein said filtering said mask to remove an indication of at least one among the plurality of face candidate regions comprises testing at least one among a size, a position, an aspect ratio, an intra-region flatness, a shape, and a contrast between interior and boundary areas of at least one among the plurality of face candidate regions.
-
-
32. An apparatus comprising a data storage medium, said data storage medium having machine-readable code stored thereon, the machine-readable code including instructions executable by an array of logic elements, the instructions defining a method including:
-
receiving a frame of video data;
constructing a score map based at least in Dart on said frame of data;
producing a mask from said score map, said mask indicating a plurality of face candidate region; and
filtering said mask to remove an indication of at least one among the plurality of face candidate regions, said filtering including temporal filtering said mask, wherein said mask comprises a plurality of entries, wherein said temporal filtering said mask comprises obtaining an average of each entry and a corresponding entry of at least one temporally distinct mask, wherein each among said at least one temporally distinct mask is produced from a corresponding one among a plurality of temporally distinct score maps, each among said temporally distinct score maps being constructed based at least in part on a corresponding one among a plurality of temporally distinct frames of video data, and wherein said plurality of temporally distinct frames of video data includes at least one among a set of previous frames of video data and a set of subsequent frames of video data.
-
Specification