System and method for enhancing speech activity detection using facial feature detection
First Claim
1. A method, comprising:
- capturing video content by a camera;
analyzing, by a system including a processor, the video content, wherein the analyzing comprises detecting a user is depicted in the video content and detecting that a computing device is also depicted in the video content;
responsive to a first determination by the system that the user is looking at the computing device and a second determination by the system that a mouth of the user is moving, determining, by the system, an audio start event of a voice signal of the user, wherein the first and second determinations are based on the analyzing of the video content and the detecting of the user and the computing device; and
responsive to the audio start event, initiating processing of the voice signal.
4 Assignments
0 Petitions
Accused Products
Abstract
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for processing audio. A system configured to practice the method monitors, via a processor of a computing device, an image feed of a user interacting with the computing device and identifies an audio start event in the image feed based on face detection of the user looking at the computing device or a specific region of the computing device. The image feed can be a video stream. The audio start event can be based on a head size, orientation or distance from the computing device, eye position or direction, device orientation, mouth movement, and/or other user features. Then the system initiates processing of a received audio signal based on the audio start event. The system can also identify an audio end event in the image feed and end processing of the received audio signal based on the end event.
40 Citations
19 Claims
-
1. A method, comprising:
-
capturing video content by a camera; analyzing, by a system including a processor, the video content, wherein the analyzing comprises detecting a user is depicted in the video content and detecting that a computing device is also depicted in the video content; responsive to a first determination by the system that the user is looking at the computing device and a second determination by the system that a mouth of the user is moving, determining, by the system, an audio start event of a voice signal of the user, wherein the first and second determinations are based on the analyzing of the video content and the detecting of the user and the computing device; and responsive to the audio start event, initiating processing of the voice signal. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. An apparatus comprising:
-
a processor; and a non-transitory computer-readable storage medium having instructions stored which, when executed by the processor, cause the processor to perform operations comprising; analyzing video content captured by a camera to determine that the video content includes a depiction of a user and that the video content includes a depiction of a computing device; responsive to a first determination of a distance of the user to a screen of the computing device and a second determination that a mouth of the user is moving, determining an audio start event of a voice signal of the user, wherein the first and second determinations are based on the analyzing of the video content that depicts the user and the computing device; and responsive to the audio start event, initiating processing of the voice signal. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A non-transitory computer-readable storage device having instructions which, when executed by a processor, cause the processor to perform operations comprising:
-
analyzing video content captured by a camera, the video content depicting a user in the video content and a computing device depicted in the video content; responsive to a first determination that the user is looking at the computing device and a second determination that a mouth of the user is moving, determining an audio start event of a voice signal of the user, wherein the first and second determinations are based on the analyzing of the video content; and responsive to the audio start event, initiating processing of the voice signal. - View Dependent Claims (17, 18, 19)
-
Specification