SYSTEM AND METHOD FOR ENHANCING SPEECH ACTIVITY DETECTION USING FACIAL FEATURE DETECTION

US 20160189733A1
Filed: 03/08/2016
Published: 06/30/2016
Est. Priority Date: 07/18/2011
Status: Active Grant

First Claim

Patent Images

1-20. -20. (canceled)

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed herein are systems, methods, and non-transitory computer-readable storage media for processing audio. A system configured to practice the method monitors, via a processor of a computing device, an image feed of a user interacting with the computing device and identifies an audio start event in the image feed based on face detection of the user looking at the computing device or a specific region of the computing device. The image feed can be a video stream. The audio start event can be based on a head size, orientation or distance from the computing device, eye position or direction, device orientation, mouth movement, and/or other user features. Then the system initiates processing of a received audio signal based on the audio start event. The system can also identify an audio end event in the image feed and end processing of the received audio signal based on the end event.

Citations

40 Claims

1-20. -20. (canceled)

21. A method, comprising:
- analyzing, by a system including a processor, video content captured by a camera, wherein the analyzing comprises detecting a user and a computing device depicted in the video content;
  
  responsive to a first determination by the system that the user is looking at the computing device and a second determination by the system that a mouth of the user is moving, determining, by the system, an audio start event of a voice signal of the user, wherein the first and second determinations are based on the analyzing of the video content and the detecting of the user and the computing device; and
  
  responsive to the audio start event, initiating processing of the voice signal.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
- - 22. The method of claim 21, wherein the computing device is a mobile computing device.
  - 23. The method of claim 21, wherein the video content shows the user looking at a specific region of the computing device.
  - 24. The method of claim 21, wherein the determining of the audio start event is based on head orientation, eye position, eye direction, device orientation, or any combination thereof.
  - 25. The method of claim 21, further comprising:
    - determining an audio end event based on the analyzing of the video content; and
      
      causing the computing device to cease the processing of the voice signal based on the audio end event.
  - 26. The method of claim 25, wherein the determining of the audio end event is based on a third determination by the system that the video content shows the user looking away from the computing device.
  - 27. The method of claim 25, wherein the determining of the audio end event is based on a third determination by the system that the video content shows the mouth of the user is not moving.
  - 28. The method of claim 21, wherein the processing of the voice signal is by the computing device, and wherein the processor comprises multiple processors operating in a distributed processing environment.
  - 29. The method of claim 21, wherein the processing of the voice signal comprises performing speech recognition of the voice signal.
  - 30. The method of claim 21, wherein the processing of the voice signal comprises transmitting the voice signal to a second device separate from the computing device.
  - 31. The method of claim 21, further comprising ignoring a portion of an audio signal that is received prior to the audio start event, the audio signal including the voice signal.

32. An apparatus comprising:
- a processor; and
  
  a computer-readable storage medium having instructions stored which, when executed by the processor, cause the processor to perform operations comprising;
  
  analyzing video content captured by a camera;
  
  responsive to a first determination of a distance of a user to a screen of a computing device and a second determination that a mouth of the user is moving, determining an audio start event of a voice signal of the user, wherein the first and second determinations are based on the analyzing of the video content that depicts the user and the computing device; and
  
  responsive to the audio start event, initiating processing of the voice signal.
- View Dependent Claims (33, 34, 35, 36)
- - 33. The apparatus of claim 32, wherein the computing device is a mobile computing device.
  - 34. The apparatus of claim 32, wherein the video content shows the user looking at a specific region of the computing device.
  - 35. The apparatus of claim 32, wherein the determining of the audio start event is based on head orientation, eye position, eye direction, device orientation, or any combination thereof, and wherein the processor comprises multiple processors operating in a distributed processing environment.
  - 36. The apparatus of claim 32, wherein the operations further comprise ignoring a portion of an audio signal that is received prior to the audio start event, the audio signal including the voice signal.

37. A computer-readable storage device having instructions which, when executed by a processor, cause the processor to perform operations comprising:
- analyzing video content captured by a camera, the video content depicting a user and a computing device;
  
  responsive to a first determination that the user is looking at the computing device and a second determination that a mouth of the user is moving, determining an audio start event of a voice signal of the user, wherein the first and second determinations are based on the analyzing of the video content; and
  
  responsive to the audio start event, initiating processing of the voice signal.
- View Dependent Claims (38, 39, 40)
- - 38. The computer-readable storage device of claim 37, wherein the processor comprises multiple processors operating in a distributed processing environment, and wherein the operations further comprise:
    - ignoring a portion of an audio signal that is received prior to the audio start event, the audio signal including the voice signal.
  - 39. The computer-readable storage device of claim 37, wherein the video content shows the user looking at a specific region of the computing device.
  - 40. The computer-readable storage device of claim 37, wherein the determining of the audio start event is based on head orientation, eye position, eye direction, device orientation, or any combination thereof.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Inventors
VASILIEFF, BRANT JAMESON, EHLEN, PATRICK JOHN, LIESKE, JAY HENRY Jr.

Granted Patent

US 10,109,300 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G10L 15/20   Speech recognition techniqu...

G10L 25/57   for processing of video sig...

G10L 25/78   Detection of presence or ab...

H04N 1/00403   Voice input means, e.g. voi...

H04N 7/183   for receiving images from a...

SYSTEM AND METHOD FOR ENHANCING SPEECH ACTIVITY DETECTION USING FACIAL FEATURE DETECTION

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

40 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM AND METHOD FOR ENHANCING SPEECH ACTIVITY DETECTION USING FACIAL FEATURE DETECTION

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

40 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links