SYSTEM AND METHOD FOR ENHANCING SPEECH ACTIVITY DETECTION USING FACIAL FEATURE DETECTION

US 20130021459A1
Filed: 07/18/2011
Published: 01/24/2013
Est. Priority Date: 07/18/2011
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

monitoring, via a processor of a computing device, an image feed of a user interacting with the computing device;

identifying an audio start event in the image feed based on face detection of the user looking at the computing device; and

based on the audio start event, initiating processing of a received audio signal.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed herein are systems, methods, and non-transitory computer-readable storage media for processing audio. A system configured to practice the method monitors, via a processor of a computing device, an image feed of a user interacting with the computing device and identifies an audio start event in the image feed based on face detection of the user looking at the computing device or a specific region of the computing device. The image feed can be a video stream. The audio start event can be based on a head size, orientation or distance from the computing device, eye position or direction, device orientation, mouth movement, and/or other user features. Then the system initiates processing of a received audio signal based on the audio start event. The system can also identify an audio end event in the image feed and end processing of the received audio signal based on the end event.

101 Citations

View as Search Results

20 Claims

1. A method comprising:
- monitoring, via a processor of a computing device, an image feed of a user interacting with the computing device;
  
  identifying an audio start event in the image feed based on face detection of the user looking at the computing device; and
  
  based on the audio start event, initiating processing of a received audio signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the image feed is a video stream.
  - 3. The method of claim 1, wherein the user looking at the computing device further comprises the user looking at a specific region of the computing device.
  - 4. The method of claim 1, wherein the audio start event is based on at least one of a head size of the user in the image feed, head orientation, head distance from the computing device, eye position, eye direction, device orientation, and other user features.
  - 5. The method of claim 1, wherein identifying the audio start event is further based on mouth movement of the user.
  - 6. The method of claim 1, further comprising:
    - identifying an audio end event in the image feed; and
      
      ending processing of the received audio signal based on the end event.
  - 7. The method of claim 6, wherein identifying the audio end event in the image feed is based on at least one of the user looking away from the computing device, a linear model that combines events, and ending mouth movement of the user.
  - 8. The method of claim 6, wherein the audio end event is of a different type from the audio start event.
  - 9. The method of claim 1, wherein processing of the received audio signal comprises performing speech recognition of the received audio signal.
  - 10. The method of claim 1, wherein processing of the received audio signal occurs on a second device separate from the computing device.
  - 11. The method of claim 1, further comprising ignoring a portion of the received audio signal that is received prior to the audio start event.

12. A system comprising:
- a processor;
  
  a display;
  
  a microphone;
  
  a camera; and
  
  a memory storing instructions for controlling the processor to perform steps comprising;
  
  monitoring an image feed of a user received via the camera;
  
  identifying an audio start event in the image feed based on face detection of the user looking at the display; and
  
  initiating processing of an audio signal received via the microphone based on the audio start event.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The system of claim 12, wherein the image feed is a video stream.
  - 14. The system of claim 12, wherein the user looking at the display further comprises the user looking at a specific region of the display.
  - 15. The system of claim 12, wherein the audio start event is based on at least one of a head size of the user in the image feed, head orientation, head distance from the display, eye position, eye direction, device orientation, other user features, and a linear model that combines events.
  - 16. The system of claim 12, wherein identifying the audio start event is further based on mouth movement of the user.

17. A non-transitory computer-readable storage medium storing instructions which, when executed by a computing device, cause the computing device to perform steps comprising:
- monitoring an image feed of a user interacting with the computing device;
  
  identifying an audio start event in the image feed based on face detection of the user looking at the computing device; and
  
  initiating processing of a received audio signal based on the audio start event.
- View Dependent Claims (18, 19, 20)
- - 18. The non-transitory computer-readable storage medium of claim 17, the instructions further comprising:
    - identifying an audio end event in the image feed; and
      
      ending processing of the received audio signal based on the end event.
  - 19. The non-transitory computer-readable storage medium of claim 18, wherein identifying the audio end event in the image feed is based on at least one of the user looking away from the computing device and ending mouth movement of the user.
  - 20. The non-transitory computer-readable storage medium of claim 18, wherein the audio end event is of a different type from the audio start event.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Inventors
VASILIEFF, Brant Jameson, Ehlen, Patrick John, Lieske, Jay Henry Jr.

Granted Patent

US 9,318,129 B2
Time in Patent Office

Days
Field of Search
US Class Current

348/77
CPC Class Codes

G10L 15/20   Speech recognition techniqu...

G10L 25/57   for processing of video sig...

G10L 25/78   Detection of presence or ab...

H04N 1/00403   Voice input means, e.g. voi...

H04N 7/183   for receiving images from a...

SYSTEM AND METHOD FOR ENHANCING SPEECH ACTIVITY DETECTION USING FACIAL FEATURE DETECTION

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

101 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM AND METHOD FOR ENHANCING SPEECH ACTIVITY DETECTION USING FACIAL FEATURE DETECTION

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

101 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links