System and method for enhancing speech activity detection using facial feature detection

US 10,109,300 B2
Filed: 03/08/2016
Issued: 10/23/2018
Est. Priority Date: 07/18/2011
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

capturing video content by a camera;

analyzing, by a system including a processor, the video content, wherein the analyzing comprises detecting a user is depicted in the video content and detecting that a computing device is also depicted in the video content;

responsive to a first determination by the system that the user is looking at the computing device and a second determination by the system that a mouth of the user is moving, determining, by the system, an audio start event of a voice signal of the user, wherein the first and second determinations are based on the analyzing of the video content and the detecting of the user and the computing device; and

responsive to the audio start event, initiating processing of the voice signal.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed herein are systems, methods, and non-transitory computer-readable storage media for processing audio. A system configured to practice the method monitors, via a processor of a computing device, an image feed of a user interacting with the computing device and identifies an audio start event in the image feed based on face detection of the user looking at the computing device or a specific region of the computing device. The image feed can be a video stream. The audio start event can be based on a head size, orientation or distance from the computing device, eye position or direction, device orientation, mouth movement, and/or other user features. Then the system initiates processing of a received audio signal based on the audio start event. The system can also identify an audio end event in the image feed and end processing of the received audio signal based on the end event.

40 Citations

View as Search Results

19 Claims

1. A method, comprising:
- capturing video content by a camera;
  
  analyzing, by a system including a processor, the video content, wherein the analyzing comprises detecting a user is depicted in the video content and detecting that a computing device is also depicted in the video content;
  
  responsive to a first determination by the system that the user is looking at the computing device and a second determination by the system that a mouth of the user is moving, determining, by the system, an audio start event of a voice signal of the user, wherein the first and second determinations are based on the analyzing of the video content and the detecting of the user and the computing device; and
  
  responsive to the audio start event, initiating processing of the voice signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein the computing device is a mobile computing device.
  - 3. The method of claim 1, wherein the video content shows the user looking at a specific region of the computing device.
  - 4. The method of claim 1, wherein the determining of the audio start event is based on head orientation, eye position, eye direction, device orientation, or any combination thereof.
  - 5. The method of claim 1, further comprising:
    - determining an audio end event based on the analyzing of the video content; and
      
      causing the computing device to cease the processing of the voice signal based on the audio end event.
  - 6. The method of claim 5, wherein the determining of the audio end event is based on a third determination by the system that the video content shows the user looking away from the computing device.
  - 7. The method of claim 1, wherein the processing of the voice signal is by the computing device.
  - 8. The method of claim 1, wherein the processing of the voice signal comprises performing speech recognition of the voice signal.
  - 9. The method of claim 1, wherein the processing of the voice signal comprises transmitting the voice signal to a second device separate from the computing device.
  - 10. The method of claim 1, further comprising ignoring a portion of an audio signal that is received prior to the audio start event, the audio signal including the voice signal.

11. An apparatus comprising:
- a processor; and
  
  a non-transitory computer-readable storage medium having instructions stored which, when executed by the processor, cause the processor to perform operations comprising;
  
  analyzing video content captured by a camera to determine that the video content includes a depiction of a user and that the video content includes a depiction of a computing device;
  
  responsive to a first determination of a distance of the user to a screen of the computing device and a second determination that a mouth of the user is moving, determining an audio start event of a voice signal of the user, wherein the first and second determinations are based on the analyzing of the video content that depicts the user and the computing device; and
  
  responsive to the audio start event, initiating processing of the voice signal.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The apparatus of claim 11, wherein the computing device is a mobile computing device.
  - 13. The apparatus of claim 11, wherein the video content shows the user looking at a specific region of the computing device.
  - 14. The apparatus of claim 11, wherein the determining of the audio start event is based on head orientation, eye position, eye direction, device orientation, or any combination thereof.
  - 15. The apparatus of claim 11, wherein the operations further comprise ignoring a portion of an audio signal that is received prior to the audio start event, the audio signal including the voice signal.

16. A non-transitory computer-readable storage device having instructions which, when executed by a processor, cause the processor to perform operations comprising:
- analyzing video content captured by a camera, the video content depicting a user in the video content and a computing device depicted in the video content;
  
  responsive to a first determination that the user is looking at the computing device and a second determination that a mouth of the user is moving, determining an audio start event of a voice signal of the user, wherein the first and second determinations are based on the analyzing of the video content; and
  
  responsive to the audio start event, initiating processing of the voice signal.
- View Dependent Claims (17, 18, 19)
- - 17. The non-transitory computer-readable storage device of claim 16, wherein the operations further comprise:
    - ignoring a portion of an audio signal that is received prior to the audio start event, the audio signal including the voice signal.
  - 18. The non-transitory computer-readable storage device of claim 16, wherein the video content shows the user looking at a specific region of the computing device.
  - 19. The non-transitory computer-readable storage device of claim 16, wherein the determining of the audio start event is based on head orientation, eye position, eye direction, device orientation, or any combination thereof.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Vasilieff, Brant Jameson, Ehlen, Patrick John, Lieske, Jay Henry
Primary Examiner(s)
Kalapodas, Dramos I

Application Number

US15/063,928
Publication Number

US 20160189733A1
Time in Patent Office

959 Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/20   Speech recognition techniqu...

G10L 25/57   for processing of video sig...

G10L 25/78   Detection of presence or ab...

H04N 1/00403   Voice input means, e.g. voi...

H04N 7/183   for receiving images from a...

System and method for enhancing speech activity detection using facial feature detection

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

40 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for enhancing speech activity detection using facial feature detection

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

40 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links