Speech recognition system adaptation based on non-acoustic attributes and face selection based on mouth motion using pixel intensities

US 9,899,025 B2
Filed: 07/02/2015
Issued: 02/20/2018
Est. Priority Date: 11/13/2014
Status: Active Grant

First Claim

Patent Images

1. A method, comprising the steps of:

determining a vicinity from which speech input to a speech recognition system originates, wherein determining the vicinity comprises estimating a sound direction of a source of the speech input based on a signal processing method;

obtaining non-acoustic data from the vicinity of the speech input using one or more non-acoustic sensors, wherein obtaining the non-acoustic data comprises capturing visual data of the vicinity of the speech input;

identifying a subject speaker as the source of the speech input from the obtained non-acoustic data, wherein identifying the subject speaker comprises;

segmenting one or more faces from the captured visual data;

detecting mouth motion on the one or more faces, wherein detecting the mouth motion comprises applying temporal differencing on each of the one or more faces by comparing a first pixel intensity associated at a first time with a second pixel intensity at a second time; and

selecting a face corresponding to the subject speaker from the one or more faces in response to determining that a number of significantly changed pixels between the first pixel intensity and the second pixel intensity exceeds a threshold;

extracting one or more non-acoustic attributes associated with the subject speaker from the obtained non-acoustic data;

analyzing the one or more extracted non-acoustic attributes, and assigning at least one demographic to the subject speaker based on the analysis;

selecting at least one model for use by the speech recognition system based on the at least one demographic assigned to the subject speaker;

adjusting the speech recognition system using the at least one selected model; and

processing the speech input using the adjusted speech recognition system;

wherein the steps are performed by at least one processor device coupled to a memory.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Non-acoustic data from a vicinity of speech input is obtained. A subject speaker is identified as the source of the speech input from the obtained non-acoustic data by detecting mouth motion on one or more faces segmented from the non-acoustic data by comparing a first pixel intensity associated at a first time with a second pixel intensity at a second time, and selecting a face corresponding to the subject speaker from the one or more faces in response to a determination that a number of significantly changed pixels between the first pixel intensity and the second pixel intensity exceeds a threshold. A demographic is assigned to the subject speaker based on an analysis of one or more non-acoustic attributes of the subject speaker extracted from the non-acoustic data. The speech input is processed using a speech recognition system adjusted using a model selected based on the demographic.

Citations

20 Claims

1. A method, comprising the steps of:
- determining a vicinity from which speech input to a speech recognition system originates, wherein determining the vicinity comprises estimating a sound direction of a source of the speech input based on a signal processing method;
  
  obtaining non-acoustic data from the vicinity of the speech input using one or more non-acoustic sensors, wherein obtaining the non-acoustic data comprises capturing visual data of the vicinity of the speech input;
  
  identifying a subject speaker as the source of the speech input from the obtained non-acoustic data, wherein identifying the subject speaker comprises;
  
  segmenting one or more faces from the captured visual data;
  
  detecting mouth motion on the one or more faces, wherein detecting the mouth motion comprises applying temporal differencing on each of the one or more faces by comparing a first pixel intensity associated at a first time with a second pixel intensity at a second time; and
  
  selecting a face corresponding to the subject speaker from the one or more faces in response to determining that a number of significantly changed pixels between the first pixel intensity and the second pixel intensity exceeds a threshold;
  
  extracting one or more non-acoustic attributes associated with the subject speaker from the obtained non-acoustic data;
  
  analyzing the one or more extracted non-acoustic attributes, and assigning at least one demographic to the subject speaker based on the analysis;
  
  selecting at least one model for use by the speech recognition system based on the at least one demographic assigned to the subject speaker;
  
  adjusting the speech recognition system using the at least one selected model; and
  
  processing the speech input using the adjusted speech recognition system;
  
  wherein the steps are performed by at least one processor device coupled to a memory.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 2. The method of claim 1, wherein the one or more non-acoustic attributes comprise hair color.
  - 3. The method of claim 2, wherein assigning the at least one demographic to the subject speaker comprises assigning an age demographic to the subject speaker based on the analysis of the hair color.
  - 4. The method of claim 1, wherein analyzing the extracted one or more non-acoustic attributes further comprises mapping the selected face to a cluster to infer one or more characteristics of the subject speaker.
  - 5. The method of claim 4, wherein the cluster comprises at least one of a gender cluster and an ethnicity cluster.
  - 6. The method of claim 1, wherein the one or more non-acoustic sensors comprise a camera.
  - 7. The method of claim 6, wherein the camera is a pan-tilt-zoom (PTZ) camera.
  - 8. The method of claim 1, wherein segmenting the one or more faces from the captured visual data comprises using at least one face finding algorithm to find one or more likely humans in the vicinity.
  - 9. The method of claim 8, wherein the at least one face finding algorithm comprises a Jones-Viola object detection framework.
  - 10. The method of claim 1, wherein the one or more non-acoustic attributes comprise a height associated with the subject speaker.
  - 11. The method of claim 10, wherein assigning the at least one demographic to the subject speaker comprises assigning an age demographic to the subject speaker based on the analysis of the height.
  - 12. The method of claim 1, wherein obtaining the non-acoustic data comprises locating one or more head regions using at least one of the one or more non-acoustic sensors.
  - 13. The method of claim 1, wherein the one or more non-acoustic attributes comprise one or more facial features of the subject speaker extracted from the selected face.
  - 14. The method of claim 1, wherein the at least one model comprises an acoustic model.
  - 15. The method of claim 1, wherein the at least one model comprises a language model.
  - 16. The method of claim 1, wherein the direction of the source of the speech input is estimated based on a beamformer based method.
  - 17. The method of claim 1, wherein the direction of the source of the speech input is estimated based on a time delay of arrival based method.
  - 18. The method of claim 1, wherein the direction of the source of the speech input is estimated based on a spectrum estimation based method.
  - 19. The method of claim 1, wherein the one or more non-acoustic sensors comprise a three-dimensional depth sensor.
  - 20. The method of claim 1, wherein detecting the mouth motion comprises applying the temporal differencing on a lower third of each of the one or more faces.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Connell, II, Jonathan H., Marcheret, Etienne
Primary Examiner(s)
Kazeminezhad, Farzad

Application Number

US14/790,142
Publication Number

US 20160140964A1
Time in Patent Office

964 Days
Field of Search

704231
US Class Current
CPC Class Codes

G06F 3/016   Input arrangements with for...

G06N 3/008   based on physical entities ...

G06V 40/171   Local features and componen...

G10L 15/07   to the speaker

G10L 15/24   Speech recognition using no...

G10L 15/25   using position of the lips,...

G10L 2015/227   of the speaker; Human-fact...

H04R 25/407   Circuits for combining sign...

Speech recognition system adaptation based on non-acoustic attributes and face selection based on mouth motion using pixel intensities

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Speech recognition system adaptation based on non-acoustic attributes and face selection based on mouth motion using pixel intensities

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links