Speech recognition system adaptation based on non-acoustic attributes and face selection based on mouth motion using pixel intensities

US 9,881,610 B2
Filed: 11/13/2014
Issued: 01/30/2018
Est. Priority Date: 11/13/2014
Status: Active Grant

First Claim

Patent Images

1. An apparatus, comprising:

a memory; and

a processor operatively coupled to the memory and configured to;

determine a vicinity from which speech input to a speech recognition system originates, wherein the determination of the vicinity comprises an estimation of a sound direction of a source of the speech input based on a signal processing method;

obtain non-acoustic data from the vicinity of the speech input using one or more non-acoustic sensors, wherein, in the obtaining of the non-acoustic data, the processor is configured to capture visual data of the vicinity of the speech input;

identify a subject speaker as the source of the speech input from the obtained non-acoustic data, wherein, in the identification of the subject speaker, the processor is configured to;

segment one or more faces from the captured visual data;

detect mouth motion on the one or more faces, wherein the detection of the mouth motion comprises an application of temporal differencing on each of the one or more faces by comparing a first pixel intensity associated at a first time with a second pixel intensity at a second time; and

select a face corresponding to the subject speaker from the one or more faces in response to a determination that a number of significantly changed pixels between the first pixel intensity and the second pixel intensity exceeds a threshold;

extract one or more non-acoustic attributes associated with the subject speaker from the obtained non-acoustic data;

analyze the one or more non-acoustic attributes, and assign at least one demographic to the subject speaker based on the analysis;

select at least one model for use by the speech recognition system based on the demographic assigned to the subject speaker;

adjust the speech recognition system using the at least one selected model; and

process the speech input using the adjusted speech recognition system.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Non-acoustic data from a vicinity of speech input is obtained. A subject speaker is identified as the source of the speech input from the obtained non-acoustic data by detecting mouth motion on one or more faces segmented from the non-acoustic data by comparing a first pixel intensity associated at a first time with a second pixel intensity at a second time, and selecting a face corresponding to the subject speaker from the one or more faces in response to a determination that a number of significantly changed pixels between the first pixel intensity and the second pixel intensity exceeds a threshold. A demographic is assigned to the subject speaker based on an analysis of one or more non-acoustic attributes of the subject speaker extracted from the non-acoustic data. The speech input is processed using a speech recognition system adjusted using a model selected based on the demographic.

44 Citations

View as Search Results

20 Claims

1. An apparatus, comprising:
- a memory; and
  
  a processor operatively coupled to the memory and configured to;
  
  determine a vicinity from which speech input to a speech recognition system originates, wherein the determination of the vicinity comprises an estimation of a sound direction of a source of the speech input based on a signal processing method;
  
  obtain non-acoustic data from the vicinity of the speech input using one or more non-acoustic sensors, wherein, in the obtaining of the non-acoustic data, the processor is configured to capture visual data of the vicinity of the speech input;
  
  identify a subject speaker as the source of the speech input from the obtained non-acoustic data, wherein, in the identification of the subject speaker, the processor is configured to;
  
  segment one or more faces from the captured visual data;
  
  detect mouth motion on the one or more faces, wherein the detection of the mouth motion comprises an application of temporal differencing on each of the one or more faces by comparing a first pixel intensity associated at a first time with a second pixel intensity at a second time; and
  
  select a face corresponding to the subject speaker from the one or more faces in response to a determination that a number of significantly changed pixels between the first pixel intensity and the second pixel intensity exceeds a threshold;
  
  extract one or more non-acoustic attributes associated with the subject speaker from the obtained non-acoustic data;
  
  analyze the one or more non-acoustic attributes, and assign at least one demographic to the subject speaker based on the analysis;
  
  select at least one model for use by the speech recognition system based on the demographic assigned to the subject speaker;
  
  adjust the speech recognition system using the at least one selected model; and
  
  process the speech input using the adjusted speech recognition system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The apparatus of claim 1, wherein, in the segmentation of the one or more faces from the captured visual data, the processor is further configured to use at least one face finding algorithm to find one or more likely humans in the vicinity.
  - 3. The apparatus of claim 2, wherein the at least one face finding algorithm comprises a Jones-Viola object detection framework.
  - 4. The apparatus of claim 1, wherein, in the obtaining of the non-acoustic data, the processor is further configured to locate one or more head regions using the one or more non-acoustic sensors.
  - 5. The apparatus of claim 1, wherein the one or more non-acoustic attributes comprise one or more facial features of the subject speaker extracted from the selected face, and wherein the analysis of the extracted one or more non-acoustic attributes further comprises a mapping of the selected face to a cluster to infer one or more characteristics of the subject speaker.
  - 6. The apparatus of claim 1, wherein the at least one model comprises at least one of an acoustic model and a language model.
  - 7. The apparatus of claim 1, wherein the direction of the source of the speech input is estimated based on at least one of a beamformer based method, a time delay of arrival based method, and a spectrum estimation based method.
  - 8. The apparatus of claim 1, wherein the cluster comprises at least one of a gender cluster and an ethnicity cluster.
  - 9. The apparatus of claim 1, wherein the one or more non-acoustic attributes comprise hair color, and wherein, in the assignment of the at least one demographic to the subject speaker, the processor is further configured to assign an age demographic to the subject speaker based on the analysis of the hair color.
  - 10. The apparatus of claim 1, wherein the one or more non-acoustic attributes comprise a height associated with the subject speaker, and wherein, in the assignment of the at least one demographic to the subject speaker, the processor is further configured to assign an age demographic to the subject speaker based on the analysis of the height.

11. An article of manufacture comprising a non-transitory computer readable storage medium for storing computer readable program code which, when executed, causes a computer to:
- determine a vicinity from which speech input to a speech recognition system originates, wherein the determination of the vicinity comprises an estimation of a sound direction of a source of the speech input based on a signal processing method;
  
  obtain non-acoustic data from the vicinity of the speech input using one or more non-acoustic sensors, wherein the obtaining of the non-acoustic data comprises program code that causes the computer to capture visual data of the vicinity of the speech input;
  
  indenify a subject speaker as the source of the speech input from the obtained non-acoustic data, wherein the identification of the subject speaker comprises program code that causes the computer to;
  
  segment one or more faces from the captured visual data;
  
  detect mouth motion on the one or more faces, wherein the detection of the mouth motion comprises an application of temporal differencing on each of the one or more faces by comparing a first pixel intensity associated at a first time with a second pixel intensity at a second time; and
  
  select a face corresponding to the subject speaker from the one or more faces in response to a determination that a number of significantly changed pixels between the first pixel intensity and the second pixel intensity exceeds a threshold;
  
  extract one or more non-acoustic attributes associated with the subject speaker from the obtained non-acoustic data;
  
  analyze the one or more non-acoustic attributes, and assign at least one demographic to the subject speaker based on the analysis;
  
  select at least one model for use by the speech recognition system based on the demographic assigned to the subject speaker;
  
  adjust the speech recognition system using the at least one selected model; and
  
  process the speech input using the adjusted speech recognition system.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The article of claim 11, wherein the segmentation of the one or more faces from the captured visual data comprises program code to cause the computer to use at least one face finding algorithm to find one or more likely humans in the vicinity.
  - 13. The article of claim 12, wherein the at least one face finding algorithm comprises a Jones-Viola object detection framework.
  - 14. The article of claim 11, wherein the obtaining of the non-acoustic data comprises program code that causes the computer to locate one or more head regions using the one or more non-acoustic sensors.
  - 15. The article of claim 11, wherein the one or more non-acoustic attributes comprise one or more facial features of the subject speaker extracted from the selected face, and wherein the analysis of the extracted one or more non-acoustic attributes further comprises a mapping of the selected face to a cluster to infer one or more characteristics of the subject speaker.
  - 16. The article of claim 11, wherein the at least one model comprises at least one of an acoustic model and a language model.
  - 17. The article of claim 11, wherein the direction of the source of the speech input is estimated based on at least one of a beamformer based method, a time delay of arrival based method, and a spectrum estimation based method.
  - 18. The article of claim 11, wherein the cluster comprises at least one of a gender cluster and an ethnicity cluster.
  - 19. The article of claim 11, wherein the one or more non-acoustic attributes comprise hair color, and wherein the assignment of the at least one demographic to the subject speaker further comprises program code that causes the computer to assign an age demographic to the subject speaker based on the analysis of the hair color.
  - 20. The article of claim 11, wherein the one or more non-acoustic attributes comprise a height associated with the subject speaker, wherein the assignment of the at least one demographic to the subject speaker further comprises program code that causes the computer to assign an age demographic to the subject speaker based on the analysis of the height.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Connell, II, Jonathan H., Marcheret, Etienne
Primary Examiner(s)
Kazeminezhad, Farzad

Application Number

US14/540,554
Publication Number

US 20160140959A1
Time in Patent Office

1,174 Days
Field of Search

704231
US Class Current
CPC Class Codes

G06F 3/016   Input arrangements with for...

G06N 3/008   based on physical entities ...

G06V 40/171   Local features and componen...

G10L 15/07   to the speaker

G10L 15/24   Speech recognition using no...

G10L 15/25   using position of the lips,...

G10L 2015/227   of the speaker; Human-fact...

H04R 25/407   Circuits for combining sign...

Speech recognition system adaptation based on non-acoustic attributes and face selection based on mouth motion using pixel intensities

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

44 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Speech recognition system adaptation based on non-acoustic attributes and face selection based on mouth motion using pixel intensities

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

44 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others