Speech recognition system adaptation based on non-acoustic attributes and face selection based on mouth motion using pixel intensities
First Claim
1. A method, comprising the steps of:
- determining a vicinity from which speech input to a speech recognition system originates, wherein determining the vicinity comprises estimating a sound direction of a source of the speech input based on a signal processing method;
obtaining non-acoustic data from the vicinity of the speech input using one or more non-acoustic sensors, wherein obtaining the non-acoustic data comprises capturing visual data of the vicinity of the speech input;
identifying a subject speaker as the source of the speech input from the obtained non-acoustic data, wherein identifying the subject speaker comprises;
segmenting one or more faces from the captured visual data;
detecting mouth motion on the one or more faces, wherein detecting the mouth motion comprises applying temporal differencing on each of the one or more faces by comparing a first pixel intensity associated at a first time with a second pixel intensity at a second time; and
selecting a face corresponding to the subject speaker from the one or more faces in response to determining that a number of significantly changed pixels between the first pixel intensity and the second pixel intensity exceeds a threshold;
extracting one or more non-acoustic attributes associated with the subject speaker from the obtained non-acoustic data;
analyzing the one or more extracted non-acoustic attributes, and assigning at least one demographic to the subject speaker based on the analysis;
selecting at least one model for use by the speech recognition system based on the at least one demographic assigned to the subject speaker;
adjusting the speech recognition system using the at least one selected model; and
processing the speech input using the adjusted speech recognition system;
wherein the steps are performed by at least one processor device coupled to a memory.
1 Assignment
0 Petitions
Accused Products
Abstract
Non-acoustic data from a vicinity of speech input is obtained. A subject speaker is identified as the source of the speech input from the obtained non-acoustic data by detecting mouth motion on one or more faces segmented from the non-acoustic data by comparing a first pixel intensity associated at a first time with a second pixel intensity at a second time, and selecting a face corresponding to the subject speaker from the one or more faces in response to a determination that a number of significantly changed pixels between the first pixel intensity and the second pixel intensity exceeds a threshold. A demographic is assigned to the subject speaker based on an analysis of one or more non-acoustic attributes of the subject speaker extracted from the non-acoustic data. The speech input is processed using a speech recognition system adjusted using a model selected based on the demographic.
-
Citations
20 Claims
-
1. A method, comprising the steps of:
-
determining a vicinity from which speech input to a speech recognition system originates, wherein determining the vicinity comprises estimating a sound direction of a source of the speech input based on a signal processing method; obtaining non-acoustic data from the vicinity of the speech input using one or more non-acoustic sensors, wherein obtaining the non-acoustic data comprises capturing visual data of the vicinity of the speech input; identifying a subject speaker as the source of the speech input from the obtained non-acoustic data, wherein identifying the subject speaker comprises; segmenting one or more faces from the captured visual data; detecting mouth motion on the one or more faces, wherein detecting the mouth motion comprises applying temporal differencing on each of the one or more faces by comparing a first pixel intensity associated at a first time with a second pixel intensity at a second time; and selecting a face corresponding to the subject speaker from the one or more faces in response to determining that a number of significantly changed pixels between the first pixel intensity and the second pixel intensity exceeds a threshold; extracting one or more non-acoustic attributes associated with the subject speaker from the obtained non-acoustic data; analyzing the one or more extracted non-acoustic attributes, and assigning at least one demographic to the subject speaker based on the analysis; selecting at least one model for use by the speech recognition system based on the at least one demographic assigned to the subject speaker; adjusting the speech recognition system using the at least one selected model; and processing the speech input using the adjusted speech recognition system; wherein the steps are performed by at least one processor device coupled to a memory. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification