Method and Apparatus for Automatically Determining Speaker Characteristics for Speech-Directed Advertising or Other Enhancement of Speech-Controlled Devices or Services
First Claim
1. A method for automatically determining speaker characteristics, comprising the steps of:
- using a spoken utterance to specify an input or action, wherein text that corresponds to said spoken utterance, and its associated meaning or interpretation, comprises primary information conveyed by said utterance;
using said spoken utterance to convey non-text information concerning any of a speaker'"'"'s gender, age, socioeconomic status, accent, language spoken, emotional state, or other personal characteristics, wherein said non-text information comprises secondary information; and
using said primary information and said secondary information to direct behavior of a controlled system.
1 Assignment
0 Petitions
Accused Products
Abstract
In addition to conveying primary information, human speech also conveys information concerning the speaker'"'"'s gender, age, socioeconomic status, accent, language spoken, emotional state, or other personal characteristics, which is referred to as secondary information. Disclosed herein are both the means of automatic discovery and use of such secondary information to direct other aspects of the behavior of a controlled system. One embodiment of the invention comprises an improved method to determine, with high reliability, the gender of an adult speaker. A further embodiment of the invention comprises the use of this information to display a gender-appropriate advertisement to the user of an information retrieval system that uses a cell phone as the input and output device. The invention is not limited to gender and such secondary information can include, for example, any of information concerning the speaker'"'"'s age, socioeconomic status, accent, language spoken, emotional state, or other personal characteristics.
210 Citations
13 Claims
-
1. A method for automatically determining speaker characteristics, comprising the steps of:
-
using a spoken utterance to specify an input or action, wherein text that corresponds to said spoken utterance, and its associated meaning or interpretation, comprises primary information conveyed by said utterance;
using said spoken utterance to convey non-text information concerning any of a speaker'"'"'s gender, age, socioeconomic status, accent, language spoken, emotional state, or other personal characteristics, wherein said non-text information comprises secondary information; and
using said primary information and said secondary information to direct behavior of a controlled system. - View Dependent Claims (2, 3)
-
-
4. An apparatus for automatically determining speaker characteristics, comprising:
-
a speech input device;
a primary information extraction module for receiving utterances from said speech input device and comprising an automatic speech recognition system module (ASR);
a secondary information extraction module for receiving utterance from said speech input device and comprising an automatic speech characteristics module (ASC) that estimates or extracts explicit or implicit speech indicators of interest; and
a controlled system for using primary and secondary information extracted, respectively, by said ASR and ASC, to produce a system action or response as a system output. - View Dependent Claims (5, 6, 7, 8)
-
-
9. An apparatus for automatically associating speaker characteristics with speaker behaviors, comprising:
-
a speech input device;
a secondary information extraction module for receiving utterance from said speech input device and comprising an automatic speech characteristics module (ASC) that estimates or extracts explicit or implicit speech indicators of interest; and
a learning module for recording both said secondary information and user behavior, and for analyzing said secondary information and user behavior to determine relationships between speech and said behavior, and/or speech and speaker personal characteristics. - View Dependent Claims (10)
-
-
11. A method for gender classification based upon speech, comprising the steps of:
-
processing an utterance (utt) as a sequence of frames;
classifying each frame of speech as voiced (V), unvoiced (U), or silence (S), with unvoiced and silence frames discarded;
using an autocorrelation algorithm to extract a pitch estimate for every frame in the utterance;
to obtain an estimate for the utt'"'"'s pitch frequency (F0), histograming the F0 values for the frames in the utt, and selecting a greatest peak, or mode of the histogram; and
comparing pitch with a threshold to decide if the speech is from a male or female speaker.
-
-
12. A method for gender classification based upon speech, comprising the steps of:
-
performing center-clipping on each of a plurality of signal frames;
keeping a first half of a Fast Fourier Transform (FFT)-size buffer of a resulting center-clipped frame and zeroing out a second half of said buffer;
taking a forward FFT;
computing a squared magnitude of said FFT;
taking an inverse FFT of a squared magnitude spectrum to effect frame autocorrelation;
searching for a highest peak in said autocorrelation;
classifying each frame of speech as voiced (V), unvoiced (U), or silence (S), with unvoiced and silence frames discarded;
if voiced, finding pitch for said frame from the peak'"'"'s position;
incorporating the pitch in a histogram;
determining pitch for an entire utterance by employing a histogram method; and
comparing pitch with a threshold to decide if the speech is from a male or female speaker.
-
-
13. A method for discriminating between adults and children by spoken utterances, comprising the steps of:
-
analyzing a given utterance on a frame-by-frame basis;
classifying each frame of speech as voiced (V), unvoiced (U), or silence (S), with unvoiced and silence frames discarded;
providing two probability distribution functions (pdfs) of a distribution of spectral peaks, one for adults, one for children;
dividing an utterance in question into frames, where unvoiced and silence frames are discarded;
for each frame, computing a Hamming-windowed FFT, and then computing a squared magnitude of each FFT coefficient;
finding a spectral peak index of a maximum of said coefficients within said FFT;
from a sequence of spectral peak indices, computing two quantities from said two values, computing their difference
Δ
=log Putterance(child)−
log Putterance(adult).
(3)comparing said difference quantity with an experimentally determined threshold to classify said utterance;
child if the quantity exceeds a threshold, adult if it does not.
-
Specification