Speaker verifier using nearest-neighbor distance measure
First Claim
1. In a Speaker Verification System comprising a means for processing spoken test into frames of speech, a means for enrolling a speaker into the system, a means for eliciting a spoken test phrase from a speaker claiming to be a specified enrolled speaker, a means for determining one or more verification distances between said spoken test phrase and corresponding "words" entered into the system during said enrollment into the system of said specified enrolled speaker, and a means for determining a verification score from such verification distance data and for determining therefrom whether said claiming speaker is said specified enrolled speaker, the improvement wherein:
- said processing means includes a means for converting said spoken text into non-parametric speech vectors, whereby at least one of said speech vectors is included in each of said frames of speech; and
said determination of said verification distance includes a determination of nearest-neighbor Euclidean distances between single frames of speech associated with said spoken test phrase and corresponding frames of speech associated with said "words" entered into the system during said enrollment into the system of said specified enrolled speaker and between single frames of speech associated with said enrollment "words" of said specified enrolled speaker and corresponding frames of speech associated with said spoken test phrase.
1 Assignment
0 Petitions
Accused Products
Abstract
A speaker verification system which accepts or rejects the claimed identity of an individual based on analysis and measurements of the speaker'"'"'s utterances. The utterances are elicited by prompting the individual seeking identification to read test phrases chosen at random by the verification system composed of words from a small vocabulary. Nearest-neighbor distances between speech frames derived from such spoken test phrases and speech frames of corresponding vocabulary "words" from previously stored utterances of the speaker seeking identification are computed along with distances between such spoken test phrases and corresponding vocabulary words for a set of reference speakers. The claim for identification is accepted or rejected based on the relationship among such distances and a predetermined threshold value.
-
Citations
33 Claims
-
1. In a Speaker Verification System comprising a means for processing spoken test into frames of speech, a means for enrolling a speaker into the system, a means for eliciting a spoken test phrase from a speaker claiming to be a specified enrolled speaker, a means for determining one or more verification distances between said spoken test phrase and corresponding "words" entered into the system during said enrollment into the system of said specified enrolled speaker, and a means for determining a verification score from such verification distance data and for determining therefrom whether said claiming speaker is said specified enrolled speaker, the improvement wherein:
-
said processing means includes a means for converting said spoken text into non-parametric speech vectors, whereby at least one of said speech vectors is included in each of said frames of speech; and said determination of said verification distance includes a determination of nearest-neighbor Euclidean distances between single frames of speech associated with said spoken test phrase and corresponding frames of speech associated with said "words" entered into the system during said enrollment into the system of said specified enrolled speaker and between single frames of speech associated with said enrollment "words" of said specified enrolled speaker and corresponding frames of speech associated with said spoken test phrase. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. The Speaker Verfication System comprising:
-
means for processing spoken test entered into the system whereby said spoken test is sampled, digitized and converted into frames of speech, each frame being comprised of multiple speech vector components, said speech vector components being non-parametric in nature; means for enrolling a speaker into the system whereby predetermined spoken text is entered into the system by said speaker and processed by said means for processing and thereafter stored by the system; means responsive to a request for identification for a speaker claiming to be a specified enrolled speaker for generating a prompt phrase comprising one or more "words" derived from said predetermined spoken test entered by said specified enrolled speaker and whereupon said prompt phrase is spoken by said claiming speaker and said spoken prompt phrase is entered into the system and processed by said means for processing; means for determining nearest-neighbor distances di,T, wherein said nearest-neighbor distances di,T are computed as the Euclidian distances between each frame of said processed spoken prompt phrase and speech frames from corresponding regions of each occurrence of the same "word" stored during said enrollment into the system of said specified enrolled speaker; means for determining nearest-neighbor distances dj,E, wherein said nearest-neighbor distances dj,E are computed as the Euclidian distances between each frame of each occurrence of each "word" comprising said prompt phrase and speech frames from corresponding regions of each occurrence of the same "word" in said processed spoken prompt phrase; means for determining a distance dT,E between said processed spoken prompt phrase and corresponding "words" entered into the system during said enrollment into the system of said specified enrolled speaker, wherein said distance dT,E is derived from an average of all said nearest-neighbor distances di,T and an average of all said nearest-neighbor distances dj,E ; and means for determining a verification score related to said distances di,T, dj,E and dT,E and for determining therefrom whether said claiming speaker is said specified enrolled speaker. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
-
-
15. A speaker verification system comprising:
-
means for entering spoken test into the system; means for sampling and digitizing said spoken test; means for converting said digitized samples into frames of speech, each frame being comprised of multiple speech vector components, said speech vector components being non-parametric in nature; means for enrolling one or more speakers into the system during an enrollment session whereby predetermined spoken test is entered into the system by each such speaker and processed by said means for sampling and means for converting and thereafter stored by the system; means for identifying stored enrollment data for a particular enrolled speaker based on a claim for verification as said particular enrolled speaker; mean for identifying one or more "words" derived from the spoken test entered by said particular enrolled speaker during said enrollment session and means for presentation of said "words" as a prompt to be spoken by a speaker during a verification session, said prompted spoken "words" being thereupon entered into the system via said means for entering and processed by said means for sampling and means for converting; means for storing said prompted spoken "words"; means for comparing each speech frame from said verification session with speech frames from corresponding regions of each occurrence of the same "word" stored during said particular speaker'"'"'s enrollment session, and computing nearest-neighbor distances di,T between al such pairs of verification and enrollment frames; means for comparing each speech frame from each occurrence of "words" comprising said prompt stored during said particular speaker'"'"'s enrollment session with speech frames from corresponding regions of said prompted spoken "words", and computing nearest-neighbor distances dj,E between all such pairs of enrollment and verification frames; means for computing a distance dT,E from an average of all said nearest-neighbor distances di,T and an average of all said nearest-neighbor distances dj,E ; means for comparing said distance dT,E with a predetermined value and causing an output signal to occur based on the difference between said distance dT,E and said predetermined value, said output signal being indicative of acceptance or rejection for a speaker claiming to be said particular enrolled speaker. - View Dependent Claims (16, 17, 18, 19)
-
-
20. In a method of automatically verifying a speaker as matching a claimed identify, including the steps of processing spoken input speech signals into a series of frames of digital data representing said input speech, analyzing the speech frames by a speaker verification module which compares the incoming speech to a reference set of speech features and generates respective match scores therefrom, and determining whether the input speech corresponds with the identified speaker based upon the match scores, the improvement wherein:
-
said step of processing spoken input speech signals includes a substep of converting said spoken input speech into non-parametric speech vectors, whereby at least one of said speech vectors is included in each of said frames of data representing said input speech; and said comparison of incoming speech to reference speech features by said speaker recognition module includes generating a match score which is a sum of a first score set equal to the average of the minimum Euclidian squared distances between an input speech frame for a given region of a particular "word" and speech frames from said reference set of speech features corresponding to the same region of the same "word" over all frames of all "words" of said input speech, and a second score set equal to the average of the minimum Euclidian squared distances between a speech frame for a given region of a particular "word" from said reference set of speech features and an input speech frame corresponding to the same region of the same "word" over all frames of all "words" comprising said reference set of speech features. - View Dependent Claims (21, 22, 23, 24, 25, 26)
-
-
27. In a method of automatically verifying a speaker as matching a claimed identity, including the steps of establishing the claimed identity, generation of a verification phrase comprising one or more "words" to be spoken by the speaker, processing the spoken input speech signals into a series of frames of digital data representing the input speech, analyzing the speech frames by a speaker verification module which compares the input speech to a reference set of speech features of the identified speaker obtained during prior enrollment sessions and generates respective match scores therefrom, and determining whether the input speech is identified with the identified speaker based upon the match scores, the improvement wherein:
-
said step of processing spoken input speech signals includes a substep of coverting said spoken input speech into non-parametric speech vectors, whereby a least one of said speech vectors is included in each of said frames of data representing the input speech; and said comparison of incoming speech to reference speech features by said speaker recognition module includes generating a match score which is a sum of a first score set equal to the average of the minimum Euclidian squared distances between an input speech frame for a given region of a particular "word" and enrollment speech frames corresponding to the same region of the same "word", over all frames of all "words" of the input speech, and a second score set equal to the average of the minimum Euclidian squared distance between an enrollment speech frame for a given region of a particular "word" with an input speech frame corresponding to the same region of the same "word", over all frames of all "words" comprising the reference set of speech features, wherein the distance from tj to the corresponding enrollment "word" E is;
##EQU10## and the distance from ei to the corresponding test "word" T is;
##EQU11## wherein tj is the j-th frame of the input "word" T and ei is the i-th frame of enrollment "word" E, Wi and Fi are respectively the word and frame indexes for frame i, and Wj and Fj are respectively the word and frame indexes for frame j, andwherein said first score is equal to the average of dJ,E over all frames and said second score is equal to the average di,T over all frames. - View Dependent Claims (28, 29, 30, 31, 32, 33)
-
Specification