Method of and device for phone-based speaker recognition
First Claim
1. A device for phone-based speaker recognition, comprising:
- at least one phone recognizer for converting input digitized voice signals into a time ordered stream of phones based on at least one linguistic characteristic, with each of said phone recognizers having a voice input, to receive said input digitized voice signals, and an output for transmitting said time ordered stream of phones;
for each of said phone recognizers, a corresponding tokenizer, having an input for receiving said time ordered stream of phones, with each of said tokenizers creating a set containing phone n-grams and the number of times each of said phone n-grams occurred in said time ordered stream of phones, and having an output for transmitting said set containing phone n-grams and the number of times each of said phone n-grams occurred;
for each of said tokenizers, a corresponding recognition scorer further comprising;
(a) at least one speaker model scorer, each of said speaker model scorers receives the corresponding set containing phone n-grams and the number of times each of said phone n-grams occurred in said time ordered stream of phones and computes a speaker log-likelihood score for each of said phone n-grams in said set containing phone n-grams and the number of times each of said phone n-grams occurred in said time ordered stream of phones using a corresponding speaker model which contains the number of occurrences of each of said phone n-grams that occurred in a speaker training speech set collected from a particular speaker;
(b) a background model scorer for computing a background log-likelihood score for each of said phone n-grams in said set containing phone n-grams and the number of times each of said phone n-grams occurred in said time ordered stream of phones using a corresponding backgrounds model which contains the number of occurrences of each of said phone n-grams that occurred in background training speech set collected from many speakers, excluding all of said particular speakers; and
(c) for each of said speaker model scorers, a ratio scorer that produces a speaker log-likelihood ratio from said speaker log-likelihood score and said background log-likelihood score;
for each of said recognition scorers, a corresponding fusion scorer which combines all of said corresponding speaker log-likelihood ratios from said corresponding ratio scorers to produce a single speaker score; and
a speaker selector which evaluates all of said single speaker scores to determine a speaker identity for the speaker of said input digitized voice signals.
2 Assignments
0 Petitions
Accused Products
Abstract
A language-independent speaker-recognition system based on parallel cumulative differences in dynamic realization of phonetic features ( i.e. , pronunciation) between speakers rather than spectral differences in voice quality. The system exploits phonetic information from many phone recognizers to perform text independent speaker recognition. A digitized speech signal from a speaker is converted to a sequence of phones by each phone recognizer. Each phone sequence is then modified based on the energy in the signal. The modified phone sequences are tokenized to produce phone n-grams that are compared against a speaker and a background model for each phone recognizer to produce log-likelihood ratio scores. The log-likelihood ratio scores from each phone recognizer are fused to produce a final recognition score for each speaker model. The recognition score for each speaker model is then evaluated to determine which of the modeled speakers, if any, produced the digitized speech signal.
58 Citations
15 Claims
-
1. A device for phone-based speaker recognition, comprising:
-
at least one phone recognizer for converting input digitized voice signals into a time ordered stream of phones based on at least one linguistic characteristic, with each of said phone recognizers having a voice input, to receive said input digitized voice signals, and an output for transmitting said time ordered stream of phones;
for each of said phone recognizers, a corresponding tokenizer, having an input for receiving said time ordered stream of phones, with each of said tokenizers creating a set containing phone n-grams and the number of times each of said phone n-grams occurred in said time ordered stream of phones, and having an output for transmitting said set containing phone n-grams and the number of times each of said phone n-grams occurred;
for each of said tokenizers, a corresponding recognition scorer further comprising;
(a) at least one speaker model scorer, each of said speaker model scorers receives the corresponding set containing phone n-grams and the number of times each of said phone n-grams occurred in said time ordered stream of phones and computes a speaker log-likelihood score for each of said phone n-grams in said set containing phone n-grams and the number of times each of said phone n-grams occurred in said time ordered stream of phones using a corresponding speaker model which contains the number of occurrences of each of said phone n-grams that occurred in a speaker training speech set collected from a particular speaker;
(b) a background model scorer for computing a background log-likelihood score for each of said phone n-grams in said set containing phone n-grams and the number of times each of said phone n-grams occurred in said time ordered stream of phones using a corresponding backgrounds model which contains the number of occurrences of each of said phone n-grams that occurred in background training speech set collected from many speakers, excluding all of said particular speakers; and
(c) for each of said speaker model scorers, a ratio scorer that produces a speaker log-likelihood ratio from said speaker log-likelihood score and said background log-likelihood score;
for each of said recognition scorers, a corresponding fusion scorer which combines all of said corresponding speaker log-likelihood ratios from said corresponding ratio scorers to produce a single speaker score; and
a speaker selector which evaluates all of said single speaker scores to determine a speaker identity for the speaker of said input digitized voice signals. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method of phone-based speaker recognition, comprising the steps of:
-
converting input digitized voice signals into at least one time ordered stream of phones with each of said time ordered stream of phones based on at least one linguistic characteristic;
creating a set containing phone n-grams and the number of times each of said phone n-grams occurred in each of said time ordered stream of phones;
computing a speaker log-likelihood score for each of at least one possible particular speaker for each of said phone n-grams in said set containing phone n-grams and the number of times each of said phone n-grams occurred in said time ordered stream of phones using a corresponding particular speaker model which contains the number of occurrences of each of said phone n-grams that occurred in a speaker training speech set collected from the particular speaker;
computing a background log-likelihood score for each of said phone n-grams in said set containing phone n-grams and the number of times each of said phone n-grams occurred in said time ordered stream of phones using a corresponding backgrounds model which contains the number of occurrences of each of said phone n-grams that occurred in background training speech set collected from many speakers, excluding all of said particular speakers;
producing a speaker log-likelihood ratio from each of said speaker log-likelihood scores and said background log-likelihood scores;
combining all of said corresponding speaker log-likelihood ratios to produce a single speaker score; and
determining a speaker identity based on an evaluation of all of said single speaker scores.
-
Specification