Method and apparatus for text-independent speaker recognition
First Claim
1. A method of recognizing an unknown speaker as one of a plurality of speaker candidates from portions of speech of each of said speaker candidates and said unknown speaker comprising the steps of:
- converting digitized samples of said portions of speech of said speaker candidates into frames of speech, each frame representing a point in a preselected multi-dimensional speech space;
generating a character set representative of said speech space comprising a plurality of characters selected from said frames of speech of all of said speaker candidates;
generating a speaker model of each of each speaker candidates, said speaker model comprising a plurality of model characters selected from said character set and representative of an associated speaker'"'"'s voice characteristics;
converting digitized samples of said portions of speech of said unknown speaker into frames of speech; and
comparing said frames of speech from said unknown speaker with said speaker models to determine which one of said speaker candidates has the greatest likelihood of being said unknown speaker.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for recognizing an unknown speaker from a plurality of speaker candidates. Portions of speech from the speaker candidates and from the unknown speaker are sampled and digitized. The digitized samples are converted into frames of speech, each frame representing a point in an LPC-12 multi-dimensional speech space. Using a character covering algorithm, a set of frames of speech is selected, called characters, from the frames of speech of all speaker candidates. The speaker candidates'"'"' portions of speech are divided into smaller portions called segments. A smaller plurality of model characters for each speaker candidate is selected from the character set. For each set of model characters the distance from each speaker candidate'"'"'s frame of speech to the closest character in the model set is determined and stored in a model histogram. When a model histogram is completed for a segment a distance D is found whereby at least a majority of frames have distances greater D. The mean distance value of D and variance across all segments for both speaker and imposter is then calculated. These values are added to the set of model characters to form the speaker model. To perform recognition the frames of the unknown speaker as they are received are buffered and compared with the sets of model characters to form model histograms for each speaker. A likelihood ratio is formed. The speaker candidate with the highest likelihood ratio is chosen as the unknown speaker.
-
Citations
16 Claims
-
1. A method of recognizing an unknown speaker as one of a plurality of speaker candidates from portions of speech of each of said speaker candidates and said unknown speaker comprising the steps of:
-
converting digitized samples of said portions of speech of said speaker candidates into frames of speech, each frame representing a point in a preselected multi-dimensional speech space; generating a character set representative of said speech space comprising a plurality of characters selected from said frames of speech of all of said speaker candidates; generating a speaker model of each of each speaker candidates, said speaker model comprising a plurality of model characters selected from said character set and representative of an associated speaker'"'"'s voice characteristics; converting digitized samples of said portions of speech of said unknown speaker into frames of speech; and comparing said frames of speech from said unknown speaker with said speaker models to determine which one of said speaker candidates has the greatest likelihood of being said unknown speaker. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. An apparatus for recognizing an unknown speaker as one of a plurality of speaker candidates from portions of speech of each of said speaker candidates and unknown speaker comprising:
-
means for sampling and digitizing said portions of speech to produce digitized samples; means for converting said digitized samples into frames of speech, each frame representing a point in a multi-dimensional speech space; means for generating a speaker model of each of said speaker candidates, said speaker model comprising a plurality of model characters selected from said frames of speech associated with a speaker candidate'"'"'s portion of speech and representative of said speaker candidates voice characteristics; and means for comparing said frames of speech from said unknown speaker with said speaker models to determine which one of said speaker candidates has the greatest likelihood of being said unknown speaker. - View Dependent Claims (14, 15, 16)
-
Specification