Method and apparatus for text-independent speaker recognition

US 4,720,863 A
Filed: 11/03/1982
Issued: 01/19/1988
Est. Priority Date: 11/03/1982
Status: Expired due to Term

First Claim

Patent Images

1. A method of recognizing an unknown speaker as one of a plurality of speaker candidates from portions of speech of each of said speaker candidates and said unknown speaker comprising the steps of:

converting digitized samples of said portions of speech of said speaker candidates into frames of speech, each frame representing a point in a preselected multi-dimensional speech space;

generating a character set representative of said speech space comprising a plurality of characters selected from said frames of speech of all of said speaker candidates;

generating a speaker model of each of each speaker candidates, said speaker model comprising a plurality of model characters selected from said character set and representative of an associated speaker'"'"'s voice characteristics;

converting digitized samples of said portions of speech of said unknown speaker into frames of speech; and

comparing said frames of speech from said unknown speaker with said speaker models to determine which one of said speaker candidates has the greatest likelihood of being said unknown speaker.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for recognizing an unknown speaker from a plurality of speaker candidates. Portions of speech from the speaker candidates and from the unknown speaker are sampled and digitized. The digitized samples are converted into frames of speech, each frame representing a point in an LPC-12 multi-dimensional speech space. Using a character covering algorithm, a set of frames of speech is selected, called characters, from the frames of speech of all speaker candidates. The speaker candidates'"'"' portions of speech are divided into smaller portions called segments. A smaller plurality of model characters for each speaker candidate is selected from the character set. For each set of model characters the distance from each speaker candidate'"'"'s frame of speech to the closest character in the model set is determined and stored in a model histogram. When a model histogram is completed for a segment a distance D is found whereby at least a majority of frames have distances greater D. The mean distance value of D and variance across all segments for both speaker and imposter is then calculated. These values are added to the set of model characters to form the speaker model. To perform recognition the frames of the unknown speaker as they are received are buffered and compared with the sets of model characters to form model histograms for each speaker. A likelihood ratio is formed. The speaker candidate with the highest likelihood ratio is chosen as the unknown speaker.

Citations

16 Claims

1. A method of recognizing an unknown speaker as one of a plurality of speaker candidates from portions of speech of each of said speaker candidates and said unknown speaker comprising the steps of:
- converting digitized samples of said portions of speech of said speaker candidates into frames of speech, each frame representing a point in a preselected multi-dimensional speech space;
  
  generating a character set representative of said speech space comprising a plurality of characters selected from said frames of speech of all of said speaker candidates;
  
  generating a speaker model of each of each speaker candidates, said speaker model comprising a plurality of model characters selected from said character set and representative of an associated speaker'"'"'s voice characteristics;
  
  converting digitized samples of said portions of speech of said unknown speaker into frames of speech; and
  
  comparing said frames of speech from said unknown speaker with said speaker models to determine which one of said speaker candidates has the greatest likelihood of being said unknown speaker.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1 wherein said portions of speech of said speaker candidates are provided over a first communication channel and said portion of speech of said unknown speaker provided over a second communication channel having different characteristics from said first communication channel;
    - and said method further comprises the step of;
      
      performing blind deconvolution of the digitized samples of both the speaker candidates'"'"' and unknown speaker'"'"'s portions of speech before converting them into frames of speech.
  - 3. The method of claim 1 wherein said preselected multi-dimensional speech space is a 12 coefficient linear predictive code (LPC-12) speech space.
  - 4. The method of claim 1 wherein each of said characters in said character set is separated in said speech space from all other characters in said character set by a predetermined minimum distance.
  - 5. The method of claim 4 wherein said predetermined mininum distance is 1.5 Itakura units.
  - 6. The method of claim 1 wherein the step of generating a speaker model comprises the steps of:
    - dividing each of said portions of speech of each of said speaker candidates into a plurality of speech segments, each segment comprising a plurality of frames;
      
      determining character frequency of occurrence data for each segment which is the number of times in a selected segment that each character in the character set is the closest character to the frames of said selected segment;
      
      generating character occurrence statistics which comprises;
      
      selecting one of the speaker candidates as the speaker and setting the remainder of said speaker candidates collectively as an imposter;
      
      calculating for each character in the character set the mean frequency of occurrence and standard deviation (std) across all speaker segments and the mean frequency of occurrence and standard deviation (std) across all imposter segments from the character frequency of occurrence data, andselecting a smaller plurality of model characters from said character set based on said character occurrence statistics.
  - 7. The method of claim 6 wherein the step of selecting model characters comprises selecting N characters having the top N f values, where N is any integer and f is determined as below:
    - if the speaker mean>
      
      0.9 the imposter mean,then f=4 * (Speaker mean-Imposter mean)-(Speaker std &
      
      Imposter std).
  - 8. The method of claim 7 wherein N is less than or equal to 40.
  - 9. The method of claim 1 wherein the step of generating a speaker model further comprises the steps of:
    - dividing each of said portions of speech of said speaker candidates into a plurality of speech segments, each segment comprising a plurality of frames;
      
      determining the distance, D, from each frame in each segment to the closest model character for each plurality of model characters associated with a speaker candidate;
      
      saving the distance, D, in a model histogram for each speaker candidate for each segment of speech;
      
      operating on each model histogram to select an optimum value of "D" such that at least a majority of input frames have distances greater than "D";
      
      selecting one of the speaker candidates as speaker and setting the remainder of said speaker candidates collectively as an imposter;
      
      calculating the mean distance value of "D" and standard deviation (std) across all speaker segments and the mean distance value of "D" and standard deviation (std) across all imposter segments from the optimum "D" values; and
      
      appending said mean distance values and std'"'"'s to each plurality of model characters of an associated speaker candidate.
  - 10. The method of claim 9 wherein the step of operating on each model histogram comprises:
    - finding the distance value "D" from said histogram such that 30% of the frames had distances less than "D" and 70% of the frames had distances greater than "D".
  - 11. The method of claim 9 wherein the step of comparing said frames of speech from said unknown speaker with said speaker models further comprises:
    - determining the distance, D, from each frame from said unknown speaker to the closest model character for each plurality of model characters associated with a speaker candidate;
      
      saving the distance, D, in a model histogram for each speaker candidate;
      
      operating on a selected model histogram to select an optimum value of "D" such that at least a majority of input frames have distances greater than "D";
      
      using said "D" value and an associated speaker model'"'"'s mean and std values to calculate the probability that the associated speaker candidate would produce distance "D" (Prob (D/Spk)) and the probability that the imposter would produce distance "D" (Prob (D/Imp));
      
      form and save the likelihood ratio Prob (D/Spk)/Prob (D/Imp);
      
      repeat the above steps of using said "D" values and forming and saving said likelihood ratio for each of said speaker models; and
      
      choosing said speaker which has the highest likelihood ratio as said unknown speaker.
  - 12. The method of claim 9 wherein the step of operating on a selected model histogram comprises:
    - finding the distance "D" such that 30% of the frames had distances less than "D" and 70% of the frames had distances greater than "D".

13. An apparatus for recognizing an unknown speaker as one of a plurality of speaker candidates from portions of speech of each of said speaker candidates and unknown speaker comprising:
- means for sampling and digitizing said portions of speech to produce digitized samples;
  
  means for converting said digitized samples into frames of speech, each frame representing a point in a multi-dimensional speech space;
  
  means for generating a speaker model of each of said speaker candidates, said speaker model comprising a plurality of model characters selected from said frames of speech associated with a speaker candidate'"'"'s portion of speech and representative of said speaker candidates voice characteristics; and
  
  means for comparing said frames of speech from said unknown speaker with said speaker models to determine which one of said speaker candidates has the greatest likelihood of being said unknown speaker.
- View Dependent Claims (14, 15, 16)
- - 14. The apparatus of claim 13 wherein said means for generating a speaker model further comprises:
    - means for generating a character set having a plurality of characters from said frames of speech of all of said speaker candidates which characters are representative of said speech space wherein each of said characters in said character set is separated in said speech space from all other characters in said character set by a predetermined minimum distance; and
      
      means for selecting said model characters for each of said speaker candidates from said character set.
  - 15. The apparatus of claim 13 wherein said apparatus further comprises means for channel normalization when said portions of speech of said speaker candidates are received over a separate and different channel from said portion of speech of said unknown speaker whereby said apparatus is still capable of recognizing said unknown speaker.
  - 16. The apparatus of claim 15 wherein said means for channel normalization comprises means for blind deconvolution of said digitized samples.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
ITT Corporation (ITT, Inc.)
Original Assignee
ITT Defense Communications
Inventors
Wrench, Edwin H. Jr., Li, Kung-Pu
Primary Examiner(s)
CANGIALOSI, SALVATORE A

Application Number

US06/439,010
Time in Patent Office

1,903 Days
Field of Search

381/4, 381/37, 381/41, 381/42, 381/43, 381/44, 381/45, 364/513, 364/513.5
US Class Current

704/247
CPC Class Codes

G10L 17/02   Preprocessing operations, e...

G10L 17/04   Training, enrolment or mode...

G10L 17/06   Decision making techniques;...

Method and apparatus for text-independent speaker recognition

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for text-independent speaker recognition

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links