Hierarchical real-time speaker recognition for biometric VoIP verification and targeting

US 8,160,877 B1
Filed: 08/06/2009
Issued: 04/17/2012
Est. Priority Date: 08/06/2009
Status: Active Grant

First Claim

Patent Images

1. A method for real-time speaker recognition, comprising:

obtaining speech data of a speaker to identify the speaker from a plurality of speakers;

extracting, using a processor of a computer, a coarse feature of the speaker from the speech data;

identifying the speaker as belonging to a pre-determined speaker cluster that is one of a plurality of partitions of the plurality of speakers and corresponds to a subset of a plurality of biometric signatures of the plurality of speakers, wherein identifying the speaker as belonging to the pre-determined speaker cluster is based on comparing the coarse feature of the speaker to a speaker independent parameter representing the subset of the plurality of biometric signatures;

further identifying, in response to identifying the speaker as belonging to the pre-determined speaker cluster, the speaker as belonging to a second level pre-determined speaker cluster that is one of a plurality of second level partitions of the pre-determined speaker cluster and corresponds to a second level subset of the subset of the plurality of biometric signatures, wherein identifying the speaker as belonging to the second level pre-determined speaker cluster is based on comparing the coarse feature of the speaker to a second level speaker independent parameter representing the second level subset of the subset of the plurality of biometric signatures;

extracting, using the processor of the computer, a plurality of Mel-Frequency Cepstral Coefficients (MFCC) and a plurality of Gaussian Mixture Model (GMM) components from the speech data;

determining a biometric signature of the speaker based on the plurality of MFCC and the plurality of GMM components; and

determining in real time, using the processor of the computer, an identity of the speaker by comparing the biometric signature of the speaker to the second level subset of the subset of the plurality of biometric signatures, wherein each of the plurality of biometric signatures is specific to one of the plurality of speakers.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for real-time speaker recognition including obtaining speech data of a speaker, extracting, using a processor of a computer, a coarse feature of the speaker from the speech data, identifying the speaker as belonging to a pre-determined speaker cluster based on the coarse feature of the speaker, extracting, using the processor of the computer, a plurality of Mel-Frequency Cepstral Coefficients (MFCC) and a plurality of Gaussian Mixture Model (GMM) components from the speech data, determining a biometric signature of the speaker based on the plurality of MFCC and the plurality of GMM components, and determining in real time, using the processor of the computer, an identity of the speaker by comparing the biometric signature of the speaker to one of a plurality of biometric signature libraries associated with the pre-determined speaker cluster.

Citations

22 Claims

1. A method for real-time speaker recognition, comprising:
- obtaining speech data of a speaker to identify the speaker from a plurality of speakers;
  
  extracting, using a processor of a computer, a coarse feature of the speaker from the speech data;
  
  identifying the speaker as belonging to a pre-determined speaker cluster that is one of a plurality of partitions of the plurality of speakers and corresponds to a subset of a plurality of biometric signatures of the plurality of speakers, wherein identifying the speaker as belonging to the pre-determined speaker cluster is based on comparing the coarse feature of the speaker to a speaker independent parameter representing the subset of the plurality of biometric signatures;
  
  further identifying, in response to identifying the speaker as belonging to the pre-determined speaker cluster, the speaker as belonging to a second level pre-determined speaker cluster that is one of a plurality of second level partitions of the pre-determined speaker cluster and corresponds to a second level subset of the subset of the plurality of biometric signatures, wherein identifying the speaker as belonging to the second level pre-determined speaker cluster is based on comparing the coarse feature of the speaker to a second level speaker independent parameter representing the second level subset of the subset of the plurality of biometric signatures;
  
  extracting, using the processor of the computer, a plurality of Mel-Frequency Cepstral Coefficients (MFCC) and a plurality of Gaussian Mixture Model (GMM) components from the speech data;
  
  determining a biometric signature of the speaker based on the plurality of MFCC and the plurality of GMM components; and
  
  determining in real time, using the processor of the computer, an identity of the speaker by comparing the biometric signature of the speaker to the second level subset of the subset of the plurality of biometric signatures, wherein each of the plurality of biometric signatures is specific to one of the plurality of speakers.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein each of the plurality of partitions of the plurality of speakers comprises one of a plurality of speaker gender clusters, wherein each of the plurality of second level partitions of the pre-determined speaker cluster comprises one of a plurality of speaker age clusters, the method further comprising:
    - further identifying, in response to identifying the speaker as belonging to the second level pre-determined speaker cluster, the speaker as belonging to a third level pre-determined speaker cluster that is one of a plurality of third level partitions of the second level pre-determined speaker cluster and corresponds to a third level subset of the second level subset of the subset of the plurality of biometric signatures, wherein identifying the speaker as belonging to the third level pre-determined speaker cluster is based on comparing the coarse feature of the speaker to a third level speaker independent parameter representing the third level subset of the second level subset of the subset of the plurality of biometric signatures,wherein each of the plurality of third level partitions of the second level pre-determined speaker cluster comprises one of a plurality of speaker native language clusters, andwherein comparing the biometric signature of the speaker to the second level subset of the subset of the plurality of biometric signatures is limited to comparing to the third level subset of the second level subset of the subset of the plurality of biometric signatures.
  - 3. The method of claim 2, wherein the speech data is extracted from a phone call conversation of the speaker originated from a pre-determined voice over Internet protocol (VoIP) phone number, the method further comprising:
    - assigning a biometric signature of a owner of the pre-determined VoIP phone number to the pre-determined VoIP phone number;
      
      determining in real time that the speaker is not the owner of the pre-determined VoIP phone number when the biometric signature of the speaker mis-matches the biometric signature assigned to the pre-determined VoIP phone number.
  - 4. The method of claim 3, wherein the biometric signature assigned to the pre-determined VoIP phone number is generated based on another speech data of the owner obtained during a training period, wherein one of language and content of the speech data and the another speech data are different.
  - 5. The method of claim 1, further comprising:
    - generating pre-processed speech data by removing silence frames and oversaturation frames from the speech data as well as normalizing loudness of the speech data,wherein the plurality of MFCC and the plurality of GMM components are extracted from the pre-processed speech data.
  - 6. The method of claim 5, wherein the speech data is less than 3 seconds in duration.
  - 7. The method of claim 1, further comprising:
    - representing the speech data as a sum of a plurality of amplitude modulation and frequency modulation (AM-FM) models corresponding to a plurality of formants;
      
      extracting, using a processor of a computer, a plurality of vowel-like speech frames from the speech data;
      
      determining, using the processor of the computer, one or more dominant formant frequencies based on the plurality of vowel-like speech frames and the plurality of AM-FM models;
      
      generating one or more set of filtered speech data by band-passed filtering the plurality of vowel-like speech frames based on the one or more dominant formant frequencies;
      
      generating one or more quasi-periodic amplitude envelops from the one or more set of filtered speech data using discrete energy separation algorithm (DESA); and
      
      determining a pitch period of the speaker from the one or more quasi-periodic amplitude envelops,wherein the coarse feature of the speaker comprises the pitch period of the speaker.
  - 8. The method of claim 7, wherein the pre-determined speaker cluster corresponds to one of male speaker cluster and female speaker cluster.

9. A non-transitory computer readable medium, embodying instructions when executed by the computer to perform real-time speaker recognition, the instructions comprising functionality for:
- obtaining speech data of a speaker to identify the speaker from a plurality of speakers;
  
  extracting a coarse feature of the speaker from the speech data;
  
  identifying the speaker as belonging to a pre-determined speaker cluster that is one of a plurality of partitions of the plurality of speakers and corresponds to a subset of a plurality of biometric signatures of the plurality of speakers, wherein identifying the speaker as belonging to the pre-determined speaker cluster is based on comparing the coarse feature of the speaker to a speaker independent parameter representing the subset of the plurality of biometric signatures;
  
  further identifying, in response to identifying the speaker as belonging to the pre-determined speaker cluster, the speaker as belonging to a second level pre-determined speaker cluster that is one of a plurality of second level partitions of the pre-determined speaker cluster and corresponds to a second level subset of the subset of the plurality of biometric signatures, wherein identifying the speaker as belonging to the second level pre-determined speaker cluster is based on comparing the coarse feature of the speaker to a second level speaker independent parameter representing the second level subset of the subset of the plurality of biometric signatures;
  
  extracting a plurality of Mel-Frequency Cepstral Coefficients (MFCC) and a plurality of Gaussian Mixture Model (GMM) components from the speech data;
  
  determining a biometric signature of the speaker based on the plurality of MFCC and the plurality of GMM components; and
  
  determining in real time, using the processor of the computer, an identity of the speaker by comparing the biometric signature of the speaker to the second level subset of the subset of the plurality of biometric signatures, wherein each of the plurality of biometric signatures is specific to one of the plurality of speakers.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The non-transitory computer readable medium of claim 9, wherein each of the plurality of partitions of the plurality of speakers comprises one of a plurality of speaker gender clusters hierarchy, wherein each of the plurality of second level partitions of the pre-determined speaker cluster comprises one of a plurality of speaker age clusters, the instructions further comprising functionality for:
    - further identifying, in response to identifying the speaker as belonging to the second level pre-determined speaker cluster, the speaker as belonging to a third level pre-determined speaker cluster that is one of a plurality of third level partitions of the second level pre-determined speaker cluster and corresponds to a third level subset of the second level subset of the subset of the plurality of biometric signatures, wherein identifying the speaker as belonging to the third level pre-determined speaker cluster is based on comparing the coarse feature of the speaker to a third level speaker independent parameter representing the third level subset of the second level subset of the subset of the plurality of biometric signatures,wherein each of the plurality of third level partitions of the second level pre-determined speaker cluster comprises one of a plurality of speaker native language clusters, andwherein comparing the biometric signature of the speaker to the second level subset of the subset of the plurality of biometric signatures is limited to comparing to the third level subset of the second level subset of the subset of the plurality of biometric signatures.
  - 11. The non-transitory computer readable medium of claim 10, wherein the plurality of hierarchical speaker clusters comprises more than 500 clusters and the one of the plurality of biometric signature libraries comprises over 1000 biometric signatures.
  - 12. The non-transitory computer readable medium of claim 9, the instructions when executed by the processor further comprising functionality for:
    - generating pre-processed speech data by removing silence frames and oversaturation frames from the speech data as well as normalizing loudness of the speech data,wherein the plurality of MFCC and the plurality of GMM components are extracted from the pre-processed speech data.
  - 13. The non-transitory computer readable medium of claim 12, wherein the speech data is less than 3 seconds in duration.
  - 14. The non-transitory computer readable medium of claim 9, the instructions when executed by the processor further comprising functionality for:
    - representing the speech data as a sum of a plurality of amplitude modulation and frequency modulation (AM-FM) models corresponding to a plurality of formants;
      
      extracting a plurality of vowel-like speech frames from the speech data;
      
      determining one or more dominant formant frequencies based on the plurality of vowel-like speech frames and the plurality of AM-FM models;
      
      generating one or more set of filtered speech data by band-passed filtering the plurality of vowel-like speech frames based on the one or more dominant formant frequencies;
      
      generating one or more quasi-periodic amplitude envelops from the one or more set of filtered speech data using discrete energy separation algorithm (DESA); and
      
      determining a pitch period of the speaker from the one or more quasi-periodic amplitude envelops,wherein the coarse feature of the speaker comprises the pitch period of the speaker.
  - 15. The non-transitory computer readable medium of claim 14, wherein the pre-determined speaker cluster corresponds to one of male speaker cluster and female speaker cluster.

16. A system for speaker recognition, comprising:
- a repository storing a plurality of biometric signature libraries;
  
  a processor; and
  
  memory storing instructions when executed by the processor comprising functionalities for;
  
  obtaining speech data of a speaker to identify the speaker from a plurality of speakers;
  
  extracting a coarse feature of the speaker from the speech data;
  
  identifying the speaker as belonging to a pre-determined speaker cluster that is one of a plurality of partitions of the plurality of speakers and corresponds to a subset of a plurality of biometric signatures of the plurality of speakers, wherein identifying the speaker as belonging to the pre-determined speaker cluster is based on comparing the coarse feature of the speaker to a speaker independent parameter representing the subset of the plurality of biometric signatures;
  
  further identifying, in response to identifying the speaker as belonging to the pre-determined speaker cluster, the speaker as belonging to a second level pre-determined speaker cluster that is one of a plurality of second level partitions of the pre-determined speaker cluster and corresponds to a second level subset of the subset of the plurality of biometric signatures, wherein identifying the speaker as belonging to the second level pre-determined speaker cluster is based on comparing the coarse feature of the speaker to a second level speaker independent parameter representing the second level subset of the subset of the plurality of biometric signatures;
  
  extracting a plurality of Mel-Frequency Cepstral Coefficients (MFCC) for a Gaussian Mixture Model (GMM) from the speech data;
  
  determining a biometric signature of the speaker based on the plurality of MFCC and the GMM; and
  
  determining in real time, using the processor of the computer, an identity of the speaker by comparing the biometric signature of the speaker to the second level subset of the subset of the plurality of biometric signatures, wherein each of the plurality of biometric signatures is specific to one of the plurality of speakers.
- View Dependent Claims (17, 18, 19, 20, 21, 22)
- - 17. The system of claim 16, wherein each of the plurality of partitions of the plurality of speakers comprises one of a plurality of speaker gender clusters, wherein each of the plurality of second level partitions of the pre-determined speaker cluster comprises one of a plurality of speaker age clusters, the instructions when executed by the processor further comprising functionalities for:
    - further identifying, in response to identifying the speaker as belonging to the second level pre-determined speaker cluster, the speaker as belonging to a third level pre-determined speaker cluster that is one of a plurality of third level partitions of the second level pre-determined speaker cluster and corresponds to a third level subset of the second level subset of the subset of the plurality of biometric signatures, wherein identifying the speaker as belonging to the third level pre-determined speaker cluster is based on comparing the coarse feature of the speaker to a third level speaker independent parameter representing the third level subset of the second level subset of the subset of the plurality of biometric signatures,wherein each of the plurality of third level partitions of the second level pre-determined speaker cluster comprises one of a plurality of speaker native language clusters, andwherein comparing the biometric signature of the speaker to the second level subset of the subset of the plurality of biometric signatures is limited to comparing to the third level subset of the second level subset of the subset of the plurality of biometric signatures.
  - 18. The system of claim 17, wherein the plurality of hierarchical speaker clusters comprises more than 500 clusters and the one of the plurality of biometric signature libraries comprises over 1000 biometric signatures.
  - 19. The system of claim 16, the instructions when executed by the processor further comprising functionality for:
    - generating pre-processed speech data by removing silence frames and oversaturation frames from the speech data as well as normalizing loudness of the speech data,wherein the plurality of MFCC for the GMM are extracted from the pre-processed speech data.
  - 20. The system of claim 19, wherein the speech data is less than 3 seconds in duration.
  - 21. The system of claim 16, the instructions when executed by the processor further comprising functionality for:
    - representing the speech data as a sum of a plurality of amplitude modulation and frequency modulation (AM-FM) models corresponding to a plurality of formants;
      
      extracting a plurality of vowel-like speech frames from the speech data;
      
      determining one or more dominant formant frequencies based on the plurality of vowel-like speech frames and the plurality of AM-FM models;
      
      generating one or more set of filtered speech data by band-passed filtering the plurality of vowel-like speech frames based on the one or more dominant formant frequencies;
      
      generating one or more quasi-periodic amplitude envelops from the one or more set of filtered speech data using discrete energy separation algorithm (DESA); and
      
      determining a pitch period of the speaker from the one or more quasi-periodic amplitude envelops,wherein the coarse feature of the speaker comprises the pitch period of the speaker.
  - 22. The system of claim 21, wherein the pre-determined speaker cluster corresponds to one of male speaker cluster and female speaker cluster.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
The Boeing Co.
Original Assignee
Narus, Inc. (Gen Digital Inc.)
Inventors
Nucci, Antonio, Keralapura, Ram
Primary Examiner(s)
Godbold, Douglas

Application Number

US12/536,784
Time in Patent Office

985 Days
Field of Search

704246-250
US Class Current

704/246
CPC Class Codes

G10L 17/06 Decision making techniques;...

Hierarchical real-time speaker recognition for biometric VoIP verification and targeting

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Hierarchical real-time speaker recognition for biometric VoIP verification and targeting

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links