Speaker verification utilizing compressed audio formants

US 6,898,568 B2
Filed: 07/13/2001
Issued: 05/24/2005
Est. Priority Date: 07/13/2001
Status: Active Grant

First Claim

Patent Images

1. A method of performing speaker verification to determine whether a speaker is a registered speaker, the method comprising:

a) obtaining an array of frames of compressed audio formants representing the speaker uttering a predetermined pass phrase, each frame within the array including;

i) energy data and pitch data characterizing the residue of the speaker uttering the predetermined pass phrase; and

ii) a plurality of formant coefficients characterizing the resonance of the speaker uttering the predetermined pass phrase; and

b) performing a time domain normalization of the array of frames of compressed audio formants to a sample array of frames of compressed audio formants such that such that the two arrays are of an equal quantity of frames;

c) determining whether the speaker is the registered speaker by;

generating an array of discrepancy values, each discrepancy value representing the difference between one of;

i) an energy value;

ii) a pitch value; and

iii) a formant coefficient value of a frame of the array and a corresponding energy value;

ii) pitch value; and

iii) formant coefficient value of a corresponding frame in the sample array; and

determining whether the array of discrepancy values is within a predetermined threshold.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A identity of a remote speaker is verified by receiving compressed audio formants from a remote Internet telephony client and comparing the compressed audio formants with sample compressed audio formants known to represent the person the remote speaker purports to be. The compressed audio formants include energy and pitch data characterizing the residue of the speaker uttering a predetermined pass phrase and a plurality of formant coefficients characterizing the resonance of the speaker.

Citations

12 Claims

1. A method of performing speaker verification to determine whether a speaker is a registered speaker, the method comprising:
- a) obtaining an array of frames of compressed audio formants representing the speaker uttering a predetermined pass phrase, each frame within the array including;
  
  i) energy data and pitch data characterizing the residue of the speaker uttering the predetermined pass phrase; and
  
  ii) a plurality of formant coefficients characterizing the resonance of the speaker uttering the predetermined pass phrase; and
  
  b) performing a time domain normalization of the array of frames of compressed audio formants to a sample array of frames of compressed audio formants such that such that the two arrays are of an equal quantity of frames;
  
  c) determining whether the speaker is the registered speaker by;
  
  generating an array of discrepancy values, each discrepancy value representing the difference between one of;
  
  i) an energy value;
  
  ii) a pitch value; and
  
  iii) a formant coefficient value of a frame of the array and a corresponding energy value;
  
  ii) pitch value; and
  
  iii) formant coefficient value of a corresponding frame in the sample array; and
  
  determining whether the array of discrepancy values is within a predetermined threshold.
- View Dependent Claims (2, 3, 4)
- - 2. The method of performing speaker verification of claim 1, wherein performing a time domain normalization comprises:
    - comparing the quantity of frames in the array with the quantity of frames in the sample array to determine the quantity of frames to be decimated from the larger of the two arrays such that the two arrays are of an equal quantity of frames;
      
      selecting a pitch decimation group of frames from the larger of the two arrays, the pitch decimation group being the selection of frames which, if decimated, yields the best alignment between the pitch values of the two arrays after decimation;
      
      selecting an energy decimation group of frames from the larger of the two arrays, the energy decimation group being the selection of frames which, if decimated, yields the best alignment between the energy values of the two arrays after decimation;
      
      selecting a plurality of formant coefficient decimation groups, each formant coefficient decimation group being a selection of frames from the larger of the two arrays which, if decimated, yields the best alignment between the formant coefficient values of the two arrays after decimation; and
      
      determining a decimation group of frames from the larger of the two arrays, the decimation group being a quantity of frames equal to the quantity of frames to be decimated and being the frames which are selected by weighted average from the pitch decimation group, the energy decimation group, and each formant coefficient decimation group; and
      
      decimating the decimation group of frames from the larger of the two arrays.
  - 3. The method of performing speaker verification of claim 2, wherein the step of obtaining an array of frames of compressed audio formants includes receiving the frames of compressed audio formants from a remote Internet telephony device.
  - 4. The method of performing speaker verification of claim 3, wherein the step of obtaining an array of frames of compressed audio formants from the remote Internet telephony device comprises receiving audio input of the speaker uttering the pass phrase from a microphone, and digitizing the audio input, converting the digitized audio input to a sequence of frames of compressed audio formants.

5. A method of determining whether a speaker is a registered speaker, the method comprising:
- a) obtaining compressed audio formants for each frame of an array of frames representing the speaker uttering a predetermined pass phrase;
  
  b) performing a time domain normalization of the array to a sample array of frames stored in a memory and representing the registered speaker uttering the predetermined pass phrase to decimate a portion of the frames of the larger of the two arrays such that the two arrays, after decimation, are of an equal quantity of frames, the portion of the frames to be decimated being selected by;
  
  selecting a plurality of audio ferment decimation groups, each audio formant decimation group being a selection of frames from the larger of the two arrays which, if decimated, yields the best alignment between a formant coefficient value of each frame of each the array and the corresponding formant coefficient value of each frame of the sample array; and
  
  determining a decimation group of frames from the larger of the two arrays, the decimation group being a quantity of frames equal to the quantity of frames to be decimated and being the frames which are selected by weighted average from each of the audio format decimation groups;
  
  c) generating an array of discrepancy values, each discrepancy value representing the difference between one of an audio formant value of a frame of the array and a corresponding audio formant value of a corresponding frame of the sample array; and
  
  d) determining that the remote speaker is the registered speaker if the array of discrepancy values is within a predetermined threshold.
- View Dependent Claims (6, 7, 8)
- - 6. The method of determining whether a speaker is a registered speaker of claim 5, whereindetermining the decimation group of frames comprises:
    - selecting a pitch decimation group of frames from the larger of the two arrays, the pitch decimation group being the selection of frames which, if decimated, yields the best alignment between the pitch values of the two arrays after decimation;
      
      selecting an energy decimation group of frames from the larger of the two arrays, the energy decimation group being the selection of frames which, if decimated, yields the best alignment between the energy values of the two arrays after decimation;
      
      selecting a plurality of formant coefficient decimation groups, each formant coefficient decimation group being a selection of frames from the larger of the two arrays which, if decimated, yields the best alignment between the formant coefficient values of the two after decimation; and
      
      selecting frames from the larger of the two arrays for the decimation group by weighted average from the pitch decimation group, the energy decimation group, and each formant coefficient decimation group.
  - 7. The method of determining whether a speaker is a registered speaker of claim 6, wherein the step of obtaining compressed audio formants includes obtaining the compressed audio formants from a remote location and sending the compressed audio formants from the remote location.
  - 8. The method of determining whether a speaker is a registered speaker of claim 7, wherein the step of obtaining compressed audio formants at a remote location includes receiving audio input of the speaker uttering the pass phrase from a microphone, digitizing the audio input, and compressing the digitized audio input to generate compressed audio formants.

9. A speaker verification server for determining whether a remote speaker is a registered speaker, the server comprising:
- a) a network interface for receiving, via a packet switched network, compressed audio formants for each frame of an array of frames representing the remote speaker uttering a predetermined pass phrase as audio input to a remote telephony client;
  
  b) a database storing compressed audio formants for each frame of a sample array of representing the registered speaker uttering the predetermined pass phrase as audio input; and
  
  c) a verification application operatively coupled to each of the network interface and the database for comparing the compressed audio formants of the array of frames to the compressed audio formants of the sample array of frames to determine whether the remote speaker is the registered speaker by;
  
  performing a time domain normalization of the array to the sample array such that such that the two arrays are of an equal quantity of frames;
  
  generating an array of discrepancy values, each discrepancy value representing the difference between one of an audio formant value of a frame of the array and a corresponding audio formant value of a corresponding frame of the sample array; and
  
  determining that the remote speaker is the registered speaker if the array of discrepancy values is within a predetermined threshold.
- View Dependent Claims (10, 11, 12)
- - 10. The speaker verification server of claim 9, wherein the verification application performs time domain normalization by:
    - comparing the quantity of frames in the array with the quantity of frames in the sample array to determine the quantity of frames to be decimated from the larger of the two arrays such that the two arrays are of an equal quantity of frames;
      
      selecting a plurality of audio formant decimation groups, each audio formant decimation group being a selection of frames from the larger of the two arrays which, if decimated, yields the best alignment between a formant coefficient value of each frame of each the array and the corresponding formant coefficient value of each frame of the sample array after decimation; and
      
      determining a decimation group of frames from the larger of the two arrays, the decimation group being a quantity of frames equal to the quantity of frames to be decimated and being the frames which are selected by weighted average from each of the audio formant decimation groups; and
      
      decimating the decimation group of frames from the larger of the two arrays.
  - 11. The speaker verification server of claim 10, wherein the compressed audio formants include energy data and pitch data characterizing the residue of the speaker uttering the predetermined pass phrase and formant coefficients characterizing the resonance of the speaker uttering the predetermined pass phrase;
    - and each frame includes an energy value and pitch value characterizing the residue of the registered speaker uttering the registered pass phrase and formant coefficient values characterizing the resonance of the registered speaker uttering the registered pass phrase.
  - 12. The speaker verification server of claim 11, wherein the verification application determines the decimation group of frames by:
    - selecting a pitch decimation group of frames from the larger of the two arrays, the pitch decimation group being the selection of frames which, if decimated, yields the best alignment between the pitch values of the two arrays after decimation;
      
      selecting an energy decimation group of frames from the larger of the two arrays, the energy decimation group being the selection of frames which, if decimated, yields the best alignment between the energy values of the two arrays after decimation;
      
      selecting a plurality of formant coefficient decimation groups, each formant coefficient decimation group being a selection of frames from the larger of the two arrays which, if decimated, yields the best alignment between the formant coefficient values of the two after decimation; and
      
      selecting frames from the larger of the two arrays for the decimation group by weighted average from the pitch decimation group, the energy decimation group, and each formant coefficient decimation group.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Innomedia Pre Limited
Original Assignee
Innomedia Pte Ltd
Inventors
Ng, Kai Wa, Lin, Nan Sheng, Ouyang, Jing Zheng
Primary Examiner(s)
MCFADDEN, SUSAN IRIS

Application Number

US09/904,999
Publication Number

US 20030014247A1
Time in Patent Office

1,411 Days
Field of Search

704/243, 704/246, 704/250, 704/231, 704/270, 704/273, 704/209, 704/207, 704/208
US Class Current

704/246
CPC Class Codes

G10L 17/06 Decision making techniques;...

G10L 25/27 characterised by the analys...

Speaker verification utilizing compressed audio formants

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Speaker verification utilizing compressed audio formants

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links