Speaker recognition using spectral coefficients normalized with respect to unequal frequency bands

US 5,583,961 A
Filed: 08/13/1993
Issued: 12/10/1996
Est. Priority Date: 03/25/1993
Status: Expired due to Term

First Claim

Patent Images

1. A method of speaker recognition, said method comprising the steps of:

deriving recognition feature data from an input speech signal represented by plural successive frames of digital data for a speech utterance, said recognition feature data comprising a plurality of coefficients each related to speech signal magnitude in a predetermined frequency band;

comparing said feature data with predetermined speaker reference data;

indicating recognition of a speaker in dependence upon the comparison;

said frequency bands being unevenly spaced with respect to frequency,said deriving step including a step of deriving a long term average spectral magnitude extending over plural of said frames of digital data; and

processing at least one of said coefficients so as to generate a normalized coefficient in which the effect of said long term magnitude is substantially reduced.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Apparatus and method for speaker recognition includes generating, in response to a speech signal, a plurality of feature data having a series of coefficient sets, each set having a plurality of coefficients indicating the short term special amplitude in a plurality of frequency bands. The feature data is compared with predetermined speaker reference data, and recognition of a corresponding speaker is indicated in dependence upon such comparison. The frequency bands are unevenly spaced along the frequency axis, and a long term average spectral magnitude of at least one of said coefficients is derived and used for normalizing the at least one coefficient.

77 Citations

View as Search Results

49 Claims

1. A method of speaker recognition, said method comprising the steps of:
- deriving recognition feature data from an input speech signal represented by plural successive frames of digital data for a speech utterance, said recognition feature data comprising a plurality of coefficients each related to speech signal magnitude in a predetermined frequency band;
  
  comparing said feature data with predetermined speaker reference data;
  
  indicating recognition of a speaker in dependence upon the comparison;
  
  said frequency bands being unevenly spaced with respect to frequency,said deriving step including a step of deriving a long term average spectral magnitude extending over plural of said frames of digital data; and
  
  processing at least one of said coefficients so as to generate a normalized coefficient in which the effect of said long term magnitude is substantially reduced.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. A method as in claim 1, in which the frequency bands are spaced on a mel-frequency scale.
  - 3. A method as in claim 1, in which the frequency bands are spaced linearly with respect to frequency below a predetermined limit and logarithmically with respect to frequency above said limit.
  - 4. A method as in claim 1 in which the deriving step includes a step of generating a logarithm of said magnitude, generating a logarithmic long term average value and subtracting the logarithmic long term average from the logarithmic magnitude.
  - 5. A method as in claim 1 in which said comparing step time-aligns the feature data with the reference data.
  - 6. A method as in claim 5, in which the comparing step employs a Dynamic Time Warp process.
  - 7. A method as in claim 1 further comprises:
    - recognizing a speech start point and a speech end point within said input speech signal; and
      
      deriving said long term average over the duration between said start point and said end point.
  - 8. A method as in claim 1 in which said long term average comprises the long term mean.
  - 9. A method as in claim 1 in which said long term average comprises a moving average which is periodically updated.
  - 10. A method as in claim 1 further comprising:
    - inputting a plurality of words one after another, and forming said long term average over all of said words.

11. Apparatus for speaker recognition which comprises:
- means for generating a plurality of feature data comprising a series of coefficient sets from a speech signal represented by plural successive frames of digital data for a speech utterance, each set comprising a plurality of coefficients indicating short term spectral magnitude in a plurality of unevenly spaced frequency bands,means for comparing said feature data with predetermined speaker reference data and for indicating recognition of a corresponding speaker in dependence upon said comparison, andmeans for deriving a long term average spectral magnitude of at least one of said coefficients extending over plural of said frames of digital data and for normalizing the said at least one coefficient by said long term average.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 12. Apparatus as in claim 11, in which the frequency bands are spaced on a mel-frequency scale.
  - 13. Apparatus as in claim 11, in which the frequency bands are spaced linearly with respect to frequency below a predetermined limit and logarithmically with respect to frequency above said limit.
  - 14. Apparatus as in claim 11 in which the means for generating said coefficients are arranged to generate a logarithm of said magnitude, to generate a logarithmic long term average value and to subtract the logarithmic long term average from the logarithmic coefficient magnitude.
  - 15. Apparatus as in claim 11 in which said means for comparing is arranged to time-align the feature data with the reference data.
  - 16. Apparatus as in claim 15, in which the means for comparing employs a Dynamic Time Warp process.
  - 17. Apparatus as in claim 11 further comprising:
    - means for recognizing a start point and an end point within said speech signal,said means for deriving and normalizing is arranged to derive said long term average over the duration of a speech utterance between said start point and said end point.
  - 18. Apparatus as in claim 11 in which said long term average comprises the long term means.
  - 19. Apparatus as in claim 11 in which said long term average comprises a moving average which is periodically updated.
  - 20. Apparatus as in claim 11 arranged for inputting speech signals representing a plurality of words one after another, in which said means for deriving and normalizing is arranged to form said long term average over all of the said words.
  - 21. Apparatus as in claim 11 adapted to be connected to a telephone network.
  - 22. A telephone network comprising apparatus as in claim 21.

23. Apparatus for recognition processing of a voice signal represented by plural successive frames of digital data for a speech utterance, said apparatus comprising:
- means for deriving recognition data comprising a plurality of signals each related to short term amplitude in a corresponding frequency band of said voice signal said frequency bands being unevenly spaced in the frequency domain,means for performing recognition processing using said recognition data,means for periodically generating or updating a moving long term average spectral amplitude extending over plural of said frames of digital data in said frequency bands; and
  
  means for processing feature data based on said recognition data using said long term average to reduce their dependence upon stationary spectral envelope components.

24. A method of speaker recognition, said method comprising:
- generating recognition feature data from an input speech signal, said recognition feature data comprising a plurality of coefficients each related to the speech signal magnitude in a predetermined frequency band, said frequency bands being unevenly spaced along the frequency axis, the step of generating said coefficients including a step of deriving a long term average spectral magnitude and processing at least one of said coefficients so as to generate a normalized coefficient in which the effect of said long term magnitude is substantially reduced;
  
  comparing said feature data with predetermined speaker reference data; and
  
  indicating recognition of a speaker in dependence upon the comparison.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33)
- - 25. A method as in claim 24 in which the frequency bands are spaced on a mel-frequency scale.
  - 26. A method as in claim 24 in which the frequency bands are spaced linearly for frequencies below a predetermined limit and logarithmically for frequencies above said limit.
  - 27. A method as in claim 24 wherein the step of generating said coefficients includes the sub-steps of:
    - generating a logarithm of said magnitude,generating a logarithmic long term average value, andsubtracting the logarithmic long term average from the logarithmic magnitude.
  - 28. A method as in claim 24 wherein said comparison includes time-alignment of the feature data with the reference data.
  - 29. A method as in claim 28 wherein the comparison employs a Dynamic Time Warp process.
  - 30. A method as in claim 24 further comprises:
    - recognizing a speech start point and a speech end point within said input speech signal; and
      
      deriving said long term average over the duration between said start point and said end point.
  - 31. A method as in claim 24 wherein said long term average comprises the long term mean.
  - 32. A method as in claim 24 in which said long term average comprises a moving average which is periodically updated.
  - 33. A method as in claim 24 including inputting a speech signal representing a plurality of words one after another, and forming said long term average over all of said words.

34. Apparatus for speaker recognition, said apparatus comprising:
- means for generating from a speech signal, a plurality of feature data comprising a series of coefficient sets, each set comprising a plurality of coefficients indicating the short term spectral magnitude in a plurality of frequency bands, said frequency bands being unevenly spaced along the frequency axis;
  
  means for deriving a long term average spectral magnitude of at lest one of said coefficients;
  
  means for normalizing the or each of said at least one coefficient by said long term average;
  
  means for comparing said feature data with predetermined speaker reference data; and
  
  means for indicating recognition of a corresponding speaker in dependence upon said comparison.
- View Dependent Claims (35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45)
- - 35. Apparatus as in claim 34 wherein the frequency bands are spaced on a mel-frequency scale.
  - 36. Apparatus as in claim 34 in which the frequency bands are spaced linearly for frequencies below a predetermined limit and logarithmically for frequencies above said limit.
  - 37. Apparatus as in claim 34 wherein the means for generating said coefficients include means to:
    - generate a logarithm of said magnitude, generate a logarithmic long term average value and subtract the logarithmic long term average from the logarithmic coefficient magnitude.
  - 38. Apparatus as in claim 34 wherein means for comparing is arranged to time-align the feature data with the reference data.
  - 39. Apparatus as in claim 38 in which the means for comparing employs a Dynamic Time Warp process.
  - 40. Apparatus as in claim 34 further comprising:
    - means for recognizing a start point and an end point within said speech signal, in which aid means for normalizing is arranged to derive said long term average over the duration of the utterance between said start point and said end point.
  - 41. Apparatus as in claim 34 wherein the long term average comprises the long term mean.
  - 42. Apparatus as in claim 34 wherein said long term average comprises a moving average which is periodically updated.
  - 43. Apparatus as in claim 34 arranged for inputting a speech signal representing a plurality of words one after another, in which said means for normalizing is arranged to form said long term average over all of said words.
  - 44. Apparatus as in claim 34 adapted to be connected to a telephone network.
  - 45. A telephone network comprising apparatus as in claim 44.

46. Apparatus for recognition processing of a voice signal, said apparatus comprising:
- means for deriving recognition data comprising a plurality of signals each related to the short term amplitude in a corresponding frequency band of said voice signal, said frequency bands being unevenly spaced in the frequency domain;
  
  means for periodically generating or updating a moving long term average spectral amplitude in said frequency bands,means for processing said feature data using said long term average to reduce their dependence upon stationary spectral envelope components; and
  
  means for performing recognition processing in dependence thereon.

47. A method of speaker recognition comprising:
- generating recognition feature data from an input speech signal, comparing said feature data with predetermined speaker reference data; and
  
  indicating recognition of a speaker in dependence upon the comparison;
  
  wherein said recognition feature generating step comprises;
  
  identifying a portion of the input speech signal as representing a single contiguous utterance;
  
  generating a plurality of coefficients each relating to the signal magnitude in one of a plurality of predetermined frequency bands of an identified portion of the signal, said frequency bands being unevenly spaced in the frequency domain;
  
  deriving a long term average spectral magnitude of the coefficients of the single contiguous utterance; and
  
  processing at least one of said coefficients so as to generate a normalized coefficient in which the effect of said long term magnitude is substantially reduced.
- View Dependent Claims (48, 49)
- - 48. A method as in claim 47 wherein the recognition feature generation step is repeated for each successive single contiguous utterance in isolation.
  - 49. A method as in claim 47 wherein said long term average comprises a running average which is periodically updated.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
British Telecommunications (BT Group Plc) (BT Group PLC)
Original Assignee
British Telecommunications PLC (BT Group PLC)
Inventors
Tang, Joseph G., Pawlewski, Mark
Primary Examiner(s)
MacDonald, Allen R.
Assistant Examiner(s)
CHOWDHURY, INDRINAL

Application Number

US08/105,583
Time in Patent Office

1,215 Days
Field of Search

381/42, 381/43, 395/2.5, 395/2.51, 395/2.52, 395/2.43
US Class Current

704/241
CPC Class Codes

G10L 17/02 Preprocessing operations, e...

G10L 25/87 Detection of discrete point...

Speaker recognition using spectral coefficients normalized with respect to unequal frequency bands

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

77 Citations

49 Claims

Specification

Solutions

Use Cases

Quick Links

Speaker recognition using spectral coefficients normalized with respect to unequal frequency bands

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

77 Citations

49 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links