Automatic Text-Independent, Language-Independent Speaker Voice-Print Creation and Speaker Recognition

US 20080312926A1
Filed: 05/24/2005
Published: 12/18/2008
Est. Priority Date: 05/24/2005
Status: Abandoned Application

First Claim

Patent Images

1-26. -26. (canceled)

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An automatic dual-step, text independent, language-independent speaker voice-print creation and speaker recognition method, wherein a neural network-based technique is used in a first step and a Markov model-based technique is used in a second step. In particular, the first step uses a neural network-based technique for decoding the content of what is uttered by the speaker in terms of language independent acoustic-phonetic classes, wherein the second step uses the sequence of language-independent acoustic-phonetic classes from the first step and employs a Markov model-based technique for creating the speaker voice-print and for recognizing the speaker. The combination of the two steps enables improvement in the accuracy and efficiency of the speaker voice-print creation and of the speaker recognition, without setting any constraints on the lexical content of the speaker utterance and on the language thereof.

107 Citations

View as Search Results

52 Claims

1-26. -26. (canceled)

27. A method for creating a voice-print of a speaker based on an input voice signal representing an utterance of said speaker, comprising:
- processing said input voice signal to provide a sequence of language-independent acoustic-phonetic classes associated with corresponding temporal segments of said input voice signal, said language-independent acoustic-phonetic classes representing sounds in said utterance and being represented by respective original acoustic models;
  
  adapting the original acoustic model of each of said language-independent acoustic-phonetic classes to the speaker, based on the temporal segment of the input voice signal associated with a language-independent acoustic-phonetic class; and
  
  creating said voice-print based on the adapted acoustic models of said language-independent acoustic-phonetic classes.
- View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
- - 28. The method of claim 27, wherein processing said input voice signal comprises:
    - carrying out a neural network-based decoding.
  - 29. The method of claim 28, wherein said neural network-based decoding is performed by using a hybrid hidden Markov models/artificial neural networks decoder.
  - 30. The method of claim 27, wherein said original acoustic models of said language-independent acoustic-phonetic classes are hidden Markov models.
  - 31. The method of claim 27, wherein processing said input voice signal comprises:
    - extracting observation vectors from said input voice signal, each observation vector being formed by parameters extracted from the input voice signal at a fixed time frame; and
      
      temporally aligning said observation vectors with said input voice signal so as to associate sets of observation vectors with corresponding temporal segments of the input voice signal; and
      
      wherein adapting the original acoustic model of each of said language-independent acoustic-phonetic classes to the speaker, based on the temporal segment of the input voice signal associated with a language-independent acoustic-phonetic class comprises;
      
      adapting the original acoustic model of each of said language-independent acoustic-phonetic classes to the speaker, based on the set of observation vectors associated with the temporal segment of the input voice signal in turn associated with the language-independent acoustic-phonetic class.
  - 32. The method of claim 31, wherein the original acoustic model of each of said language-independent acoustic-phonetic classes is formed by a number of acoustic states, and wherein adapting the original acoustic model of each of said language-independent acoustic-phonetic classes to the speaker, based on the set of observation vectors associated with the corresponding temporal segment of the input voice signal, comprises:
    - associating sub-sets of observation vectors in said set of observation vectors with corresponding acoustic states of the original acoustic model of said language-independent acoustic-phonetic class; and
      
      adapting each acoustic state of the original acoustic model of said language-independent acoustic-phonetic class to the speaker, based on the corresponding sub-set of observation vectors.
  - 33. The method of claim 32, wherein adaptation of an original acoustic model of a language-independent acoustic-phonetic class to a speaker is performed by implementing a maximum a posteriori adaptation technique.
  - 34. The method of claim 32, wherein association of sub-sets of observation vectors with acoustic states of said original acoustic models of said language-independent acoustic-phonetic classes is carried out by means of dynamic programming techniques which perform dynamic time-warping based on said original acoustic models.
  - 35. A method for verifying a speaker based on a voice-print created according to claim 27, and on an input voice signal representing an utterance of said speaker, comprising:
    - processing said input voice signal to provide a sequence of language-independent acoustic-phonetic classes associated with corresponding temporal segments of said input voice signal; and
      
      computing a likelihood score indicative of a probability that said utterance has been made by the same speaker as the speaker to whom said voice-print belongs, said likelihood score being computed based on said input speech signal, said original acoustic models of said language-independent acoustic-phonetic classes and the adapted acoustic models of said language-independent acoustic-phonetic classes used to create said voice-print.
  - 36. The method of claim 35, wherein said language-independent acoustic-phonetic classes are represented by respective original acoustic models having the same topology as the original acoustic models used to create said voice-print.
  - 37. The method of claim 35, wherein computing said likelihood score comprises:
    - computing first contributions to said likelihood score, one for each one of said language-independent acoustic-phonetic classes, each first contribution being computed based on a corresponding temporal segment of said input voice signal, and on the adapted acoustic model of said language-independent acoustic-phonetic class used to create said speaker voice-print;
      
      computing second contributions to said likelihood score, one for each language-independent acoustic-phonetic class, each second contribution being computed based on a corresponding temporal segment of said input voice signal, and on the original acoustic model of said language-independent acoustic-phonetic class; and
      
      computing said likelihood score based on said first and second contributions.
  - 38. The method of claim 36, wherein processing said input voice signal comprises:
    - extracting observation vectors from said input voice signal, each observation vector being formed by parameters extracted from the input voice signal at a fixed time frame;
      
      temporally aligning said observation vectors with said input voice signal so as to associate sets of observation vectors with corresponding temporal segments of the input voice signal;
      
      wherein computing a first contribution to said likelihood score for each language-independent acoustic-phonetic class comprises;
      
      computing said first contribution to said likelihood score based on a set of observation vectors associated with the language-independent acoustic-phonetic class and the adapted acoustic model of said language-independent acoustic-phonetic class used to create said speaker voice-print;
      
      and wherein computing said second contribution to said likelihood score for each language-independent acoustic-phonetic class comprises;
      
      computing said second contribution to said likelihood score based on the set of observation vectors associated with said language-independent acoustic-phonetic class and said original acoustic model of said language-independent acoustic-phonetic class.
  - 39. The method of claim 35, further comprising:
    - verifying said speaker based on said likelihood score.
  - 40. The method of claim 39, wherein verifying said speaker comprises:
    - comparing said likelihood score with a given threshold; and
      
      verifying said speaker based on an outcome of said comparison.
  - 41. The method of claim 35, wherein processing said input voice signal comprises:
    - carrying out a neural network-based decoding.
  - 42. The method of claim 41, wherein said neural network-based decoding is performed by using a hybrid hidden Markov models/artificial neural networks decoder.
  - 43. The method of claim 35, wherein said original acoustic models of said language-independent acoustic-phonetic classes are hidden Markov models.
  - 44. A method for identifying a speaker based on a number of voice-prints, each created according to claim 27, and on an input voice signal, representing an utterance of said speaker, comprising:
    - performing a number of speaker verifications according to a method for verifying a speaker based on a voice-print created according to the method of claim 27, and on an input voice signal representing an utterance of said speaker, comprising;
      
      processing said input voice signal to provide a sequence of language-independent acoustic-phonetic classes associated with corresponding temporal segments of said input voice signal; and
      
      computing a likelihood score indicative of a probability that said utterance has been made by the same speaker as the speaker to whom said voice-print belongs, said likelihood score being computed based on said input speech signal, said original acoustic models of said language-independent acoustic-phonetic classes and the adapted acoustic models of said language-independent acoustic-phonetic classes used to create said voice-print, each speaker verification being based on a respective one of said voice-prints; and
      
      identifying said speaker based on outcomes of said speaker verifications.
  - 45. The method of claim 44, wherein each speaker verification provides a corresponding likelihood score, and identifying said speaker based on outcomes of said speaker verifications comprising:
    - identifying said speaker based on said likelihood scores.
  - 46. The method of claim 45, wherein identifying said speaker based on said likelihood scores comprises:
    - identifying the maximum likelihood score;
      
      comparing said maximum likelihood score with a given threshold; and
      
      identifying said speaker based on an outcome of said comparison.

47. A speaker recognition system capable of being configured to implement a method for creating a voice-print of a speaker based on an input voice signal representing an utterance of said speaker, comprising:
- processing said input voice signal to provide a sequence of language-independent acoustic-phonetic classes associated with corresponding temporal segments of said input voice signal, said language-independent acoustic-phonetic classes representing sounds in said utterance and being represented by respective original acoustic models;
  
  adapting the original acoustic model of each of said language-independent acoustic-phonetic classes to the speaker, based on the temporal segment of the input voice signal associated with a language-independent acoustic-phonetic class; and
  
  creating said voice-print based on the adapted acoustic models of said language-independent acoustic-phonetic classes.
- View Dependent Claims (48, 49)
- - 48. The system of claim 47, capable of being further configured to implement a method for verifying a speaker based on a voice-print created according to the method for creating a voice-print of a speaker and on an input voice signal representing an utterance of said speaker, comprising:
    - processing said input voice signal to provide a sequence of language-independent acoustic-phonetic classes associated with corresponding temporal segments of said input voice signal; and
      
      computing a likelihood score indicative of a probability that said utterance has been made by the same speaker as the speaker to whom said voice-print belongs, said likelihood score being computed based on said input speech signal, said original acoustic models of said language-independent acoustic-phonetic classes, and the adapted acoustic models of said language-independent acoustic-phonetic classes used to create said voice-print.
  - 49. The system of claim 47, capable of being further configured to implement a method for identifying a speaker based on a number of voice-prints, each created according to the method for creating a voice-print of a speaker, and on an input voice signal, representing an utterance of said speaker, comprising:
    - performing a number of speaker verifications by a method for verifying a speaker based on a voice-print created according to the method for creating a voice-print of a speaker and on an input voice signal representing an utterance of said speaker, comprising;
      
      processing said input voice signal to provide a sequence of language-independent acoustic-phonetic classes associated with corresponding temporal segments of said input voice signal; and
      
      computing a likelihood score indicative of a probability that said utterance has been made by the same speaker as the one to whom said voice-print belongs, said likelihood score being computed based on said input speech signal, said original acoustic models of said language-independent acoustic-phonetic classes, and the adapted acoustic models of said language-independent acoustic-phonetic classes used to create said voice-print, each speaker verification being based on a respective one of said voice-prints; and
      
      identifying said speaker based on outcomes of said speaker verifications.

50. A computer program product loadable in a memory of a processing system and comprising software code portions capable of implementing, when the computer program product is run on the processing system, a method for creating a voice-print of a speaker based on an input voice signal representing an utterance of said speaker, comprising:
- processing said input voice signal to provide a sequence of language-independent acoustic-phonetic classes associated with corresponding temporal segments of said input voice signal, said language-independent acoustic-phonetic classes representing sounds in said utterance and being represented by respective original acoustic models;
  
  adapting the original acoustic model of each of said language-independent acoustic-phonetic classes to the speaker, based on the temporal segment of the input voice signal associated with a language-independent acoustic-phonetic class; and
  
  creating said voice-print based on the adapted acoustic models of said language-independent acoustic-phonetic classes.
- View Dependent Claims (51, 52)
- - 51. The computer program product of claim 50, further comprising software code portions capable of implementing, when the computer program product is run on the processing system, a method for verifying a speaker based on a voice-print created according to the method for creating a voice-print of a speaker and on an input voice signal representing an utterance of said speaker, comprising:
    - processing said input voice signal to provide a sequence of language-independent acoustic-phonetic classes associated with corresponding temporal segments of said input voice signal; and
      
      computing a likelihood score indicative of a probability that said utterance has been made by the same speaker as the speaker to whom said voice-print belongs, said likelihood score being computed based on said, input speech signal, said original acoustic models of said language-independent acoustic-phonetic classes, and the adapted acoustic models of said language-independent acoustic-phonetic classes used to create said voice-print.
  - 52. The computer program product of claim 50, further comprising software code portions capable of implementing, when the computer program product is run on the processing system, a method for identifying a speaker based on a number of voice-prints, each created according to the method for creating a voice-print of a speaker, and on an input voice signal representing an utterance of said speaker, comprising:
    - performing a number of speaker verifications by a method for verifying a speaker based on a voice-print created according to the method for creating a voice-print of a speaker and on an input voice signal representing an utterance of said speaker, comprising;
      
      processing said input voice signal to provide a sequence of language-independent acoustic-phonetic classes associated with corresponding temporal segments of said input voice signal; and
      
      computing a likelihood score indicative of a probability that said utterance has been made by the same speaker as the speaker to whom said voice-print belongs, said likelihood score being computed based on said, input speech signal, said original acoustic models of said language-independent acoustic-phonetic classes, and the adapted acoustic models of said language-independent acoustic-phonetic classes used to create said voice-print, each speaker verification being based on a respective one of said voice-prints; and
      
      identifying said speaker based on outcomes of said speaker verifications.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Loquendo SpA (Microsoft Corporation)
Original Assignee
Loquendo SpA (Microsoft Corporation)
Inventors
Colibro, Daniele, Vair, Claudio, Fissore, Luciano

Application Number

US11/920,849
Publication Number

US 20080312926A1
Time in Patent Office

Days
Field of Search
US Class Current

704/249
CPC Class Codes

G10L 17/04   Training, enrolment or mode...

G10L 17/14   Use of phonemic categorisat...

G10L 17/16   Hidden Markov models [HMM]

Automatic Text-Independent, Language-Independent Speaker Voice-Print Creation and Speaker Recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

107 Citations

52 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic Text-Independent, Language-Independent Speaker Voice-Print Creation and Speaker Recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

107 Citations

52 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links