Background learning of speaker voices

US 7,171,360 B2
Filed: 05/07/2002
Issued: 01/30/2007
Est. Priority Date: 05/10/2001
Status: Active Grant

First Claim

Patent Images

1. A method of automatically identifying a speaker;

the method including;

identifying a speaker by;

receiving a test utterance from the speaker;

determining a most likely one of a plurality of speaker models for the test utterance; and

identifying the speaker associated with the most likely speaker model as the speaker of the test utterance;

wherein the method includes generating the plurality of speaker models in the background by;

receiving training utterances from a pool of utterances from the plurality of speakers in the background, without prior knowledge of the speakers who spoke the respective training utterances;

blind clustering of the training utterances from the pool of utterances based on a predetermined criterion, wherein the blind clustering includes calculating a likelihood vector for each training utterance; and

non-explicitly training for each of the clusters a corresponding speaker model, each of the models representing a speaker.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speaker identification system includes a speaker model generator 110 for generating a plurality of speaker models. To this end, the generator records training utterances from a plurality of speakers in the background, without prior knowledge of the speakers who spoke the utterances. The generator performs a blind clustering of the training utterances based on a predetermined criterion. For each of the clusters a corresponding speaker model is trained.

A speaker identifier 130 identifies a speaker determining a most likely one of the speaker models for an utterance received from the speaker. The speaker associated with the most likely speaker model is identified as the speaker of the test utterance.

Citations

20 Claims

1. A method of automatically identifying a speaker;
- the method including;
  
  identifying a speaker by;
  
  receiving a test utterance from the speaker;
  
  determining a most likely one of a plurality of speaker models for the test utterance; and
  
  identifying the speaker associated with the most likely speaker model as the speaker of the test utterance;
  
  wherein the method includes generating the plurality of speaker models in the background by;
  
  receiving training utterances from a pool of utterances from the plurality of speakers in the background, without prior knowledge of the speakers who spoke the respective training utterances;
  
  blind clustering of the training utterances from the pool of utterances based on a predetermined criterion, wherein the blind clustering includes calculating a likelihood vector for each training utterance; and
  
  non-explicitly training for each of the clusters a corresponding speaker model, each of the models representing a speaker.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. A method as claimed in claim 1, wherein the step of blind clustering of the training utterances x_i, i<
    - N based on the predetermined criterion includes;
      
      modeling each respective one of the training utterances x_iby a respective model λ
      
      _i;
      
      calculating for each training utterance x_ia corresponding likelihood vector L_i, where each vector element L_ij, 1≦
      
      j≦
      
      N represents a likelihood of the training utterance x_iagainst a respective one of the models λ
      
      _j;
      
      determining for each training utterance x_ia corresponding ranking vector F_i, where each element F_ijof the ranking vector F_iis assigned a ranking value representing a ranking of the corresponding likelihood L_ijcompared to the other elements of the likelihood vector L_isuch that a higher likelihood value of L_ijis reflected by a higher ranking value of F_ij;
      
      clustering the training utterances x_ibased on a criterion that a minimum in a distance measure between F_iand F_jindicates that the training utterances x_iand X_joriginate from a same speaker.
  - 3. A method as claimed in claim 2, wherein the ranking is such that η
    - lowest likelihood values of the elements L_ijof the likelihood vector L_iare represented by distinct values of the corresponding elements F_ijof the ranking vector F_i, and that the remaining N-η
      
      elements L_ijof the likelihood vector L_iare represented by a same predetermined ranking value of the corresponding elements F_ijof the ranking vector F_i, where η
      
      represents an expected number of training utterances per cluster, and the predetermined ranking value being lower than any of the η
      
      distinct ranking values.
  - 4. A method as claimed in claim 1, wherein the method includes:
    - receiving an enrolment utterance from a speaker,determining a most likely one of a plurality of speaker models for the enrolment utterance;
      
      receiving identifying information of the speaker; and
      
      storing the identifying information in association with the most likely speaker model.
  - 5. A method as claimed in claim 4, wherein the method includes:
    - verifying whether a likelihood of the most likely speaker model is above a predetermined threshold; and
      
      if the likelihood is below the predetermined threshold, requesting a further utterance from the speaker, and iterativelyreceiving the further utterance;
      
      adapting the most likely speaker model with the further utterance; and
      
      determining the likelihood of the adapted speaker model;
      
      until the likelihood is above the predetermined threshold.
  - 6. A method as claimed in claim 1, wherein the steps of recording training utterances, blind clustering the utterances and training the speaker models is performed iteratively until a predetermined level of confidence has been achieved.
  - 7. A method as claimed in claim 6, wherein in response to achieving the predetermined confidence level the speaker is automatically requested to provide information identifying the speaker, followed by receiving the identifying information and storing the identifying information in association with the most likely speaker model.
  - 8. A method as claimed in claim 1, wherein the method includes, in response to having identified the speaker, automatically retrieving a personal profile for interacting with a CE (consumer electronics) device.
  - 9. A method as claimed in claim 1, wherein the method includes recognizing the test utterance used for identifying the speaker as a voice command;
    - and executing the recognized voice command in a speaker-dependent way.
  - 10. A computer program product comprising a computer readable medium into which is embodied a computer program operative to cause a processor to perform the method as claimed in claim 1.

11. A system for automatically identifying a speaker;
- the system includes;
  
  a speaker identifier operative to identify a speaker by;
  
  receiving a test utterance from the speaker;
  
  determining a most likely one of a plurality of speaker models for the test utterance; and
  
  identifying the speaker associated with the most likely speaker model as the speaker of the test utterance; and
  
  a speaker model generator operative to generate the plurality of speaker models, wherein the speaker model generator is operative to generate the plurality of speaker models in the background by;
  
  receiving training utterances from a pool of utterances from the plurality of speakers in the background, without prior knowledge of the speakers who spoke the respective training utterances;
  
  blind clustering of the training utterances from the pool of utterances based on a predetermined criterion, wherein the blind clustering includes calculating a likelihood vector for each training utterance; and
  
  non-explicitly training for each of the clusters a corresponding speaker model, each of the models representing a speaker.

12. A method for automatically identifying based on speech, comprising:
- receiving training utterances from a pool of utterances from the plurality of speakers;
  
  performing blind clustering of the received training utterances from the pool of utterances based on a predetermined criterion to produce clusters, wherein the blind clustering includes calculating a likelihood vector for each training utterance;
  
  non-explicitly training, for said clusters, corresponding speaker models representing associated speakers;
  
  receiving a test utterance from a speaker;
  
  determining a most likely one of said speaker models for the test utterance; and
  
  identifying a speaker associated with the determined most likely speaker model as the speaker of the test utterance,wherein said receiving training utterances occurs in the background, without prior knowledge of the speakers who respectively spoke said training utterances, and said speaker models are generated in the background.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
- - 13. A method as claimed in claim 12, wherein said blind clustering comprises:
    - the generating of the speaker models for respective ones of said received training utterances;
      
      calculating, for said received training utterances, corresponding likelihood vectors, whose elements individually represent likelihood of a respective one of the training utterances based on a multiple ones of the models;
      
      forming, for the training utterances, corresponding ranking vectors,, whose elements are assigned corresponding ranking values representing relative ranking of likelihood compared to other elements of a respective one of the likelihood vectors; and
      
      clustering the training utterances based on a criterion that judges training utterances to have originated from the same speaker based on distance between ones of said corresponding ranking vectors.
  - 14. A method as claimed in claim 12, wherein said blind clustering comprises clustering the training utterances according to distance between ones of ranking vectors which respectively indicate likelihood of a particular one of said training utterances based on a plurality of said models.
  - 15. The method of claim 14, wherein said clustering of training utterances is based on a criterion that a minimum in a distance measure between pairs of ranking vectors indicates that a respective pair of said training utterances originate from a same speaker.
  - 16. The method of claim 14, wherein at least some of said ranking vectors utilized in said clustering the training utterances relate to utterances of a same speaker.
  - 17. The method of claim 16, wherein said ranking vectors which respectively indicate likelihood are divided into groups, each group relating to utterances of a single speaker particular to that group.
  - 18. A method as claimed in claim 12, wherein the method includes, in response to having identified the speaker, automatically retrieving a personal profile for interacting with a CE (consumer electronics) device.
  - 19. A method as claimed in claim 12, wherein the method includes recognizing the test utterance used for identifying the speaker as a voice command;
    - and executing the recognized voice command in a speaker-dependent way.
  - 20. A computer program product comprising a computer readable medium into which is embodied a computer program operative to cause a processor to perform the method as claimed in claim 12.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Koninklijke Philips Electronics N.V. (Koninklijke Philips N.V.)
Inventors
Cheng, Jyh-Min, Chu, Ya-Cherng, Huang, Chao-Shih, Tsai, Wei-Ho
Primary Examiner(s)
Lerner; Martin

Application Number

US10/140,499
Publication Number

US 20030088414A1
Time in Patent Office

1,729 Days
Field of Search

704/236, 704/239, 704/240, 704/243, 704/244, 704/245, 704/246, 704/250, 704/256.2, 704/270, 704/273, 704/275
US Class Current

704/245
CPC Class Codes

G10L 15/07 to the speaker

G10L 17/04 Training, enrolment or mode...

Background learning of speaker voices

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Background learning of speaker voices

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links