SPEAKER SELECTING DEVICE, SPEAKER ADAPTIVE MODEL CREATING DEVICE, SPEAKER SELECTING METHOD, SPEAKER SELECTING PROGRAM, AND SPEAKER ADAPTIVE MODEL MAKING PROGRAM

US 20100114572A1
Filed: 02/29/2008
Published: 05/06/2010
Est. Priority Date: 03/27/2007
Status: Active Grant

First Claim

Patent Images

1-21. -21. (canceled)

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

To enable selection of a speaker, the acoustic feature value of which is similar to that of an utterance speaker, with accuracy and stability, while adapting to changes even when the acoustic feature value of the speaker changes every moment. A speaker score calculating means (22) calculates a long-time speaker score (log likelihood of each of a plurality of speaker models stored in a speaker model storage section (31) with respect to the acoustic feature value) based on an arbitrary number of utterances, for example, and calculates a short-time speaker score based on a short-time utterance, for example. A long-time speaker selecting means 23 selects speakers corresponding to a predetermined number of speaker models having a high long-time speaker score. A short-time speaker selecting means 24 selects speakers corresponding to the speaker models, the number of which is smaller than the predetermined number and the short-time speaker sore of which is high, from among the speakers selected by the long-time speaker selecting means 23.

Citations

42 Claims

1-21. -21. (canceled)

22. A speaker selecting device comprising:
- a speaker model storage means that stores a plurality of speaker models;
  
  an acoustic feature value calculating means that calculates a feature value from received voice signals; and
  
  a speaker score calculating means that calculates a likelihood of each of the plurality of speaker models stored in the speaker model storage means with respect to the feature value calculated by the acoustic feature value calculating means, whereinthe speaker score calculating means calculates a first likelihood and a second likelihood based on the voice signals of two relatively different time lengths,the speaker score calculating means comprises;
  
  a first selection means that selects speakers corresponding to a predetermined number of speaker models the first likelihood of which is high; and
  
  a second selection means that narrows the speakers selected by the first selection means down to speakers the number of which is smaller than the predetermined number and the second likelihood of which is high, andthe speaker score calculating means sequentially outputs information corresponding the speakers selected by the second selection means.
- View Dependent Claims (23, 24, 25, 26, 27, 28, 29)
- - 23. The speaker selecting device according to claim 22, whereinthe speaker score calculating means calculates, as the first likelihood, a long-time likelihood based on a voice signal of a relatively long time, and calculates, as the second likelihood, a short-time likelihood based on a voice signal of a relatively short time,the first selection means is a long-time speaker selecting means that selects speakers corresponding to a predetermined number of speaker models the long-time likelihood of which is high, andthe second selection means is a short-time speaker selecting means that selects speakers corresponding to speaker models the number of which is smaller than the predetermined number and the short-time likelihood of which is high.
  - 24. The speaker selecting device according to claim 22, whereinthe speaker score calculating means calculates, as the first likelihood, a short-time likelihood based on a voice signal of a relatively short time, and calculates, as the second likelihood, a long-time likelihood based on a voice signal of a relatively long time,the first selection means is a short-time speaker selecting means that selects speakers corresponding to a predetermined number of speaker models the short-time likelihood of which is high, andthe second selection means is a long-time speaker selecting means that selects speakers corresponding to speaker models the number of which is smaller than the predetermined number and the long-time likelihood of which is high.
  - 25. The speaker selecting device according to claim 23, whereinthe long-time speaker selecting means selects the speakers using the likelihoods calculated by the speaker score calculating means, and a first threshold relating to a predetermined likelihood, andthe short-time speaker selecting means selects the speakers using the likelihoods calculated by the speaker score calculating means, and a second threshold which is a threshold relating to a predetermined likelihood and which is a value equal to or different from the first threshold.
  - 26. The speaker selecting device according to claim 23, further comprising an utterance dependence storage means that stores data indicating a temporal dependence between utterances,wherein the speaker score calculating means calculates the likelihoods by reflecting the data stored in the utterance dependence storage means.
  - 27. A speaker adaptive model creating device comprising:
    - selecting device according to claim 22; and
      
      an adaptive model creating means that creates a speaker adaptive model by statistical calculation based on sufficient statistics corresponding to speakers selected by the speaker selecting device.
  - 28. A speaker adaptive model creating device comprising:
    - selecting device according to claim 23;
      
      a means that creates one sufficient statistic relating to a long-time speaker by statistical calculation, from sufficient statistics respectively corresponding to a plurality of speakers selected by a long-time speaker selecting means;
      
      a means that creates one sufficient statistic relating to a short-time speaker by statistical calculation, from sufficient statistics respectively corresponding to a plurality of speakers selected by a short-time speaker selecting means; and
      
      an adaptive model creating means that integrates, by statistical calculation, the sufficient statistics calculated by each of the means, to thereby create a speaker adaptive model.
  - 29. A speaker adaptive model creating device comprising:
    - selecting device according to claim 23;
      
      a short-time speaker integrating means that calculates a frequency of occurrence of speakers selected by a short-time speaker selecting means; and
      
      an adaptive model creating means that creates one speaker adaptive model by weighting and integrating sufficient statistics based on the frequency of occurrence of speakers.

30. A speaker selecting method comprising:
- storing a plurality of speaker models in advance;
  
  calculating a feature value from received voice signals;
  
  calculating a first likelihood and a second likelihood based on the voice signals of two relatively different time lengths, for each of the plurality of speaker models stored with respect to the calculated feature value and selecting speakers using the calculated likelihood, the method comprising;
  
  selecting speakers corresponding to a predetermined speaker models the first likelihood of which is high;
  
  narrowing speakers selected as the speakers corresponding to the predetermined number of speaker models the first likelihood of which is high, down to speaker models the number of which is smaller than the predetermined number and the second likelihood of which is high; and
  
  sequentially outputting information corresponding to speakers narrowed down to the speaker models the number of which is smaller than the predetermined number and the second likelihood of which is high.
- View Dependent Claims (31, 32, 33, 34)
- - 31. The speaker selecting method according to claim 30, whereinin calculating the first likelihood and the second likelihood, a long-time likelihood based on a voice signal of a relatively long time is calculated as the first likelihood, and a short-time likelihood based on a voice signal of a relatively short time is calculated as the second likelihood,in selecting the speakers corresponding to the predetermined number of speaker models the first likelihood of which is high, speakers corresponding to a predetermined number of speaker models the long-time likelihood of which is high are selected, andin narrowing down to the speaker models the second likelihood of which is high, speakers the number of which is smaller than the predetermined number and the short-time likelihood of which is high are selected.
  - 32. The speaker selecting method according to claim 30, whereinin calculating the first likelihood and the second likelihood, a short-time likelihood based on a voice signal of a relatively short time is calculated as the first likelihood, and a long-time likelihood based on a voice signal of a relatively long time is calculated as the second likelihood,in selecting the speakers corresponding to the predetermined number of speaker models the first likelihood of which is high, speakers corresponding to a predetermined number of speaker models the short-time likelihood of which is high are selected, andin narrowing down to the speaker models the second likelihood of which is high, speakers corresponding to speaker models the number of which is smaller than the predetermined number and the long-time likelihood of which is high are selected.
  - 33. The speaker selecting method according to claim 31, whereinin selecting the speakers corresponding to the speaker models the long-time likelihood of which is high, the speakers are selected using the likelihoods calculated when the first likelihood and the second likelihood are calculated, and a first threshold relating to a predetermined likelihood, andin selecting the speakers corresponding to the predetermined number of speaker models the short-time likelihood of which is high, the speakers are selected using the likelihoods calculated when the first likelihood and the second likelihood are calculated, and a second threshold which is a threshold relating to a predetermined likelihood and which is a value equal to or different from the first threshold.
  - 34. The speaker selecting method according to claim 31, whereindata indicating a temporal dependence between utterances is stored in advance, andin calculating the first likelihood and the second likelihood, the likelihoods are calculated by reflecting the stored data indicating the temporal dependence between utterances.

35. A storage medium for recording a speaker selecting program for causing a computer which performs a speaker selection processing for selecting speakers using speaker models stored in a speaker model storage means that stores a plurality of speaker models, to execute:
- a speaker score calculation processing for calculating a first likelihood and a second likelihood based on the voice signals of two relatively different time lengths;
  
  a first selection processing for selecting speakers corresponding to a predetermined number of speakers the first likelihood of which is high;
  
  a second selection processing for narrowing the speakers selected in the first selection processing down to speaker models the number of which is smaller than the predetermined number and the second likelihood of which is high; and
  
  a processing for sequentially outputting information corresponding to the speakers selected in the second selection processing.
- View Dependent Claims (36, 37, 38, 39, 40, 41, 42)
- - 36. The storage medium for recording a speaker selecting program according to claim 35, whereinthe speaker score calculation processing includes causing the computer to calculate, as the first likelihood, a long-time likelihood based on a voice signal of a relatively long time, and to calculate, as the second likelihood, a short-time likelihood based on a voice signal of a relatively short time,the first selection processing includes causing the computer to execute long-time selection processing for selecting speakers corresponding to a predetermined number of speaker models the long-time likelihood of which is high, andthe second selection processing includes causing the computer to execute a short-time selection processing for selecting speakers corresponding to speaker models the number of which is smaller than the predetermined number and the short-time likelihood of which is high.
  - 37. The storage medium for recording a speaker selecting program according to claim 35, whereinthe speaker score calculation processing includes causing the computer to calculate, as the first likelihood, a short-time likelihood based on a voice signal of a relatively short time, and to calculate, as the second likelihood, a long-time likelihood based on a voice signal of a relatively long time,the first selection processing includes causing the computer to execute a short-time selection processing for selecting speakers corresponding to a predetermined number of speaker models the short-time likelihood of which is high, andthe second selection processing includes causing the computer to execute a long-time selection processing for selecting speakers corresponding to speaker models the number of which is smaller than the predetermined number and the long-time likelihood of which is high.
  - 38. The storage medium for recording a speaker selecting program according to claim 36, whereinthe long-time speaker selection processing includes causing the computer to select speakers using the likelihoods calculated in the speaker score calculation processing, and a first threshold relating to a predetermined likelihood, andthe short-time speaker selection processing includes causing the computer to select speakers using the likelihoods calculated in the speaker score calculation processing, and a second threshold which is a threshold relating to a predetermined likelihood and which is a value equal to or different from the first threshold.
  - 39. The storage medium for recording a speaker selecting program according to claim 36, wherein the speaker score calculation processing includes causing a computer accessible to an utterance dependence storage means that stores data indicating a temporal dependence between utterances, to calculate a likelihood by reflecting the stored data indicating the temporal dependence between utterances.
  - 40. A storage medium for recording a speaker adaptive model creating program for causing a computer to execute:
    - processings in a speaker selecting program according to claim 35; and
      
      an adaptive model creating processing for creating a speaker adaptive model by statistical calculation based on sufficient statistics corresponding to speakers selected in a second selection processing.
  - 41. A storage medium for recording a speaker adaptive model creating program for causing a computer to execute:
    - processings in a speaker selecting program according to claim 36;
      
      a processing for creating one sufficient statistic relating to a long-time speaker by statistical calculation, from sufficient statistics respectively corresponding to a plurality of speakers selected in a long-time speaker selection processing;
      
      a processing for creating one sufficient statistic relating to a short-time speaker by statistical calculation, from sufficient statistics respectively corresponding to a plurality of speakers selected in a short-time speaker selection processing; and
      
      an adaptive model creating processing for creating a speaker adaptive model by integrating the sufficient statistics calculated in the processings for creating the sufficient statistics by statistical calculation.
  - 42. A storage medium for recording a speaker adaptive model creating program for causing a computer to execute:
    - processings in a speaker selecting program according to claim 36;
      
      a short-time speaker integrating processing for calculating a frequency of occurrence of speakers selected in a short-time speaker selection processing; and
      
      an adaptive model creating processing for creating one speaker adaptive model by weighting and integrating sufficient statistics based on the frequency of occurrence of speakers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
NEC Corporation
Original Assignee
NEC Corporation
Inventors
Emori, Tadashi, Tani, Masahiro, Onishi, Yoshifumi

Granted Patent

US 8,452,596 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/247
CPC Class Codes

G10L 17/08 Use of distortion metrics o...

SPEAKER SELECTING DEVICE, SPEAKER ADAPTIVE MODEL CREATING DEVICE, SPEAKER SELECTING METHOD, SPEAKER SELECTING PROGRAM, AND SPEAKER ADAPTIVE MODEL MAKING PROGRAM

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

42 Claims

Specification

Solutions

Use Cases

Quick Links

SPEAKER SELECTING DEVICE, SPEAKER ADAPTIVE MODEL CREATING DEVICE, SPEAKER SELECTING METHOD, SPEAKER SELECTING PROGRAM, AND SPEAKER ADAPTIVE MODEL MAKING PROGRAM

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

42 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links