Blind clustering of data with application to speech processing systems

US 5,862,519 A
Filed: 04/01/1997
Issued: 01/19/1999
Est. Priority Date: 04/02/1996
Status: Expired due to Term

First Claim

Patent Images

1. A method for segmenting speech without knowledge of linguistic information into a plurality of segments comprising the steps of:

estimating a range of a number of said segments in said speech;

dynamically determining locations of boundaries for each estimate of a number of said segments K within said range of said number of said segments;

determining an optimality criterion Q_k for each of said estimate of said number of segments K from said location of said boundaries;

determining an optimal number of segments K₀ in said speech from said optimality criterion Q_k ;

segmenting said speech into said optimal number of segments K₀ ; and

storing said optimal number of segments K₀.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to a method for segmenting speech into subword speech segments. Optimal boundary locations for each estimate of a number of segments are determined within an estimated range of the number of segments. In addition, an optimality criteria is found for each estimate of the number of segments within the range. Using the optimality criteria, the optimal number of subwords are determined. From the location of the boundaries and the optimal number of segments, data can be clustered or speech can be segmented. The method can be used in data processing systems, speaker verification, medium size vocabulary speech recognition systems, language identification systems and coarse subword level speech segmentation processes.

Citations

26 Claims

1. A method for segmenting speech without knowledge of linguistic information into a plurality of segments comprising the steps of:
- estimating a range of a number of said segments in said speech;
  
  dynamically determining locations of boundaries for each estimate of a number of said segments K within said range of said number of said segments;
  
  determining an optimality criterion Q_k for each of said estimate of said number of segments K from said location of said boundaries;
  
  determining an optimal number of segments K₀ in said speech from said optimality criterion Q_k ;
  
  segmenting said speech into said optimal number of segments K₀ ; and
  
  storing said optimal number of segments K₀.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1 wherein said range of said number of segments is between a minimum number of segments K_min and a maximum number of segments K_max.
  - 3. The method of claim 2 wherein said minimum number of segments K_min is determined by using a convex hull method on a loudness function derived from said speech.
  - 4. The method of claim 3 wherein said maximum number of segments K_max is determined by a spectral variation function formed of the Euclidian norm of the delta cepstral coefficient.
  - 5. The method of claim 4 wherein the spectral variation function is represented by:
    - ##EQU12## wherein Δ
      
      c_n is the delta cepstral coefficient of the m^th cepstral coefficient at the time frame n and p is the order of the cepstral coefficient vector.
  - 6. The method of claim 5 wherein said minimum number of segments K_min and said maximum number of segments K_max is determined from a subword histogram of a predetermined phonetically labeled database.
  - 7. The method of claim 1 wherein said step of dynamically determining locations of boundaries is determined with modified level building dynamic programming with the steps of:
    - generating reference vectors related to subword segments dynamically during said level building;
      
      calculating the accumulated distance based on said reference vectors in a segment for frames i through n of said speech for a plurality of levels representing the number of boundaries corresponding to said estimate of said number of segments K; and
      
      determining a backtrack pointer for each of said frames for each said level corresponding to a best path through said frames between adjacent levels.
  - 8. The method of claim 7 wherein said optimality criterion Q_k is determined from a normal decomposition method as the summation of a plurality of normal distribution of the form:
    - ##EQU13## wherein P_i is the prior probability and P_i (x) defined as μ
      
      (X,M,Σ
      
      _i) is normal with expected vector M_i and covariance matrix Σ
      
      _i and a log likelihood criteria of the form ##EQU14##

9. A system for segmenting speech without knowledge of linguistic information into a plurality of segments comprising:
- means for estimating a range of a number of said segments in said speech;
  
  means for dynamically determining locations of boundaries for each estimate of a number of said segments K within said range of said number of said segments;
  
  means for determining an optimality criterion Q_k for each of said estimate of said number of segments K from said location of said boundaries;
  
  means for determining an optimal number of segments K₀ in said speech from said optimality criterion Q_k ;
  
  means for segmenting said speech into said optimal number of segments K₀ ;
  
  means for storing said optimal number of segments K₀ ; and
  
  means for modeling said speech based on said optimal number of segments K₀.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The system of claim 9 wherein said range of said number of segments is between a minimum number of segments K_min and a maximum number of segments K_max.
  - 11. The system of claim 10 wherein said minimum number of segments K_min is determined by using a convex hull method on a loudness function derived from said speech.
  - 12. The system of claim 11 wherein said maximum number of segments K_max is determined by a spectral variation function formed of the Euclidian norm of the delta cepstral coefficient.
  - 13. The system of claim 12 wherein the spectral variation function is represented by:
    - ##EQU15## wherein Δ
      
      c_n is the delta cepstral coefficient of the m^th cepstral coefficient at the time frame n and p is the order of the cepstral coefficient vector.
  - 14. The system of claim 10 wherein said means for dynamically determining locations of boundaries comprises:
    - means for generating reference vectors related to subword segments dynamically during said level building;
      
      means for calculating the accumulated distance based on said reference vectors in a segments for frames i through n of said speech for a plurality of levels representing the number of boundaries corresponding to said estimate of said number of segments K; and
      
      means for determining a backtrack pointer for each of said frames for each said level corresponding to a best path through said speech frames between adjacent levels.
  - 15. The system of claim 14 wherein said optimality criterion Q_k is determined as the summation of a plurality of normal distribution of the form:
    - ##EQU16## wherein P_i is the prior probability and P_i (x) defined as μ
      
      (X,M,Σ
      
      _i) is normal with expected vector M_i and covariance matrix Σ and
      
      a log likelihood criteria of the form ##EQU17##

16. A system for speaker verification without knowledge of linguistic information for speech spoken by said speaker comprising:
- means for extracting at least one spectral feature vector from first speech;
  
  means for segmenting said extracted feature vector by estimating a range of a number of said segments in said extracted feature vector;
  
  means for dynamically determining locations of boundaries for each estimate of numbers of said segments K within said range of said number of said segments;
  
  means for determining an optimality criterion Q_k for each of said estimate of said number of segments K from said location of said boundaries for determining an optimal number of segments K₀ in said first speech from said optimal criterion Q_k ;
  
  means for segmenting said first speech into said optimal number of segments;
  
  means for storing said boundaries and said optimal number of segments as segmentation parameters;
  
  means for determining a first subword model from said segmentation parameters of said first speech;
  
  means for determining a second subword model from said optimal number of segments;
  
  means for storing said first subword model and said second subword model;
  
  means for extracting at least one second feature vector from a second speech sample;
  
  means for segmenting said second feature vector into said optimal number of segments using said stored segmentation parameters;
  
  means for recognizing the segmented second speech sample from said stored first subword model and said second subword model to produce recognized output; and
  
  means for determining from said recognized output whether to accept or reject said speaker.
- View Dependent Claims (17, 18, 19, 20, 21)
- - 17. The system of claim 16 wherein said means for determining from said recognized output whether to accept or reject said speaker further comprises:
    - determining a first score from said first subword model;
      
      determining a second score from said second subword model; and
      
      combining said first and second scores.
  - 18. The system of claim 17 wherein said means for recognizing further comprises:
    - determining a third score for said second speech sample from said stored first subword model;
      
      determining a fourth score for said second speech sample from said stored second subword model;
      
      combining said third and fourth scores; and
      
      determining the similarity of said combined first and second score with said combined third and fourth score.
  - 19. The system of claim 16 wherein the maximum number of segments K_max of said range of said number of segments is determined by a spectral variation formed of the Euclidian norm of the delta cepstral cofficient wherein the spectral variation function is represented by:
    - ##EQU18## wherein Δ
      
      c_n is the delta cepstral coefficient of the m^th cepstral coefficient at the time frame n and p is the order of the cepstral coefficient vector.
  - 20. The system of claim 19 wherein said means for dynamically determining locations of boundaries comprises:
    - means for generating reference vectors related to subword segments dynamically during said level building;
      
      means for calculating the accumulated distance based on said reference vectors in a segments for frames i through n of said speech for a plurality of levels representing the number of boundaries corresponding to said estimate of said number of segments K; and
      
      means for determining a backtrack pointer for each of said frames for each said level corresponding to a best path through said speech frames between adjacent levels.
  - 21. The system of claim 20 wherein said optimality criterion Q_k is determined as the summation of a plurality of normal distribution of the form:
    - ##EQU19## wherein P_i is the prior probability and P_i (x) defined as μ
      
      (X,M,Σ
      
      _i) is normal with expected vector M_i and covariance matrix Σ
      
      _i and a log likelihood criteria of the form ##EQU20##

22. A system for speech recognition of user defined vocabulary words comprising:
- means for estimating a range of a number of said subwords in a first vocabulary word;
  
  means for dynamically determining locations of boundaries for each estimate of a number of said subwords within said range of said number of said subwords;
  
  means for determining an optimality criterion Q_k for each of said estimate of said number of subwords from said location of said boundaries;
  
  means for determining an optimal number of subwords in said vocabulary word from said optimality criterion Q_k ;
  
  means for modeling said subwords with a classifier to determine a plurality of word models for said vocabulary word;
  
  means for storing said word models; and
  
  means for recognizing a second vocabulary word from said stored word models.
- View Dependent Claims (23)
- - 23. The system of claim 22 wherein said means for recognizing determines a score for each of said word models and further comprising:
    - means for determining a maximum value of said scores from said word models; and
      
      means for assigning a recognized word label to said subwords corresponding to said word model having said maximum value.

24. A system for recognizing a language comprising:
- means for estimating a range of a number of said subwords in first speech of said language;
  
  means for dynamically determining locations of boundaries for each estimate of a number of said subwords within said range of said number of said subwords;
  
  means for determining an optimal criterion Q_k for each of said estimate of said number of subwords from said location of said boundaries;
  
  means for determining an optimal number of subwords in said first speech from said optimality criterion Q_k ;
  
  means for modeling said subwords with a classifier to determine a language model;
  
  means for storing said language model; and
  
  means for recognizing a language of a second speech sample from said stored language model.
- View Dependent Claims (25)
- - 25. The system of claim 24 wherein said means for recognizing determines a score for each of a plurality of said language models and further comprising:
    - means for determining a maximum value of said scores from said language models; and
      
      means for assigning a recognized language label to said subwords corresponding to said language model having said maximum value.

26. A system for phonetic transcription comprising:
- means for estimating a range of a number of said subwords in a first speech sample;
  
  means for dynamically determining locations of boundaries for each estimate of a number of said subwords within said range of said number of said subwords;
  
  means for determining an optimal criterion Q_k for each of said estimate of said number of subwords from said location of said boundaries;
  
  means for determining an optimal number of subwords in said speech from said optimality criterion Q_k ; and
  
  means for storing said boundary locations,wherein said boundary locations are used in subsequent phonetic transcription.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SpeechWorks International, Inc. (Microsoft Corporation)
Original Assignee
T-Netix, Inc. (Cognizant Technology Solutions Corp.)
Inventors
Sharma, Manish, Mammone, Richard J.
Primary Examiner(s)
Hudspeth, David R.
Assistant Examiner(s)
SAX, ROBERT L

Application Number

US08/827,562
Time in Patent Office

658 Days
Field of Search

704/231, 704/243, 704/245, 704/246, 704/247
US Class Current

704/231
CPC Class Codes

G10L 15/04 Segmentation; Word boundary...

G10L 25/48 specially adapted for parti...

Blind clustering of data with application to speech processing systems

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Blind clustering of data with application to speech processing systems

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links