APPARATUS FOR SPEECH RECOGNITION USING MULTIPLE ACOUSTIC MODEL AND METHOD THEREOF

US 20140180689A1
Filed: 03/18/2013
Published: 06/26/2014
Est. Priority Date: 12/24/2012
Status: Active Grant

First Claim

Patent Images

1. An apparatus for recognizing voice using multiple acoustic models, the apparatus comprising:

a voice data database (DB) configured to store voice data collected in various noise environments;

a model generating means configured to perform classification for each speaker and environment based on the collected voice data, and to generate an acoustic model of a binary tree structure as the classification result; and

a voice recognizing means configured to extract feature data of voice data when the voice data is received from a user, to select multiple models from the generated acoustic model based on the extracted feature data, to parallel recognize the voice data based on the selected multiple models, and to output a word string corresponding to the voice data as the recognition result.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed are an apparatus for recognizing voice using multiple acoustic models according to the present invention and a method thereof. An apparatus for recognizing voice using multiple acoustic models includes a voice data database (DB) configured to store voice data collected in various noise environments; a model generating means configured to perform classification for each speaker and environment based on the collected voice data, and to generate an acoustic model of a binary tree structure as the classification result; and a voice recognizing means configured to extract feature data of voice data when the voice data is received from a user, to select multiple models from the generated acoustic model based on the extracted feature data, to parallel recognize the voice data based on the selected multiple models, and to output a word string corresponding to the voice data as the recognition result.

Citations

14 Claims

1. An apparatus for recognizing voice using multiple acoustic models, the apparatus comprising:
- a voice data database (DB) configured to store voice data collected in various noise environments;
  
  a model generating means configured to perform classification for each speaker and environment based on the collected voice data, and to generate an acoustic model of a binary tree structure as the classification result; and
  
  a voice recognizing means configured to extract feature data of voice data when the voice data is received from a user, to select multiple models from the generated acoustic model based on the extracted feature data, to parallel recognize the voice data based on the selected multiple models, and to output a word string corresponding to the voice data as the recognition result.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The apparatus of claim 1, wherein the model generating means comprises:
    - a data constituting unit configured to extract, from the collected voice data, feature vector data to be two types of spectral data and cepstral data;
      
      a speaker classifying unit configured to classify the extracted feature vector data based on a speaker, and to generate a binary tree-based speaker centric hierarchical model including a speaker classification hidden Markov model (HMM) group, a speaker classification Gaussian mixture model (GMM) group, and a speaker classification data group as the classification result;
      
      an environment classifying unit configured to classify the generated speaker classification HMM group and speaker classification data group based on an environment, and to generate an environment classification data group as the classification result; and
      
      an acoustic model generating unit configured to perform environmental adaptation with respect to the generated environment classification data group and speaker classification HMM group, and to generate a binary tree-based environment centric hierarchical model including an environment classification HMM group and an environment classification GMM group as the performance result.
  - 3. The apparatus of claim 2, wherein the speaker classifying unitgenerates a speaker-independent GMM and a speaker-independent HMM based on the extracted cepstral data, performs speaker adaptation with respect to the generated speaker-independent GMM and speaker-independent HMM, and generates a binary tree-based cepstral speaker classification HMM group as the performance result,classifies the cepstral data based on a speaker and generates a cepstral speaker classification data group as the classification result, andgenerates a spectral speaker classification data group that is classified in correspondence to the spectral data extracted from the same voice data as the generated cepstral speaker classification data group, generates a cepstral speaker classification GMM group by directly learning a speaker classification data group, and generates a cepstral speaker classification data group by speaker-adapting cepstral speaker classification data to a speaker-independent model.
  - 4. The apparatus of claim 3, wherein the environment classifying unit classifies the generated cepstral speaker classification HMM group and spectral speaker classification data group based on an environment and generates a cepstral environment classification data group as the classification result.
  - 5. The apparatus of claim 4, wherein the acoustic model generating unit performs environmental adaptation with respect to the generated cepstral environment classification data group and cepstral speaker classification HMM group and generates a binary tree-based environment centric hierarchical model including a cepstral environment classification GMM group and a cepstral environment classification HMM group as the performance result.
  - 6. The apparatus of claim 1, wherein the voice recognizing means comprises:
    - a feature extracting unit configured to extract the feature data of voice data received from the user;
      
      a model selecting unit configured to calculate a similarity between the extracted feature data and pre-stored acoustic model, and to select the multiple models based on the calculation result;
      
      a parallel recognizing unit configured to perform viterbi-based parallel recognition with respect to the voice data based on the selected multiple models, a pre-stored pronunciation model, and a language model; and
      
      a recognition selecting unit configured to output a word string having the highest scores among multiple word strings that are output as the performance result.
  - 7. The apparatus of claim 6, wherein the model selecting unit calculates the similarity while performing traversal of a root node to a lower node of the binary tree-based acoustic model, and repeats a process of deleting a model having a relatively low similarity and adding a model having a relatively high similarity until final N models are obtained in a descending order of the similarity as the calculation result.

8. A method of recognizing voice using multiple acoustic models, the method comprising:
- storing voice data collected in various noise environments in voice data DB;
  
  performing classification for each speaker and environment based on the collected voice data, and generating an acoustic model of a binary tree structure as the classification result; and
  
  extracting feature data of voice data when the voice data is received from a user, selecting multiple models from the generated acoustic model based on the extracted feature data, parallel recognizing the voice data based on the selected multiple models, and outputting a word string corresponding to the voice data as the recognition result.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The method of claim 8, wherein the performing comprises:
    - Extracting the feature vector data to be two types of spectral data and cepstral data from the collected voice data;
      
      classifying the extracted feature vector data based on a speaker, and generating a binary tree-based speaker centric hierarchical model including a speaker classification HMM group, a speaker classification GMM group, and a speaker classification data group as the classification result;
      
      classifying the generated speaker classification HMM group and speaker classification data group based on an environment, and generating an environment classification data group as the classification result; and
      
      performing environmental adaptation with respect to the generated environment classification data group and speaker classification HMM group, and generating a binary tree-based environment centric hierarchical model including an environment classification HMM group and an environment classification GMM group as the performance result.
  - 10. The method of claim 9, wherein the generating the binary tree-based speaker centric hierarchical model comprises:
    - generating a speaker-independent GMM and a speaker-independent HMM based on the extracted cepstral data, performing speaker adaptation with respect to the generated speaker-independent GMM and speaker-independent HMM, and generating a binary tree-based cepstral speaker classification HMM group as the performance result,classifying the cepstral data based on a speaker and generating a cepstral speaker classification data group as the classification result, andgenerating a spectral speaker classification data group that is classified in correspondence to the spectral data extracted from the same voice data as the generated cepstral speaker classification data group, generating a cepstral speaker classification GMM group by directly learning a speaker classification data group, and generating a cepstral speaker classification data group by speaker-adapting cepstral speaker classification data to a speaker-independent model.
  - 11. The method of claim 10, wherein the generating the environment classification data group comprises:
    - classifying the generated cepstral speaker classification HMM group and spectral speaker classification data group based on an environment and generating a cepstral environment classification data group as the classification result.
  - 12. The method of claim 11, wherein the generating the binary tree-based environment centric hierarchical model comprises:
    - performing environmental adaptation with respect to the generated cepstral environment classification data group and cepstral speaker classification HMM group and generating a binary tree-based environment centric hierarchical model including a cepstral environment classification GMM group and a cepstral environment classification HMM group as the performance result.
  - 13. The method of claim 8, wherein the outputting comprises:
    - extracting the feature data of voice data received from the user;
      
      calculating a similarity between the extracted feature data and pre-stored acoustic model, and selecting the multiple models based on the calculation result;
      
      performing viterbi-based parallel recognition with respect to the voice data based on the selected multiple models, a pre-stored pronunciation model, and a language model; and
      
      outputting a word string having the highest scores among multiple word strings that are output as the performance result.
  - 14. The method of claim 13, wherein the selecting comprises:
    - calculating the similarity while performing traversal of a root node to a lower node of the binary tree-based acoustic model, and repeating a process of deleting a model having a relatively low similarity and adding a model having a relatively high similarity until final N models are obtained in a descending order of the similarity as the calculation result.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Electronics and Telecommunications Research Institute
Original Assignee
Electronics and Telecommunications Research Institute
Inventors
KIM, Dong Hyun

Granted Patent

US 9,378,742 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/246
CPC Class Codes

G10L 15/065 Adaptation

G10L 15/32 Multiple recognisers used i...

APPARATUS FOR SPEECH RECOGNITION USING MULTIPLE ACOUSTIC MODEL AND METHOD THEREOF

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

APPARATUS FOR SPEECH RECOGNITION USING MULTIPLE ACOUSTIC MODEL AND METHOD THEREOF

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links