Speech recognition device and the operation method thereof

US 9,514,751 B2
Filed: 03/25/2014
Issued: 12/06/2016
Est. Priority Date: 09/17/2013
Status: Active Grant

First Claim

Patent Images

1. A speech recognition device comprising:

at least one hardware processor configured to;

receive, from a speech recognition terminal, speech data corresponding to a speech input by a speaking person and multi-sensor data corresponding to an environment in which the speech is input by the speaking person, the multi-sensor data being useable as additional information to the speech input for performing speech recognition and the multi-sensor data including an image of the speaking person and estimated location and position of the speech recognition terminal to the speaking person while the speech is input;

select a language model from a plurality of language models for the speech input, the language model being selected as representing a correspondence between a plurality of data among the multi-sensor data including the image of the speaking person of the speech input, the environment in which the speech is input by the speaking person, and the estimated location and position of the speech recognition terminal to the speaking person and previous multi-sensor data including a plurality of data among previous images of speaking persons and corresponding environments in which previous speeches are input;

select an acoustic model from among a plurality of acoustic models for the speech input, the acoustic model being selected as representing a correspondence between a plurality of data among the multi-sensor data including the image of the speaking person of the speech input, the environment in which the speech is input by the speaking person, the estimated location and position of the speech recognition terminal to the speaking person, and an estimated signal to noise ratio (SNR) for the speech data and the previous multi-sensor data including the plurality of data among previous images of speaking persons and the corresponding environments in which previous speeches are input; and

control the speech recognition of the speech input to be performed according to the selected language model and the selected acoustic model which varies in consideration of the plurality of data among the multi-sensor data obtained while the speech is input through application of a feature vector extracted from the speech data to the selected language model and the selected acoustic model, and transmit a result of the speech recognition of the speech data to the speech recognition terminal,wherein the estimated SNR for the speech varies according to a relationship determined between the speech input and proximity of a distance between the speech recognition terminal and the speaking person obtained through the estimated location and position of the speech recognition terminal to the speaking person while the speech is being input.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described herein is a speech recognition device comprising: a communication module receiving speech data corresponding to speech input from a speech recognition terminal and multi-sensor data corresponding to input environment of the speech; a model selection module selecting a language and acoustic model corresponding to the multi-sensor data among a plurality of language and acoustic models classified according to the speech input environment on the basis of previous multi-sensor data; and a speech recognition module controlling the communication module to apply a feature vector extracted from the speech data to the language and acoustic model and transmit speech recognition result for the speech data to the speech recognition terminal.

13 Citations

View as Search Results

9 Claims

1. A speech recognition device comprising:
- at least one hardware processor configured to;
  
  receive, from a speech recognition terminal, speech data corresponding to a speech input by a speaking person and multi-sensor data corresponding to an environment in which the speech is input by the speaking person, the multi-sensor data being useable as additional information to the speech input for performing speech recognition and the multi-sensor data including an image of the speaking person and estimated location and position of the speech recognition terminal to the speaking person while the speech is input;
  
  select a language model from a plurality of language models for the speech input, the language model being selected as representing a correspondence between a plurality of data among the multi-sensor data including the image of the speaking person of the speech input, the environment in which the speech is input by the speaking person, and the estimated location and position of the speech recognition terminal to the speaking person and previous multi-sensor data including a plurality of data among previous images of speaking persons and corresponding environments in which previous speeches are input;
  
  select an acoustic model from among a plurality of acoustic models for the speech input, the acoustic model being selected as representing a correspondence between a plurality of data among the multi-sensor data including the image of the speaking person of the speech input, the environment in which the speech is input by the speaking person, the estimated location and position of the speech recognition terminal to the speaking person, and an estimated signal to noise ratio (SNR) for the speech data and the previous multi-sensor data including the plurality of data among previous images of speaking persons and the corresponding environments in which previous speeches are input; and
  
  control the speech recognition of the speech input to be performed according to the selected language model and the selected acoustic model which varies in consideration of the plurality of data among the multi-sensor data obtained while the speech is input through application of a feature vector extracted from the speech data to the selected language model and the selected acoustic model, and transmit a result of the speech recognition of the speech data to the speech recognition terminal,wherein the estimated SNR for the speech varies according to a relationship determined between the speech input and proximity of a distance between the speech recognition terminal and the speaking person obtained through the estimated location and position of the speech recognition terminal to the speaking person while the speech is being input.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The speech recognition device of claim 1, whereinthe estimated location and position of the speech recognition terminal to the speaking person is based on a basis of location data through one or any combination of a network-based method, a GPS method, a satellite signal-based method, and a WiFi signal-based method;
    - andthe at least one hardware processor is further configured to;
      
      estimate an age of the speaking person based on the image of the speaking person for the selecting of the language model and the selecting of the acoustic model;
      
      wherein proximity data for the distance is generated by a proximity sensor and the estimated SNR for the speech data is estimated based on the proximity data generated by the proximity sensor.
  - 3. The speech recognition device of claim 1, further comprising:
    - a database storing the plurality of language models and the plurality of acoustic models classified according to the previous multi-sensor data including the plurality of data among the previous images of speaking persons and the corresponding environments in which the previous speeches are input.
  - 4. The speech recognition device of claim 1, whereinthe feature vector is extracted from the speech data during a preprocessing operation of the speech recognition for application to the selected language model and the selected acoustic model.
  - 5. The speech recognition device of claim 4, wherein the extracted feature vector is based on classifying of the speech data into frame units and eliminating frame-basis noise components.

6. A method of operating a speech recognition device via at least one hardware processor, the method comprising:
- receiving, from a speech recognition terminal, speech data corresponding to a speech input by a speaking person and multi-sensor data corresponding to an environment in which the speech is input by the speaking person, the multi-sensor data being useable as additional information to the speech input for performing speech recognition and the multi-sensor data including an image of the speaking person of the speech and estimated location and position of the speech recognition terminal to the speaking person while the speech is input;
  
  selecting a language model from a plurality of language models for the speech input, the language model being selected as representing a correspondence between a plurality of data among the multi-sensor data including the image of the speaking person of the speech input, the environment in which the speech is input by the speaking person and the estimated location and position of the speech recognition terminal to the speaking person and previous multi-sensor data including a plurality of data among previous images of speaking persons and corresponding environments in which previous speeches are input;
  
  selecting an acoustic model from among a plurality of acoustic models for the speech input, the acoustic model being selected as representing a correspondence between a plurality of data among the multi-sensor data including the image of the speaking person of the speech input, the environment in which the speech is input by the speaking person and the estimated location and position of the speech recognition terminal to the speaking person, and an estimated signal to noise ratio (SNR) for the speech data, the previous multi-sensor data including the plurality of data among previous images of speaking persons and the corresponding environments in which previous speeches are input;
  
  controlling the speech recognition of the speech input to be performed according to the selected language model and the selected acoustic model which varies in consideration of the plurality of data among the multi-sensor data obtained while the speech is input through application of a feature vector extracted from the speech data to the selected language model and the selected acoustic model; and
  
  transmitting a result of the speech recognition of the speech data to the speech recognition terminal,wherein the estimated SNR for the speech varies according to a relationship determined between the speech input and proximity of a distance between the speech recognition terminal and the speaking person obtained through the estimated location and position of the speech recognition terminal to the speaking person while the speech is being input.
- View Dependent Claims (7, 8, 9)
- - 7. The method of operating the speech recognition device of claim 6, whereinthe estimated location and position of the speech recognition terminal to the speaking person is based on location data through one or any combination of a network-based method, a satellite signal-based method, a GPS method, and a WiFi signal-based method;
    - the selecting of the language model and the selecting of the acoustic model further comprise estimating an age of the speaking person based on the image of the speaking person; and
      
      wherein proximity data for the distance is generated by a proximity sensor and the estimated SNR for the speech data is estimated based on the proximity data generated by the proximity sensor.
  - 8. The method of operating the speech recognition device of claim 6, whereinthe feature vector is extracted from the speech data during a preprocessing operation of the speech recognition for application to the selected language model and the selected acoustic model.
  - 9. The method of operating the speech recognition device of claim 6, wherein the extracted vector is based on classifying of the speech data into frame units and estimating frame-basis noise components.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Electronics and Telecommunications Research Institute
Original Assignee
Electronics and Telecommunications Research Institute
Inventors
Kim, Dong-Hyun
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
KIM, JONATHAN C

Application Number

US14/224,427
Publication Number

US 20150081288A1
Time in Patent Office

987 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G10L 15/183   using context dependencies,...

G10L 15/187   Phonemic context, e.g. pron...

G10L 15/30   Distributed recognition, e....

G10L 2015/228   of application context

Speech recognition device and the operation method thereof

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

13 Citations

9 Claims

Specification

Use Cases

Quick Links

Others

Speech recognition device and the operation method thereof

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

13 Citations

9 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others