Method and device for training acoustic model, computer device and storage medium

US 10,522,136 B2
Filed: 12/28/2017
Issued: 12/31/2019
Est. Priority Date: 06/16/2017
Status: Active Grant

First Claim

Patent Images

1. A method for training an acoustic model, comprising:

obtaining supervised speech data and unsupervised speech data, wherein the supervised speech data is speech data with manual annotation and the unsupervised speech data is speech data with machine annotation;

extracting speech features from the supervised speech data, and extracting speech features from the unsupervised speech data; and

performing a supervised learning task on the speech features of the supervised speech data, and performing an unsupervised learning task on the speech features of the unsupervised speech data, by using a deep learning network, to train and obtain the acoustic model;

wherein the deep learning network comprises an input layer, at least one hidden layer and an output layer;

wherein the input layer is shared by the supervised learning task and the unsupervised learning task, such that the supervised learning task and the unsupervised learning task are performed in parallel; and

after training the model, a final acoustic model is that of obtained by retaining all the parameters of the model, to retain both outputs of the supervised learning task and outputs of the unsupervised learning task in the reasoning phase, and merging the outputs as a final output.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments of the present disclosure provide a method and a device for training an acoustic model, a computer device and a storage medium. The method includes obtaining supervised speech data and unsupervised speech data, in which, the supervised speech data is speech data with manual annotation and the unsupervised speech data is speech data with machine annotation; extracting speech features from the supervised speech data and the unsupervised speech data; and performing a multi-task learning having a supervised learning task and an unsupervised learning task on the speech features of the supervised speech data and the unsupervised speech data by using a deep learning network, to train and obtain the acoustic model.

52 Citations

View as Search Results

20 Claims

1. A method for training an acoustic model, comprising:
- obtaining supervised speech data and unsupervised speech data, wherein the supervised speech data is speech data with manual annotation and the unsupervised speech data is speech data with machine annotation;
  
  extracting speech features from the supervised speech data, and extracting speech features from the unsupervised speech data; and
  
  performing a supervised learning task on the speech features of the supervised speech data, and performing an unsupervised learning task on the speech features of the unsupervised speech data, by using a deep learning network, to train and obtain the acoustic model;
  
  wherein the deep learning network comprises an input layer, at least one hidden layer and an output layer;
  
  wherein the input layer is shared by the supervised learning task and the unsupervised learning task, such that the supervised learning task and the unsupervised learning task are performed in parallel; and
  
  after training the model, a final acoustic model is that of obtained by retaining all the parameters of the model, to retain both outputs of the supervised learning task and outputs of the unsupervised learning task in the reasoning phase, and merging the outputs as a final output.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method according to claim 1, wherein,the at least one hidden layer is shared by the supervised learning task and the unsupervised learning task and trained commonly by the supervised speech data and the unsupervised speech data;
    - andthe output layer comprises a supervised learning task output layer and an unsupervised learning task output layer.
  - 3. The method according to claim 1, wherein,a first part of the at least one hidden layer is shared by the supervised learning task and the unsupervised learning task, and a second part of the at least one hidden layer is separately trained and adjusted by the supervised learning task and the unsupervised learning task;
    - andthe output layer comprises a supervised learning task output layer and an unsupervised learning task output layer.
  - 4. The method according to claim 2, wherein,after training the model, a final acoustic model is that of obtained by discarding parameters of the at least one hidden layer and/or parameters of the output layer trained and adjusted by the unsupervised learning task, to only retain outputs of the supervised learning task of the acoustic model in a reasoning phase.
  - 5. The method according to claim 3, wherein,after training the model, a final acoustic model is that of obtained by discarding parameters of the at least one hidden layer and/or parameters of the output layer trained and adjusted by the unsupervised learning task, to only retain outputs of the supervised learning task of the acoustic model in a reasoning phase.
  - 6. The method according to claim 1, wherein performing the supervised learning task on the speech features of the supervised speech data, and performing the unsupervised learning task on the speech features of the unsupervised speech data, using the deep learning network, to train and obtain the acoustic model comprises:
    - performing the supervised learning task on the speech features of the supervised speech data, and performing the unsupervised learning task on the speech features of the unsupervised speech data, to train and obtain the acoustic model according to respective weights set in advance for the supervised learning task and the unsupervised learning task.
  - 7. The method according to claim 1, after obtaining the supervised speech data and the unsupervised speech data, and before extracting the speech features, further comprising:
    - filtering and screening the unsupervised speech data by a confidence filtering.

8. A computer device, comprising:
- one or more processors;
  
  a storage device, configured to store one or more programs;
  
  wherein the one or more processors are configured to read the one or more programs from the storage device to perform acts of;
  
  obtaining supervised speech data and unsupervised speech data, wherein the supervised speech data is speech data with manual annotation and the unsupervised speech data is speech data with machine annotation;
  
  extracting speech features from the supervised speech data, and extracting speech features from the unsupervised speech data; and
  
  performing a supervised learning task on the speech features of the supervised speech data, and performing an unsupervised learning task on the speech features of the unsupervised speech data by using a deep learning network, to train and obtain the acoustic model;
  
  wherein the deep learning network comprises an input layer, at least one hidden layer and an output layer;
  
  wherein the input layer is shared by the supervised learning task and the unsupervised learning task, such that the supervised learning task and the unsupervised learning task are performed in parallel; and
  
  after training the model, a final acoustic model is that of obtained by retaining all the parameters of the model, to retain both outputs of the supervised learning task and outputs of the unsupervised learning task in the reasoning phase, and merging the outputs as a final output.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer device according to claim 8, wherein,the at least one hidden layer is shared by the supervised learning task and the unsupervised learning task and trained commonly by the supervised speech data and the unsupervised speech data;
    - andthe output layer comprises a supervised learning task output layer and an unsupervised learning task output layer.
  - 10. The computer device according to claim 8, wherein,a first part of the at least one hidden layer is shared by the supervised learning task and the unsupervised learning task, and a second part of the at least one hidden layer is separately trained and adjusted by the supervised learning task and the unsupervised learning task;
    - andthe output layer comprises a supervised learning task output layer and an unsupervised learning task output layer.
  - 11. The computer device according to claim 9, wherein,after training the model, a final acoustic model is that of obtained by discarding parameters of the at least one hidden layer and/or parameters of the output layer trained and adjusted by the unsupervised learning task, to only retain outputs of the supervised learning task of the acoustic model in a reasoning phase.
  - 12. The computer device according to claim 10, wherein,after training the model, a final acoustic model is that of obtained by discarding parameters of the at least one hidden layer and/or parameters of the output layer trained and adjusted by the unsupervised learning task, to only retain outputs of the supervised learning task of the acoustic model in a reasoning phase.
  - 13. The computer device according to claim 8, wherein the one or more processors are configured to perform the supervised learning task on the speech features of the supervised speech data, and performing the unsupervised learning task on the speech features of the unsupervised speech data, using the deep learning network, to train and obtain the acoustic model by acts of:
    - performing the supervised learning task on the speech features of the supervised speech data, and performing the unsupervised learning task on the speech features of the unsupervised speech data, to train and obtain the acoustic model according to respective weights set in advance for the supervised learning task and the unsupervised learning task.
  - 14. The computer device according to claim 8, wherein the one or more processors are further configured to read the one or more programs from the storage device to perform acts of:
    - after obtaining the supervised speech data and the unsupervised speech data and before extracting the speech features, filtering and screening the unsupervised speech data by a confidence filtering.

15. A non-transitory computer readable storage medium, configured to store computer instructions, wherein when the instructions are executed by a processor, a method for training an acoustic model is implemented and the method comprises:
- obtaining supervised speech data and unsupervised speech data, wherein the supervised speech data is speech data with manual annotation and the unsupervised speech data is speech data with machine annotation;
  
  extracting speech features from the supervised speech data and extracting speech features from the unsupervised speech data; and
  
  performing a supervised learning task on the speech features of the supervised speech data, and performing an unsupervised learning task on the speech features of the unsupervised speech data by using a deep learning network, to train and obtain the acoustic model;
  
  wherein the deep learning network comprises an input layer, at least one hidden layer and an output layer;
  
  wherein the input layer is shared by the supervised learning task and the unsupervised learning task, such that the supervised learning task and the unsupervised learning task are performed in parallel; and
  
  after training the model, a final acoustic model is that of obtained by retaining all the parameters of the model, to retain both outputs of the supervised learning task and outputs of the unsupervised learning task in the reasoning phase, and merging the outputs as a final output.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The non-transitory computer readable storage medium according to claim 15, wherein,the at least one hidden layer is shared by the supervised learning task and the unsupervised learning task and trained commonly by the supervised speech data and the unsupervised speech data;
    - andthe output layer comprises a supervised learning task output layer and an unsupervised learning task output layer.
  - 17. The non-transitory computer readable storage medium according to claim 15, wherein,a first part of the at least one hidden layer is shared by the supervised learning task and the unsupervised learning task, and a second part of the at least one hidden layer is separately trained and adjusted by the supervised learning task and the unsupervised learning task;
    - andthe output layer comprises a supervised learning task output layer and an unsupervised learning task output layer.
  - 18. The non-transitory computer readable storage medium according to claim 16, wherein,after training the model, a final acoustic model is that of obtained by discarding parameters of the at least one hidden layer and/or parameters of the output layer trained and adjusted by the unsupervised learning task, to only retain outputs of the supervised learning task of the acoustic model in a reasoning phase.
  - 19. The non-transitory computer readable storage medium according to claim 15, wherein performing the supervised learning task on the speech features of the supervised speech data, and performing the unsupervised learning task on the speech features of the unsupervised speech data using the deep learning network, to train and obtain the acoustic model comprises:
    - performing the supervised learning task on the speech features of the supervised speech data, and performing the unsupervised learning task on the speech features of the unsupervised speech data, to train and obtain the acoustic model according to respective weights set in advance for the supervised learning task and the unsupervised learning task.
  - 20. The non-transitory computer readable storage medium according to claim 15, wherein the method further comprises:
    - after obtaining the supervised speech data and the unsupervised speech data and before extracting the speech features, filtering and screening the unsupervised speech data by a confidence filtering.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Baidu Online Network Technology (Beijing) Co., Ltd (Baidu Incorporated)
Original Assignee
Baidu Online Network Technology (Beijing) Co., Ltd (Baidu Incorporated)
Inventors
Huang, Bin, Peng, Yiping, Li, Xiangang
Primary Examiner(s)
Islam, Mohammad K

Application Number

US15/856,165
Publication Number

US 20180366107A1
Time in Patent Office

733 Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/063 Training

G10L 15/16 using artificial neural net...

Method and device for training acoustic model, computer device and storage medium

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

52 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and device for training acoustic model, computer device and storage medium

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

52 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links