Methods and Systems for Recognizing Simultaneous Speech by Multiple Speakers

US 20190318725A1
Filed: 04/13/2018
Published: 10/17/2019
Est. Priority Date: 04/13/2018
Status: Active Grant

First Claim

Patent Images

1. A speech recognition system for recognizing speech including overlapping speech by multiple speakers, comprising:

a hardware processor;

computer storage memory to store data along with having computer-executable instructions stored thereon that, when executed by the processor is to implement a stored speech recognition network;

an input interface to receive an acoustic signal, the received acoustic signal including a mixture of speech signals by multiple speakers, wherein the multiple speakers include target speakers;

an encoder network and a decoder network of the stored speech recognition network are trained to transform the received acoustic signal into a text for each target speaker, such that the encoder network outputs a set of recognition encodings, and the decoder network uses the set of recognition encodings to output the text for each target speaker; and

an output interface to transmit the text for each target speaker.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for a speech recognition system for recognizing speech including overlapping speech by multiple speakers. The system including a hardware processor. A computer storage memory to store data along with having computer-executable instructions stored thereon that, when executed by the processor is to implement a stored speech recognition network. An input interface to receive an acoustic signal, the received acoustic signal including a mixture of speech signals by multiple speakers, wherein the multiple speakers include target speakers. An encoder network and a decoder network of the stored speech recognition network are trained to transform the received acoustic signal into a text for each target speaker. Such that the encoder network outputs a set of recognition encodings, and the decoder network uses the set of recognition encodings to output the text for each target speaker. An output interface to transmit the text for each target speaker.

Citations

21 Claims

1. A speech recognition system for recognizing speech including overlapping speech by multiple speakers, comprising:
- a hardware processor;
  
  computer storage memory to store data along with having computer-executable instructions stored thereon that, when executed by the processor is to implement a stored speech recognition network;
  
  an input interface to receive an acoustic signal, the received acoustic signal including a mixture of speech signals by multiple speakers, wherein the multiple speakers include target speakers;
  
  an encoder network and a decoder network of the stored speech recognition network are trained to transform the received acoustic signal into a text for each target speaker, such that the encoder network outputs a set of recognition encodings, and the decoder network uses the set of recognition encodings to output the text for each target speaker; and
  
  an output interface to transmit the text for each target speaker.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The speech recognition system of claim 1, wherein the set of recognition encodings includes a recognition encoding for each target speaker, and the decoder network uses the recognition encoding for each target speaker to output the text for that target speaker.
  - 3. The speech recognition system of claim 1, wherein the encoder network includes a mixture encoder network, a set of speaker-differentiating encoder networks, and a recognition encoder network, such that a number of speaker-differentiating encoder networks is equal to, or larger than, a number of target speakers, wherein the mixture encoder network outputs a mixture encoding for the received acoustic signal, each speaker-differentiating encoder network outputs a speaker-differentiated encoding from the mixture encoding, and the recognition encoder network outputs a recognition encoding from each speaker-differentiated encoding.
  - 4. The speech recognition system of claim 3, wherein the stored speech recognition network is pre-trained with an initial speaker-differentiating encoder network using datasets including acoustic signals with speech by a single speaker and corresponding text labels.
  - 5. The speech recognition system of claim 4, wherein some of the speaker-differentiating encoder networks in the set of speaker-differentiating encoder networks are initialized based on the initial speaker-differentiating encoder network.
  - 6. The speech recognition system of claim 5, wherein the initialization includes random perturbations.
  - 7. The speech recognition system of claim 1, wherein the encoder network includes a speaker separation network and an acoustic encoder network, such that the speaker separation network outputs a set of separation encodings, wherein a number of separation encodings is equal to, or larger than, a number of target speakers, and the acoustic encoder network uses the set of separation encodings to output a set of recognition encodings.
  - 8. The speech recognition system of claim 7, wherein each recognition encoding of the set of recognition encodings corresponds to each separation encoding in the set of separation encodings, such that the acoustic encoder network outputs a recognition encoding for each separation encoding.
  - 9. The speech recognition system of claim 7, wherein the set of separation encodings includes a single separation encoding for each target speaker, and the set of recognition encodings includes a single recognition encoding for each target speaker, such that the acoustic encoder network uses the single separation encoding for each target speaker to output the single recognition encoding for that target speaker.
  - 10. The speech recognition system of claim 7, wherein the set of separation encodings and the received acoustic signal are used to output separated signals for each target speaker.
  - 11. The speech recognition system of claim 7, wherein the at least one speaker separation network is trained to output separation encodings using datasets including acoustic signals from multiple speakers and their corresponding mixture.
  - 12. The speech recognition system of claim 7, wherein the acoustic encoder network and the decoder network are trained to output text using datasets including acoustic signals with speech by at least one speaker and corresponding text labels.
  - 13. The speech recognition system of claim 7, wherein the at least one speaker separation network, the acoustic encoder network, and the decoder network are jointly trained using datasets including acoustic signals with speech by multiple overlapping speakers and corresponding text labels.
  - 14. The speech recognition system of claim 7, wherein the stored speech recognition network is trained using datasets including acoustic signals with speech by multiple overlapping speakers and corresponding text labels, such that the training involves minimizing an objective function using a weighted combination of decoding costs and separation costs.
  - 15. The speech recognition system of claim 1, wherein speech from the target speakers includes speech from one or more language.
  - 16. The speech recognition system of claim 15, wherein the text for at least one target speaker includes information about the language of the speech of that at least one target speaker.
  - 17. The speech recognition system of claim 1, wherein the stored speech recognition network is trained using datasets including acoustic signals with speech by multiple overlapping speakers and corresponding text labels.

18. A speech recognition system for recognizing speech including overlapping speech by multiple speakers, comprising:
- a hardware processor;
  
  computer storage memory to store data along with having computer-executable instructions stored thereon that, when executed by the processor, is to implement a stored speech recognition network;
  
  an input interface to receive an acoustic signal, the received acoustic signal includes a mixture of speech signals by multiple speakers, wherein the multiple speakers include target speakers;
  
  an encoder network and a decoder network of the stored speech recognition network are trained to transform the received acoustic signal into a text for each target speaker, such that the encoder network outputs a set of recognition encodings, and the decoder network uses the set of recognition encodings to output the text for each target speaker, such that the encoder network also includes a mixture encoder network, a set of speaker-differentiating encoder networks, and a recognition encoder network; and
  
  an output interface to transmit the text for each target speaker.
- View Dependent Claims (19, 20)
- - 19. The speech recognition system of claim 18, wherein a number of speaker-differentiating encoder networks is equal to, or larger than, a number of target speakers, such that the mixture encoder network outputs a mixture encoding for the received acoustic signal, each speaker-differentiating encoder network outputs a speaker-differentiated encoding from the mixture encoding, and the recognition encoder network outputs a recognition encoding from each preliminary recognition encoding.
  - 20. The speech recognition system of claim 19, wherein the stored speech recognition network is pretrained with an initial speaker-differentiating encoder network using datasets including acoustic signals with speech by a single speaker and corresponding text labels, wherein some of the speaker-differentiating encoder networks in the set of speaker-differentiating encoder networks are initialized based on the initial speaker-differentiating encoder network, such that the initialization includes random perturbations.

21. A method using a speech recognition system to recognize separate speaker signals within an audio signal having overlapping speech by multiple speakers, comprising:
- receiving an acoustic signal including a mixture of speech signals by multiple speakers via an input interface, wherein the multiple speakers include target speakers;
  
  inputting the received audio signal using a hardware processor into a pre-trained speech recognition network stored in a computer readable memory, such that the pre-trained speech recognition network is configured fortransforming the received acoustic signal into a text for each target speaker using an encoder network and a decoder network of the pre-trained speech recognition network by, using the encoder network to output a set of recognition encodings, and the decoder network uses the set of recognition encodings to output the text for each target speaker; and
  
  transmitting the text for each target speaker using an output interface.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Mitsubishi Electric Research Laboratories, Inc. (Mitsubishi Electric Corporation)
Original Assignee
Mitsubishi Electric Research Laboratories, Inc. (Mitsubishi Electric Corporation)
Inventors
Le Roux, Jonathan, Hori, Takaaki, Settle, Shane, Seki, Hiroshi, Watanabe, Shinji, Hershey, John

Granted Patent

US 10,811,000 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/063   Training

G10L 15/16   using artificial neural net...

G10L 15/22   Procedures used during a sp...

G10L 17/00   Speaker identification or v...

G10L 25/30   using neural networks

Methods and Systems for Recognizing Simultaneous Speech by Multiple Speakers

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and Systems for Recognizing Simultaneous Speech by Multiple Speakers

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links