LOW-LATENCY MULTI-SPEAKER SPEECH RECOGNITION
First Claim
1. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs including instructions, which when executed by one or more processors of an electronic device, cause the electronic device to:
- receive mixed speech data representing utterances of a target speaker and utterances of one or more interfering audio sources, wherein the utterances of the target speaker and the utterances of the one or more interfering audio sources at least partially overlap;
obtain a target speaker representation representing speech characteristics of the target speaker, wherein the target speaker representation is generated by a first learning network pre-trained for speaker verification;
determine, using a second learning network, probability distributions of phonetic elements directly from the mixed speech data, wherein inputs of the second learning network include the mixed speech data and the target speaker representation, wherein an output of the learning network includes the probability distributions of phonetic elements, and wherein the first learning network and the second learning network are different learning networks;
generate text corresponding to the utterances of the target speaker based on the probability distributions of the phonetic elements; and
provide a response based on the text corresponding to the utterances of the target speaker.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and processes for operating an intelligent automated assistant are provided. In one example, a method includes receiving mixed speech data representing utterances of a target speaker and utterances of one or more interfering audio sources. The method further includes obtaining a target speaker representation, which represents speech characteristics of the target speaker; and determining, using a learning network, probability distributions of phonetic elements directly from the mixed speech data. The inputs of the learning network include the mixed speech data and the target speaker representation. An output of the learning network includes the probability distributions of phonetic elements. The method further includes generating text corresponding to the utterances of the target speaker based on the probability distributions of the phonetic elements; and providing a response to the target speaker based on the text corresponding to the utterances of the target speaker.
-
Citations
17 Claims
-
1. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs including instructions, which when executed by one or more processors of an electronic device, cause the electronic device to:
-
receive mixed speech data representing utterances of a target speaker and utterances of one or more interfering audio sources, wherein the utterances of the target speaker and the utterances of the one or more interfering audio sources at least partially overlap; obtain a target speaker representation representing speech characteristics of the target speaker, wherein the target speaker representation is generated by a first learning network pre-trained for speaker verification; determine, using a second learning network, probability distributions of phonetic elements directly from the mixed speech data, wherein inputs of the second learning network include the mixed speech data and the target speaker representation, wherein an output of the learning network includes the probability distributions of phonetic elements, and wherein the first learning network and the second learning network are different learning networks; generate text corresponding to the utterances of the target speaker based on the probability distributions of the phonetic elements; and provide a response based on the text corresponding to the utterances of the target speaker. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A method for performing speech-to-text conversion in a multi-speaker environment by a virtual assistant, comprising:
-
receiving mixed speech data representing utterances of a target speaker and utterances of one or more interfering audio sources, wherein the utterances of the target speaker and the utterances of the one or more interfering audio sources at least partially overlap; obtaining a target speaker representation representing speech characteristics of the target speaker, wherein the target speaker representation is generated by a first learning network pre-trained for speaker verification; determining, using a second learning network, probability distributions of phonetic elements directly from the mixed speech data, wherein inputs of the second learning network include the mixed speech data and the target speaker representation, wherein an output of the learning network includes the probability distributions of phonetic elements, and wherein the first learning network and the second learning network are different learning networks; generating text corresponding to the utterances of the target speaker based on the probability distributions of the phonetic elements; and providing a response based on the text corresponding to the utterances of the target speaker.
-
-
17. An electronic device, comprising:
-
one or more processors; memory; and one or more programs stored in memory, the one or more programs including instructions for; receiving mixed speech data representing utterances of a target speaker and utterances of one or more interfering audio sources, wherein the utterances of the target speaker and the utterances of the one or more interfering audio sources at least partially overlap; obtaining a target speaker representation representing speech characteristics of the target speaker, wherein the target speaker representation is generated by a first learning network pre-trained for speaker verification; determining, using a second learning network, probability distributions of phonetic elements directly from the mixed speech data, wherein inputs of the second learning network include the mixed speech data and the target speaker representation, wherein an output of the learning network includes the probability distributions of phonetic elements, and wherein the first learning network and the second learning network are different learning networks; generating text corresponding to the utterances of the target speaker based on the probability distributions of the phonetic elements; and providing a response based on the text corresponding to the utterances of the target speaker.
-
Specification