LOW-LATENCY MULTI-SPEAKER SPEECH RECOGNITION

US 20200135209A1
Filed: 08/07/2019
Published: 04/30/2020
Est. Priority Date: 10/26/2018
Status: Active Grant

First Claim

Patent Images

1. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs including instructions, which when executed by one or more processors of an electronic device, cause the electronic device to:

receive mixed speech data representing utterances of a target speaker and utterances of one or more interfering audio sources, wherein the utterances of the target speaker and the utterances of the one or more interfering audio sources at least partially overlap;

obtain a target speaker representation representing speech characteristics of the target speaker, wherein the target speaker representation is generated by a first learning network pre-trained for speaker verification;

determine, using a second learning network, probability distributions of phonetic elements directly from the mixed speech data, wherein inputs of the second learning network include the mixed speech data and the target speaker representation, wherein an output of the learning network includes the probability distributions of phonetic elements, and wherein the first learning network and the second learning network are different learning networks;

generate text corresponding to the utterances of the target speaker based on the probability distributions of the phonetic elements; and

provide a response based on the text corresponding to the utterances of the target speaker.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and processes for operating an intelligent automated assistant are provided. In one example, a method includes receiving mixed speech data representing utterances of a target speaker and utterances of one or more interfering audio sources. The method further includes obtaining a target speaker representation, which represents speech characteristics of the target speaker; and determining, using a learning network, probability distributions of phonetic elements directly from the mixed speech data. The inputs of the learning network include the mixed speech data and the target speaker representation. An output of the learning network includes the probability distributions of phonetic elements. The method further includes generating text corresponding to the utterances of the target speaker based on the probability distributions of the phonetic elements; and providing a response to the target speaker based on the text corresponding to the utterances of the target speaker.

Citations

17 Claims

1. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs including instructions, which when executed by one or more processors of an electronic device, cause the electronic device to:
- receive mixed speech data representing utterances of a target speaker and utterances of one or more interfering audio sources, wherein the utterances of the target speaker and the utterances of the one or more interfering audio sources at least partially overlap;
  
  obtain a target speaker representation representing speech characteristics of the target speaker, wherein the target speaker representation is generated by a first learning network pre-trained for speaker verification;
  
  determine, using a second learning network, probability distributions of phonetic elements directly from the mixed speech data, wherein inputs of the second learning network include the mixed speech data and the target speaker representation, wherein an output of the learning network includes the probability distributions of phonetic elements, and wherein the first learning network and the second learning network are different learning networks;
  
  generate text corresponding to the utterances of the target speaker based on the probability distributions of the phonetic elements; and
  
  provide a response based on the text corresponding to the utterances of the target speaker.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The non-transitory computer-readable storage medium of claim 1, wherein the mixed speech data include acoustic representations of a plurality of audio frames corresponding to the utterances of the target speaker and the utterances of the one or more interfering audio sources, wherein each audio frame is associated with a predetermined period of time.
  - 3. The non-transitory computer-readable storage medium of claim 1, wherein obtaining the target speaker representation representing speech characteristics of the target speaker comprises:
    - receiving enrollment utterances from the target speaker before receiving mixed speech data; and
      
      determining a target speaker vector based on the enrollment utterances from the target speaker.
  - 4. The non-transitory computer-readable storage medium of claim 1, wherein obtaining the target speaker representation representing speech characteristics of the target speaker comprises:
    - receiving a trigger phrase uttered by the target speaker, wherein the trigger phrase is part of the utterances represented by the mixed speech data; and
      
      determining a target speaker vector based on the trigger phrase uttered by the target speaker.
  - 5. The non-transitory computer-readable storage medium of claim 1, wherein determining the probability distributions of phonetic elements directly from the mixed speech data comprises:
    - generating an intermediate representation of the mixed speech data; and
      
      determining, using the second learning network, the probability distributions of phonetic elements, wherein the second learning network is single learning network comprising a first portion and a second portion.
  - 6. The non-transitory computer-readable storage medium of claim 5, wherein generating the intermediate representation of the mixed speech data comprises:
    - extracting feature vectors from a plurality of audio frames corresponding to the utterances represented by the mixed speech data; and
      
      generating, based on the extracted feature vectors, the intermediate representation of the mixed speech data using the first portion of the second learning network.
  - 7. The non-transitory computer-readable storage medium of claim 6, wherein generating, based on the extracted feature vectors, the intermediate representation of the mixed speech data comprises:
    - processing the extracted feature vectors using the first portion of the second learning network, wherein the first portion of the second learning network includes a convolutional layer and a pooling layer; and
      
      obtaining the intermediate representation of the mixed speech data based on the processing results of the first portion of the second learning network.
  - 8. The non-transitory computer-readable storage medium of claim 5, wherein the second portion of the second learning network includes a first hidden layer of a first type, and wherein determining the probability distributions of phonetic elements comprises:
    - generating a first hidden layer input including a concatenation of the intermediate representation of the mixed speech data with the target speaker representation; and
      
      determining, using a first hidden layer of the plurality of hidden layers of the first type, a first hidden layer output based on the first hidden layer input.
  - 9. The non-transitory computer-readable storage medium of claim 8, wherein the second portion of the second learning network further includes a plurality of subsequent hidden layers of the first type, further comprising, for each subsequent hidden layer of the plurality of hidden layers of the first type:
    - generating a subsequent hidden layer input including a concatenation of a preceding hidden layer output with the target speaker representation; and
      
      determining a subsequent hidden layer output based on the concatenation of a preceding hidden layer output with the target speaker representation.
  - 10. The non-transitory computer-readable storage medium of claim 9, wherein the one or more programs further include instructions, which when executed by the one or more processors of the electronic device, cause the electronic device to:
    - generate the probability distributions of phonetic elements based on a last hidden layer output associated with a last hidden layer of the first type, wherein the phonetic elements include senones.
  - 11. The non-transitory computer-readable storage medium of claim 5, wherein the second portion of the second learning network includes a first hidden layer of a second type, and wherein determining the probability distributions of phonetic elements comprises, using the first hidden layer of the second type:
    - projecting the intermedia representation of the mixed speech data to pairs of embedding vectors, wherein the pairs of embedding vectors include acoustic embeddings and keys;
      
      determining, for each acoustic embedding, a scalar coefficient based on the target speaker representation and a key corresponding to the acoustic embedding; and
      
      determining a first hidden layer output based on each acoustic embedding and the corresponding scalar coefficient for each acoustic embedding.
  - 12. The non-transitory computer-readable storage medium of claim 11, wherein the second portion of the second learning network further includes a plurality of subsequent hidden layers of the second type, and wherein the one or more programs further include instructions, which when executed by the one or more processors of the electronic device, cause the electronic device to:
    - using each subsequent hidden layer of the plurality of hidden layers of the second type;
      
      project a preceding hidden layer output to pairs of additional embedding vectors, wherein the pairs of additional embedding vectors include additional acoustic embeddings and additional keys;
      
      determine, for each additional acoustic embedding, an additional scalar coefficient based on the target speaker representation and an additional key corresponding to the additional acoustic embedding; and
      
      determine a subsequent hidden layer output based on each additional acoustic embedding and the corresponding additional scalar coefficient.
  - 13. The non-transitory computer-readable storage medium of claim 12, wherein the one or more programs further include instructions, which when executed by the one or more processors of the electronic device, cause the electronic device to:
    - generate the probability distributions of phonetic elements based on a last hidden layer output associated with a last hidden layer of the second type, wherein the phonetic elements include senones.
  - 14. The non-transitory computer-readable storage medium of claim 1, wherein generating text corresponding to the utterances of the target speaker based on the probability distributions of the phonetic elements comprises:
    - performing beam-search decoding based on the probability distributions of the phonetic elements.
  - 15. The non-transitory computer-readable storage medium of claim 1, wherein providing a response based on the text corresponding to the utterances of the target speaker comprises:
    - determining a user intent based on the text corresponding to the utterances of the target speaker; and
      
      performing one or more tasks based on the user intent.

16. A method for performing speech-to-text conversion in a multi-speaker environment by a virtual assistant, comprising:
- receiving mixed speech data representing utterances of a target speaker and utterances of one or more interfering audio sources, wherein the utterances of the target speaker and the utterances of the one or more interfering audio sources at least partially overlap;
  
  obtaining a target speaker representation representing speech characteristics of the target speaker, wherein the target speaker representation is generated by a first learning network pre-trained for speaker verification;
  
  determining, using a second learning network, probability distributions of phonetic elements directly from the mixed speech data, wherein inputs of the second learning network include the mixed speech data and the target speaker representation, wherein an output of the learning network includes the probability distributions of phonetic elements, and wherein the first learning network and the second learning network are different learning networks;
  
  generating text corresponding to the utterances of the target speaker based on the probability distributions of the phonetic elements; and
  
  providing a response based on the text corresponding to the utterances of the target speaker.

17. An electronic device, comprising:
- one or more processors;
  
  memory; and
  
  one or more programs stored in memory, the one or more programs including instructions for;
  
  receiving mixed speech data representing utterances of a target speaker and utterances of one or more interfering audio sources, wherein the utterances of the target speaker and the utterances of the one or more interfering audio sources at least partially overlap;
  
  obtaining a target speaker representation representing speech characteristics of the target speaker, wherein the target speaker representation is generated by a first learning network pre-trained for speaker verification;
  
  determining, using a second learning network, probability distributions of phonetic elements directly from the mixed speech data, wherein inputs of the second learning network include the mixed speech data and the target speaker representation, wherein an output of the learning network includes the probability distributions of phonetic elements, and wherein the first learning network and the second learning network are different learning networks;
  
  generating text corresponding to the utterances of the target speaker based on the probability distributions of the phonetic elements; and
  
  providing a response based on the text corresponding to the utterances of the target speaker.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
DELFARAH, Masood, ABDELHAMID, Ossama A., HWANG, Kyuyeon, MCALLASTER, Donald R., SINISCALCHI, Sabato Marco

Granted Patent

US 11,475,898 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/20   Speech recognition techniqu...

G10L 17/00   Speaker identification or v...

G10L 17/02   Preprocessing operations, e...

G10L 17/04   Training, enrolment or mode...

G10L 17/18   Artificial neural networks;...

G10L 21/0272   Voice signal separating

LOW-LATENCY MULTI-SPEAKER SPEECH RECOGNITION

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

LOW-LATENCY MULTI-SPEAKER SPEECH RECOGNITION

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links