Recognizing speech in the presence of additional audio

US 9,601,116 B2
Filed: 04/07/2016
Issued: 03/21/2017
Est. Priority Date: 02/14/2014
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving, by a mobile device, an audio signal;

determining, by the mobile device and using a model that is trained to detect a presence of a synthesized voice and a model that is trained to detect a presence of a user'"'"'s voice, that the audio signal likely includes both the synthesized voice and the user'"'"'s voice;

in response to determining, by the mobile device and using a model that is trained to detect a presence of a synthesized voice and a model that is trained to detect a presence of a user'"'"'s voice, that the audio signal likely includes both the synthesized voice and the user'"'"'s voice, suppressing, by the mobile device, operation of a speech synthesis module implemented by the mobile device;

after suppressing operation of the speech synthesis module, obtaining, by the mobile device, a transcription corresponding to the audio signal from an automated speech recognizer; and

providing, by the mobile device, the transcription for output.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The technology described in this document can be embodied in a computer-implemented method that includes receiving, at a processing system, a first signal including an output of a speaker device and an additional audio signal. The method also includes determining, by the processing system, based at least in part on a model trained to identify the output of the speaker device, that the additional audio signal corresponds to an utterance of a user. The method further includes initiating a reduction in an audio output level of the speaker device based on determining that the additional audio signal corresponds to the utterance of the user.

183 Citations

20 Claims

1. A computer-implemented method comprising:
- receiving, by a mobile device, an audio signal;
  
  determining, by the mobile device and using a model that is trained to detect a presence of a synthesized voice and a model that is trained to detect a presence of a user'"'"'s voice, that the audio signal likely includes both the synthesized voice and the user'"'"'s voice;
  
  in response to determining, by the mobile device and using a model that is trained to detect a presence of a synthesized voice and a model that is trained to detect a presence of a user'"'"'s voice, that the audio signal likely includes both the synthesized voice and the user'"'"'s voice, suppressing, by the mobile device, operation of a speech synthesis module implemented by the mobile device;
  
  after suppressing operation of the speech synthesis module, obtaining, by the mobile device, a transcription corresponding to the audio signal from an automated speech recognizer; and
  
  providing, by the mobile device, the transcription for output.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein suppressing, by the mobile device, operation of a speech synthesis module implemented by the mobile device comprises initiating a reduction in an audio output level of the speech synthesis module.
  - 3. The method of claim 2, wherein initiating a reduction in an audio output level of the speech synthesis module comprises interrupting output of the speech synthesis module.
  - 4. The method of claim 1, further comprising:
    - obtaining a first vector corresponding to at least a portion of the audio signal;
      
      comparing the first vector to a second vector corresponding to the model that is trained to detect a presence of a synthesized voice; and
      
      determining that the audio signal comprises additional audio other than the synthesized voice based on a result of the comparison satisfying a threshold.
  - 5. The method of claim 1, further comprising:
    - obtaining a first vector corresponding to at least a portion of the audio signal; and
      
      determining that the audio signal comprises additional audio other than the synthesized voice based on the first vector satisfying a threshold.
  - 6. The method of claim 1, wherein each of the model that is trained to detect a presence of a synthesized voice and the model that is trained to detect a presence of a user'"'"'s voice is an i-vector based model.
  - 7. The method of claim 1, wherein each of the model that is trained to detect a presence of a synthesized voice and the model that is trained to detect a presence of a user'"'"'s voice is a neural network based model.

8. A non-transitory computer readable storage device storing instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising:
- receiving, by a mobile device, an audio signal;
  
  determining, by the mobile device and using a model that is trained to detect a presence of a synthesized voice and a model that is trained to detect a presence of a user'"'"'s voice, that the audio signal likely includes both the synthesized voice and the user'"'"'s voice;
  
  in response to determining, by the mobile device and using a model that is trained to detect a presence of a synthesized voice and a model that is trained to detect a presence of a user'"'"'s voice, that the audio signal likely includes both the synthesized voice and the user'"'"'s voice, suppressing, by the mobile device, operation of a speech synthesis module implemented by the mobile device;
  
  after suppressing operation of the speech synthesis module, obtaining, by the mobile device, a transcription corresponding to the audio signal from an automated speech recognizer; and
  
  providing, by the mobile device, the transcription for output.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer readable storage device of claim 8, wherein suppressing, by the mobile device, operation of a speech synthesis module implemented by the mobile device comprises initiating a reduction in an audio output level of the speech synthesis module.
  - 10. The computer readable storage device of claim 9, wherein initiating a reduction in an audio output level of the speech synthesis module comprises interrupting output of the speech synthesis module.
  - 11. The computer readable storage device of claim 8, further comprising:
    - obtaining a first vector corresponding to at least a portion of the audio signal;
      
      comparing the first vector to a second vector corresponding to the model that is trained to detect a presence of a synthesized voice; and
      
      determining that the audio signal comprises additional audio other than the synthesized voice based on a result of the comparison satisfying a threshold.
  - 12. The computer readable storage device of claim 8, further comprising:
    - obtaining a first vector corresponding to at least a portion of the audio signal; and
      
      determining that the audio signal comprises additional audio other than the synthesized voice based on the first vector satisfying a threshold.
  - 13. The computer readable storage device of claim 8, wherein each of the model that is trained to detect a presence of a synthesized voice and the model that is trained to detect a presence of a user'"'"'s voice is an i-vector based model.
  - 14. The computer readable storage device of claim 8, wherein each of the model that is trained to detect a presence of a synthesized voice and the model that is trained to detect a presence of a user'"'"'s voice is a neural network based model.

15. A system comprising:
- one or more computers; and
  
  one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving, by a mobile device, an audio signal;
  
  determining, by the mobile device and using a model that is trained to detect a presence of a synthesized voice and a model that is trained to detect a presence of a user'"'"'s voice, that the audio signal likely includes both the synthesized voice and the user'"'"'s voice;
  
  in response to determining, by the mobile device and using a model that is trained to detect a presence of a synthesized voice and a model that is trained to detect a presence of a user'"'"'s voice, that the audio signal likely includes both the synthesized voice and the user'"'"'s voice, suppressing, by the mobile device, operation of a speech synthesis module implemented by the mobile device;
  
  after suppressing operation of the speech synthesis module, obtaining, by the mobile device, a transcription corresponding to the audio signal from an automated speech recognizer; and
  
  providing, by the mobile device, the transcription for output.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The system of claim 15, wherein suppressing, by the mobile device, operation of a speech synthesis module implemented by the mobile device comprises initiating a reduction in an audio output level of the speech synthesis module.
  - 17. The system of claim 16, wherein initiating a reduction in an audio output level of the speech synthesis module comprises interrupting output of the speech synthesis module.
  - 18. The system of claim 15, further comprising:
    - obtaining a first vector corresponding to at least a portion of the audio signal;
      
      comparing the first vector to a second vector corresponding to the model that is trained to detect a presence of a synthesized voice; and
      
      determining that the audio signal comprises additional audio other than the synthesized voice based on a result of the comparison satisfying a threshold.
  - 19. The system of claim 15, further comprising:
    - obtaining a first vector corresponding to at least a portion of the audio signal; and
      
      determining that the audio signal comprises additional audio other than the synthesized voice based on the first vector satisfying a threshold.
  - 20. The system of claim 15, wherein each of the model that is trained to detect a presence of a synthesized voice and the model that is trained to detect a presence of a user'"'"'s voice is one of:
    - an i-vector based model and a neural network based model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Melendo Casado, Diego, Moreno, Ignacio Lopez, Gonzalez-Dominguez, Javier
Primary Examiner(s)
SINGH, SATWANT K

Application Number

US15/093,309
Publication Number

US 20160225373A1
Time in Patent Office

348 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 3/165   Management of the audio str...

G06F 3/167   Audio in a user interface, ...

G10L 15/20   Speech recognition techniqu...

G10L 15/222   Barge in, i.e. overridable ...

G10L 15/26   Speech to text systems G10L...

G10L 17/00   Speaker identification or v...

G10L 17/06   Decision making techniques;...

G10L 21/034   Automatic adjustment

G10L 25/84   for discriminating voice fr...

H03G 3/3005   in amplifiers suitable for ...

Recognizing speech in the presence of additional audio

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

183 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Recognizing speech in the presence of additional audio

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

183 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links