INPUT SPEECH QUALITY MATCHING

US 20160379638A1
Filed: 06/26/2015
Published: 12/29/2016
Est. Priority Date: 06/26/2015
Status: Abandoned Application

First Claim

Patent Images

1. A computer-implemented method for processing a whispered utterance and responding in whispered synthesized speech, the method comprising:

receiving input audio data comprising an input utterance;

processing the input audio data with at least one trained model to determine that the input utterance was whispered;

performing automatic speech recognition (ASR) on the input audio data to determine input text corresponding to the input utterance;

performing natural language understanding processing on the input text to identify a query;

determining content responding to the query based on the input utterance being whispered; and

causing the content to be output.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system matches text-to-speech (TTS) or other output to a quality of an input spoken utterance. The system uses trained models to detect a speech quality and generates an indicator of the speech quality. The speech quality may be determined from audio or non-audio data. The indicator is sent to downstream components of the system such as a command processor or TTS system. The output of the system is then determined using the indicator of speech quality, thus customizing an output of the system to the manner in which the utterance was spoken.

132 Citations

23 Claims

1. A computer-implemented method for processing a whispered utterance and responding in whispered synthesized speech, the method comprising:
- receiving input audio data comprising an input utterance;
  
  processing the input audio data with at least one trained model to determine that the input utterance was whispered;
  
  performing automatic speech recognition (ASR) on the input audio data to determine input text corresponding to the input utterance;
  
  performing natural language understanding processing on the input text to identify a query;
  
  determining content responding to the query based on the input utterance being whispered; and
  
  causing the content to be output.
- View Dependent Claims (2, 3)
- - 2. The computer-implemented method of claim 1, further comprising:
    - performing text-to-speech (TTS) processing on output text based on a speech quality indicator to generate output audio data, wherein the output audio data comprises synthesized speech responding to the query, wherein the synthesized speech is configured to sound like a whispered voice, wherein performing TTS processing further comprises;
      
      performing unit selection using a voice corpus to select a plurality of stored audio data segments of recorded whispered speech, the stored audio data segments corresponding to the output text; and
      
      concatenating the plurality of stored audio segments to determine the output audio data.
  - 3. The computer-implemented method of claim 1, wherein the trained model comprises a support vector machine (SVM) configured to process audio feature vectors to determine that speech associated with the audio feature vectors has a resonance below a resonance threshold and has a volume below a volume threshold.

4. A computer-implemented method comprising:
- determining an input speech quality corresponding to input audio data;
  
  performing automatic speech recognition on the input audio data to determine input text;
  
  determining content based on the input text and the input speech quality; and
  
  causing the content to be output.
- View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12, 22, 23)
- - 5. The computer-implemented method of claim 4, wherein determining the input speech quality comprises processing the input audio data using at least one trained classifier configured to classify the audio data as either corresponding to the speech quality or not corresponding to the speech quality.
  - 6. The computer-implemented method of claim 4, further comprising:
    - performing natural language understanding processing on the input text to identify a search query; and
      
      processing the query with a search engine to obtain a search result;
      
      wherein determining the content comprises selecting, based on the input speech quality, a portion of the search result as the content.
  - 7. The computer-implemented method of claim 4, further comprising determining the input speech quality indicates that the audio data corresponds to whispered speech.
  - 8. The computer-implemented method of claim 7, wherein determining the input speech quality comprises processing the input audio data with a trained classifier configured to process audio feature vectors to determine that the input audio data has a resonance below a resonance threshold and has a volume below a volume threshold.
  - 9. The computer-implemented method of claim 8, further comprising processing input non-audio data to determine the input speech quality.
  - 10. The computer-implemented method of claim 9, wherein processing the input non-audio data comprises:
    - receiving light data from a light sensor;
      
      determining that the light data is below a light threshold; and
      
      inputting an indication that the light data is below the light threshold into the trained classifier.
  - 11. The computer-implemented method of claim 7, further comprising:
    - performing text-to-speech (TTS) processing on output text to generate output audio data, wherein the TTS processing is based on the input speech quality, and wherein performing TTS processing further comprises;
      
      performing unit selection using a voice corpus to select a plurality of stored audio data segments of recorded whispered speech, the stored audio data segments corresponding to the output text; and
      
      concatenating the plurality of stored audio segments to determine the output audio data, wherein the output audio data corresponds to an output utterance that responds to the query in a whispered voice.
  - 12. The computer-implemented method of claim 11, further comprising selecting the output text from a plurality of prepared text samples based on the speech quality.
  - 22. The computer-implemented method of claim 4, further comprising:
    - performing natural language understanding processing on the input text to identify a query;
      
      determining first content and second content that are responsive to the query; and
      
      selecting the first content as the content for output based on the input speech quality.
  - 23. The computer-implemented method of claim 4, further comprising:
    - performing natural language understanding processing on the input text to determine the input text corresponds to a request to play music;
      
      determining first music content and second music content that are responsive to the request; and
      
      determining the first music content includes an audio quality corresponding to the input speech quality; and
      
      selecting the first music content as the content for output.

13. A computing system comprising:
- at least one processor;
  
  a memory including instructions operable to be executed by the at least one processor to cause the system to perform a set of actions comprising;
  
  determining an input speech quality corresponding to input audio data;
  
  performing automatic speech recognition on the input audio data to determine input text;
  
  determining content based on the input text and the input speech quality; and
  
  causing the content to be output.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
- - 14. The computing system of claim 13, wherein determining the input speech quality comprises processing the input audio data using at least one trained classifier configured to classify the audio data as either corresponding to the speech quality or not corresponding to the speech quality.
  - 15. The computing system of claim 13, the set of actions further comprising:
    - performing natural language understanding processing on the input text to identify a search query; and
      
      processing the query with a search engine to obtain a search result;
      
      wherein determining the content comprises selecting, based on the input speech quality, a portion of the search result as the content.
  - 16. The computing system of claim 13, the set of actions further comprising determining the input speech quality indicates that the audio data corresponds to whispered speech.
  - 17. The computing system of claim 16, wherein determining the input speech quality comprises processing the input audio data with a trained classifier configured to process audio feature vectors to determine that the input audio data has a resonance below a resonance threshold and has a volume below a volume threshold.
  - 18. The computing system of claim 17, the set of actions further comprising processing input non-audio data to determine the input speech quality.
  - 19. The computing system of claim 18, wherein processing the input non-audio data comprises:
    - receiving light data from a light sensor;
      
      determining that the light data is below a light threshold; and
      
      inputting an indication that the light data is below the light threshold into the trained classifier.
  - 20. The computing system of claim 16, the set of actions further comprising:
    - performing text-to-speech (TTS) processing on output text to generate output audio data, wherein the TTS processing is based on the input speech quality, and wherein performing TTS processing further comprises;
      
      performing unit selection using a voice corpus to select a plurality of stored audio data segments of recorded whispered speech, the stored audio data segments corresponding to the output text; and
      
      concatenating the plurality of stored audio segments to determine the output audio data, wherein the output audio data corresponds to an output utterance that responds to the query in a whispered voice.
  - 21. The computing system of claim 20, the set of actions further comprising selecting the output text from a plurality of prepared text samples based on the speech quality.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Toth, Arthur Richard, Basye, Kenneth John, Barton, William Folwell

Application Number

US14/752,128
Publication Number

US 20160379638A1
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G10L 13/02   Methods for producing synth...

G10L 13/033   Voice editing, e.g. manipul...

G10L 15/18   using natural language mode...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 17/26   Recognition of special voic...

G10L 2015/223   Execution procedure of a sp...

G10L 2015/225   Feedback of the input speech

G10L 25/54   for retrieval

INPUT SPEECH QUALITY MATCHING

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

132 Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

INPUT SPEECH QUALITY MATCHING

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

132 Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links