Methods for reconstructing an audio signal

US 10,127,918 B1
Filed: 05/03/2017
Issued: 11/13/2018
Est. Priority Date: 05/03/2017
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

receiving input audio data comprising a plurality of audio samples;

detecting distortion in a first portion of the input audio data associated with a first period of time, the distortion caused by at least one of the plurality of audio samples missing from the input audio data or a magnitude value of one or more of the plurality of audio samples being equal to a saturation threshold value;

determining that a second portion of the input audio data following the first portion is not distorted, the second portion corresponding to a second period of time that begins at a first time;

performing, based on a magnitude of signal values of the input audio data, a quantization process to generate first audio data by mapping the signal values of the input audio data to discrete states corresponding to respective quantization intervals;

generating, based on the first audio data, two or more first audio data predictions corresponding to at least part of the first period of time, the two or more first audio data predictions determined using a first generative model that receives the first audio data as input features and predicts a magnitude of signal values for audio samples recursively in a first direction in time;

determining a first audio sample in the first audio data corresponding to the first time;

determining a magnitude value associated with the first audio sample;

selecting, based on at least the magnitude value associated with the first audio sample, a first data prediction of the two or more first audio data predictions;

generating, based on the first data prediction, second audio data corresponding to at least part of the first period of time;

generating, based on at least the first audio data and the second audio data, output audio data, the output audio data including the second audio data followed by a third portion of the first audio data that includes the first audio sample; and

doing at least one of (a) causing audio corresponding to the output audio data to be output by at least one speaker, or (b) causing a function corresponding to a voice command represented by the output audio data to be executed.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system configured to reconstruct audio signals. The system may identify missing audio samples due to packet loss or detect distortion caused by audio clipping and may reconstruct the audio data. The system may employ a forward-looking neural network that recursively predicts audio samples based on previous audio samples and/or a backward-looking neural network that recursively predicts audio samples based on subsequent audio samples. The system may generate audio data using only the forward-looking neural network for low latency applications or may generate audio data using both neural networks for mid to high latency applications. To reduce distortion in output audio data, the system may generate the audio data by cross-fading between outputs of the neural networks and/or may cross-fade between the generated audio data and the input audio data.

14 Citations

View as Search Results

23 Claims

1. A computer-implemented method, comprising:
- receiving input audio data comprising a plurality of audio samples;
  
  detecting distortion in a first portion of the input audio data associated with a first period of time, the distortion caused by at least one of the plurality of audio samples missing from the input audio data or a magnitude value of one or more of the plurality of audio samples being equal to a saturation threshold value;
  
  determining that a second portion of the input audio data following the first portion is not distorted, the second portion corresponding to a second period of time that begins at a first time;
  
  performing, based on a magnitude of signal values of the input audio data, a quantization process to generate first audio data by mapping the signal values of the input audio data to discrete states corresponding to respective quantization intervals;
  
  generating, based on the first audio data, two or more first audio data predictions corresponding to at least part of the first period of time, the two or more first audio data predictions determined using a first generative model that receives the first audio data as input features and predicts a magnitude of signal values for audio samples recursively in a first direction in time;
  
  determining a first audio sample in the first audio data corresponding to the first time;
  
  determining a magnitude value associated with the first audio sample;
  
  selecting, based on at least the magnitude value associated with the first audio sample, a first data prediction of the two or more first audio data predictions;
  
  generating, based on the first data prediction, second audio data corresponding to at least part of the first period of time;
  
  generating, based on at least the first audio data and the second audio data, output audio data, the output audio data including the second audio data followed by a third portion of the first audio data that includes the first audio sample; and
  
  doing at least one of (a) causing audio corresponding to the output audio data to be output by at least one speaker, or (b) causing a function corresponding to a voice command represented by the output audio data to be executed.
- View Dependent Claims (2, 3, 4)
- - 2. The computer-implemented method of claim 1, further comprising:
    - determining the third portion of the first audio data, the third portion including the first audio sample and corresponding to the second period of time;
      
      selecting the first data prediction based on the third portion and the magnitude value associated with the first audio sample;
      
      generating the output audio data, the output audio data cross-fading from the second audio data to the third portion of the first audio data; and
      
      training, using the output audio data, a neural network included in the first generative model.
  - 3. The computer-implemented method of claim 1, further comprising:
    - generating, based on the first audio data, a second data prediction corresponding to at least part of the first period of time, the second data prediction determined using a second generative model that predicts a magnitude of signal values for audio samples recursively in a second direction in time opposite the first direction beginning at the first time; and
      
      generating, based on the first data prediction and the second data prediction, the second audio data, the second audio data cross-fading between the first data prediction and the second data prediction, the second audio data corresponding to the first period of time.
  - 4. The computer-implemented method of claim 1, further comprising:
    - performing, based on a magnitude of signal values for the input audio data, the quantization process to generate the first audio data, the quantization process having nonuniform quantization intervals, wherein a first quantization interval corresponds to a first range of signal values and a second quantization interval corresponds to a second range of signal values that is smaller than the first range.

5. A computer-implemented method, comprising:
- receiving input audio data comprising a plurality of audio samples;
  
  detecting distortion in a first portion of the input audio data associated with a first period of time;
  
  determining that a second portion of the input audio data following the first portion is not distorted, the second portion corresponding to a second period of time that begins at a first time;
  
  performing a quantization process on the input audio data to generate first audio data by mapping signal values of the input audio data to discrete states corresponding to respective quantization intervals;
  
  generating, based on the first audio data, two or more first audio data predictions corresponding to at least part of the first period of time, the two or more first audio data predictions determined using a first generative model that receives the first audio data as input features and predicts audio samples recursively in a first direction in time;
  
  generating, based on the two or more first audio data predictions, second audio data corresponding to at least part of the first period of time;
  
  generating, based on at least the first audio data and the second audio data, output audio data; and
  
  doing at least one of (a) causing audio corresponding to the output audio data to be output by at least one speaker, or (b) causing a function corresponding to a voice command represented by the output audio data to be executed.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13)
- - 6. The computer-implemented method of claim 5, further comprising:
    - determining a first audio sample in the first audio data, the first audio sample corresponding to the first time;
      
      determining a magnitude value associated with the first audio sample;
      
      generating the second audio data by selecting, based on at least the magnitude value associated with the first audio sample, a first data prediction of the two or more first audio data predictions, the second audio data corresponding to the first period of time; and
      
      generating the output audio data, the output audio data including the second audio data followed by a third portion of the first audio data that includes the first audio sample.
  - 7. The computer-implemented method of claim 5, further comprising:
    - determining a third portion of the first audio data, the third portion corresponding to at least part of the second period of time;
      
      generating the second audio data by selecting, based on the third portion of the first audio data, a first data prediction of the two or more first audio data predictions, the second audio data corresponding to the first period of time and at least part of the second period of time;
      
      generating, based on at least part of the second audio data and the third portion of the first audio data, third audio data, the third audio data cross-fading from the second audio data to the third portion of the first audio data; and
      
      generating the output audio data, the output audio data including a part of the second audio data followed by the third audio data and part of the third portion of the first audio data.
  - 8. The computer-implemented method of claim 5, further comprising:
    - determining a third portion of the first audio data, the third portion corresponding to at least part of the second period of time;
      
      generating the second audio data by averaging signal values of audio samples included in the two or more first audio data predictions, the second audio data corresponding to the first period of time and at least part of the second period of time;
      
      generating, based on at least part of the second audio data and the third portion of the first audio data, third audio data, the third audio data cross-fading from the second audio data to the third portion of the first audio data; and
      
      generating the output audio data, the output audio data including part of the second audio data followed by the third audio data and part of the third portion of the first audio data.
  - 9. The computer-implemented method of claim 5, further comprising:
    - selecting a first data prediction of the two or more first audio data predictions;
      
      generating, based on the first audio data, a second data prediction corresponding to at least part of the first period of time, the second data prediction determined using a second generative model that predicts audio samples recursively in a second direction in time opposite the first direction beginning at the first time; and
      
      generating, based on the first data prediction and the second data prediction, the second audio data, the second audio data cross-fading between the first data prediction and the second data prediction, the second audio data corresponding to the first period of time.
  - 10. The computer-implemented method of claim 5, further comprising:
    - selecting a first data prediction of the two or more first audio data predictions;
      
      generating, based on the first audio data, two or more second audio data predictions corresponding to at least part of the first period of time, the two or more second audio data predictions determined using a second generative model that predicts audio samples recursively in a second direction in time opposite the first direction beginning at the first time;
      
      selecting a second data prediction of the two or more second audio data predictions; and
      
      generating, based on the first data prediction and the second data prediction, the second audio data, the second audio data cross-fading between the first data prediction and the second data prediction, the second audio data corresponding to the first period of time.
  - 11. The computer-implemented method of claim 5, further comprising:
    - generating, based on the first audio data, two or more second audio data predictions corresponding to at least part of the first period of time, the two or more second audio data predictions determined using a second generative model that predicts audio samples recursively in a second direction in time opposite the first direction beginning at the first time;
      
      determining a plurality of similarity metrics, wherein the determining the plurality of similarity metrics further comprises;
      
      determining a first similarity metric between a first data prediction of the two or more first audio data predictions and a second data prediction of the two or more second audio data predictions, anddetermining a second similarity metric between the first data prediction of the two or more first audio data predictions and a third data prediction of the two or more second audio data predictions;
      
      determining that the second similarity metric is a highest similarity metric of the plurality of similarity metrics; and
      
      generating the second audio data, the second audio data cross-fading between the first data prediction and the third data prediction, the second audio data corresponding to the first period of time.
  - 12. The computer-implemented method of claim 5, further comprising:
    - performing, based on a magnitude of signal values for the input audio data, the quantization process to generate the first audio data, the quantization process having nonuniform quantization intervals, wherein a first quantization interval corresponds to a first range of signal values and a second quantization interval corresponds to a second range of signal values that is smaller than the first range.
  - 13. The computer-implemented method of claim 5, wherein:
    - the distortion is caused by at least one of the plurality of audio samples missing from the input audio data or a magnitude value of one or more of the plurality of audio samples being equal to a saturation threshold value.

14. A system comprising:
- at least one processor; and
  
  memory including instructions operable to be executed by the at least one processor to perform a set of actions to configure the system device to;
  
  receive input audio data comprising a plurality of audio samples;
  
  detect distortion in a first portion of the input audio data associated with a first period of time;
  
  determine that a second portion of the input audio data following the first portion is not distorted, the second portion corresponding to a second period of time that begins at a first time;
  
  perform a quantization process on the input audio data to generate first audio data by mapping signal values of the input audio data to discrete states corresponding to respective quantization intervals;
  
  generate, based on the first audio data, two or more first audio data predictions corresponding to at least part of the first period of time, the two or more first audio data predictions determined using a first generative model that receives the first audio data as input features and predicts audio samples recursively in a first direction in time;
  
  generate, based on the two or more first audio data predictions, second audio data corresponding to at least part of the first period of time;
  
  generate, based on at least the first audio data and the second audio data, output audio data; and
  
  do at least one of (a) cause audio corresponding to the output audio data to be output by at least one speaker, or (b) cause a function corresponding to a voice command represented by the output audio data to be executed.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 15. The system of claim 14, wherein the memory includes additional instructions operable to be executed by the at least one processor to further configure the system to:
    - determine a first audio sample in the first audio data, the first audio sample corresponding to the first time;
      
      determine a magnitude value associated with the first audio sample;
      
      generate the second audio data by selecting, based on at least the magnitude value associated with the first audio sample, a first data prediction of the two or more first audio data predictions, the second audio data corresponding to the first period of time; and
      
      generate the output audio data, the output audio data including the second audio data followed by a third portion of the first audio data that includes the first audio sample.
  - 16. The system of claim 14, wherein the memory includes additional instructions operable to be executed by the at least one processor to further configure the system to:
    - determine a third portion of the first audio data, the third portion corresponding to at least part of the second period of time;
      
      generate the second audio data by selecting, based on the third portion of the first audio data, a first data prediction of the two or more first audio data predictions, the second audio data corresponding to the first period of time and at least part of the second period of time;
      
      generate, based on at least part of the second audio data and the third portion of the first audio data, third audio data, the third audio data cross-fading from the second audio data to the third portion of the first audio data; and
      
      generate the output audio data, the output audio data including a part of the second audio data followed by the third audio data and part of the third portion of the first audio data.
  - 17. The system of claim 14, wherein the memory includes additional instructions operable to be executed by the at least one processor to further configure the system to:
    - determine a third portion of the first audio data, the third portion corresponding to at least part of the second period of time;
      
      generate the second audio data by averaging signal values of audio samples included in the two or more first audio data predictions, the second audio data corresponding to the first period of time and at least part of the second period of time;
      
      generate, based on at least part of the second audio data and the third portion of the first audio data, third audio data, the third audio data cross-fading from the second audio data to the third portion of the first audio data; and
      
      generate the output audio data, the output audio data including part of the second audio data followed by the third audio data and part of the third portion of the first audio data.
  - 18. The system of claim 14, wherein the memory includes additional instructions operable to be executed by the at least one processor to further configure the system to:
    - select a first data prediction of the two or more first audio data predictions;
      
      generate, based on the first audio data, a second data prediction corresponding to at least part of the first period of time, the second data prediction determined using a second generative model that predicts audio samples recursively in a second direction in time opposite the first direction beginning at the first time; and
      
      generate, based on the first data prediction and the second data prediction, the second audio data, the second audio data cross-fading between the first data prediction and the second data prediction, the second audio data corresponding to the first period of time.
  - 19. The system of claim 14, wherein the memory includes additional instructions operable to be executed by the at least one processor to further configure the system to:
    - select a first data prediction of the two or more first audio data predictions;
      
      generate, based on the first audio data, two or more second audio data predictions corresponding to at least part of the first period of time, the two or more second audio data predictions determined using a second generative model that predicts audio samples recursively in a second direction in time opposite the first direction beginning at the first time;
      
      select a second audio prediction of the two or more second audio data predictions; and
      
      generate, based on the first data prediction and the second data prediction, the second audio data, the second audio data cross-fading between the first data prediction and the second data prediction, the second audio data corresponding to the first period of time.
  - 20. The system of claim 14, wherein the memory includes additional instructions operable to be executed by the at least one processor to further configure the system to:
    - generate, based on the first audio data, two or more second audio data predictions corresponding to at least part of the first period of time, the two or more second audio data predictions determined using a second generative model that predicts audio samples recursively in a second direction in time opposite the first direction beginning at the first time;
      
      determine a plurality of similarity metrics, wherein determining the plurality of similarity metrics further comprises;
      
      determining a first similarity metric between a first data prediction of the two or more first audio data predictions and a second data prediction of the two or more second audio data predictions, anddetermining a second similarity metric between the first data prediction of the two or more first audio data predictions and a third data prediction of the two or more second audio data predictions;
      
      determine that the second similarity metric is a highest similarity metric of the plurality of similarity metrics; and
      
      generate the second audio data, the second audio data cross-fading between the first data prediction and the third data prediction, the second audio data corresponding to the first period of time.
  - 21. The system of claim 14, wherein the memory includes additional instructions operable to be executed by the at least one processor to further configure the system to:
    - perform, based on a magnitude of signal values for the input audio data, the quantization process to generate the first audio data, the quantization process having nonuniform quantization intervals, wherein a first quantization interval corresponds to a first range of signal values and a second quantization interval corresponds to a second range of signal values that is smaller than the first range.
  - 22. The system of claim 14, wherein the input audio data corresponds to an utterance, and the memory includes additional instructions operable to be executed by the at least one processor to further configure the system to:
    - cause a voice command represented by the utterance to be determined; and
      
      cause the function to be performed based at least in part on the voice command.
  - 23. The system of claim 14, wherein the memory includes instructions operable to be executed by the at least one processor to configure the system to:
    - cause the audio corresponding to the output audio data to be output by the at least one speaker.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Kamath Koteshwara, Krishna, Kristjansson, Trausti Thor
Primary Examiner(s)
Azad, Abul

Application Number

US15/585,458
Time in Patent Office

559 Days
Field of Search

None
US Class Current
CPC Class Codes

G10L 15/20   Speech recognition techniqu...

G10L 21/02   Speech enhancement, e.g. no...

G10L 25/51   for comparison or discrimin...

G11B 27/038   Cross-faders therefor

Methods for reconstructing an audio signal

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

14 Citations

23 Claims

Specification

Use Cases

Quick Links

Others

Methods for reconstructing an audio signal

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

14 Citations

23 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others