Text-to-speech processing using previously speech processed data

US 10,140,973 B1
Filed: 09/15/2016
Issued: 11/27/2018
Est. Priority Date: 09/15/2016
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

receiving first input audio data corresponding to an utterance;

performing automatic speech recognition processing on the first input audio data to create first reference text data;

in a database associated with a first user profile, storing a first association between the first input audio data and the first reference text data;

receiving, from a first device associated with the first user profile, a message intended for a second device, the message including first text data;

determining the first text data corresponds to the first reference text data;

identifying the first input audio data in the database based at least in part on the first association;

causing the first device to output a visual indication representing that the first text data corresponds to the first reference text data;

generating, after causing the first device to output the visual indication, output audio data including the first input audio data; and

sending, to the second device, the output audio data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, methods, and devices for generating text-to-speech output using previously captured speech are described. Spoken audio is obtained and undergoes speech processing to create text. The resulting text is stored with the spoken audio, with both the text and the spoken audio being associated with the individual that spoke the audio. Various spoken audio and corresponding text are stored over time to create a library of speech units. When the individual sends a text message to a recipient, the text message is processed to determine portions of text, and the portions of text are compared to the library of text associated with the individual. When text in the library is identified, the system selects the spoken audio units associated with the identified stored text. The selected spoken audio units are then used to generate output audio data corresponding to the original text message, with the output audio data being sent to a device of the message recipient.

57 Citations

View as Search Results

18 Claims

1. A computer-implemented method, comprising:
- receiving first input audio data corresponding to an utterance;
  
  performing automatic speech recognition processing on the first input audio data to create first reference text data;
  
  in a database associated with a first user profile, storing a first association between the first input audio data and the first reference text data;
  
  receiving, from a first device associated with the first user profile, a message intended for a second device, the message including first text data;
  
  determining the first text data corresponds to the first reference text data;
  
  identifying the first input audio data in the database based at least in part on the first association;
  
  causing the first device to output a visual indication representing that the first text data corresponds to the first reference text data;
  
  generating, after causing the first device to output the visual indication, output audio data including the first input audio data; and
  
  sending, to the second device, the output audio data.
- View Dependent Claims (2, 3)
- - 2. The computer-implemented method of claim 1, further comprising:
    - receiving second input audio data corresponding to a second utterance;
      
      performing automatic speech recognition processing on the second input audio data to create second reference text data;
      
      in the database, storing a second association between the second input audio data and the second reference text data;
      
      determining a pronunciation of the first text data;
      
      determining a first diphone identifier associated with the first reference text data;
      
      determining a second diphone identifier associated with the second reference text data; and
      
      determining the first diphone identifier and the second diphone identifier correspond to the pronunciation,wherein generating the output audio data comprises concatenating the first input audio data to the second input audio data.
  - 3. The computer-implemented method of claim 2, further comprising:
    - associating the first reference text data with first pronunciation data;
      
      associating the second reference text data with second pronunciation data;
      
      receiving, from the first device, a third message intended for the second device, the third message including second text data;
      
      determining the second text data corresponds to a first word of the first reference text data and a second word of the second reference text data, the first word being identical to the second word;
      
      performing prosodic analysis processing on the second text data to determine third pronunciation data; and
      
      identifying the first word for generating second output audio data based at least in part on the first pronunciation data being at least similar to the third pronunciation data.

4. A system, comprising:
- at least one processor; and
  
  at least one memory including instructions that, when executed by the at least one processor, cause the system to;
  
  receive first input audio data corresponding to at least one first utterance associated with user profile data;
  
  perform automatic speech recognition processing on the first input audio data to create first text data;
  
  associate the first text data with the user profile data, the first text data being associated with at least a portion of the first input audio data, the at least a portion of the first input audio data having a first prosodic characteristic;
  
  receive second text data;
  
  determine the second text data is associated with the user profile data;
  
  determine the first text data corresponds to at least a portion of the second text data;
  
  determine the at least a portion of the second text data is associated with third text data, the third text data being associated with second input audio data having a second prosodic characteristic;
  
  perform prosodic analysis processing on the second text data to determine a third prosodic characteristic;
  
  determine the third prosodic characteristic at least substantially matches the first prosodic characteristic;
  
  generate, after determining the third prosodic characteristic at least substantially matches the first prosodic characteristic, output audio data using the at least a portion of the first input audio data; and
  
  send the output audio data to a first device.
- View Dependent Claims (5, 6, 7, 8, 9, 10)
- - 5. The system of claim 4, wherein a first portion of the first input audio data represents a diphone and wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to:
    - determine a pronunciation of the second text data;
      
      determine the diphone corresponds to the pronunciation; and
      
      generate the output audio data based at least in part on the diphone corresponding to the pronunciation.
  - 6. The system of claim 4, wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to:
    - cause a second device to output a first visual indication representing a first portion of the second text data corresponds to the first text data; and
      
      cause the second device to output a second visual indication representing a second portion of the second text data does not correspond to the first text data, the first visual indication and the second visual indication being different with respect to at least color.
  - 7. The system of claim 4, wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to:
    - receive third text data;
      
      determine the third text data does not correspond to the first text data;
      
      cause a second device to request further input audio corresponding to at least one second utterance corresponding to the third text data;
      
      receive, from the second device, second input audio data; and
      
      associate the third text data with the second input audio data and the user profile data.
  - 8. The system of claim 4, wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to:
    - receive third text data representing a first word;
      
      determine the first word does not correspond to the first text data;
      
      perform natural language understanding processing on the third text data;
      
      determine a second word having a similar meaning as the first word;
      
      determine the second word corresponds to the first text data; and
      
      send, to a second device, first data representing the second word.
  - 9. The system of claim 4, wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to:
    - receive third text data including a first portion and a second portion;
      
      determine a first portion of the first text data corresponding to the first portion of the third text data;
      
      determine a second portion of the first text data corresponding to the second portion of the third text data; and
      
      generate second output audio data at least partially corresponding to the first portion of the first text data and the second portion of the first text data.
  - 10. The system of claim 9, wherein the first portion of the first text data is a first sequence of words and the second portion of the first text data is a second sequence of words.

11. A computer-implemented method, comprising:
- receiving first input audio data corresponding to at least one first utterance associated with user profile data;
  
  performing automatic speech recognition processing on the first input audio data to create first text data;
  
  associating the first text data with the user profile data, the first text data being associated with at least a portion of the first input audio data;
  
  receiving second text data;
  
  determining the second text data is associated with the user profile data;
  
  determining the first text data corresponds to a first portion of the second text data;
  
  causing a first device to request further input audio data corresponding to at least one second utterance corresponding to a second portion of the second text data;
  
  receiving, from the first device and after causing the first device to request further input audio data, second input audio data;
  
  associating the second portion of the second text data with the second input audio data and the user profile data;
  
  generating output audio data using the at least a portion of the first input audio data and the second input audio data; and
  
  sending the output audio data to a second device.
- View Dependent Claims (12, 13, 14, 15, 16, 17)
- - 12. The computer-implemented method of claim 11, further comprising:
    - receiving recipient information; and
      
      determining the second device using at least one of the user profile data or the recipient information.
  - 13. The computer-implemented method of claim 11, further comprising:
    - receiving third text data;
      
      determining the third text data corresponds to fourth text data, the fourth text data being associated with first audio data having a first prosodic characteristic;
      
      determining the third text data corresponds to fifth text data, the fifth text data being associated with second audio data having a second prosodic characteristic;
      
      performing prosodic analysis processing on the third text data to determine a third prosodic characteristic; and
      
      selecting the first audio data for text-to-speech processing based at least in part on the third prosodic characteristic at least substantially matching the first prosodic characteristic.
  - 14. The computer-implemented method of claim 11, further comprising:
    - causing the first device to output a first visual indication representing a first portion of the second text data corresponds to the first text data; and
      
      causing the first device to output a second visual indication representing a second portion of the second text data does not correspond to the first text data, the first visual indication and the second visual indication being different with respect to at least color.
  - 15. The computer-implemented method of claim 11, further comprising:
    - receiving third text data including a first word;
      
      determining the first word does not correspond to the first text data;
      
      performing natural language understanding processing on the third text data;
      
      determining a second word having a similar meaning as the first word;
      
      determining the second word corresponds to the first text data; and
      
      sending, to the first device, first data representing the second word.
  - 16. The computer-implemented method of claim 11, further comprising:
    - receiving third text data including a first portion and a second portion;
      
      determining a first portion of the first text data corresponding to the first portion of the third text data;
      
      determining a second portion of the first text data corresponding to the second portion of the third text data; and
      
      generating second output audio data at least partially corresponding to the first portion of the first text data and the second portion of the first text data.
  - 17. The computer-implemented method of claim 16, wherein the first portion of the first text data is a first sequence of words and the second portion of the first text data is a second sequence of words.

18. A system, comprising:
- at least one processor; and
  
  at least one memory including instructions that, when executed by the at least one processor, cause the system to;
  
  receive input audio data corresponding to at least one utterance associated with user profile data;
  
  perform automatic speech recognition processing on the input audio data to create first text data;
  
  associate the first text data with the user profile data and at least a portion of the input audio data;
  
  receive second text data representing at least a first word;
  
  determine the second text data is associated with the user profile data;
  
  determine the first word does not correspond to the first text data;
  
  perform natural language understanding processing on the second text data;
  
  determine a second word having a similar meaning as the first word;
  
  determine the second word corresponds to the first text data;
  
  send, to a first device, first data representing the second word;
  
  receive, from the first device, an indication representing the second word is to be used;
  
  generate output audio data using the second word; and
  
  send the output audio data to a second device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Dalmia, Manish Kumar, Kuklinski, Rafal
Primary Examiner(s)
Lerner, Martin

Application Number

US15/266,116
Time in Patent Office

803 Days
Field of Search

704258, 704260, 704261, 704266, 704269, 379 8816
US Class Current
CPC Class Codes

G06F 40/247   Thesauruses; Synonyms

G06F 40/30   Semantic analysis

G06N 7/01   Probabilistic graphical mod...

G10L 13/06   Elementary speech units use...

G10L 13/07   Concatenation rules

G10L 13/10   Prosody rules derived from ...

G10L 15/26   Speech to text systems G10L...

Text-to-speech processing using previously speech processed data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

57 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Text-to-speech processing using previously speech processed data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

57 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links