HYBRID COMPRESSION OF TEXT-TO-SPEECH VOICE DATA

US 20140122060A1
Filed: 12/19/2012
Published: 05/01/2014
Est. Priority Date: 10/26/2012
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

one or more processors;

a computer-readable memory; and

a module comprising executable instructions stored in the computer-readable memory, the module, when executed by the one or more processors, configured to;

obtain a voice recording and a corresponding sequence of speech units;

select a first speech segment, wherein the first speech segment corresponds to a portion of the voice recording and wherein the first speech segment corresponds to a first speech unit;

apply a first compression technique to the first speech segment to create a first compressed speech segment, wherein the first compression technique comprises one of time domain compression or perceptual compression;

apply a second compression technique to the first compressed speech segment to create a second compressed speech segment, wherein the second compression technique comprises one of time domain compression or perceptual compression, and wherein the second compression technique is different from the first compression technique;

distribute the second compressed speech segment to a client computing device for use in a text-to-speech system.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Recorded or synthesized speech segments of text-to-speech (TTS) systems may be compressed though the use of both time domain compression and perceptual compression techniques. The twice-compressed recording may be separated into speech segments corresponding to words or subword units for use in a TTS system. The compression rate of time domain compression, and the ratio of time domain compression to perceptual compression, may be modified for any speech segment. The compression amount or ratio may be determined based on linguistic or acoustic features of the word or subword unit that the speech segment represents. Differing compression amounts and ratios may be applied to portions of a single speech segment.

31 Citations

View as Search Results

26 Claims

1. A system comprising:
- one or more processors;
  
  a computer-readable memory; and
  
  a module comprising executable instructions stored in the computer-readable memory, the module, when executed by the one or more processors, configured to;
  
  obtain a voice recording and a corresponding sequence of speech units;
  
  select a first speech segment, wherein the first speech segment corresponds to a portion of the voice recording and wherein the first speech segment corresponds to a first speech unit;
  
  apply a first compression technique to the first speech segment to create a first compressed speech segment, wherein the first compression technique comprises one of time domain compression or perceptual compression;
  
  apply a second compression technique to the first compressed speech segment to create a second compressed speech segment, wherein the second compression technique comprises one of time domain compression or perceptual compression, and wherein the second compression technique is different from the first compression technique;
  
  distribute the second compressed speech segment to a client computing device for use in a text-to-speech system.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system of claim 1, wherein time domain compression is based at least in part on Time Domain Pitch Synchronous Overlap and Add (TD-PSOLA) compression or Waveform Similarity Overlap and Add (WSOLA) compression.
  - 3. The system of claim 1, wherein perceptual compression is based at least in part on Code-Excited Linear Prediction (CELP), Algebraic Code-Excited Linear Prediction (ACELP), Linear Predictive Coding (LPC), or Residual Excited Linear Predictive Coding (RELPC).
  - 4. The system of claim 1, wherein the first speech unit comprises one of a diphone, a phoneme, or a triphone.
  - 5. The system of claim 1, wherein the first compression technique is time domain compression and a compression rate is based at least in part on the speech unit.

6. A computer-implemented method comprising:
- applying, by a text-to-speech voice development system comprising one or more computing devices, a first compression technique to a portion of a voice recording to create a first compressed portion; and
  
  applying, by the voice development system, a second compression technique to the first compressed portion to create a second compressed portion;
  
  wherein the second compression technique is different from the first compression technique, and wherein at least one of the first compression technique or the second compression technique comprises time-domain compression.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 7. The computer-implemented method of claim 6, wherein time domain compression is based at least in part on Time Domain Pitch Synchronous Overlap and Add (TD-PSOLA) compression or Waveform Similarity Overlap and Add (WSOLA) compression.
  - 8. The computer-implemented method of claim 6, wherein at least one of the first compression technique or the second compression technique is based at least in part on Code-Excited Linear Prediction (CELP), Algebraic Code-Excited Linear Prediction (ACELP), Linear Predictive Coding (LPC), or Residual Excited Linear Predictive Coding (RELPC).
  - 9. The computer-implemented method of claim 6, further comprising storing position data regarding a position of a first speech segment within the second compressed portion based at least in part on a text associated with the voice recording.
  - 10. The computer-implemented method of claim 6, wherein the portion corresponds to one of a phoneme, a diphone, or a word.
  - 11. The computer-implemented method of claim 6, wherein applying the first compression technique comprises applying a different level of time domain compression to a first subportion and a second subportion of the portion.
  - 12. The computer-implemented method of claim 11, wherein the first sub portion corresponds to one of a phoneme, a diphone, or a word.
  - 13. The computer-implemented method of claim 6, further comprising determining a level of compression to apply to the portion based at least in part on a linguistic feature of a text corresponding to the portion.
  - 14. The computer-implemented method of claim 13, wherein the linguistic feature comprises an identification of a phoneme.
  - 15. The computer-implemented method of claim 13 wherein the linguistic feature comprises an indication of a phoneme class, wherein the phoneme class is one of a voiced phoneme, an unvoiced phoneme, a plosive, a vowel, a consonant, a liquid, or a fricative.
  - 16. The computer-implemented method of claim 6, wherein applying the first compression technique comprises applying a different level of compression to a first subportion and a second subportion of the portion.
  - 17. The computer-implemented method of claim 6, further comprising determining a level of compression to apply to the first compressed portion based at least in part on a linguistic feature of a text corresponding to the portion.
  - 18. The computer-implemented method of claim 17, wherein the linguistic feature comprises an identification of a phoneme.
  - 19. The computer-implemented method of claim 17 wherein the acoustic feature comprises an indication of a phoneme class, wherein the phoneme class is one of a voiced phoneme, an unvoiced phoneme, a plosive, a vowel, a consonant, a liquid, or a fricative.

20. A non-transitory computer readable medium which stores a text-to-speech component comprising executable code that directs a client computing device to perform a process comprising:
- receiving text comprising a sequence of words; and
  
  assembling an audio presentation corresponding to the text, the audio presentation comprising a sequence of speech segments, wherein the sequence of speech segments is based at least in part on the sequence of words, and wherein assembling the audio presentation comprises;
  
  retrieving a first compressed speech segment;
  
  applying two decompression techniques to the first compressed speech segment to obtain a first speech segment;
  
  retrieving a second compressed speech segment;
  
  applying two decompression techniques to the second compressed speech segment to obtain a second speech segment;
  
  concatenating the first speech segment and the second speech segment.
- View Dependent Claims (21, 22, 23, 24, 25, 26)
- - 21. The non-transitory computer readable medium of claim 20, wherein the first speech segment corresponds to a word or subword unit.
  - 22. The non-transitory computer readable medium of claim 21, wherein a subword unit comprises one of a phoneme or diphone.
  - 23. The non-transitory computer readable medium of claim 20, wherein applying two decompression techniques to the first compressed speech segment comprises determining a level of time domain compression applied to the first compressed speech segment.
  - 24. The non-transitory computer readable medium of claim 23, wherein determining the level of time domain compression comprises one of querying a database or inspecting metadata associated with the first compressed speech segment.
  - 25. The non-transitory computer readable medium of claim 20, wherein applying two decompression techniques to the first compressed speech segment comprises determining a level of perceptual compression applied to the first compressed speech segment.
  - 26. The non-transitory computer readable medium of claim 25, wherein determining the level of perceptual compression comprises one of querying a database or inspecting metadata associated with the first speech segment.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
IVONA Software Sp zoo (Amazon.com, Inc.)
Inventors
Kaszczuk, Michal T., Osowski, Lukasz M.

Granted Patent

US 9,064,489 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/9
CPC Class Codes

G10L 13/04   Details of speech synthesis...

G10L 19/00   Speech or audio signals ana...

G10L 21/04   Time compression or expansion

HYBRID COMPRESSION OF TEXT-TO-SPEECH VOICE DATA

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

31 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

HYBRID COMPRESSION OF TEXT-TO-SPEECH VOICE DATA

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

31 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links