Memory usage in a text-to-speech system

US 20060229877A1
Filed: 04/06/2005
Published: 10/12/2006
Est. Priority Date: 04/06/2005
Status: Abandoned Application

First Claim

Patent Images

1. A method of creating prosodic information for a concatenative text-to-speech synthesis system, comprising analysing training speech samples and generating acoustic units and associated prosodic information for selection of said acoustic units, said prosodic information including first duration information, compressing the first duration information by producing statistical data describing the behavior of the first duration information, storing said prosodic information wherein the first duration information is replaced by said statistical data, thereby reducing a memory capacity required for storing said prosodic information.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In the concatenative text-to-speech system, high compression rate of duration data in the prosodic template is achieved by extracting statistical parameters describing behavior of actual duration values of instances of each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed, and storing only the extracted statistical parameters, instead of the original duration values. Entries of each given basic unit in the prosodic template is sorted and indexed in the order of increasing duration value. Consequently, the amount of duration data can be significantly reduced, while keeping the error statistically under acceptable range.

Citations

27 Claims

1. A method of creating prosodic information for a concatenative text-to-speech synthesis system, comprising analysing training speech samples and generating acoustic units and associated prosodic information for selection of said acoustic units, said prosodic information including first duration information, compressing the first duration information by producing statistical data describing the behavior of the first duration information, storing said prosodic information wherein the first duration information is replaced by said statistical data, thereby reducing a memory capacity required for storing said prosodic information.
- View Dependent Claims (2, 3, 4, 5, 10, 21)
- - 2. A method according to claim 1, wherein said statistical data include statistical parameters of durations for each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed among the acoustic units.
  - 3. A method according to claim 1, wherein said statistical data describe behavior of duration value entries of all instances within each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed.
  - 4. A method according to claim 1, wherein said statistical data include at least one of a mean value and a deviation of durations for each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed.
  - 5. A method according to claim 1, comprising sorting entries of each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed in the order of increasing duration values.
  - 10. A method according to claim 1, wherein entries of each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed in the acoustic data database are in the order of increasing duration values.
  - 21. An executable program code that, when run on a computing device, cause the device to perform the method steps of claim 1.

6. A method for concatenative text-to-speech synthesis, comprising inputting a text, analyzing the text and producing phonetic presentation of the text, selecting from a memory, based on said phonetic presentation, prestored prosodic information including compressed duration information in form of statistical data that describes behavior of first duration information of a given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed, decompressing said compressed duration information by producing from said statistical data an estimation of said first duration information of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed by means of a statistical function, selecting, based on the estimation of said first duration information, a stored acoustic unit of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed from an acoustic data database to be concatenated to form synthetic speech.
- View Dependent Claims (7, 8, 9)
- - 7. A method according to claim 6, wherein said statistical function includes one of:
    - a probability model;
      
      uniform probability model;
      
      Gaussian probability model;
      
      curve fitting to a sorted duration curve;
      
      polynomial approximation;
      
      spline-based approximation; and
      
      vector quantization.
  - 8. A method according to claim 6, wherein said statistical data describe behavior of duration value entries of all instances within each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed.
  - 9. A method according to claim 6, wherein said statistical data include at least one of:
    - statistical parameters of durations for each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed among the acoustic units;
      
      a mean value of durations for each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed; and
      
      a deviation of durations for each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed.

11. A device for a concatenative text-to-speech synthesis, comprising a text analyzer producing phonetic presentation of a text input;
- a memory storing a lexicon for the text analyzer, voice data including acoustic units, and associated prosodic information for selection of said acoustic units, said prosodic information including compressed duration information in form of statistical data that describes behavior of first duration information of each syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed, decompressor decompressing said compressed duration information by a predetermined statistical function producing an estimation of said first duration information of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed based on the statistical data;
  
  a selector selecting, based on the estimation of said first duration information and other prosodic information, a stored acoustic unit of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed from an acoustic data database to be concatenated to form synthetic speech.
- View Dependent Claims (12, 13, 14, 15)
- - 12. A device according to claim 11, wherein said statistical function includes one of:
    - a probability model;
      
      uniform probability model;
      
      Gaussian probability model;
      
      curve fitting to a sorted duration curve;
      
      polynomial quantization;
      
      spline quantization; and
      
      vector quantization.
  - 13. A device according to claim 11, wherein said statistical data describe behavior of duration value entries of all instances within each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed.
  - 14. A device according to claim 11, wherein said statistical data include at least one of:
    - statistical parameters of durations for each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed among the acoustic units;
      
      a mean value of durations for each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed; and
      
      a deviation of durations for each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed.
  - 15. A device according to claim 11, wherein said device is a mobile device comprising an executable program code configured to implement the text analyzer, the decompressor and the selector.

16. A mobile communication device, comprising a data processing unit;
- a memory storing a lexicon for text analysis, voice data including acoustic units, and associated prosodic information for selection of said acoustic units, said prosodic information including compressed duration information in form of statistical data that describes behavior of first duration information of each syllable, and a program code that causes the data processing unit to analyze the text and producing phonetic presentation of a text input, to select from said memory, based on said phonetic presentation, compressed duration information of a given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed, to decompress said compressed duration information by producing from said statistical data an estimation of said first duration information of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed by means of a statistical function, and to select, based on the estimation of said first duration information, a stored acoustic unit of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed from an acoustic data database to be concatenated to form synthetic speech.
- View Dependent Claims (17, 18, 19)
- - 17. A device according to claim 16, wherein said statistical function includes one of:
    - a probability model;
      
      uniform probability model;
      
      Gaussian probability model;
      
      curve fitting to a sorted duration curve;
      
      polynomial quantization;
      
      spline quantization; and
      
      vector quantization.
  - 18. A device according to claim 16, wherein said statistical data describe behavior of duration value entries of all instances within each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed.
  - 19. A device according to claim 16, wherein said statistical data include at least one of:
    - statistical parameters of durations for each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed among the acoustic units;
      
      a mean value of durations for each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed; and
      
      a deviation of durations for each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed.

20. A data storage encoded with an executable program that, when run on a computing device, cause the device to analyze the text and producing phonetic presentation of a text input, to select from said memory, based on said phonetic presentation, compressed duration information of a given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed, to decompress said compressed duration information by producing from said statistical data an estimation of said first duration information of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed by means of a statistical function, and to select, based on the estimation of said first duration information, a stored acoustic unit of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed from an acoustic data database to be concatenated to form synthetic speech.

22. A device for creating prosodic information for a concatenative text-to-speech synthesis system, comprising analyzer analysing training speech samples and generating acoustic units and associated prosodic information for selection of said acoustic units, said prosodic information including first duration information, compressor compressing the first duration information by producing statistical data describing the behavior of the first duration information, memory storing said prosodic information wherein the first duration information is replaced by said statistical data, thereby reducing a memory capacity required for storing said prosodic information.
- View Dependent Claims (23, 24, 25, 26)
- - 23. A device according to claim 22, wherein said statistical data include statistical parameters of durations for each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed among the acoustic units.
  - 24. A device according to claim 22, wherein said statistical data describe behavior of duration value entries of all instances within each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed.
  - 25. A device according to claim 22, wherein said statistical data include at least one of a mean value and a deviation of durations for each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed.
  - 26. A device according to claim 22, comprising sorting entries of each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed in the order of increasing duration values.

27. A concatenative text-to-speech synthesis system, comprising means analysing training speech samples and generating acoustic units and associated prosodic information for selection of said acoustic units, said prosodic information including first duration information, means compressing the first duration information by producing statistical data describing the behavior of the first duration information of each syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed, means storing a lexicon for the text analyzer, voice data including said acoustic units, and said associated prosodic information containing said compressed duration information, means producing phonetic presentation of a text input;
- means decompressing said compressed duration information by a predetermined statistical function producing an estimation of said first duration information of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed based on the statistical data;
  
  means selecting, based on the estimation of said first duration information and other prosodic information, a stored acoustic unit of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed from an acoustic data database to be concatenated to form synthetic speech.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nokia Corporation
Original Assignee
Nokia Corporation
Inventors
Tian, Jilei, Nurminen, Jani

Application Number

US11/100,001
Publication Number

US 20060229877A1
Time in Patent Office

Days
Field of Search
US Class Current

704/267
CPC Class Codes

G10L 13/06 Elementary speech units use...

Memory usage in a text-to-speech system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Memory usage in a text-to-speech system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links