Text to speech synthesis

US 7,979,280 B2
Filed: 02/22/2007
Issued: 07/12/2011
Est. Priority Date: 03/17/2006
Status: Active Grant

First Claim

Patent Images

1. A method for converting an input linguistic description into a speech waveform comprising:

deriving at least one target unit sequence corresponding to the input linguistic description;

assigning in a waveform unit database one or more waveform units to each target unit of the at least one target unit sequence;

selecting for the at least one target unit sequence a plurality of alternative waveform unit sequences approximating the at least one target unit sequence, using the one or more waveform units assigned to each target unit of the at least one target unit sequence;

concatenating the alternative waveform unit sequences to form alternative speech waveforms; and

presenting the alternative speech waveforms to an operating person and enabling the choice of one of the presented alternative speech waveforms.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An input linguistic description is converted into a speech waveform by deriving at least one target unit sequence corresponding to the linguistic description, selecting from a waveform unit database for the target unit sequences a plurality of alternative unit sequences approximating the target unit sequences, concatenating the alternative unit sequences to alternative speech waveforms and presenting the alternative speech waveforms to an operating person and enabling the choice of one of the presented alternative speech waveforms. There are no iterative cycles of manual modification and automatic selection, which enables a fast way of working. The operator does not need knowledge of units, targets, and costs, but chooses from a set of given alternatives. The fine-tuning of TTS prompts therefore becomes accessible to non-experts.

Citations

18 Claims

1. A method for converting an input linguistic description into a speech waveform comprising:
- deriving at least one target unit sequence corresponding to the input linguistic description;
  
  assigning in a waveform unit database one or more waveform units to each target unit of the at least one target unit sequence;
  
  selecting for the at least one target unit sequence a plurality of alternative waveform unit sequences approximating the at least one target unit sequence, using the one or more waveform units assigned to each target unit of the at least one target unit sequence;
  
  concatenating the alternative waveform unit sequences to form alternative speech waveforms; and
  
  presenting the alternative speech waveforms to an operating person and enabling the choice of one of the presented alternative speech waveforms.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. Method as in claim 1, wherein said plurality of alternative waveform unit sequences is generated in a predetermined way, by deriving at least one further target unit sequence using feedback from a previously selected waveform unit sequence.
  - 3. Method as claimed in claim 1, wherein at least one unit of at least one target unit sequence has a target pitch that is higher or lower by a predetermined minimal amount than the pitch of the corresponding unit of a previously selected waveform unit sequence.
  - 4. Method as claimed in claim 1, wherein at least one unit of at least one target unit sequence has a target duration that is longer or shorter by a predetermined minimal amount than the duration of the corresponding unit of a previously selected waveform unit sequence.
  - 5. Method as claimed in claim 1, wherein at least one unit of at least one target unit sequence imposes a predetermined difference in a voice quality or recording parameter or in other features, for example the unit identity, compared to a corresponding unit of at least one previously selected waveform unit sequence.
  - 6. Method as claimed in claim 1, wherein at least one unit of at least one target unit sequence imposes a predetermined minimum distance to a corresponding unit of at least one previously selected waveform unit sequence, measured by using an objective distance metric based on a speech parameterization.
  - 7. Method as claimed in claim 1, wherein alternative unit sequences are generated by varying at least one parameter of the unit selection cost functions by a predetermined minimal amount, wherein the at least one varied parameter is preferably the pitch mismatch weight or the phonetic context mismatch weight.
  - 8. Method as claimed in claim 1, wherein the linguistic description is partitioned into at least two subsets for which alternative waveform unit sequences are created and presented to the operator.
  - 9. Method as claimed in claim 8, wherein for at least one subset a predefined default choice of a waveform unit sequence is used instead of choosing a waveform unit sequence by the operating person, wherein said default choice is preferably predefined in a cache storing the operator'"'"'s choice for a subset in a given context.
  - 10. Method as claimed in claim 8, wherein at least one subset is further partitioned into subcategories for which alternative waveform unit sequences are generated and presented to the operator.
  - 11. Method as claimed in claim 8, wherein the optimisation of subsets is done with a graphical editor, which can display the linguistic entities associated with subsets and at least one set of alternative waveform unit sequences for at least one subset, wherein the alternative waveform unit sequences are referenced by descriptors, allowing the operator to evaluate only those alternatives where an improvement is expected.
  - 12. Method as claimed in claim 1, wherein an operator'"'"'s choice is stored in the form of unit sequence information, so that the speech waveform can be re-created at a later time, wherein the optimisation of speech waveforms is done on a first system and the storing of unit sequence information as well as the re-creation of speech waveforms is done on a second system, preferably an in-car navigation system.
  - 13. Method as claimed in claim 1, wherein the waveform unit sequences corresponding to waveforms chosen by the operator are used to improve the behaviour of the standard unit selection by updating the system parameters according to the target units or cost function variations preferred on average.
  - 14. Method as claimed in claim 1, wherein the waveform unit sequences corresponding to waveforms chosen by the operator are used to improve the behaviour of the standard waveform unit selection by adapting the unit selection parameters to increase overlap between the default unit sequences and a large set of manually optimized unit sequences.
  - 15. Method as claimed in claim 1, wherein the selecting includes selecting alternative waveform unit sequences with at least one minimal variation criteria.
  - 16. A non-transitory computer readable medium comprising program code for performing all the steps of claim 1 when said program is run on a computer.

17. A text to speech processor for converting an input linguistic description into a speech waveform, said processor comprising:
- a deriving unit for deriving at least one target unit sequence corresponding to the input linguistic description;
  
  an assigning unit for assigning in a waveform unit database one or more waveform units to each target unit of the at least one target unit sequence;
  
  a selection unit for selecting the at least one target unit sequence a plurality of alternative unit sequences approximating the at least one target unit sequence, using the one or more waveform units assigned to each target unit of the at least one target unit sequence;
  
  a concatenating unit for concatenating the alternative waveform unit sequences to form alternative speech waveforms; and
  
  a presenting unit for presenting the alternative speech waveforms to an operating person and enabling the choice of one of the presented alternative speech waveforms.
- View Dependent Claims (18)
- - 18. The processor as claimed in claim 17, wherein the selecting unit is for selecting alternative waveform unit sequences with at least one minimal variation criteria.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
SVOX AG (Microsoft Corporation)
Inventors
Wouters, Johan, Traber, Christof, Keller, Jürgen, Riedi, Marcel, Reber, Martin
Primary Examiner(s)
AZAD, ABUL K

Application Number

US11/709,056
Publication Number

US 20090076819A1
Time in Patent Office

1,601 Days
Field of Search

704258-269
US Class Current

704/268
CPC Class Codes

G10L 13/033 Voice editing, e.g. manipul...

G10L 13/07 Concatenation rules

Text to speech synthesis

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Text to speech synthesis

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links