Speech synthesis

US 20040172249A1
Filed: 04/19/2004
Published: 09/02/2004
Est. Priority Date: 05/25/2001
Status: Abandoned Application

First Claim

Patent Images

1. A method of producing synthesised speech from a text, comprising:

(a) providing a database of diphones derived from samples of natural speech;

(b) analysing the text to render the text as a succession of target diphones;

(c) identifying, for each target diphone, the value of each of a number of predetermined diphone features;

(d) identifying in the database diphones which are potential matches to each target diphone;

(e) establishing a target cost for each of said predetermined features of each potential database diphone in relation to each target diphone;

(f) modifying the target cost of each feature in accordance with predetermined factors associated with said diphone features; and

(g) calculating the least-cost combination to achieve output speech corresponding to the text.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention makes use of a database of diphones derived from natural speech. A text is rendered as a series of target diphones and for each of these a number of predetermined diphone features are identified. Potential matches from the database are identified and a target cost for each of these features is established. The target costs are modified before selecting a least-cost combination. The modification of the target costs may be done by weighting, or by use of distribution functions. The calculation of the least-cost combination may be performed by a dynamic search program such as a Viterbi search. In the preferred embodiments, diphone join costs are also included in the least-cost calculation, and are also modified before the calculation is made. In addition to, or instead of, modification of target costs, the potential matches may be pre-pruned to identify a predetermined number of potential matches in descending order of suitability.

24 Citations

View as Search Results

21 Claims

1. A method of producing synthesised speech from a text, comprising:
- (a) providing a database of diphones derived from samples of natural speech;
  
  (b) analysing the text to render the text as a succession of target diphones;
  
  (c) identifying, for each target diphone, the value of each of a number of predetermined diphone features;
  
  (d) identifying in the database diphones which are potential matches to each target diphone;
  
  (e) establishing a target cost for each of said predetermined features of each potential database diphone in relation to each target diphone;
  
  (f) modifying the target cost of each feature in accordance with predetermined factors associated with said diphone features; and
  
  (g) calculating the least-cost combination to achieve output speech corresponding to the text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. A method according to claim 1, including evaluating the join cost of joining each diphone to its successor, and including the join costs in the least-cost calculation.
  - 3. A method according to claim 2, in which the join costs are also modified in accordance with predetermined features of one or both of the target diphone and candidate diphone.
  - 4. A method according to claim 3, in which the modification of diphone feature costs and join costs is effected using a simple weighting procedure.
  - 5. A method according to claim 3, in which the modification of diphone feature costs and join costs makes use of distribution functions.
  - 6. A method according to claim 5, in which the cost is modified according to a cost function which is V-shaped, and the zero-cost point is located using the centroid of a pre-established probability distribution.
  - 7. A method according to claim 6, in which the slope of the V is modified in dependence on the variance of the probability distribution.
  - 8. A method according to claim 5, in which the cost is modified according to a cost function which is the inverse of a pre-established probability distribution.
  - 9. A method according to any preceding claim, in which calculation of the least-cost combination is performed by a dynamic search program.
  - 10. A method according to claim 9, in which the dynamic search program is a Viterbi search.
  - 11. A method according to any preceding claim and including the step of pre-pruning candidate diphones on the basis of categorical features.
  - 12. A method according to claim 11, in which the pre-pruning step makes use of a decision tree working on predetermined categorical features of the candidate diphones.
  - 13. A method according to claim 12, in which said diphone features are one or more of phonetic, prosodic, linguistic, and acoustic features.
  - 14. A method according to claim 13, in which said features are one or more of:
    - word syllable adjacent word pair stress duration pitch intonation contour position in sentence text type text subject matter
  - 15. A method according to any of claims 11 to 14, in which the pre-pruning step assigns values based on suitability to the target diphones, and in which said pre-pruning values are used in assigning target costs.

16. A method of producing synthesised speech from a text, comprising:
- (a) providing a database of diphones derived from samples of natural speech;
  
  (b) analysing the text to render the text as a succession of target diphones;
  
  (c) identifying, for each target diphone, the value of each of a number of predetermined diphone features;
  
  (d) identifying in the database diphones which are potential matches to each target diphone;
  
  (e) pre-pruning said potential matches by means of sorting by category to identify a predetermined number of potential matches of descending order of suitability;
  
  (f) establishing a target cost for each of said predetermined features of each potential database diphone in relation to each target diphone; and
  
  (g) calculating the least-cost combination to achieve output speech corresponding to the text.
- View Dependent Claims (17, 18)
- - 17. A method according to claim 16, in which said pre-pruning is effected by means of a decision tree.
  - 18. A method according to claim 16 or claim 17, in which said pre-pruning step assigns values based on suitability to the target diphones, and in which said pre-pruning values are used in assigning target costs.

19. A system for producing synthesised speech from text, the system comprising:
- memory means storing a database of diphones derived from natural speech;
  
  processing means arranged to;
  
  (a) analyse the text to render the text as a succession of target diphones;
  
  (b) identify, for each target diphone, the value of each of a number of predetermined diphone features;
  
  (c) identify in the database diphones which are potential matches to each target diphone;
  
  (d) establish a target cost for each of said predetermined features of each potential database diphone in relation to each target diphone;
  
  (e) modify the target cost of each feature in accordance with predetermined factors associated with said diphone features; and
  
  (f) calculate the least-cost combination to achieve output speech corresponding to the text; and
  
  speech synthesis means operable to retrieve and concatenate the diphones identified as constituting said least cost combination.
- View Dependent Claims (21)
- - 21. A data carrier holding software adapted to cause a processing means to operate steps (a)-(f) of claim 19 or claim 20.

20. A system for producing synthesised speech from text, the system comprising:
- memory means storing a database of diphones derived from natural speech;
  
  processing means arranged to;
  
  (a) analyse the text to render the text as a succession of target diphones;
  
  (b) identify, for each target diphone, the value of each of a number of predetermined diphone features;
  
  (c) identify in the database diphones which are potential matches to each target diphone;
  
  (d) pre-prune said potential matches by means of sorting by category to identify a predetermined number of potential matches of descending order of suitability;
  
  (e) establish a target cost for each of said predetermined features of each potential database diphone in relation to each target diphone; and
  
  (f) calculate the least-cost combination to achieve output speech corresponding to the text; and
  
  speech synthesis means operable to retrieve and concatenate the diphones identified as constituting said least cost combination.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Rhetorical Group Plc (The Registrant)
Original Assignee
Rhetorical Group Plc (The Registrant)
Inventors
Aylett, Matthew Peter, Fackrell, Justin Wynford Andrew, Taylor, Paul Alexander

Application Number

US10/478,348
Publication Number

US 20040172249A1
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/04 Details of speech synthesis...

G10L 13/07 Concatenation rules

Speech synthesis

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

24 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Speech synthesis

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

24 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links