Realistic Speech Synthesis System

US 20130289998A1
Filed: 03/15/2013
Published: 10/31/2013
Est. Priority Date: 04/30/2012
Status: Active Grant

First Claim

Patent Images

1. A method of synthesizing speech from text, comprising the steps of:

selecting one or more scenario parameters;

inputting text parsed into corresponding phonetic components;

merging said phonetic components with breathing and non-speech effects to produce a transcript of phoneme segment strings;

producing prosody contour data from said one or more scenario parameters and said transcript of phoneme segment strings;

producing stitched filter data from said one or more scenario parameters and said transcript of phoneme segment strings;

synthesizing speech from said stitched filter data and said prosody contour data; and

outputting said synthesized speech from a playback device.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for realistic speech synthesis which converts text into synthetic human speech with qualities appropriate to the context such as the language and dialect of the speaker, as well as expanding a speaker'"'"'s phonetic inventory to produce more natural sounding speech.

Citations

32 Claims

1. A method of synthesizing speech from text, comprising the steps of:
- selecting one or more scenario parameters;
  
  inputting text parsed into corresponding phonetic components;
  
  merging said phonetic components with breathing and non-speech effects to produce a transcript of phoneme segment strings;
  
  producing prosody contour data from said one or more scenario parameters and said transcript of phoneme segment strings;
  
  producing stitched filter data from said one or more scenario parameters and said transcript of phoneme segment strings;
  
  synthesizing speech from said stitched filter data and said prosody contour data; and
  
  outputting said synthesized speech from a playback device.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method according to claim 1, wherein the one or more scenario parameters are selected from the group consisting of language, dialect, accent, phonetic reduction, domain, context, and a single speaker.
  - 3. The method according to claim 1, wherein the one or more scenario parameters are selected by a user.
  - 4. The method according to claim 1, wherein a user provides the prosody contour.
  - 5. The method according to claim 1, wherein producing said stitched filter data comprises:
    - receiving said text parsed into corresponding phonetic components;
      
      matching each corresponding phonetic component with a corresponding signal feature candidate;
      
      identifying within the corresponding signal feature candidates each pair of adjacent signal feature candidates;
      
      modifying the formant features of each phonetic component within the pair of adjacent signal feature candidates such the first candidate within the pair transitions smoothly to the second phonetic candidate within the pair.

6. A method for synthesizing speech from text, comprising the steps of:
- providing a computer having a first database and a second database stored in the memory thereof and in which data is stored, said data in said first database representing a set of signal feature candidates representative of a single speaker, and said data in said second database representing a second set of signal feature candidates;
  
  receiving a target set of phonetic components representative of text;
  
  analyzing said single speaker signal feature candidates from said first database to determine whether a corresponding single speaker signal feature candidate exists for each target phonetic component;
  
  retrieving from said second database a replacement signal feature candidate from said second set of signal feature candidates for any target phonetic component that does not have a corresponding single speaker signal feature candidate;
  
  synthesizing speech from at least one of said corresponding single signal feature candidates and said replacement signal feature candidates.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The method according to claim 6, further comprising the step of:
    - modifying said replacement signal feature candidates such that the synthesized speech from the replacement signal feature candidates resembles the synthesized speech of the single speaker signal feature candidates.
  - 8. The method according to claim 7, wherein modifying comprises:
    - constructing a map of said single speaker signal feature candidates and corresponding signal feature candidates from said second set of signal feature candidates;
      
      training a system on said map capable of generalizing difference between said single speaker signal feature candidates and corresponding signal feature candidates from said second stored set of signal feature candidates;
      
      modifying said replacement phonetic component according to generalized difference represented in said system.
  - 9. The method according to claim 6, wherein said single speaker signal feature candidates and said replacement components are signal feature candidates representing diphones, each candidate from said single speaker signal feature candidates and said replacement components having a first steady-state portion, a transition portion, and a second steady-state portion.
  - 10. The method according to claim 9, wherein modifying comprises:
    - identifying said transition portion of said replacement diphones and said single speaker candidate;
      
      training a system on said transition portion, capable of generalizing features of said transition portion;
      
      generating a new transition portion according said system;
      
      replacing said transition portion of said first steady-portion with said new transition portion.

11. A method for synthesizing speech from text, comprising the steps of:
- providing a computer having a first database and a second database stored in the memory thereof and in which data is stored, said data in said first database representing a set of signal feature candidates representative of a single speaker, and said data in said second database representing a second set of signal feature candidates;
  
  receiving a target set of phonetic components representative of text;
  
  analyzing said single speaker signal feature candidates from said first database to determine whether a corresponding single speaker signal feature candidate of sufficient quality exists for each target phonetic component;
  
  retrieving from said second database a replacement signal feature candidate from said second set of signal feature candidates for any target phonetic component that does not have a corresponding single speaker signal feature candidate of sufficient quality; and
  
  synthesizing speech from at least one of the corresponding single speaker signal feature candidates and the replacement signal feature candidates.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The method according to claim 11, wherein sufficient quality is determined by:
    - receiving said single speaker signal feature candidates;
      
      identifying within the corresponding signal feature candidates each pair of adjacent signal feature candidates;
      
      measuring the cost of joining each said pair of adjacent signal feature candidates; and
      
      determining whether said cost is too high.
  - 13. The method according to claim 11, further comprising the step of:
    - modifying said replacement signal feature candidates such that the resulting synthesized speech from the replacement signal feature candidates resembles the resulting synthesized speech of the single speaker signal feature candidates.
  - 14. The method according to claim 13, wherein modifying further comprises:
    - constructing a map of said single speaker signal feature candidates and corresponding signal feature candidates from said second set of signal feature candidates;
      
      training a system on said map capable of generalizing difference between said single speaker signal feature candidates and corresponding signal feature candidates from said second s set of signal feature candidates;
      
      modifying said replacement signal feature candidate according to generalized difference represented in said system.
  - 15. The method according to claim 13, wherein said single speaker signal feature candidates and said replacement components are signal feature candidates representing diphones;
    - each candidate from said single speaker phonetic components and said replacement components having a first steady-state portion, a transition portion, and a second steady-state portion.
  - 16. The method according to claim 15, wherein modifying further comprises:
    - identifying said transition portion of said replacement diphones and said single speaker feature candidates;
      
      training a system on said transition portion, capable of generalizing features of said transition portion;
      
      generating a new transition portion according said system;
      
      replacing said transition portion of said first steady-portion with said new transition portion.

17. A non-transitory computer-readable storage medium containing program code comprising:
- program code for selecting one or more scenario parameters;
  
  program code for inputting text parsed into corresponding phonetic components;
  
  program code for merging said phonetic components with breathing and non-speech effects to produce a transcript of phoneme segment strings;
  
  program code for producing prosody contour data from said one or more scenario parameters and said transcript of phoneme segment strings;
  
  program code for producing stitched filter data from said one or more scenario parameters and said transcript of phoneme segment strings;
  
  program code for synthesizing speech from said stitched filter data and said prosody contour data;
  
  program code for outputting said synthesized speech from a playback device.
- View Dependent Claims (18, 19, 20, 21)
- - 18. The storage medium according to claim 17, wherein the one or more scenario parameters are selected from the group consisting of language, dialect, accent, phonetic reduction, domain, context, and a single speaker.
  - 19. The storage medium according to claim 17, wherein the one or more scenario parameters are selected by a user.
  - 20. The storage medium according to claim 17, wherein the user provides the prosody contour.
  - 21. The storage medium according to claim 17, wherein producing said stitched filter data further comprises:
    - program code for receiving said text parsed into corresponding phonetic components;
      
      program code for matching each phonetic component with a corresponding signal feature candidate;
      
      program code for identifying within the corresponding signal feature candidates each pair of adjacent signal feature candidates;
      
      program code for modifying the formant features of each signal feature candidate within the pair of adjacent signal feature candidates such the first phonetic component within the pair transitions smoothly to the second phonetic component within the pair.

22. A non-transitory computer-readable storage medium containing program code, comprising:
- program code for receiving a target set of phonetic components representative of text;
  
  program code for analyzing a single speaker'"'"'s signal feature candidates, said single speaker'"'"'s signal feature candidates stored in a database, to determine whether a corresponding single speaker signal feature candidate exists for each said target phonetic component;
  
  program code for retrieving from a second set of signal feature candidates, said second set of signal feature candidates stored in database, a replacement signal feature candidate for any target phonetic component that does not have a corresponding single speaker signal feature candidate;
  
  program code for synthesizing speech from at least one of said corresponding single speaker signal feature candidates and said replacement signal feature candidates.
- View Dependent Claims (23, 24, 25, 26)
- - 23. The storage medium according to claim 22, further comprising program code for modifying said replacement signal feature candidates such that the synthesized speech from the replacement signal feature candidates resembles the synthesized speech of the single speaker signal feature candidates.
  - 24. The storage medium according to claim 23, wherein said program code for modifying comprises:
    - program code for constructing a map of said single speaker signal feature candidates and corresponding signal feature candidates from said second set of signal feature candidates;
      
      program code for training a system on said map capable of generalizing difference between said single speaker signal feature candidates and corresponding signal feature candidates from said second stored set of signal feature candidates;
      
      program code for modifying said replacement signal feature candidate according to generalized difference represented in said system.
  - 25. The storage medium according to claim 22, wherein said single speaker signal feature candidates and said replacement components are signal feature candidate representing diphones, each candidate from said single speaker signal feature candidates and said replacement components having a first steady-state portion, a transition portion, and a second steady-state portion.
  - 26. The storage medium according to claim 25, wherein program code for modifying comprises:
    - program code for identifying said transition portion of said replacement candidates and said single speaker candidate;
      
      program code for training a system on said transition portion, capable of generalizing features of said transition portion;
      
      program code for generating a new transition portion according said system;
      
      program code for replacing said transition portion of said first steady-portion with said new transition portion.

27. A non-transitory computer-readable storage medium containing program code, comprising:
- program code for receiving a target set of phonetic components representative of text;
  
  program code for analyzing a single speaker'"'"'s signal feature candidates, said single speaker'"'"'s signal feature candidates stored in a database, to determine whether a corresponding single speaker signal feature candidate of sufficient quality exists for each said target phonetic component;
  
  program code for retrieving from a second set of signal feature candidates, said second set of signal feature candidates stored in database, a replacement signal feature candidate for any target phonetic component that does not have a corresponding single speaker signal feature candidate of sufficient quality;
  
  program code for synthesizing speech from at least one of said corresponding single speaker signal feature candidates and said replacement signal feature candidates.
- View Dependent Claims (28, 29, 30, 31, 32)
- - 28. The storage medium according to claim 27, wherein sufficient quality is determined by program code comprising:
    - program code for receiving said single speaker signal feature candidates;
      
      program code for identifying within the corresponding signal feature candidates each pair of adjacent signal feature candidates;
      
      program code for measuring the cost of joining each said pair of adjacent signal feature candidates; and
      
      program code for determining whether said cost is too high.
  - 29. The storage medium according to claim 27, further comprising program code for modifying said replacement signal feature candidates such that the synthesized speech from the replacement signal feature candidates resembles the synthesized speech of the single speaker signal feature candidates.
  - 30. The storage medium according to claim 29, wherein program code for modifying further comprises:
    - program code for constructing a map of said single speaker signal feature candidates and corresponding signal feature candidates from said second set of signal feature candidates;
      
      program code for training a system on said map capable of generalizing difference between said single speaker signal feature candidates and corresponding signal feature candidates from said second s set of signal feature candidates;
      
      program code for modifying said replacement signal feature candidate according to generalized difference represented in said system.
  - 31. The storage medium according to claim 29, wherein said single speaker signal feature candidates and said replacement components are diphones;
    - each diphone from said single speaker signal feature candidates and said replacement components having a first steady-state portion, a transition portion, and a second steady-state portion.
  - 32. The storage medium according to claim 31, wherein program code for modifying further comprises:
    - program code for identifying said transition portion of said replacement diphones and said single speaker feature candidates;
      
      program code for training a system on said transition portion, capable of generalizing features of said transition portion;
      
      program code for generating a new transition portion according said system;
      
      program code for replacing said transition portion of said first steady-portion with said new transition portion.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SRC, Inc.
Original Assignee
SRC, Inc.
Inventors
Eller, David Donald, Morphet, Steven Brian, Boyett, Watson Brent

Granted Patent

US 9,368,104 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/08 Text analysis or generation...

Realistic Speech Synthesis System

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

32 Claims

Specification

Solutions

Use Cases

Quick Links

Realistic Speech Synthesis System

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

32 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links