Method and system for generating synthesized speech based on human recording

US 20070033049A1
Filed: 06/27/2006
Published: 02/08/2007
Est. Priority Date: 06/28/2005
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for generating synthesized speech, comprising the steps of searching over a database that contains pre-recorded utterances to select a best-matched pre-recorded utterance that best matches text content to be synthesized into speech;

dividing the best-matched pre-recorded utterance into a plurality of segments comprising remaining segments that are the same as corresponding parts of the text content and difference segments that are different from corresponding parts of the text content;

synthesizing speech for the parts of the text content corresponding to the difference segments to generate synthesized speech segments; and

splicing the synthesized speech segments of the parts of the text content corresponding to the difference segments with the remaining segments of the selected pre-recorded utterance.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system that incorporates human recording with a TTS system to generate synthesized speech with high quality by searching over a database of pre-recorded utterances to select an utterance best matching text content to be synthesized into speech; dividing the best-matched utterance into a plurality of segments to generate remaining segments that are the same as corresponding parts of the text content and difference segments that are different from corresponding parts of the text content; synthesizing speech for the parts of the text content corresponding to the difference segments; and splicing the synthesized speech segments with the remaining segments of the best-matched utterance.

Citations

16 Claims

1. A computer-implemented method for generating synthesized speech, comprising the steps of searching over a database that contains pre-recorded utterances to select a best-matched pre-recorded utterance that best matches text content to be synthesized into speech;
- dividing the best-matched pre-recorded utterance into a plurality of segments comprising remaining segments that are the same as corresponding parts of the text content and difference segments that are different from corresponding parts of the text content;
  
  synthesizing speech for the parts of the text content corresponding to the difference segments to generate synthesized speech segments; and
  
  splicing the synthesized speech segments of the parts of the text content corresponding to the difference segments with the remaining segments of the selected pre-recorded utterance.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method according to claim 1, wherein the step of searching over a database comprises the steps of calculating an edit-distance between the text content and each pre-recorded utterance in the database;
    - selecting the pre-recorded utterance with a minimum edit-distance as the best-matched pre-recorded utterance; and
      
      determining edit operations for converting the best-matched pre-recorded utterance into speech for the text content.
  - 3. The method according to claim 2, wherein calculating an edit-distance is performed as follows:
  - 4. The method according to claim 2, wherein the step of determining edit operations comprises:
    - determining editing locations and corresponding editing types.
  - 5. The method according to claim 4, wherein the step of dividing the best-matched pre-recorded utterance into a plurality of segments comprises:
    - according to the determined editing locations, chopping out edit segments to be edited from the best-matched pre-recorded utterance, wherein the edit segments are the difference segments and non-edit segments are the remaining segments.

6. A system for generating synthesized speech, comprising:
- a speech database for storing pre-recorded utterances;
  
  a text input device for inputting text content to be synthesized into speech;
  
  a searching means for searching over the speech database to select best-matched pre-recorded utterance that best match inputted text content;
  
  a speech splicing means for dividing the best-matched pre-recorded utterance into a plurality of segments to generate remaining segments that are the same as corresponding parts of the text content and difference segments that are different from corresponding parts of the text content;
  
  synthesizing speech for parts of the inputted text content corresponding to the difference segments to generated synthesized speech segments; and
  
  splicing the synthesized speech segments with the remaining segments to generate synthesized speech; and
  
  a speech output device for outputting the synthesized speech corresponding to the inputted text content.
- View Dependent Claims (7, 8, 9, 10, 11)
- - 7. The system according to claim 6, wherein the searching means further comprises:
    - a calculating unit for calculating edit-distances between text content and each pre-recorded utterance in the speech database;
      
      a selecting unit for selecting the pre-recorded utterance with minimum edit-distance as the best-matched utterance; and
      
      a determining unit for determining edit operations for converting the best-matched pre-recorded utterance into speech for the text content.
  - 8. The system according to claim 7, wherein the calculating unit calculates an edit-distance as follows:
  - 9. The system according to claim 7, wherein the determining unit comprises a unit for determining editing locations and corresponding editing types.
  - 10. The system according to claim 9, wherein the speech splicing means chops out segments to be edited from the best-matched pre-recorded utterance according to the determined editing locations, wherein edit segments to be edited are the difference segments and non-edit segments are the remaining segments.
  - 11. The system according to claim 6, wherein the speech splicing means further comprises:
    - a dividing unit for dividing the best-matched pre-recorded utterance into a plurality of remaining segments and difference segments;
      
      a speech synthesizing unit for synthesizing speech for the parts of the inputted text content corresponding to the difference segments to generate synthesized speech segments; and
      
      a splicing unit for splicing the synthesized speech segments with the remaining segments.

12. A program storage device readable by machine tangibly embodying a program of instructions executable by the machined for implementing a method for generating synthesized speech, wherein the method comprises the steps of:
- searching over a database that contains pre-recorded utterances to select a best-matched pre-recorded utterance that best matches text content to be synthesized into speech;
  
  dividing the best-matched pre-recorded utterance into a plurality of segments comprising remaining segments that are the same as corresponding parts of the text content and difference segments that are different from corresponding parts of the text content;
  
  synthesizing speech for the parts of the text content corresponding to the difference segments to generate synthesized speech segments; and
  
  splicing the synthesized speech segments of the parts of the text content corresponding to the difference segments with the remaining segments of the selected pre-recorded utterance.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The device according to claim 12, wherein the step of searching over a database comprises the steps of:
    - calculating an edit-distance between the text content and each pre-recorded utterance in the database;
      
      selecting the pre-recorded utterance with a minimum edit-distance as the best-matched pre-recorded utterance; and
      
      determining edit operations for converting the best-matched pre-recorded utterance into speech for the text content.
  - 14. The device according to claim 13, wherein calculating an edit-distance is performed as follows:
  - 15. The device according to claim 13, wherein the step of determining edit operations comprises:
    - determining editing locations and corresponding editing types.
  - 16. The device according to claim 15, wherein the step of dividing the best-matched pre-recorded utterance into a plurality of segments comprises:
    - according to the determined editing locations, chopping out edit segments to be edited from the best-matched pre-recorded utterance, wherein the edit segments are the difference segments and non-edit segments are the remaining segments.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
International Business Machines Corporation
Inventors
Qin, Yong, Zhu, Weibin, Zhang, Wei, Shen, Liqin

Granted Patent

US 7,899,672 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/04 Details of speech synthesis...

Method and system for generating synthesized speech based on human recording

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for generating synthesized speech based on human recording

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links