Method and system for generating synthesized speech based on human recording

US 7,899,672 B2
Filed: 06/27/2006
Issued: 03/01/2011
Est. Priority Date: 06/28/2005
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for generating synthesized speech from input text, the method comprising:

selecting a best-matched pre-recorded utterance from a plurality of pre-recorded utterances, wherein the selecting is based, at least in part, on a degree of matching between the input text and texts associated with the plurality of pre-recorded utterances;

dividing the best-matched pre-recorded utterance into a plurality of segments comprising remaining segments that match corresponding parts of the input text and difference segments that do not match corresponding parts of the input text;

synthesizing speech for parts of the input text corresponding to the difference segments in the selected best-matched pre-recorded utterance to generate synthesized speech segments; and

splicing the synthesized speech segments of the parts of the input text corresponding to the difference segments with the remaining segments of the selected best-matched pre-recorded utterance to generate the synthesized speech for the input text.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system that incorporates human recording with a TTS system to generate synthesized speech with high quality by searching over a database of pre-recorded utterances to select an utterance best matching text content to be synthesized into speech; dividing the best-matched utterance into a plurality of segments to generate remaining segments that are the same as corresponding parts of the text content and difference segments that are different from corresponding parts of the text content; synthesizing speech for the parts of the text content corresponding to the difference segments; and splicing the synthesized speech segments with the remaining segments of the best-matched utterance.

Citations

15 Claims

1. A computer-implemented method for generating synthesized speech from input text, the method comprising:
- selecting a best-matched pre-recorded utterance from a plurality of pre-recorded utterances, wherein the selecting is based, at least in part, on a degree of matching between the input text and texts associated with the plurality of pre-recorded utterances;
  
  dividing the best-matched pre-recorded utterance into a plurality of segments comprising remaining segments that match corresponding parts of the input text and difference segments that do not match corresponding parts of the input text;
  
  synthesizing speech for parts of the input text corresponding to the difference segments in the selected best-matched pre-recorded utterance to generate synthesized speech segments; and
  
  splicing the synthesized speech segments of the parts of the input text corresponding to the difference segments with the remaining segments of the selected best-matched pre-recorded utterance to generate the synthesized speech for the input text.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method according to claim 1, wherein selecting a best-matched pre-recorded utterance comprises:
    - calculating an edit-distance between the input text and each of the plurality of pre-recorded utterances;
      
      selecting the pre-recorded utterance with a minimum edit-distance as the best-matched pre-recorded utterance; and
      
      determining at least one edit operation for converting the best-matched pre-recorded utterance into the synthesized speech for the input text.
  - 3. The method according to claim 2, wherein calculating an edit-distance is performed as follows:
  - 4. The method according to claim 2, wherein determining at least one edit operation comprises:
    - determining at least one editing location and at least one corresponding editing type.
  - 5. The method according to claim 4, wherein dividing the best-matched pre-recorded utterance into a plurality of segments comprises:
    - according to the determined at least one editing location, chopping out at least one edit segment to be edited from the best-matched pre-recorded utterance, wherein the include the at least one edit segment.

6. A system for generating synthesized speech for input text, the system comprising:
- at least one storage device comprising a plurality of pre-recorded utterances; and
  
  at least one computer configured to;
  
  select a best-matched pre-recorded utterance from a plurality of pre-recorded utterances, wherein the selecting is based, at least in part, on a degree of matching between the input text and texts associated with the plurality of pre-recorded utterances;
  
  divide the best-matched pre-recorded utterance into a plurality of segments comprising remaining segments that match corresponding parts of the input text and difference segments that do not match corresponding parts of the input text;
  
  synthesize speech for parts of the input text corresponding to the difference segments in the selected best-matched pre-recorded utterance to generate synthesized speech segments; and
  
  splice the synthesized speech segments with the remaining segments to generate synthesized speech for the input text.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The system according to claim 6, wherein the at least one computer is further configured to:
    - calculate an edit-distance between the input text and each of the plurality of pre-recorded utterances in the at least one storage device;
      
      select the pre-recorded utterance with minimum edit-distance as the best-matched utterance; and
      
      determine at least one edit operation for converting the best-matched pre-recorded utterance into the synthesized speech for the input text.
  - 8. The system according to claim 7, wherein the edit-distance is calculated as follows:
  - 9. The system according to claim 7, wherein determining at least one edit operation comprises determining at least one editing location and at least one corresponding editing type.
  - 10. The system according to claim 9, wherein the at least one computer is further configured to:
    - chop out at least one edit segment to be edited from the best-matched pre-recorded utterance according to the determined at least one editing location, wherein the difference segments include the at least one edit segment.

11. A machine-readable program storage device tangibly embodying a program of instructions that, when executed by the machine, perform a method for generating synthesized speech from input text, the method comprising:
- selecting a best-matched pre-recorded utterance from a plurality of pre-recorded utterances, wherein the selecting is based, at least in part, on a degree of matching between the input text and texts associated with the plurality of pre-recorded utterances;
  
  dividing the best-matched pre-recorded utterance into a plurality of segments comprising remaining segments that match corresponding parts of the input text and difference segments that do not match corresponding parts of the input text;
  
  synthesizing speech for parts of the input text corresponding to the difference segments in the selected best-matched pre-recorded utterance to generate synthesized speech segments; and
  
  splicing the synthesized speech segments of the parts of the input text corresponding to the difference segments with the remaining segments of the selected best-matched pre-recorded utterance to generate the synthesized speech for the input text.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The device according to claim 11, wherein selecting a best-matched pre-recorded utterance comprises:
    - calculating an edit-distance between the input text and each of the plurality of pre-recorded utterances;
      
      selecting the pre-recorded utterance with a minimum edit-distance as the best-matched pre-recorded utterance; and
      
      determining at least one edit operation for converting the best-matched pre-recorded utterance into the synthesized speech for the input text.
  - 13. The device according to claim 12, wherein calculating an edit-distance is performed as follows:
  - 14. The device according to claim 12, wherein determining at least one edit operation comprises:
    - determining at least one editing location and at least one corresponding editing type.
  - 15. The device according to claim 14, wherein dividing the best-matched pre-recorded utterance into a plurality of segments comprises:
    - according to the determined at least one editing location, chopping out at least one edit segment to be edited from the best-matched pre-recorded utterance, wherein the difference segments include the at least one edit segment.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Zhu, Weibin, Zhang, Wei, Shen, Liqin, Qin, Yong
Primary Examiner(s)
Armstrong; Angela A

Application Number

US11/475,820
Publication Number

US 20070033049A1
Time in Patent Office

1,708 Days
Field of Search

None
US Class Current

704/260
CPC Class Codes

G10L 13/04 Details of speech synthesis...

Method and system for generating synthesized speech based on human recording

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for generating synthesized speech based on human recording

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links