Methods and apparatus for formant-based voice systems

US 8,447,592 B2
Filed: 09/13/2005
Issued: 05/21/2013
Est. Priority Date: 09/13/2005
Status: Active Grant

First Claim

Patent Images

1. A method of processing a voice signal to extract information to facilitate training a speech synthesis model for use with a formant-based text-to-speech synthesizer, the method comprising acts of:

detecting a plurality of candidate features in the voice signal;

grouping different combinations of the plurality of candidate features into a plurality of candidate feature sets;

forming a plurality of voice waveforms, each of the plurality of voice waveforms formed, at least in part, by processing a respective one of the plurality of candidate feature sets;

performing at least one comparison between the voice signal and each of the plurality of voice waveforms;

selecting at least one of the plurality of candidate feature sets based, at least in part, on the at least one comparison with the voice signal; and

using the selected at least one of the plurality of candidate feature sets to assist in training the speech synthesis model by incorporating and/or modifying at least one rule in the speech synthesis model, the at least one rule specifying how features should transition over time when synthesizing speech from a given text, wherein the speech synthesis model, when trained, is configured to synthesize the speech from the given text without using pre-recorded voice fragments.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In one aspect, a method of processing a voice signal to extract information to facilitate training a speech synthesis model is provided. The method comprises acts of detecting a plurality of candidate features in the voice signal, performing at least one comparison between one or more combinations of the plurality of candidate features and the voice signal, and selecting a set of features from the plurality of candidate features based, at least in part, on the at least one comparison. In another aspect, the method is performed by executing a program encoded on a computer readable medium. In another aspect, a speech synthesis model is provided by, at least in part, performing the method.

Citations

27 Claims

1. A method of processing a voice signal to extract information to facilitate training a speech synthesis model for use with a formant-based text-to-speech synthesizer, the method comprising acts of:
- detecting a plurality of candidate features in the voice signal;
  
  grouping different combinations of the plurality of candidate features into a plurality of candidate feature sets;
  
  forming a plurality of voice waveforms, each of the plurality of voice waveforms formed, at least in part, by processing a respective one of the plurality of candidate feature sets;
  
  performing at least one comparison between the voice signal and each of the plurality of voice waveforms;
  
  selecting at least one of the plurality of candidate feature sets based, at least in part, on the at least one comparison with the voice signal; and
  
  using the selected at least one of the plurality of candidate feature sets to assist in training the speech synthesis model by incorporating and/or modifying at least one rule in the speech synthesis model, the at least one rule specifying how features should transition over time when synthesizing speech from a given text, wherein the speech synthesis model, when trained, is configured to synthesize the speech from the given text without using pre-recorded voice fragments.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, further comprising an act of converting the voice signal into a same format as the plurality of voice waveforms prior to performing the at least one comparison.
  - 3. The method of claim 1, wherein forming the plurality of voice waveforms includes forming the plurality of voice waveforms in a same format as the voice signal, and wherein the act of selecting the at least one of the plurality of candidate feature sets includes an act of selecting at least one of the plurality of candidate feature sets corresponding to a respective at least one of the plurality of voice waveforms that is most similar to the voice signal according to a first criteria, the selected one of the plurality of candidate feature sets being used to train, at least in part, the voice synthesis model.
  - 4. The method of claim 1, further comprising an act of segmenting the voice signal into a plurality of frames, each of the plurality of frames corresponding to a respective interval of the voice signal, and wherein the acts of:
    - detecting a plurality of candidate features includes an act of detecting a plurality of candidate features in each of the plurality of frames; and
      
      grouping the plurality of candidate features includes an act of grouping different combinations of the plurality of candidate features detected in each of the plurality of frames into a respective plurality of candidate feature sets, each of the plurality of candidate feature sets associated with one of the plurality of frames from which the corresponding plurality of candidates features was detected, and further grouping different combinations of the plurality of candidate feature sets to form a respective plurality of candidate feature tracts.
  - 5. The method of claim 4, wherein forming the plurality of voice waveforms includes forming the plurality of voice waveforms, each of the plurality of voice waveforms being formed, at least in part, from a respective one of the plurality of candidate feature tracts, and wherein the act of selecting the at least one of the plurality of candidate feature sets includes an act of selecting one of the plurality of candidate feature tracts associated with a respective one of the plurality of voice waveforms that is most similar to the voice signal according to the first criteria, the selected one of the plurality of feature tracts being used to train, at least in part, the voice synthesis model.
  - 6. The method of claim 4, wherein each of the plurality of feature tracts includes an associated candidate feature set from each of the plurality of frames.
  - 7. The method of claim 4, wherein the acts of:
    - detecting a plurality of candidate features in each of the plurality of frames includes an act of detecting at least one candidate formant; and
      
      grouping the plurality of candidate features includes an act of grouping the plurality of candidate features such that each of the plurality of candidate feature sets includes at least one value representative of the at least one candidate formant detected in the respective frame.
  - 8. The method of claim 7, wherein the acts of:
    - detecting includes an act of detecting a plurality of candidate formants; and
      
      grouping the plurality of candidate features includes an act of grouping the plurality of candidate features into the plurality of candidate feature sets for each of the plurality of frames such that each of the plurality of candidate feature sets includes at least one value representative of each of a first formant, a second formant and a third formant detected in the respective frame.
  - 9. The method of claim 8, wherein the act of detecting includes an act of detecting at least one additional feature selected from the group consisting of:
    - pitch, timbre, energy and spectral slope.

10. A computer readable medium encoded with a program for execution on at least one processor, the program, when executed on the at least one processor, performing a method of processing a voice signal to extract information to facilitate training a speech synthesis model for use with a formant-based text-to-speech synthesizer, the method comprising acts of:
- detecting a plurality of candidate features in the voice signal;
  
  grouping different combinations of the plurality of candidate features into a plurality of candidate feature sets;
  
  forming a plurality of voice waveforms, each of the plurality of voice waveforms formed, at least in part, by processing a respective one of the plurality of candidate feature sets;
  
  performing at least one comparison between the voice signal and each of the plurality of voice waveforms;
  
  selecting at least one of the plurality of candidate feature sets based, at least in part, on the at least one comparison with the voice signal; and
  
  using the selected at least one of the plurality of candidate feature sets to assist in training the speech synthesis model by incorporating and/or modifying at least one rule in the speech synthesis model, the at least one rule specifying how features should transition over time when synthesizing speech from a given text, wherein the speech synthesis model, when trained, is configured to synthesize the speech from the given text without using pre-recorded voice fragments.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The computer readable medium of claim 10, further comprising an act of converting the voice signal into a same format as the plurality of voice waveforms prior to performing the at least one comparison.
  - 12. The computer readable medium of claim 10, wherein forming the plurality of voice waveforms includes forming the plurality of voice waveforms in a same format as the voice signal, and wherein the act of selecting the at least one of the plurality of candidate feature sets includes an act of selecting at least one of the plurality of candidate feature sets corresponding to a respective at least one of the plurality of voice waveforms that is most similar to the voice signal according to a first criteria, the selected one of the plurality of candidate feature sets being used to train, at least in part, the voice synthesis model.
  - 13. The computer readable medium of claim 10, further comprising an act of segmenting the voice signal into a plurality of frames, each of the plurality of frames corresponding to a respective interval of the voice signal, and wherein the acts of:
    - detecting a plurality of candidate features includes an act of detecting a plurality of candidate features in each of the plurality of frames; and
      
      grouping the plurality of candidate features includes an act of grouping different combinations of the plurality of candidate features detected in each of the plurality of frames into a respective plurality of candidate feature sets, each of the plurality of candidate feature sets associated with one of the plurality of frames from which the corresponding plurality of candidates features was detected, and further grouping different combinations of the plurality of candidate feature sets to form a respective plurality of candidate feature tracts.
  - 14. The computer readable medium of claim 13, wherein forming the plurality of voice waveforms includes forming the plurality of voice waveforms, each of the plurality of voice waveforms being formed, at least in part, from a respective one of the plurality of candidate feature tracts, and wherein the act of selecting the at least one of the plurality of candidate feature sets includes an act of selecting one of the plurality of candidate feature tracts associated with a respective one of the plurality of voice waveforms that is most similar to the voice signal according to the first criteria, the selected one of the plurality of feature tracts being used to train, at least in part, the voice synthesis model.
  - 15. The computer readable medium of claim 13, wherein each of the plurality of feature tracts includes an associated candidate feature set from each of the plurality of frames.
  - 16. The computer readable medium of claim 13, wherein the acts of:
    - detecting a plurality of candidate features in each of the plurality of frames includes an act of detecting at least one formant; and
      
      grouping the plurality of candidate features includes an act of grouping the plurality of candidate features such that each of the plurality of candidate feature sets includes at least one value representative of at least one candidate formant detected in the respective frame.
  - 17. The computer readable medium of claim 16, wherein the acts of:
    - detecting includes an act of detecting a plurality of candidate formants; and
      
      grouping the plurality of candidate features includes an act of grouping the plurality of candidate features into the plurality of candidate feature sets for each of the plurality of frames such that each of the plurality of candidate feature sets includes at least one value representative of each of a first formant, a second formant and a third formant detected in the respective frame.
  - 18. The computer readable medium of claim 17, wherein the act of detecting includes an act of detecting at least one additional feature selected from the group consisting of:
    - pitch, timbre, energy and spectral slope.

19. A computer readable medium encoded with a speech synthesis model for use with a formant-based text-to-speech synthesizer adapted to, when operating, generate human recognizable speech, the speech synthesis model trained to generate the human recognizable speech, at least in part, by performing acts of:
- detecting a plurality of candidate features in the voice signal;
  
  grouping different combinations of the plurality of candidate features into a plurality of candidate feature sets;
  
  forming a plurality of voice waveforms, each of the plurality of voice waveforms formed, at least in part, by processing a respective one of the plurality of candidate feature sets;
  
  performing at least one comparison between the voice signal and each of the plurality of voice waveforms;
  
  selecting at least one of the plurality of candidate feature sets based, at least in part, on the at least one comparison with the voice signal; and
  
  using the selected at least one of the plurality of candidate feature sets to assist in training the speech synthesis model by incorporating and/or modifying at least one rule in the speech synthesis model, the at least one rule specifying how features should transition over time when synthesizing speech from a given text, wherein the speech synthesis model, when trained, is configured to synthesize the speech from the given text without using pre-recorded voice fragments.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
- - 20. The computer readable medium of claim 19, further comprising an act of converting the voice signal into a same format as the plurality of voice waveforms prior to performing the at least one comparison.
  - 21. The computer readable medium of claim 19, wherein forming the plurality of voice waveforms includes forming the plurality of voice waveforms in a same format as the voice signal, and wherein the act of selecting the at least one of the plurality of candidate feature sets includes an act of selecting at least one of the plurality of candidate feature sets corresponding to a respective at least one of the plurality of voice waveforms that is most similar to the voice signal according to a first criteria, the selected one of the plurality of candidate feature sets being used to train, at least in part, the voice synthesis model.
  - 22. The computer readable medium of claim 19, further comprising an act of segmenting the voice signal into a plurality of frames, each of the plurality of frames corresponding to a respective interval of the voice signal, and wherein the acts of:
    - detecting a plurality of candidate features includes an act of detecting a plurality of candidate features in each of the plurality of frames; and
      
      grouping the plurality of candidate features includes an act of grouping different combinations of the plurality of candidate features detected in each of the plurality of frames into a respective plurality of candidate feature sets, each of the plurality of candidate feature sets associated with one of the plurality of frames from which the corresponding plurality of candidates features was detected, and further grouping different combinations of the plurality of candidate feature sets to form a respective plurality of candidate feature tracts.
  - 23. The computer readable medium of claim 22, wherein forming the plurality of voice waveforms includes forming the plurality of voice waveforms, each of the plurality of voice waveforms being formed, at least in part, from a respective one of the plurality of candidate feature tracts, and wherein the act of selecting the at least one of the plurality of candidate feature sets includes an act of selecting one of the plurality of candidate feature tracts associated with a respective one of the plurality of voice waveforms that is most similar to the voice signal according to the first criteria, the selected one of the plurality of feature tracts being used to train, at least in part, the voice synthesis model.
  - 24. The computer readable medium of claim 22, wherein each of the plurality of feature tracts includes an associated candidate feature set from each of the plurality of frames.
  - 25. The computer readable medium of claim 22, wherein the acts of:
    - detecting a plurality of candidate features in each of the plurality of frames includes an act of detecting at least one formant; and
      
      grouping the plurality of candidate features includes an act of grouping the plurality of candidate features such that each of the plurality of candidate feature sets includes at least one value representative of at least one candidate formant detected in the respective frame.
  - 26. The computer readable medium of claim 25, wherein the acts of:
    - detecting includes an act of detecting a plurality of candidate formants; and
      
      grouping the plurality of candidate features includes an act of grouping the plurality of candidate features into the plurality of candidate feature sets for each of the plurality of frames such that each of the plurality of candidate feature sets includes at least one value representative of each of a first formant, a second formant and a third formant detected in the respective frame.
  - 27. The computer readable medium of claim 26, wherein the act of detecting includes an act of detecting at least one additional feature selected from the group consisting of:
    - pitch, timbre, energy and spectral slope.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Edgington, Michael D., Gillick, Laurence, Cohen, Jordan R.
Primary Examiner(s)
COLUCCI, MICHAEL C

Application Number

US11/225,524
Publication Number

US 20070061145A1
Time in Patent Office

2,807 Days
Field of Search

704/216, 704/260, 704/221, 704/258, 704/208, 704/209, 704/219, 704/222, 704/230, 704/240, 704/246, 704/251, 704/261, 704/9, 709/206, 709/203, 715/767, 379/282, 379/283
US Class Current

704/207
CPC Class Codes

G10L 13/027   Concept to speech synthesis...

G10L 13/033   Voice editing, e.g. manipul...

G10L 25/15   the extracted parameters be...

Methods and apparatus for formant-based voice systems

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for formant-based voice systems

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links