Methods and apparatus for formant-based voice systems

US 20070061145A1
Filed: 09/13/2005
Published: 03/15/2007
Est. Priority Date: 09/13/2005
Status: Active Grant

First Claim

Patent Images

1. A method of processing a voice signal to extract information to facilitate training a speech synthesis model, the method comprising acts of:

detecting a plurality of candidate features in the voice signal;

performing at least one comparison between one or more combinations of the plurality of candidate features and the voice signal; and

selecting a set of features from the plurality of candidate features based, at least in part, on the at least one comparison.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In one aspect, a method of processing a voice signal to extract information to facilitate training a speech synthesis model is provided. The method comprises acts of detecting a plurality of candidate features in the voice signal, performing at least one comparison between one or more combinations of the plurality of candidate features and the voice signal, and selecting a set of features from the plurality of candidate features based, at least in part, on the at least one comparison. In another aspect, the method is performed by executing a program encoded on a computer readable medium. In another aspect, a speech synthesis model is provided by, at least in part, performing the method.

Citations

36 Claims

1. A method of processing a voice signal to extract information to facilitate training a speech synthesis model, the method comprising acts of:
- detecting a plurality of candidate features in the voice signal;
  
  performing at least one comparison between one or more combinations of the plurality of candidate features and the voice signal; and
  
  selecting a set of features from the plurality of candidate features based, at least in part, on the at least one comparison.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, further comprising an act of grouping the plurality of candidate features into a plurality of candidate sets, and wherein the act of selecting the set of features includes an act of selecting at least one of the plurality of candidate sets.
  - 3. The method of claim 2, further comprising an act of converting each of the plurality of candidate sets into a respective voice waveform provided in a same format as the voice signal, and wherein the act of selecting the at least one of the plurality of candidate sets includes an act of selecting at least a one of the plurality of candidate sets that is most similar to the voice signal according to first criteria, the selected one of the plurality of candidate sets being used to train, at least in part, the voice synthesis model.
  - 4. The method of claim 2, further comprising an act of converting the voice signal and each of the plurality of candidate sets into a same format, and wherein the act of selecting the at least one of the plurality of candidate sets includes an act of selecting at least a one of the plurality of candidate sets that is most similar to the voice signal according to a first criteria, the selected one of the plurality of candidate sets being used to train, at least in part, the voice synthesis model.
  - 5. The method of claim 2, further comprising an act of segmenting the voice signal into a plurality of frames, each of the plurality of frames corresponding to a respective interval of the voice signal, and wherein the acts of:
    - detecting a plurality of candidate features includes an act of detecting a plurality of candidate features in each of the plurality of frames; and
      
      grouping the plurality of candidate features includes an act of grouping the plurality of candidate features detected in each of the plurality of frames into a respective plurality of candidate sets, each of the plurality of candidate sets associated with one of the plurality of frames from which the corresponding plurality of candidates features was detected, each of the plurality of frames being associated with at least one of the plurality of candidate sets.
  - 6. The method of claim 5, wherein the act of selecting the at least one of the plurality of candidate sets includes an act of selecting, for each of the plurality of frames, one of the candidate sets associated with the respective frame, the selected candidate sets forming a feature tract that represents a description of the voice signal, the feature tract being used to train, at least in part, the voice synthesis model.
  - 7. The method of claim 5, wherein the act of grouping the plurality of candidate features includes an act of forming a plurality of feature tracts, each of the plurality of feature tracts including an associated candidate set for each of the plurality of frames.
  - 8. The method of claim 7, wherein the act of performing a comparison includes an act of performing a comparison between the voice signal and each of the plurality of feature tracts.
  - 9. The method of claim 8, wherein the act of selecting includes an act of selecting, for use in training the voice synthesis model, a first feature tract from the plurality of feature tracts that is most similar to the voice signal according to first criteria.
  - 10. The method of claim 5, wherein the acts of:
    - detecting a plurality of candidate features in each of the plurality of frames includes an act of detecting at least one formant; and
      
      grouping the plurality of candidate features includes an act of grouping the plurality of candidate features such that each of the plurality of candidate sets includes at least one value representative of at least one candidate formant detected in a respective frame.
  - 11. The method of claim 10, wherein the acts of:
    - detecting includes an act of detecting a plurality of formants; and
      
      grouping the plurality of candidate features includes an act of grouping the plurality of candidate features into a plurality of candidate sets for each of the plurality of frames, wherein each of the plurality of candidate sets includes at least one value representative of each of a first formant, a second formant and a third formant detected in the respective frame.
  - 12. The method of claim 11, wherein the act of detecting includes act of detecting at least one feature selected from the group consisting of:
    - pitch, timbre, energy and spectral slope.

13. A computer readable medium encoded with a program for execution on at least one processor, the program, when executed on the at least one processor, performing a method of processing a voice signal to extract information to facilitate training a speech synthesis model, the method comprising acts of:
- detecting a plurality of candidate features in the voice signal;
  
  performing at least one comparison between one or more combinations of the plurality of candidate features and the voice signal; and
  
  selecting a set of features from the plurality of candidate features based, at least in part, on the at least one comparison.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 14. The computer readable medium of claim 13, further comprising an act of grouping the plurality of candidate features into a plurality of candidate sets, and wherein the act of selecting the set of features includes an act of selecting at least one of the plurality of candidate sets.
  - 15. The computer readable medium of claim 14, further comprising an act of converting each of the plurality of candidate sets into a respective voice waveform provided in a same format as the voice signal, and wherein the act of selecting the at least one of the plurality of candidate sets includes an act of selecting at least a one of the plurality of candidate sets that is most similar to the voice signal according to first criteria, the selected one of the plurality of candidate sets being used to train, at least in part, the voice synthesis model.
  - 16. The computer readable medium of claim 14, further comprising an act of converting the voice signal and each of the plurality of candidate sets into a same format, and wherein the act of selecting the at least one of the plurality of candidate sets includes an act of selecting at least a one of the plurality of candidate sets that is most similar to the voice signal according to a first criteria, the selected one of the plurality of candidate sets being used to train, at least in part, the voice synthesis model.
  - 17. The computer readable medium of claim 14, further comprising an act of segmenting the voice signal into a plurality of frames, each of the plurality of frames corresponding to a respective interval of the voice signal, and wherein the acts of:
    - detecting a plurality of candidate features includes an act of detecting a plurality of candidate features in each of the plurality of frames; and
      
      grouping the plurality of candidate features includes an act of grouping the plurality of candidate features detected in each of the plurality of frames into a respective plurality of candidate sets, each of the plurality of candidate sets associated with one of the plurality of frames from which the corresponding plurality of candidates features was detected, each of the plurality of frames being associated with at least one of the plurality of candidate sets.
  - 18. The computer readable medium of claim 17, wherein the act of selecting the at least one of the plurality of candidate sets includes an act of selecting, for each of the plurality of frames, one of the candidate sets associated with the respective frame, the selected candidate sets forming a feature tract that represents a description of the voice signal, the feature tract being used to train, at least in part, the voice synthesis model.
  - 19. The computer readable medium of claim 17, wherein the act of grouping the plurality of candidate features includes an act of forming a plurality of feature tracts, each of the plurality of feature tracts including an associated candidate set for each of the plurality of frames.
  - 20. The computer readable medium of claim 19, wherein the act of performing a comparison includes an act of performing a comparison between the voice signal and each of the plurality of feature tracts.
  - 21. The computer readable medium of claim 20, wherein the act of selecting includes an act of selecting, for use in training the voice synthesis model, a first feature tract from the plurality of feature tracts that is most similar to the voice signal according to first criteria.
  - 22. The computer readable medium of claim 17, wherein the acts of:
    - detecting a plurality of candidate features in each of the plurality of frames includes an act of detecting at least one formant; and
      
      grouping the plurality of candidate features includes an act of grouping the plurality of candidate features such that each of the plurality of candidate sets includes at least one value representative of at least one candidate formant detected in a respective frame.
  - 23. The computer readable medium of claim 22, wherein the acts of:
    - detecting includes an act of detecting a plurality of formants; and
      
      grouping the plurality of candidate features includes an act of grouping the plurality of candidate features into a plurality of candidate sets for each of the plurality of frames, wherein each of the plurality of candidate sets includes at least one value representative of each of a first formant, a second formant and a third formant detected in the respective frame.
  - 24. The computer readable medium of claim 23, wherein the act of detecting includes act of detecting at least one feature selected from the group consisting of:
    - pitch, timbre, energy and spectral slope.

25. A computer readable medium encoded with a speech synthesis model adapted to, when operating, generate human recognizable speech, the speech synthesis modeled trained to generate the human recognizable speech, at least in part, by performing acts of:
- detecting a plurality of candidate features in the voice signal;
  
  performing at least one comparison between one or more combinations of the plurality of candidate features and the voice signal; and
  
  selecting a set of features from the plurality of candidate features based, at least in part, on the at least one comparison.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
- - 26. The computer readable medium of claim 25, further comprising an act of grouping the plurality of candidate features into a plurality of candidate sets, and wherein the act of selecting the set of features includes an act of selecting at least one of the plurality of candidate sets.
  - 27. The computer readable medium of claim 26, further comprising an act of converting each of the plurality of candidate sets into a respective voice waveform provided in a same format as the voice signal, and wherein the act of selecting the at least one of the plurality of candidate sets includes an act of selecting at least a one of the plurality of candidate sets that is most similar to the voice signal according to first criteria, the selected one of the plurality of candidate sets being used to train, at least in part, the voice synthesis model.
  - 28. The computer readable medium of claim 26, further comprising an act of converting the voice signal and each of the plurality of candidate sets into a same format, and wherein the act of selecting the at least one of the plurality of candidate sets includes an act of selecting at least a one of the plurality of candidate sets that is most similar to the voice signal according to a first criteria, the selected one of the plurality of candidate sets being used to train, at least in part, the voice synthesis model.
  - 29. The computer readable medium of claim 26, further comprising an act of segmenting the voice signal into a plurality of frames, each of the plurality of frames corresponding to a respective interval of the voice signal, and wherein the acts of:
    - detecting a plurality of candidate features includes an act of detecting a plurality of candidate features in each of the plurality of frames; and
      
      grouping the plurality of candidate features includes an act of grouping the plurality of candidate features detected in each of the plurality of frames into a respective plurality of candidate sets, each of the plurality of candidate sets associated with one of the plurality of frames from which the corresponding plurality of candidates features was detected, each of the plurality of frames being associated with at least one of the plurality of candidate sets.
  - 30. The computer readable medium of claim 29, wherein the act of selecting the at least one of the plurality of candidate sets includes an act of selecting, for each of the plurality of frames, one of the candidate sets associated with the respective frame, the selected candidate sets forming a feature tract that represents a description of the voice signal, the feature tract being used to train, at least in part, the voice synthesis model.
  - 31. The computer readable medium of claim 29, wherein the act of grouping the plurality of candidate features includes an act of forming a plurality of feature tracts, each of the plurality of feature tracts including an associated candidate set for each of the plurality of frames.
  - 32. The computer readable medium of claim 31, wherein the act of performing a comparison includes an act of performing a comparison between the voice signal and each of the plurality of feature tracts.
  - 33. The computer readable medium of claim 32, wherein the act of selecting includes an act of selecting, for use in training the voice synthesis model, a first feature tract from the plurality of feature tracts that is most similar to the voice signal according to first criteria.
  - 34. The computer readable medium of claim 29, wherein the acts of:
    - detecting a plurality of candidate features in each of the plurality of frames includes an act of detecting at least one formant; and
      
      grouping the plurality of candidate features includes an act of grouping the plurality of candidate features such that each of the plurality of candidate sets includes at least one value representative of at least one candidate formant detected in a respective frame.
  - 35. The computer readable medium of claim 34, wherein the acts of:
    - detecting includes an act of detecting a plurality of formants; and
      
      grouping the plurality of candidate features includes an act of grouping the plurality of candidate features into a plurality of candidate sets for each of the plurality of frames, wherein each of the plurality of candidate sets includes at least one value representative of each of a first formant, a second formant and a third formant detected in the respective frame.
  - 36. The computer readable medium of claim 35, wherein the act of detecting includes act of detecting at least one feature selected from the group consisting of:
    - pitch, timbre, energy and spectral slope.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Voice Signal Technologies Incorporated (Microsoft Corporation)
Inventors
Cohen, Jordan, Gillick, Laurence, Edgington, Michael

Granted Patent

US 8,447,592 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/262
CPC Class Codes

G10L 13/027   Concept to speech synthesis...

G10L 13/033   Voice editing, e.g. manipul...

G10L 25/15   the extracted parameters be...

Methods and apparatus for formant-based voice systems

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

36 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for formant-based voice systems

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

36 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links