Method and apparatus for speech synthesis, program, recording medium, method and apparatus for generating constraint information and robot apparatus

US 20040019484A1
Filed: 03/13/2003
Published: 01/29/2004
Est. Priority Date: 03/15/2002
Status: Active Grant

First Claim

Patent Images

1. A speech synthesis method for receiving information on the emotion to synthesize the speech, comprising:

a prosodic data forming step of forming prosodic data from a string of pronunciation marks which is based on an uttered text, uttered as speech;

a constraint information generating step of generating the constraint information used for maintaining prosodic features of the uttered text;

a parameter changing step of changing parameters of said prosodic data, in consideration of said constraint information, responsive to the information on the emotion; and

a speech synthesis step of synthesizing the speech based on said prosodic data the parameters of which have been changed in said parameter changing step.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The emotion is to be added to the synthesized speech as the prosodic feature of the language is maintained. In a speech synthesis device 200, a language processor 201 generates a string of pronunciation marks from the text, and a prosodic data generating unit 202 creates prosodic data, expressing the time duration, pitch, sound volume or the like parameters of phonemes, based on the string of pronunciation marks. A constraint information generating unit 203 is fed with the prosodic data and with the string of pronunciation marks to generate the constraint information which limits the changes in the parameters to add the so generated constraint information to the prosodic data. A emotion filter 204, fed with the prosodic data, to which has been added the constraint information, changes the parameters of the prosodic data, within the constraint, responsive to the feeling state information, imparted to it. A waveform generating unit 205 synthesizes the speech waveform based on the prosodic data the parameters of which have been changed.

Citations

63 Claims

1. A speech synthesis method for receiving information on the emotion to synthesize the speech, comprising:
- a prosodic data forming step of forming prosodic data from a string of pronunciation marks which is based on an uttered text, uttered as speech;
  
  a constraint information generating step of generating the constraint information used for maintaining prosodic features of the uttered text;
  
  a parameter changing step of changing parameters of said prosodic data, in consideration of said constraint information, responsive to the information on the emotion; and
  
  a speech synthesis step of synthesizing the speech based on said prosodic data the parameters of which have been changed in said parameter changing step.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The speech synthesis method according to claim 1 wherein the uttered text is a specific language.
  - 3. The speech synthesis method according to claim 1 or 2, wherein said constraint information is annexed to said prosodic data.
  - 4. The speech synthesis method according to any one of claims 1 to 3, wherein said parameters are at least one selected from the group consisting of the pitch, duration and sound volume of the phoneme.
  - 5. The speech synthesis method according to any one of claims 1 to 4, wherein, in said parameter changing step, the parameters of said prosodic data in a portion containing said prosodic features are not changed.
  - 6. The speech synthesis method according to any one of claims 1 to 4, wherein, in said parameter changing step, the parameters of said prosodic data are changed while the magnitude relation, difference or ratio of the parameter values in a portion containing said prosodic features is maintained.
  - 7. The speech synthesis method according to any one of claims 1 to 4, wherein, in said parameter changing step, the parameters of said prosodic data are changed so that said parameter value in a portion containing said prosodic features is within a predetermined range.
  - 8. The speech synthesis method according to any one of claims 4 to 7, wherein said prosodic feature is the position of an accent core of an accent phrase contained in the uttered text;
    - wherein, in said constraint information generating step, the information indicating the position of said accent core is generated; and
      
      wherein, in said parameter changing step, said pitch in said prosodic data is changed lest the position of said accent core should be changed.
  - 9. The speech synthesis method according to any one of claims 4 to 7, wherein said prosodic feature is a continuous rising pitch pattern or a continuous falling pitch pattern in the vicinity of the trailing end of said uttered text or a paragraph contained in said uttered text;
    - wherein, in said constraint information generating step, the information indicating said pattern is generated; and
      
      wherein, in said parameter changing step, said pitch in said prosodic data is changed lest said pattern should be changed.
  - 10. The speech synthesis method according to any one of claims 4 to 7, wherein said prosodic feature is the time duration of a particular phoneme in case the meaning and contents of a word contained in an uttered text are changed due to the difference in the duration of the particular phoneme in said word;
    - wherein, in said constraint information generating step, the information specifying an upper limit and/or a lower limit of the time duration of said particular phoneme is generated; and
      
      wherein, in said parameter changing step, said time duration in said prosodic data is changed so as to satisfy upper and/or lower limits of said time duration.
  - 11. The speech synthesis method according to any one of claims 4 to 7, wherein said prosodic feature is an accent position in said word in case the meaning and the contents of a word contained in said uttered text are changed with said accent position;
    - wherein, in said constraint information generating step, the information indicating said accent information is generated; and
      
      wherein, in said parameter changing step, said sound volume in said prosodic data is changed lest said accent position should be changed.
  - 12. The speech synthesis method according to any one of claims 4 to 7 wherein said prosodic feature is the relative intensity among a plurality of words contained in the uttered text when the meaning and contents of said uttered text are changed by said relative intensity;
    - wherein, in said constraint information generating step, the information representing said relative intensity is generated; and
      
      wherein, in said parameter changing step, said sound volume in said prosodic data is changed lest said relative intensity should be changed.
  - 13. The speech synthesis method according to any one of claims 4 to 7, wherein there are provided a plurality of phoneme symbols corresponding to emotion states for one phoneme;
    - and wherein, in said parameter changing step, at least a portion of the phoneme symbols is changed responsive to emotion states discriminated in said discriminating step.
  - 14. The speech synthesis method according to claim 1, wherein, in said parameter changing step, at least a portion of the phoneme symbols is changed to other phoneme symbols.
  - 15. The speech synthesis method according to claim 14, wherein whether or not the phoneme symbols are to be changed is specified from one phoneme in the uttered text to another, from one word in the uttered text to another, from one paragraph in the uttered text to another, from one accent phrase to another or from one uttered text to another.
  - 16. The speech synthesis method according to any one of claims 1 to 15, wherein said prosodic data is added to said string of pronunciation marks.

17. A speech synthesis method for receiving information on the emotion to synthesize the speech, comprising:
- a data inputting step for inputting prosodic data which is based on the text uttered as speech and the constraint information for maintaining the prosodic feature of said uttered text;
  
  a parameter changing step of changing parameters of said prosodic data, in consideration of said constraint information, responsive to the information on the emotion; and
  
  a speech synthesis step of synthesizing the speech based on the prosodic data the parameters of which have been changed in said parameter changing step.
- View Dependent Claims (18, 19)
- - 18. The speech synthesis method according to claim 17 wherein said constraint information is added to said prosodic data.
  - 19. The speech synthesis method according to claim 17 or 18, wherein said parameters are at least one selected from the group consisting of the pitch, time duration and sound volume of the phoneme.

20. A speech synthesis apparatus for receiving information on the emotion to synthesize the speech, comprising:
- prosodic data generating means for generating prosodic data from a string of pronunciation marks which is based on a text uttered as speech;
  
  constraint information generating means for generating the constraint information adapted for maintaining the prosodic feature of said uttered text;
  
  parameter changing means for changing parameters of said prosodic data, in consideration of said constraint information, responsive to the information on the emotion; and
  
  speech synthesis means for synthesizing the speech based on said prosodic data the parameters of which have been changed by said parameter changing means.
- View Dependent Claims (21)
- - 21. The speech synthesis apparatus according to claim 20 wherein said parameters are at least one selected from the group consisting of the pitch, time duration and sound volume of the phoneme.

22. A speech synthesis apparatus for receiving information on the emotion to synthesize the speech, comprising:
- data inputting means for inputting prosodic data which is based on the uttered text uttered as speech, and the constraint information for maintaining the prosodic feature of said uttered text;
  
  parameter changing means for changing parameters of said prosodic data, in consideration of said constraint information, responsive to the information on the emotion; and
  
  speech synthesis means for synthesizing the speech based on said prosodic data the parameters of which have been changed in said parameter changing means.
- View Dependent Claims (23)
- - 23. The speech synthesis apparatus according to claim 22, wherein said parameters are at least one selected from the group consisting of the pitch, time duration and sound volume of the phoneme.

24. A program product for having a computer execute the processing for receiving information on the emotion to synthesize the speech, comprising:
- a prosodic data forming step of forming prosodic data from a string of pronunciation marks which is based on an uttered text, uttered as speech;
  
  a constraint information generating step of generating the constraint information used for maintaining prosodic features of the uttered text;
  
  a parameter changing step of changing parameters of said prosodic data, in consideration of said constraint information, responsive to the information on the emotion; and
  
  a speech synthesis step of synthesizing the speech based on said prosodic data the parameters of which have been changed in said parameter changing step.
- View Dependent Claims (25)
- - 25. The program the program according to claim 24 wherein said parameters are at least one selected from the group consisting of the pitch, time duration and sound volume of the phoneme.

26. A program product loadable into a computer for having the computer perform the processing of receiving information on the emotion to synthesize the speech, comprising:
- a data inputting step for inputting prosodic data which is based on the test uttered as speech and the constraint information for maintaining the prosodic feature of said uttered text;
  
  a parameter changing step of changing parameters of said prosodic data, in consideration of said constraint information, responsive to the information on the emotion; and
  
  a speech synthesis step of synthesizing the speech based on the prosodic data the parameters of which have been changed in said parameter changing step.
- View Dependent Claims (27)
- - 27. The program product according to claim 26 wherein said parameters are at least one selected from the group consisting of the pitch, time duration and sound volume of the phoneme.

28. A computer-readable recording medium on which there is recorded a program for having a computer execute the processing of receiving information on the emotion to synthesize the speech, comprising:
- a prosodic data forming step of forming prosodic data from a string of pronunciation marks which is based on an uttered text, uttered as speech;
  
  a constraint information generating step of generating the constraint information used for maintaining prosodic features of the uttered text;
  
  a parameter changing step of changing parameters of said prosodic data, in consideration of said constraint information, responsive to the information on the emotion; and
  
  a speech synthesis step of synthesizing the speech based on said prosodic data the parameters of which have been changed in said parameter changing step.
- View Dependent Claims (29)
- - 29. The computer-readable recording medium according to claim 28, wherein said parameters are at least one selected from the group consisting of the pitch, time duration and sound volume of the phoneme.

30. A recording medium on which there is recorded a program adapted for having a computer perform the processing of receiving information on the emotion to synthesize the speech, comprising:
- a data inputting step for inputting prosodic data which is based on the text uttered as speech and the constraint information for maintaining the prosodic feature of said uttered text;
  
  a parameter changing step of changing parameters of said prosodic data, in consideration of said constraint information, responsive to the information on the emotion; and
  
  a speech synthesis step of synthesizing the speech based on the prosodic data, the parameters of which have been changed in said parameter changing step.
- View Dependent Claims (31)
- - 31. The recording medium according to claim 30, wherein said parameters are at least one selected from the group consisting of the pitch, time duration and sound volume of the phoneme.

32. A method for generating the constraint information comprising:
- a constraint information generating step of being fed with a string of pronunciation marks specifying an uttered text, uttered as speech, for generating the constraint information for maintaining the prosodic feature of said uttered text when changing parameters of prosodic data prepared from said string of pronunciation marks in accordance with the parameter change control information.
- View Dependent Claims (33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44)
- - 33. The constraint information generating method according to claim 32, wherein the uttered text is a specific language.
  - 34. The constraint information generating method according to claim 32 or 33, wherein said parameter change control information is the emotion state information or the character information.
  - 35. The constraint information generating method according to any one of claims 32 to 34, wherein said constraint information is annexed to said prosodic data.
  - 36. The constraint information generating method according to any one of claims 32 to 35, wherein said parameters are at least one selected from the group consisting of the pitch, duration and sound volume of the phoneme.
  - 37. The constraint information generating method according to claim 36, wherein, in said constraint information generating step, constraint information for maintaining the parameters of said prosodic data in a portion containing said prosodic features is generated lest the parameters should not be changed.
  - 38. The constraint information generating method according to claim 36, wherein, in said constraint information generating step, constraint information for maintaining the magnitude relation, difference or ratio of the parameter values in a portion containing said prosodic features is generated.
  - 39. The constraint information generating method according to claim 36, wherein, in said constraint information generating step, constraint information for maintaining said parameter value in a portion containing said prosodic features is within a predetermined range.
  - 40. The constraint information generating method according to any one of claims 36 to 39, wherein said prosodic feature is the position of an accent core of an accent phrase contained in the uttered text;
    - and wherein, in said constraint information generating step, the information indicating the position of said accent core is generated.
  - 41. The constraint information generating method according to any one of claims 36 to 39, wherein said prosodic feature is a continuous rising pitch pattern or a continuous falling pitch pattern in the vicinity of the trailing end of said uttered text or the vicinity of the boundary of a paragraph contained in said uttered text;
    - and wherein, in said constraint information generating step, the information indicating said pattern is generated.
  - 42. The constraint information generating method according to any one of claims 36 to 39, wherein said prosodic feature is the time duration of a specified phoneme in case the meaning and contents of a word contained in the uttered text are changed by the difference in time duration of said specified phoneme;
    - and wherein, in said constraint information generating step, the information indicating the upper and/or lower limit of the time duration of said specified music is generated.
  - 43. The constraint information generating method according to any one of claims 36 to 39, wherein said prosodic feature is a stress position of a word contained in an uttered text in case the meaning and contents of said word are changed by said stress position;
    - and wherein, in said constraint information generating step, the information indicating said stress position is generated.
  - 44. The constraint information generating method according to any one of claims 36 to 39, wherein said prosodic feature is the relative intensity among respective words contained in the uttered text when the meaning and the contents of said uttered text are changed by said relative intensity among said respective words;
    - and wherein, in said control information generating step, the information indicating said relative intensity is generated.

45. An apparatus for generating the constraint information comprising:
- constraint information generating means for being fed with a string of pronunciation marks specifying an uttered text, uttered as speech, for generating the constraint information for maintaining the prosodic feature of said uttered text when changing parameters of prosodic data prepared from said string of pronunciation marks in accordance with the parameter change control information.
- View Dependent Claims (46, 47)
- - 46. The constraint information generating apparatus according to claim 45, wherein said parameter change control information is the emotion state information or the character information.
  - 47. The constraint information generating apparatus according to claim 45 or 46, wherein said parameters are at least one selected from the group consisting of the pitch, duration and sound volume of the phoneme.

48. An autonomous robot apparatus performing a movement based on the input information supplied thereto, comprising:
- an emotion model ascribable to said movement;
  
  emotion discrimination means for discriminating the emotion state of said emotion model;
  
  prosodic data creating means for creating prosodic data from a string of pronunciation marks which is based on the text uttered as speech;
  
  constraint information generating means for generating the constraint information adapted for maintaining the prosodic feature of said uttered text;
  
  parameter changing means for changing parameters of said prosodic data, in consideration of said constraint information, responsive to the emotion state discriminated by said discriminating means; and
  
  speech synthesizing means for synthesizing the speech based on said prosodic data the parameters of which have been changed by the parameter changing means.
- View Dependent Claims (49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60)
- - 49. The autonomous robot apparatus according to claim 48, wherein the uttered text is a specific language.
  - 50. The autonomous robot apparatus according to claim 48 or 49, wherein said constraint information is annexed to said prosodic data.
  - 51. The autonomous robot apparatus according to any one of claims 48 to 50, wherein said parameters are at least one selected from the group consisting of the pitch, duration and sound volume of the phoneme.
  - 52. The autonomous robot apparatus according to claim 51, wherein said parameter changing means does not change the parameters of said prosodic data in a portion containing said prosodic features.
  - 53. The autonomous robot apparatus according to claim 51, wherein said parameter changing means changes the parameters of said prosodic data, maintaining the magnitude relation, difference or ratio of the parameter values in a portion containing said prosodic features.
  - 54. The autonomous robot apparatus according to claim 51, wherein said parameter changing means changes the parameters of said prosodic data so that said parameter value in a portion containing said prosodic features is within a predetermined range.
  - 55. The autonomous robot apparatus according to any one of claims 51 to 54, wherein said prosodic feature is the position of an accent core of an accent phrase contained in the uttered text;
    - wherein, in said constraint information generating means, the information indicating the position of said accent core is generated; and
      
      wherein, in said parameter changing means, said pitch in said prosodic data is changed lest the position of said accent core should be changed.
  - 56. The autonomous robot apparatus according to any one of claims 51 to 54, wherein said prosodic feature is a continuous rising pitch pattern or a continuous falling pitch pattern in the vicinity of the trailing end of said uttered text or the vicinity of the boundary of a paragraph contained in said uttered text;
    - wherein, in said constraint information generating means, the information indicating said pattern is generated; and
      
      wherein, in said parameter changing means, said pitch in said prosodic data is changed lest said pattern should be changed.
  - 57. The autonomous robot apparatus according to any one of claims 51 to 54, wherein said prosodic feature is the time duration of a particular phoneme in case the meaning and contents of a word contained in an uttered text are changed due to the difference in the duration of the particular phoneme in said word;
    - wherein, in said constraint information changing means, the information specifying an upper limit and/or a lower limit of the time duration of said particular phoneme is generated; and
      
      wherein, in said parameter changing means, said time duration in said prosodic data is changed so as to satisfy upper and/or lower limits of said time duration.
  - 58. The autonomous robot apparatus according to any one of claims 51 to 54, wherein said prosodic feature is the stress position in case the meaning and the contents of a word contained in said uttered text are changed with a stress position in said word;
    - wherein, in said constraint information generating means, the information indicating said stress information is generated; and
      
      wherein, in said parameter changing means, said sound volume in said prosodic data is changed lest said stress position should be changed.
  - 59. The autonomous robot apparatus according to any one of claims 51 to 54, wherein said prosodic feature is the relative intensity among a plurality of words contained in the uttered text when the meaning and contents of said uttered text are changed by said relative intensity;
    - wherein, in said constraint information generating means, the information representing said relative intensity is generated; and
      
      wherein, in said parameter changing means, said sound volume in said prosodic data is changed lest said relative intensity should be changed.
  - 60. The autonomous robot apparatus according to any one of claims 48 to 59 further comprising emotion model changing means for determining said movement by changing the state of said emotion model based on said input information.

61. An autonomous robot apparatus performing a movement based on the input information supplied thereto, comprising:
- an emotion model ascribable to said movement;
  
  emotion discrimination means for discriminating the emotion state of said emotion model;
  
  data inputting means for inputting prosodic data which is based on the text uttered as speech and the constraint information for maintaining the prosodic feature of said uttered text;
  
  parameter changing means for changing parameters of said prosodic data, in consideration of said constraint information, responsive to the emotion state discriminated by said discriminating means; and
  
  speech synthesizing means for synthesizing the speech based on said prosodic data, the parameters of which have been changed by the parameter changing means.
- View Dependent Claims (62, 63)
- - 62. The autonomous robot apparatus according to claim 61, wherein said constraint information is annexed to said prosodic data.
  - 63. The autonomous robot apparatus according to claim 61 or 62, wherein said parameters are at least one selected from the group consisting of the pitch, duration and sound volume of the phoneme.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sony Corporation (Sony Group Corp.), Sony France SA (Sony Group Corp.)
Original Assignee
Sony Corporation (Sony Group Corp.), Sony France SA (Sony Group Corp.)
Inventors
Kobayashi, Erika, Akabane, Makoto, Kobayashi, Kenichiro, Nitta, Tomoaki, Yamazaki, Nobuhide, Kumakura, Toshiyuki, Oudeyer, Pierre Yves

Granted Patent

US 7,412,390 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/258
CPC Class Codes

G10L 13/02 Methods for producing synth...

G10L 13/04 Details of speech synthesis...

Method and apparatus for speech synthesis, program, recording medium, method and apparatus for generating constraint information and robot apparatus

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

63 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for speech synthesis, program, recording medium, method and apparatus for generating constraint information and robot apparatus

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

63 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links