Speech synthesis device, speech synthesis method, and program

US 20060136214A1
Filed: 06/03/2004
Published: 06/22/2006
Est. Priority Date: 06/05/2003
Status: Active Grant

First Claim

Patent Images

1-22. -22. (canceled)

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A simply configured speech synthesis device and the like for producing a natural synthetic speech at high speed. When data representing a message template is supplied, a voice unit editor (5) searches a voice unit database (7) for voice unit data on a voice unit whose sound matches a voice unit in the message template. Further, the voice unit editor (5) predicts the cadence of the message template and selects, one at a time, a best match of each voice unit in the message template from the voice unit data that has been retrieved, according to the cadence prediction result. For a voice unit for which no match can be selected, an acoustic processor (41) is instructed to supply waveform data representing the waveform of each unit voice. The voice unit data that is selected and the waveform data that is supplied by the acoustic processor (41) are combined to generate data representing a synthetic speech.

Citations

41 Claims

1-22. -22. (canceled)

23. A speech synthesis device, the device comprising:
- a first storage means for storing a plurality of pieces of voice unit data representative of one or more speech words;
  
  a selection means for selecting voice unit data whose reading is common with a speech word composing inputted sentence information from the plurality of pieces of voice unit data stored in the first storage means;
  
  a missing part synthesis means, for a speech word among the sentence information for which the selection means could not select the voice unit data, for synthesizing speech data representative of a desired speech waveform; and
  
  a synthesis means for combining the voice unit data selected from the selection means and the speech data synthesized by the missing part synthesis means to create data representative of a synthesis speech corresponding to the sentence information, wherein the missing part synthesis means has a second storage means for storing a plurality of pieces of data representative of one or more pitches of voice waveform fragments; and
  
  wherein data representative of voice waveform fragments composing the speech word whose voice unit data could not be selected is acquired from the second storage means and the acquired data is mutually combined to synthesize the speech data representative of the desired speech waveform.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 39, 40, 41)
- - 24. The speech synthesis device according to claim 23, further comprising a cadence prediction means for predicting a cadence of the speech word composing the inputted sentence information, wherein the selection means selects voice unit data whose cadence matches with a cadence prediction result under predetermined conditions.
  - 25. The speech synthesis device according to claim 24, wherein the selection means operates to exclude from the objects of selection voice unit data whose cadence does not match with the cadence prediction result under the predetermined conditions.
  - 26. The speech synthesis device according to claim 24, wherein the missing part synthesis means comprises a missing part cadence prediction means that predicts the cadence of the speech word for which the selection means could not select voice unit data, and wherein the synthesis means identifies a phoneme and acquires data representative of the voice unit data composing the speech word, for which the selection means could not select voice unit data and acquires from the second storage means, converts the acquired data such that the phoneme or the speech waveform fragment represented by the data matches with the cadence result predicted by the missing part cadence prediction means, and combines the converted data to synthesize speech data representative of the desired speech waveform.
  - 27. The speech synthesis device according to claim 25, wherein the missing part synthesis means comprises a missing part cadence prediction means that predicts the cadence of the speech word for which the selection means could not select voice unit data, and wherein the synthesis means identifies a phoneme and acquires data representative of the voice unit data composing the speech word, for which the selection means could not select voice unit data and acquires from the second storage means, converts the acquired data such that the phoneme or the speech waveform fragment represented by the data matches with the cadence result predicted by the missing part cadence prediction means, and combines the converted data to synthesize speech data representative of the desired speech waveform.
  - 28. The speech synthesis device according to claim 24, wherein the first storage means stores cadence data representative of time variations in a pitch of a voice unit represented by voice unit data with the cadence data being associated with the voice unit data, and wherein the selection means selects, from the respective voice unit data, voice unit data whose reading is common with the speech word composing the sentence information and for which a time variation in the pitch represented by the associated cadence data is closest to the cadence prediction result.
  - 29. The speech synthesis device according to claim 25, wherein the first storage means stores cadence data representative of time variations in a pitch of a voice unit represented by voice unit data with the cadence data being associated with the voice unit data, and wherein the selection means selects, from the respective voice unit data, voice unit data whose reading is common with the speech word composing the sentence information and for which a time variation in the pitch represented by the associated cadence data is closest to the cadence prediction result.
  - 30. The speech synthesis device according to claim 23, wherein the device further comprises utterance speed conversion means for acquiring utterance speed data specifying conditions of a speed for uttering the synthetic speech and selects or converts speech data and/or voice unit data composing data representative of the synthetic speech such that the speech data and/or voice unit data represents speech that is uttered at a speed fulfilling the conditions specified by the utterance speed data.
  - 31. The speech synthesis device according to claim 24, wherein the device further comprises utterance speed conversion means for acquiring utterance speed data specifying conditions of a speed for uttering the synthetic speech and selects or converts speech data and/or voice unit data composing data representative of the synthetic speech such that the speech data and/or voice unit data represents speech that is uttered at a speed fulfilling the conditions specified by the utterance speed data.
  - 32. The speech synthesis device according to claim 25, wherein the device further comprises utterance speed conversion means for acquiring utterance speed data specifying conditions of a speed for uttering the synthetic speech and selects or converts speech data and/or voice unit data composing data representative of the synthetic speech such that the speech data and/or voice unit data represents speech that is uttered at a speed fulfilling the conditions specified by the utterance speed data.
  - 33. The speech synthesis device according to claim 30, wherein the utterance speed conversion means, by eliminating a segment representing a speech waveform fragment from speech data and/or voice unit data composing data representative of the synthetic speech or adding a segment representative of a speech waveform fragment to the voice unit data and/or speech data, converts the voice unit data and/or speech data such that the voice unit data and/or speech data represents speech that is uttered at a speed fulfilling the conditions specified by the utterance speed data.
  - 39. The speech synthesis device according to claim 24, wherein the missing part synthesis means comprises a missing part cadence prediction means that predicts the cadence of the speech word for which the selection means could not select voice unit data, and wherein the synthesis means identifies a phoneme and acquires data representative of the voice unit data composing the speech word, for which the selection means could not select voice unit data and acquires from the second storage means, converts the acquired data such that the phoneme or the speech waveform fragment represented by the data matches with the cadence result predicted by the missing part cadence prediction means, and combines the converted data to synthesize speech data representative of the desired speech waveform.
  - 40. The speech synthesis device according to claim 23, wherein the first storage means stores cadence data representative of time variations in a pitch of a voice unit represented by voice unit data with the cadence data being associated with the voice unit data, and wherein the selection means selects, from the respective voice unit data, voice unit data whose reading is common with the speech word composing the sentence information and for which a time variation in the pitch represented by the associated cadence data is closest to the cadence prediction result.
  - 41. The speech synthesis device according to claim 25, wherein the first storage means stores cadence data representative of time variations in a pitch of a voice unit represented by voice unit data with the cadence data being associated with the voice unit data, and wherein the selection means selects, from the respective voice unit data, voice unit data whose reading is common with the speech word composing the sentence information and for which a time variation in the pitch represented by the associated cadence data is closest to the cadence prediction result.

34. A speech synthesis device, the device comprising:
- a first storage means for storing a plurality of pieces of voice unit data representative of one or more speech words;
  
  a selection means for selecting voice unit data whose reading is common with a speech word composing inputted sentence information from the plurality of pieces of voice unit data stored in the first storage means;
  
  a missing part synthesis means, for a speech word among the sentence information for which the selection means could not select the voice unit data, for synthesizing speech data representative of a desired speech waveform; and
  
  a synthesis means for combining the voice unit data selected from the selection means and the speech data synthesized by the missing part synthesis means to create data representative of a synthesis speech corresponding to the sentence information, wherein the first storage means stores phonetic data representative of a reading of the voice unit data with the phonetic data being associated with the voice unit data, and wherein the selection means operates to handle voice unit data which is associated with phonetic data representative of a reading matching with the reading of the speech word composing the sentence information as voice unit data whose reading is common with the speech word.

35. A speech synthesis method, the method comprising the steps of:
- storing a plurality of pieces of voice unit data representative of one or more speech words in a first memory;
  
  selecting voice unit data whose reading is common with a speech word composing inputted sentence information from the plurality of pieces of voice unit data stored in the first memory;
  
  synthesizing a missing part, for a speech word among the sentence information for which the voice unit data could not be selected in the selecting step, by synthesizing speech data representative of a desired speech waveform; and
  
  combining the voice unit data selected from the selection means and the speech data synthesized in the missing part synthesizing step to create data representative of a synthesis speech corresponding to the sentence information, wherein the missing part synthesizing step stores a plurality of pieces of data representative of one or more pitches of voice waveform fragments using a second memory; and
  
  wherein data representative of voice waveform fragments composing the speech word whose voice unit data could not be selected is acquired from the second memory and the acquired data is combined to synthesize the speech data representative of the desired speech waveform.

36. A speech synthesis method, the method comprising the steps of:
- storing a plurality of pieces of voice unit data representative of one or more speech words in a first memory;
  
  selecting voice unit data whose reading is common with a speech word composing inputted sentence information from the plurality of pieces of voice unit data stored in the first memory;
  
  synthesizing a missing part, for a speech word among the sentence information for which the selection means could not select the voice units data, by synthesizing speech data representative of a desired speech waveform; and
  
  combining the voice unit data selected from the selection means and the speech data synthesized in the missing part synthesis step to create data representative of a synthesis speech corresponding to the sentence information, wherein the first memory stores phonetic data representative of a reading of the voice unit data with the phonetic data being associated with the voice unit data, and wherein the selecting step handles voice unit data which is associated with phonetic data representative of a reading matching with the reading of the speech word composing the sentence information as voice unit data whose reading is common with the speech word.

37. A computer program causing a computer to operate as:
- a first storage means for storing a plurality of pieces of voice unit data representative of one or more speech words;
  
  a selection means for selecting voice unit data whose reading is common with a speech word composing inputted sentence information from the plurality of pieces of voice unit data stored in the first storage means;
  
  a missing part synthesis means, for a speech word among the sentence information for which the selection means could not select the voice units data, for synthesizing speech data representative of a desired speech waveform; and
  
  a synthesis means for combining the voice unit data selected from the selection means and the speech data synthesized by the missing part synthesis means to create data representative of a synthesis speech corresponding to the sentence information, wherein the missing part synthesis means has a second storage means for storing a plurality of pieces of data representative of one or more pitches of voice waveform fragments; and
  
  wherein data representative of voice waveform fragments composing the speech word whose voice unit data could not be selected is acquired from the second storage means and the acquired data is mutually combined to synthesize the speech data representative of the desired speech waveform.

38. A computer program causing a computer to operate as:
- a first storage means for storing a plurality of pieces of voice unit data representative of one or more speech words;
  
  a selection means for selecting voice unit data whose reading is common with a speech word composing inputted sentence information from the plurality of pieces of voice unit data stored in the first storage means;
  
  a missing part synthesis means, for a speech word among the sentence information for which the selection means could not select the voice units data, for synthesizing speech data representative of a desired speech waveform; and
  
  a synthesis means for combining the voice unit data selected from the selection means and the speech data synthesized by the missing part synthesis means to create data representative of a synthesis speech corresponding to the sentence information, wherein the first storage means stores phonetic data representative of a reading of the voice unit data with the phonetic data being associated with the voice unit data, and wherein the selection means operates to handle voice unit data which is associated with phonetic data representative of a reading matching with the reading of the speech word composing the sentence information as voice unit data whose reading is common with the speech word.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Rakuten Group, Inc.
Original Assignee
Kabushiki Kaisha Kenwood (JVC Kenwood Corporation)
Inventors
Sato, Yasushi

Granted Patent

US 8,214,216 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/265
CPC Class Codes

G10L 13/027 Concept to speech synthesis...

Speech synthesis device, speech synthesis method, and program

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

Citations

41 Claims

Specification

Solutions

Use Cases

Quick Links

Speech synthesis device, speech synthesis method, and program

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

41 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links