Apparatus and method for voice conversion using attribute information

US 7,580,839 B2
Filed: 09/19/2006
Issued: 08/25/2009
Est. Priority Date: 01/19/2006
Status: Active Grant

First Claim

Patent Images

1. A speech processing apparatus comprising:

a speech storage configured to store a plurality of speech units of a conversion-source speaker and source-speaker attribute information corresponding to the speech units;

a speech-unit extractor configured to divide the speech of a conversion-target speaker into a predetermined type of a speech unit to form target-speaker speech units;

an attribute-information generator configured to generate target-speaker attribute information corresponding to the target-speaker speech units from the speech of the conversion-target speaker or linguistic information of the speech;

a speech-unit selector configured to calculate costs on the target-speaker attribute information and the source-speaker attribute information using cost functions, and selects one or a plurality of speech units with the same phoneme from the speech storage according to the costs to form a source-speaker speech unit; and

a voice-conversion-rule generator configured to generate speech conversion functions for converting the one or the plurality of source-speaker speech units to the target-speaker speech units based on the target-speaker speech units and the one or the plurality of source-speakerspeech units.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech processing apparatus according to an embodiment of the invention includes a conversion-source-speaker speech-unit database; a voice-conversion-rule-learning-data generating means; and a voice-conversion-rule learning means, with which it makes voice conversion rules. The voice-conversion-rule-learning-data generating means includes a conversion-target-speaker speech-unit extracting means; an attribute-information generating means; a conversion-source-speaker speech-unit database; and a conversion-source-speaker speech-unit selection means. The conversion-source-speaker speech-unit selection means selects conversion-source-speaker speech units corresponding to conversion-target-speaker speech units based on the mismatch between the attribute information of the conversion-target-speaker speech units and that of the conversion-source-speaker speech units, whereby the voice conversion rules are made from the selected pair of the conversion-target-speaker speech units and the conversion-source-speaker speech units.

287 Citations

13 Claims

1. A speech processing apparatus comprising:
- a speech storage configured to store a plurality of speech units of a conversion-source speaker and source-speaker attribute information corresponding to the speech units;
  
  a speech-unit extractor configured to divide the speech of a conversion-target speaker into a predetermined type of a speech unit to form target-speaker speech units;
  
  an attribute-information generator configured to generate target-speaker attribute information corresponding to the target-speaker speech units from the speech of the conversion-target speaker or linguistic information of the speech;
  
  a speech-unit selector configured to calculate costs on the target-speaker attribute information and the source-speaker attribute information using cost functions, and selects one or a plurality of speech units with the same phoneme from the speech storage according to the costs to form a source-speaker speech unit; and
  
  a voice-conversion-rule generator configured to generate speech conversion functions for converting the one or the plurality of source-speaker speech units to the target-speaker speech units based on the target-speaker speech units and the one or the plurality of source-speakerspeech units.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The apparatus according to claim 1, whereinthe speech-unit selector selects a speech unit corresponding to source-speaker attribute information in which the cost of the cost functions is the minimum from the speech storage into the source-speaker speech unit.
  - 3. The apparatus according to claim 1, whereinthe attribute information is at least one of fundamental frequency information, duration information, phoneme environment information, and spectrum information.
  - 4. The apparatus according to claim 1, whereinthe attribute-information generator comprises:
    - an attribute-conversion-rule generator configured to generate an attribute conversion function for converting the attribute information of the conversion-target speaker to the attribute information of the conversion-source speaker;
      
      an attribute-information extractor configured to extract attribute information corresponding to the target-speaker speech units from the speech of the conversion-target speaker or the linguistic information of the speech of the conversion-target speaker; and
      
      an attribute-information converter configured to convert the attribute information corresponding to the target-speaker speech units using the attribute conversion function to use the converted attribute information as target-speaker attribute information corresponding to the target-speaker speech units.
  - 5. The apparatus according to claim 4, whereinthe attribute-conversion-rule generator comprises:
    - a analyzer configured to find an average of the fundamental frequency information of the conversion-target speaker and an average of the fundamental frequency information of the conversion-source speaker; and
      
      a difference generator configured to determine difference between the average of the fundamental frequency information of the conversion-target speaker and the average of the fundamental frequency information of the conversion-source speaker, and generates an attribute conversion function in which the difference is added to the fundamental frequency information of the conversion-source speaker.
  - 6. The apparatus according to claim 1, whereinthe voice-conversion-rule generator comprises:
    - a speech-parameter extractor configured to extract target-speaker speech parameters indicative of the voice quality of the target-speaker speech units and source-speaker speech parameters indicative of the voice quality of the source-speaker speech units; and
      
      a regression analyzer configured to obtain a regression matrix for estimating the target-speaker speech parameters from the source-speaker speech parameters,the regression matrix being the voice conversion function.
  - 7. The apparatus according to claim 1, further comprising:
    - a voice converter configured to convert the voice quality of the speech of the conversion-source speaker using the voice conversion function.
  - 8. The apparatus according to claim 1, further comprising:
    - a speech-unit storage configured to store conversion-target-speaker speech units obtained by converting the conversion-source-speaker speech units with the voice conversion function;
      
      a speech-unit selector configured to select speech units from the speech-unit storage to obtain representative speech units; and
      
      a speech-waveform generator configured to generate a speech waveform by concatenating the representative speech units.
  - 9. The apparatus according to claim 1, further comprising:
    - a speech-unit selector configured to select speech units from the speech-unit storage to obtain representative conversion-source-speaker speech units;
      
      a voice converter configured to convert the representative conversion-source-speaker speech units using the voice conversion function to obtain representative conversion-target-speaker speech units; and
      
      a speech-waveform generator configured to concatenate the representative conversion-target-speaker speech units to generate a speech waveform.
  - 10. The apparatus according to claim 1, further comprising:
    - a speech-unit storage configured to store conversion-target-speaker speech units obtained by converting the conversion-source-speaker speech units with the voice conversion function;
      
      a plural-speech-units selector configured to select a plurality of speech units for each synthesis unit from the speech-unit storage;
      
      a fusion unit configured to fuse the selected plurality of speech units to form fused speech units; and
      
      a speech-waveform generator configured to concatenate the fused speech units to generate a speech waveform.
  - 11. The apparatus according to claim 1, further comprising:
    - a plural-speech-units selector configured to select a plurality of speech units for each synthesis unit from the speech-unit storage;
      
      a voice converter configured to convert the selected plurality of speech units using the voice conversion function to obtain a plurality of conversion-target-speaker speech units;
      
      a fusion unit configured to fuse the selected plurality of conversion-target-speaker speech units to form fused speech units; and
      
      a speech-waveform generator configured to concatenate the fused speech units to generate a speech waveform.

12. A method of processing speech, the method comprising:
- storing in a storing means a plurality of speech units of a conversion-source speaker and source-speaker attribute information corresponding to the speech units;
  
  dividing the speech of a conversion-target speaker into a predetermined type of a speech unit to form target-speaker speech units;
  
  generating target-speaker attribute information corresponding to the target-speaker speech units from information on the speech of the conversion-target speaker or linguistic information of the speech;
  
  calculating costs on the target-speaker attribute information and the source-speaker attribute information using cost functions, and selecting one or a plurality of speech units with the same phoneme from the storing means according to the costs to form a source-speaker speech unit; and
  
  generating voice conversion functions for converting the one or the plurality of source-speaker speech units to the target-speaker speech units based on the target-speaker speech units and the one or a plurality of source-speaker speech units.

13. A computer-readable storage medium having stored therein a program for processing speech, the program causing a computer to implement a process comprising:
- storing a plurality of speech units of a conversion-source speaker and source-speaker attribute information corresponding to the speech units;
  
  dividing the speech of a conversion-target speaker into a predetermined type of a speech unit to form target-speaker speech units;
  
  generating target-speaker attribute information corresponding to the target-speaker speech units from information on the speech of the conversion-target speaker or linguistic information of the speech;
  
  calculating costs on the target-speaker attribute information and the source-speaker attribute information using cost functions, and selecting one or a plurality of speech units with the same phoneme from the conversion-source-speaker speech units according to the costs to form a source-speaker speech unit; and
  
  generating voice conversion functions for converting the one or a plurality of source-speaker speech units to the target-speaker speech units based on the target-speaker speech units and the one or the plurality of source-speaker speech units.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation), Toshiba Digital Solutions Corporation (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
Tamura, Masatsune, Kagoshima, Takehiko
Primary Examiner(s)
Chawan; Vijay B

Application Number

US11/533,122
Publication Number

US 20070168189A1
Time in Patent Office

1,071 Days
Field of Search

704/258, 704/270, 704/254, 704/257, 704/220, 704/222, 704/246, 704/247, 704/232, 379/88.02
US Class Current

704/258
CPC Class Codes

G10L 13/033 Voice editing, e.g. manipul...

G10L 2021/0135 Voice conversion or morphing

Apparatus and method for voice conversion using attribute information

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

287 Citations

13 Claims

Specification

Use Cases

Quick Links

Others

Apparatus and method for voice conversion using attribute information

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

287 Citations

13 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others