Codebook-less speech conversion method and system

US 20070213987A1
Filed: 03/08/2006
Published: 09/13/2007
Est. Priority Date: 03/08/2006
Status: Abandoned Application

First Claim

Patent Images

1. A method of speech conversion comprising the steps of:

dividing a source signal into multiple source frames;

for each source frame,deriving at least one line spectral frequency (LSF) vector, andmapping said at least one LSF vector to a LSF vector of a respective target frame; and

assembling said respective target frames into a target source signal.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The conversion of speech can be used to transform an utterance by a source speaker to match the speech characteristic of a target speaker, for applications such as dubbing a motion picture. During a training phase, utterances corresponding to the same sentences by both the target speaker and source speaker are force aligned according to the phonemes within the sentences. A transformation or mapping is trained so that each frame of the source utterances is mapped to a corresponding frame of the target utterance. After the completion of the training phase, a source utterance is divided into frames, which are transformed into target frames. After all target frames are created from the sequence of frames from the source utterance, a target utterance is created having the speech of the source speaker, but with the vocal characteristics of the target speaker.

Citations

14 Claims

1. A method of speech conversion comprising the steps of:
- dividing a source signal into multiple source frames;
  
  for each source frame,deriving at least one line spectral frequency (LSF) vector, andmapping said at least one LSF vector to a LSF vector of a respective target frame; and
  
  assembling said respective target frames into a target source signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein said step of dividing said source signal comprises the step of recognizing phonemes in said source signal.
  - 3. The method of claim 2, wherein said source signal comprises speech of a person, andsaid step of recognizing phonemes is performed independent of a particular language and speaker of said speech.
  - 4. The method of claim 1, wherein at least one of said multiple source frames comprises a single phoneme.
  - 5. The method of claim 1, wherein said step of deriving at least one LSF vector comprises the step of deriving at least one Hidden Markov Model (HMM) state of a source frame.
  - 6. The method of claim 1, wherein said mapping is performed without the implementation of a codebook.
  - 7. The method of claim 1, further comprising the steps of:
    - applying a phoneme recognizer to speech of a source speaker and speech of a target speaker for the same template sentence,dividing said speech of said speech of a target speaker into target frames, andforce aligning said source frames to said target frames.
  - 8. The method of claim 7, wherein said source and target frames each comprise only a single phoneme.
  - 9. The method of claim 1, wherein said source signal comprises speech from a source speaker and said target source signal includes vocal characteristics of a target speaker.

10. A method of speech conversion comprising the steps of:
- training a source to target frame transformation using a source training set of source utterances and a target training set of target utterances that transforms frames with vocal characteristics of the source speaker to frames with vocal characteristics of the target speaker;
  
  recognizing phonemes in a source utterance spoken by a source speaker having vocal source speaker vocal characteristics;
  
  subdividing the source utterance into at least one source frames comprising only one phoneme;
  
  transforming each of said at least one source frame into a target frame based on a source to target frame transformation that transforms frames with vocal characteristics of the source speaker to frames with vocal characteristics of the target speaker; and
  
  assembling the target frames transformed from each of said at least one source frame into a target utterance.
- View Dependent Claims (11)
- - 11. The method of claim 10, said step of recognizing phonemes further comprises the step of training a phonemic recognizer.

12. A system for speech conversion comprising:
- a processor;
  
  a communication bus coupled to the processor;
  
  a main memory coupled to the communication bus;
  
  an audio input coupled to the communication bus;
  
  an audio output coupled to the communication bus;
  
  wherein the processor receives a source utterance spoken by a source speaker having source speaker vocal characteristics from the audio input;
  
  the processor receives instructions from the main memory which causes the processor to;
  
  recognize phonemes in a source utterance spoken by a source speaker having vocal source speaker vocal characteristics;
  
  subdivide the source utterance into at least one source frames comprising only one phoneme;
  
  transform each of said at least one source frame into a target frame based on a frame transformation that transforms frames with vocal characteristics of the source speaker to frames with vocal characteristics of the target speaker; and
  
  assemble the target frames transformed from each of said at least one source frame into a target utterance.

13. A method of creating a dubbed soundtrack, the method comprising the steps:
- receiving a first soundtrack comprising a first vocal track of a first speaker'"'"'s speech, wherein said first vocal track includes vocal characteristics of said first speaker'"'"'s speech;
  
  receiving a second soundtrack comprising a second vocal track of a second speaker'"'"'s speech, wherein said second vocal track includes vocal characteristics of said second speaker'"'"'s speech; and
  
  converting said second soundtrack into a dubbed soundtrack, wherein said dubbed soundtrack includes a third vocal track of said second speaker'"'"'s speech, wherein said third vocal track includes vocal characteristics of said first speaker'"'"'s speech.
- View Dependent Claims (14)
- - 14. The method of claim 13, wherein said first vocal speaker'"'"'s speech is in one language and said second vocal speaker'"'"'s speech is in a different language.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Voxonic Incorporated
Original Assignee
Voxonic Incorporated
Inventors
Turk, Oytun, Deutsch, Fred, Arslan, Levent Mustafa

Application Number

US11/370,682
Publication Number

US 20070213987A1
Time in Patent Office

Days
Field of Search
US Class Current

704/268
CPC Class Codes

G10L 13/033 Voice editing, e.g. manipul...

G10L 2021/0135 Voice conversion or morphing

Codebook-less speech conversion method and system

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Codebook-less speech conversion method and system

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links