Audio-visual dialogue system and method

US 9,837,091 B2
Filed: 08/19/2014
Issued: 12/05/2017
Est. Priority Date: 08/23/2013
Status: Active Grant

First Claim

Patent Images

1. An audio-visual dialogue system, comprising:

an audio input device;

an audio output device;

a visual output device; and

a processor, the processor being arranged to;

receive an input audio signal representing a source voice from the audio input device;

perform substantially real-time voice conversion on the input audio signal to produce an output audio signal representing a target voice, wherein the output audio signal is provided to the audio output device, and wherein the real-time voice conversion process includes;

i) decomposing the input audio signal into a set of time-varying filter characteristics and a residual excitation signal;

ii) spectrally transforming the time-varying filter characteristics, and/or modifying a pitch of the residual excitation signal; and

iii) synthesising the output audio signal in dependence on the transformed time-varying filter characteristics and/or the pitch modified residual excitation signal;

generate an avatar, wherein the avatar is visually displayed on the visual output device; and

facially animate the generated avatar, wherein the animation is synchronised with the output audio signal,wherein the processor is further arranged to customise the real-time voice conversion including1) selecting one of a plurality of predefined target voices, wherein the predefined target voices are represented by a set of respective linear transformations which include a set of time-varying filter characteristics and a pitch scaling factor and2) adjusting the transformation time-varying filter characteristics and/or the pitch scaling factor of the selected predefined target voice to give customised target voice parameters,wherein transform vectors of the set of linear transformations are reduced to a mean transform vector and a plurality of orthogonal change vectors and a user interface control is used to adjust a change amount by which a change vector is added into the mean transform vector such that the time-varying filter characteristics are adjusted.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides an audio-visual dialogue system that allows a user to create an ‘avatar’ which may be customised to look and sound a particular way. The avatar may be created to resemble, for example, a person, animal or mythical creature, and generated to have a variable voice which may be female or male. The system then employs a real-time voice conversion in order to transform any audio input, for example, spoken word, into a target voice that is selected and customised by the user. The system is arranged to facially animate the avatar using a real-time lip-synching algorithm such that the generated avatar and the target voice are synchronised.

Citations

15 Claims

1. An audio-visual dialogue system, comprising:
- an audio input device;
  
  an audio output device;
  
  a visual output device; and
  
  a processor, the processor being arranged to;
  
  receive an input audio signal representing a source voice from the audio input device;
  
  perform substantially real-time voice conversion on the input audio signal to produce an output audio signal representing a target voice, wherein the output audio signal is provided to the audio output device, and wherein the real-time voice conversion process includes;
  
  i) decomposing the input audio signal into a set of time-varying filter characteristics and a residual excitation signal;
  
  ii) spectrally transforming the time-varying filter characteristics, and/or modifying a pitch of the residual excitation signal; and
  
  iii) synthesising the output audio signal in dependence on the transformed time-varying filter characteristics and/or the pitch modified residual excitation signal;
  
  generate an avatar, wherein the avatar is visually displayed on the visual output device; and
  
  facially animate the generated avatar, wherein the animation is synchronised with the output audio signal,wherein the processor is further arranged to customise the real-time voice conversion including1) selecting one of a plurality of predefined target voices, wherein the predefined target voices are represented by a set of respective linear transformations which include a set of time-varying filter characteristics and a pitch scaling factor and2) adjusting the transformation time-varying filter characteristics and/or the pitch scaling factor of the selected predefined target voice to give customised target voice parameters,wherein transform vectors of the set of linear transformations are reduced to a mean transform vector and a plurality of orthogonal change vectors and a user interface control is used to adjust a change amount by which a change vector is added into the mean transform vector such that the time-varying filter characteristics are adjusted.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. A system according to claim 1, wherein the time-varying filter characteristics are estimated over short windowed sections of the input audio signal, and wherein the short windowed sections of the input audio signal are 20 to 40ms in duration and overlapping by 5 to 15ms.
  - 3. A system according to claim 1, wherein the time-varying filter characteristics are Fourier transformed into a multiple point amplitude response prior to spectral transformation of the time-varying filter characteristics, and wherein the spectrally transformed amplitude response is inverse Fourier transformed back into time-varying filter characteristics.
  - 4. A system according to claim 1, wherein a set of at least 8 linear transforms is generated between the input audio signal and a predefined target voice.
  - 5. A system according to claim 4, wherein at least 82 predefined target voices are used, including at least 40 males target speakers and at least 42 female target speakers.
  - 6. A system according to claim 1, wherein for a plurality of signal sections within the temporally aligned plurality of sentences:
    - a) the prediction coefficients and linear prediction coding spectrum are calculated; and
      
      b) the optimum frequency mapping is found using a dynamic programming algorithm.
  - 7. A system according to claim 1, wherein the selected predefined target voice is associated with a set of linear transformations.
  - 8. A system according to claim 1, wherein the processor is further arranged to facially customise the generated avatar, wherein facially customising the generated avatar includes providing a visual array of distinct faces for selection.
  - 9. A system according to claim 8, wherein the visual array of distinct faces includes at least 250 distinct faces.
  - 10. A system according to claim 8, wherein the visual array of distinct faces vary in gender, age, ethnicity and hairstyle.
  - 11. A system according to claim 8, wherein a range of accessories and further hairstyles are available for selection.
  - 12. A system according to claim 1, wherein the audio input device and audio output device are connectable to form a two-way audio channel.

13. A method of audio-visual dialogue, comprising:
- receiving an input audio signal representing a source voice from an audio input device;
  
  performing substantially real-time voice conversion on the input audio signal to produce an output audio signal representing a target voice, wherein the output audio signal is provided to an audio output device, and wherein the substantially real-time voice conversion includes;
  
  i) decomposing the input audio signal into a set of time-varying filter characteristics and a residual excitation signal;
  
  ii) spectrally transforming the time-varying filter characteristics, and/or modifying a pitch of the residual excitation signal; and
  
  iii) synthesising the output audio signal in dependence on the transformed time-varying filter characteristics and/or the pitch modified residual excitation signal;
  
  generating an avatar, wherein the avatar is visually displayed on a visual output device;
  
  facially animating the generated avatar, wherein the animation is synchronised with the output audio signal; and
  
  customising the real-time voice conversion including1) selecting one of a plurality of predefined target voices, wherein the predefined target voices are represented by a set of respective linear transformations which include a set of time-varying filter characteristics and a pitch scaling factor; and
  
  2) adjusting the transformation time-varying filter characteristics and/or the pitch scaling factor of the selected predefined target voice to give customised target voice parameters,wherein transform vectors of the set of linear transformations are reduced to a mean transform vector and a plurality of orthogonal change vectors and a user interface control is used to adjust a change amount by which a change vector is added into the mean transform vector such that the time-varying filter characteristics are adjusted.

14. A method of audio-visual dialogue, comprising:
- receiving an input audio signal representing a source voice from an audio input device;
  
  performing substantially real-time voice conversion on the input audio signal to produce an output audio signal representing a target voice, wherein the output audio signal is provided to an audio output device, and wherein the substantially real-time voice conversion includes;
  
  i) decomposing the input audio signal into a set of time-varying filter characteristics and a residual excitation signal;
  
  ii) spectrally transforming the time-varying filter characteristics, and/or modifying a pitch of the residual excitation signal; and
  
  iii) synthesising the output audio signal in dependence on the transformed time-varying filter characteristics and/or the pitch modified residual excitation signal;
  
  generating an avatar, wherein the avatar is visually displayed on a visual output device;
  
  facially animating the generated avatar, wherein the animation is synchronised with the output audio signal; and
  
  customising the real-time voice conversion including1) selecting one of a plurality of predefined target voices, wherein the predefined target voices are represented by a set of respective linear transformations which include a set of time-varying filter characteristics and a pitch scaling factor; and
  
  2) adjusting the transformation time-varying filter characteristics and/or the pitch scaling factor of the selected predefined target voice to give customised target voice parameters,wherein the time-varying filter characteristics and/or the pitch scaling factor of the selected predefined target voice are adjusted using a plurality of sliders displayed on a user interface which when activated by the user set a change amount by which the time-varying filter characteristics and/or the pitch scaling factor are adjusted andwherein transform vectors of the set of linear transforms are reduced to a mean transform vector and a plurality of orthogonal change vectors and a user interface control is used to adjust a change amount by which a change vector is added into the mean transform vector such that the time-varying filter characteristics are adjusted.

15. An audio-visual dialogue system, comprising:
- an audio input device;
  
  an audio output device;
  
  a visual output device; and
  
  a processor, the processor being arranged to;
  
  receive an input audio signal representing a source voice from the audio input device;
  
  perform substantially real-time voice conversion on the input audio signal to produce an output audio signal representing a target voice, wherein the output audio signal is provided to the audio output device, and wherein the real-time voice conversion process includes;
  
  i) decomposing the input audio signal into a set of time-varying filter characteristics and a residual excitation signal;
  
  ii) spectrally transforming the time-varying filter characteristics, and/or modifying a pitch of the residual excitation signal; and
  
  iii) synthesising the output audio signal in dependence on the transformed time-varying filter characteristics and/or the pitch modified residual excitation signal;
  
  generate an avatar, wherein the avatar is visually displayed on the visual output device; and
  
  facially animate the generated avatar, wherein the animation is synchronised with the output audio signal,wherein the processor is further arranged to customise the real-time voice conversion including1) selecting one of a plurality of predefined target voices, wherein the predefined target voices are represented by a set of respective linear transformations which include a set of time-varying filter characteristics and a pitch scaling factor; and
  
  2) adjusting the transformation time-varying filter characteristics and/or the pitch scaling factor of the selected predefined target voice to give customised target voice parameters,wherein the time-varying filter characteristics and/or the pitch scaling factor of the selected predefined target voice are adjusted using a plurality of sliders displayed on a user interface which when activated by the user set a change amount by which the time-varying filter characteristics and/or the pitch scaling factor are adjusted andwherein transform vectors of the set of linear transforms are reduced to a mean transform vector and a plurality of orthogonal change vectors and a user interface control is used to adjust a change amount by which a change vector is added into the mean transform vector such that the time-varying filter characteristics are adjusted.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
UCL Business Ltd. (University of London)
Original Assignee
UCL Business PLC (University of London)
Inventors
Leff, Julian, Williams, Geoffrey, Huckvale, Mark
Primary Examiner(s)
He, Jialong

Application Number

US14/913,876
Publication Number

US 20160203827A1
Time in Patent Office

1,204 Days
Field of Search
US Class Current
CPC Class Codes

G06T 13/205   driven by audio data

G06T 13/40   of characters, e.g. humans,...

G10L 13/10   Prosody rules derived from ...

G10L 19/06   Determination or coding of ...

G10L 19/125   Pitch excitation, e.g. pitc...

G10L 2021/105   Synthesis of the lips movem...

G10L 25/24   the extracted parameters be...

Audio-visual dialogue system and method

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Audio-visual dialogue system and method

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links