Audio-visual dialogue system and method
First Claim
Patent Images
1. An audio-visual dialogue system, comprising:
- an audio input device;
an audio output device;
a visual output device; and
a processor, the processor being arranged to;
receive an input audio signal representing a source voice from the audio input device;
perform substantially real-time voice conversion on the input audio signal to produce an output audio signal representing a target voice, wherein the output audio signal is provided to the audio output device, and wherein the real-time voice conversion process includes;
i) decomposing the input audio signal into a set of time-varying filter characteristics and a residual excitation signal;
ii) spectrally transforming the time-varying filter characteristics, and/or modifying a pitch of the residual excitation signal; and
iii) synthesising the output audio signal in dependence on the transformed time-varying filter characteristics and/or the pitch modified residual excitation signal;
generate an avatar, wherein the avatar is visually displayed on the visual output device; and
facially animate the generated avatar, wherein the animation is synchronised with the output audio signal,wherein the processor is further arranged to customise the real-time voice conversion including1) selecting one of a plurality of predefined target voices, wherein the predefined target voices are represented by a set of respective linear transformations which include a set of time-varying filter characteristics and a pitch scaling factor and2) adjusting the transformation time-varying filter characteristics and/or the pitch scaling factor of the selected predefined target voice to give customised target voice parameters,wherein transform vectors of the set of linear transformations are reduced to a mean transform vector and a plurality of orthogonal change vectors and a user interface control is used to adjust a change amount by which a change vector is added into the mean transform vector such that the time-varying filter characteristics are adjusted.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides an audio-visual dialogue system that allows a user to create an ‘avatar’ which may be customised to look and sound a particular way. The avatar may be created to resemble, for example, a person, animal or mythical creature, and generated to have a variable voice which may be female or male. The system then employs a real-time voice conversion in order to transform any audio input, for example, spoken word, into a target voice that is selected and customised by the user. The system is arranged to facially animate the avatar using a real-time lip-synching algorithm such that the generated avatar and the target voice are synchronised.
-
Citations
15 Claims
-
1. An audio-visual dialogue system, comprising:
-
an audio input device; an audio output device; a visual output device; and a processor, the processor being arranged to; receive an input audio signal representing a source voice from the audio input device; perform substantially real-time voice conversion on the input audio signal to produce an output audio signal representing a target voice, wherein the output audio signal is provided to the audio output device, and wherein the real-time voice conversion process includes; i) decomposing the input audio signal into a set of time-varying filter characteristics and a residual excitation signal; ii) spectrally transforming the time-varying filter characteristics, and/or modifying a pitch of the residual excitation signal; and iii) synthesising the output audio signal in dependence on the transformed time-varying filter characteristics and/or the pitch modified residual excitation signal; generate an avatar, wherein the avatar is visually displayed on the visual output device; and facially animate the generated avatar, wherein the animation is synchronised with the output audio signal, wherein the processor is further arranged to customise the real-time voice conversion including 1) selecting one of a plurality of predefined target voices, wherein the predefined target voices are represented by a set of respective linear transformations which include a set of time-varying filter characteristics and a pitch scaling factor and 2) adjusting the transformation time-varying filter characteristics and/or the pitch scaling factor of the selected predefined target voice to give customised target voice parameters, wherein transform vectors of the set of linear transformations are reduced to a mean transform vector and a plurality of orthogonal change vectors and a user interface control is used to adjust a change amount by which a change vector is added into the mean transform vector such that the time-varying filter characteristics are adjusted. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method of audio-visual dialogue, comprising:
- receiving an input audio signal representing a source voice from an audio input device;
performing substantially real-time voice conversion on the input audio signal to produce an output audio signal representing a target voice, wherein the output audio signal is provided to an audio output device, and wherein the substantially real-time voice conversion includes; i) decomposing the input audio signal into a set of time-varying filter characteristics and a residual excitation signal; ii) spectrally transforming the time-varying filter characteristics, and/or modifying a pitch of the residual excitation signal; and iii) synthesising the output audio signal in dependence on the transformed time-varying filter characteristics and/or the pitch modified residual excitation signal; generating an avatar, wherein the avatar is visually displayed on a visual output device; facially animating the generated avatar, wherein the animation is synchronised with the output audio signal; and customising the real-time voice conversion including 1) selecting one of a plurality of predefined target voices, wherein the predefined target voices are represented by a set of respective linear transformations which include a set of time-varying filter characteristics and a pitch scaling factor; and 2) adjusting the transformation time-varying filter characteristics and/or the pitch scaling factor of the selected predefined target voice to give customised target voice parameters, wherein transform vectors of the set of linear transformations are reduced to a mean transform vector and a plurality of orthogonal change vectors and a user interface control is used to adjust a change amount by which a change vector is added into the mean transform vector such that the time-varying filter characteristics are adjusted.
- receiving an input audio signal representing a source voice from an audio input device;
-
14. A method of audio-visual dialogue, comprising:
-
receiving an input audio signal representing a source voice from an audio input device;
performing substantially real-time voice conversion on the input audio signal to produce an output audio signal representing a target voice, wherein the output audio signal is provided to an audio output device, and wherein the substantially real-time voice conversion includes;i) decomposing the input audio signal into a set of time-varying filter characteristics and a residual excitation signal; ii) spectrally transforming the time-varying filter characteristics, and/or modifying a pitch of the residual excitation signal; and iii) synthesising the output audio signal in dependence on the transformed time-varying filter characteristics and/or the pitch modified residual excitation signal;
generating an avatar, wherein the avatar is visually displayed on a visual output device;facially animating the generated avatar, wherein the animation is synchronised with the output audio signal; and customising the real-time voice conversion including 1) selecting one of a plurality of predefined target voices, wherein the predefined target voices are represented by a set of respective linear transformations which include a set of time-varying filter characteristics and a pitch scaling factor; and 2) adjusting the transformation time-varying filter characteristics and/or the pitch scaling factor of the selected predefined target voice to give customised target voice parameters, wherein the time-varying filter characteristics and/or the pitch scaling factor of the selected predefined target voice are adjusted using a plurality of sliders displayed on a user interface which when activated by the user set a change amount by which the time-varying filter characteristics and/or the pitch scaling factor are adjusted and wherein transform vectors of the set of linear transforms are reduced to a mean transform vector and a plurality of orthogonal change vectors and a user interface control is used to adjust a change amount by which a change vector is added into the mean transform vector such that the time-varying filter characteristics are adjusted.
-
-
15. An audio-visual dialogue system, comprising:
-
an audio input device; an audio output device; a visual output device; and a processor, the processor being arranged to; receive an input audio signal representing a source voice from the audio input device; perform substantially real-time voice conversion on the input audio signal to produce an output audio signal representing a target voice, wherein the output audio signal is provided to the audio output device, and wherein the real-time voice conversion process includes; i) decomposing the input audio signal into a set of time-varying filter characteristics and a residual excitation signal; ii) spectrally transforming the time-varying filter characteristics, and/or modifying a pitch of the residual excitation signal; and iii) synthesising the output audio signal in dependence on the transformed time-varying filter characteristics and/or the pitch modified residual excitation signal; generate an avatar, wherein the avatar is visually displayed on the visual output device; and facially animate the generated avatar, wherein the animation is synchronised with the output audio signal, wherein the processor is further arranged to customise the real-time voice conversion including 1) selecting one of a plurality of predefined target voices, wherein the predefined target voices are represented by a set of respective linear transformations which include a set of time-varying filter characteristics and a pitch scaling factor; and 2) adjusting the transformation time-varying filter characteristics and/or the pitch scaling factor of the selected predefined target voice to give customised target voice parameters, wherein the time-varying filter characteristics and/or the pitch scaling factor of the selected predefined target voice are adjusted using a plurality of sliders displayed on a user interface which when activated by the user set a change amount by which the time-varying filter characteristics and/or the pitch scaling factor are adjusted and wherein transform vectors of the set of linear transforms are reduced to a mean transform vector and a plurality of orthogonal change vectors and a user interface control is used to adjust a change amount by which a change vector is added into the mean transform vector such that the time-varying filter characteristics are adjusted.
-
Specification