Audio-Visual Dialogue System and Method

US 20160203827A1
Filed: 08/19/2014
Published: 07/14/2016
Est. Priority Date: 08/23/2013
Status: Active Grant

First Claim

Patent Images

1-78. -78. (canceled)

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides an audio-visual dialogue system that allows a user to create an ‘avatar’ which may be customised to look and sound a particular way. The avatar may be created to resemble, for example, a person, animal or mythical creature, and generated to have a variable voice which may be female or male. The system then employs a real-time voice conversion in order to transform any audio input, for example, spoken word, into a target voice that is selected and customised by the user. The system is arranged to facially animate the avatar using a real-time lip-synching algorithm such that the generated avatar and the target voice are synchronised.

39 Citations

View as Search Results

98 Claims

1-78. -78. (canceled)

79. An audio-visual dialogue system, comprising:
- an audio input device;
  
  an audio output device;
  
  a visual output device; and
  
  a processor, the processor being arranged to;
  
  receive an input audio signal representing a source voice from the audio input device;
  
  perform substantially real-time voice conversion on the input audio signal to produce an output audio signal representing a target voice, wherein the output audio signal is provided to the audio output device, and wherein the real-time voice conversion process includes;
  
  i) decomposing the input audio signal into a set of time-varying filter characteristics and a residual excitation signal;
  
  ii) spectrally transforming the time-varying filter characteristics, and/or modifying the pitch of the residual excitation signal; and
  
  iii) synthesising the output audio signal in dependence on the transformed time-varying filter characteristics and/or the pitch modified residual excitation signal;
  
  generate an avatar, wherein the avatar is visually displayed on the visual output device; and
  
  facially animate the generated avatar, wherein the animation is synchronised with the output audio signal.
- View Dependent Claims (80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97)
- - 80. A system according to claim 79, wherein the time-varying filter characteristics are estimated over short windowed sections of the input audio signal 4, and wherein the short windowed sections of the input audio signal are 20 to 40 ms in duration and overlapping by 5 to 15 ms.
  - 81. A system according to claim 79, wherein the time-varying filter characteristics are Fourier transformed into a multiple point amplitude response prior to spectral transformation of the time-varying filter characteristics 7, and wherein the spectrally transformed amplitude response is inverse Fourier transformed back into time-varying filter characteristics.
  - 82. A system according to claim 79, wherein sets of linear transformations are generated between the input audio signal and a plurality of predefined target voices.
  - 83. A system according to claim 82, wherein a set of at least 8 linear transforms is generated between the input audio signal and a predefined target voice.
  - 84. A system according to claim 83, wherein at least 82 predefined target voices are used, including at least 40 males target speakers and at least 42 female target speakers.
  - 85. A system according to any of claim 82, wherein a plurality of sentences are spoken by both the plurality of predefined target voices and the input audio signal, and wherein at least 20 sentences are spoken.
  - 86. A system according to claim 85, wherein the plurality of sentences spoken by the input audio signal and the plurality of sentences spoken by the plurality of predefined target voices are temporally aligned using the mel-frequency cepstrum coefficients of the input audio signal in combination with a dynamic programming algorithm.
  - 87. A system according to claim 86, wherein for a plurality of signal sections within the temporally aligned plurality of sentences:
    - a) the prediction coefficients and linear prediction coding spectrum are calculated; and
      
      b) the optimum frequency mapping is found using a dynamic programming algorithm.
  - 88. A system according to claim 82, wherein the processor is further arranged to customise the real-time voice conversion, the customisation comprising:
    - 1) selecting one of the plurality of predefined target voices, wherein the predefined target voices are represented by a set of time-varying filter characteristics and a pitch scaling factor; and
      
      2) adjusting the transformed time-varying filter characteristics and/or the pitch scaling factor of the selected predefined target voice to give customised target voice parameters.
  - 89. A system according to claim 88, wherein the parameters of the selected predefined target voice are adjusted using a plurality of sliders displayed on the user interface.
  - 90. A system according to claim 88, wherein the selected predefined target voice is associated with a set of linear transformations.
  - 91. A system according to claim 88, wherein the transform vectors of the set of linear transforms are reduced to a mean transform vector and a plurality of orthogonal change vectors.
  - 92. A system according to claim 91, wherein a slider is used to adjust the amount by which a change vector is added into the mean transform vector such that the time-varying filter characteristics are adjusted.
  - 93. A system according to claim 79, wherein the processor is further arranged to facially customise the generated avatar, wherein facially customising the generated avatar includes providing a visual array of distinct faces for selection.
  - 94. A system according to claim 93, wherein the visual array of distinct faces includes at least 250 distinct faces.
  - 95. A system according to claim 93, wherein the visual array of distinct faces vary in gender, age, ethnicity and hairstyle.
  - 96. A system according to claim 93, wherein a range of accessories and further hairstyles are available for selection.
  - 97. A system according to claim 79, wherein the audio input device and audio output device are connectable to form a two-way audio channel.

98. A method of audio-visual dialogue, comprising:
- receiving an input audio signal representing a source voice from an audio input device;
  
  performing substantially real-time voice conversion on the input audio signal to produce an output audio signal representing a target voice, wherein the output audio signal is provided to an audio output device, and wherein the substantially real-time voice conversion includes;
  
  i) decomposing the input audio signal into a set of time-varying filter characteristics and a residual excitation signal;
  
  ii) spectrally transforming the time-varying filter characteristics, and/or modifying the pitch of the residual excitation signal; and
  
  iii) synthesising the output audio signal in dependence on the transformed time-varying filter characteristics and/or the pitch modified residual excitation signal;
  
  generating an avatar, wherein the avatar is visually displayed on a visual output device; and
  
  facially animating the generated avatar, wherein the animation is synchronised with the output audio signal.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
UCL Business Ltd.
Original Assignee
UCL Business PLC (University of London)
Inventors
LEFF, Julian, WILLIAMS, Geoffrey, HUCKVALE, Mark

Granted Patent

US 9,837,091 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06T 13/205   driven by audio data

G06T 13/40   of characters, e.g. humans,...

G10L 13/10   Prosody rules derived from ...

G10L 19/06   Determination or coding of ...

G10L 19/125   Pitch excitation, e.g. pitc...

G10L 2021/105   Synthesis of the lips movem...

G10L 25/24   the extracted parameters be...

Audio-Visual Dialogue System and Method

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

39 Citations

98 Claims

Specification

Solutions

Use Cases

Quick Links

Audio-Visual Dialogue System and Method

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

39 Citations

98 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links