Synchronizing the moveable mouths of animated characters with recorded speech

US 6,181,351 B1
Filed: 04/13/1998
Issued: 01/30/2001
Est. Priority Date: 04/13/1998
Status: Expired due to Term

First Claim

Patent Images

1. A computer-implemented method for creating annotated sound data, the method comprising:

acquiring speech sound data comprising an utterance and a textual representation of the utterance of the speech sound data;

supplying a data structure specifying the contents of the textual representation of the utterance of the speech sound data to a speech recognition engine;

with the speech recognition engine, analyzing the speech sound data comprising the utterance and the data structure specifying the contents of the textual representation of the utterance of the speech sound data to determine linguistic event values indicative of linguistic events in the speech sound data comprising the utterance and time values indicative of when within the speech sound data comprising the utterance the linguistic events occur; and

annotating the speech sound data comprising the utterance with the linguistic event values and the time values to create annotated sound data for synchronizing speech output with other computer output or processing.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The animation of a speaking character is synchronized with recorded speech by creating and playing a linguistically enhanced sound file. A sound editing tool employs a speech recognition engine to create the linguistically enhanced sound file from recorded speech and a text of the speech. The speech recognition engine provides timing information related to word breaks and phonemes that is used by the sound editing tool to annotate the speech sound data when creating the linguistically enhanced sound file. When the linguistically enhanced sound file is played to produce sound output, the timing information is retrieved to control the animated character'"'"'s mouth movement and word pacing in the character'"'"'s word balloon. The sound editing tool additionally provides editing functions for manipulating the timing information. A text to speech engine can use the same programming interface as the linguistically enhanced sound file player to send notifications to the animation, providing prototyping without recorded speech. Since both use the same interface, recorded speech can be incorporated at a later time with minimal modifications.

Citations

30 Claims

1. A computer-implemented method for creating annotated sound data, the method comprising:
- acquiring speech sound data comprising an utterance and a textual representation of the utterance of the speech sound data;
  
  supplying a data structure specifying the contents of the textual representation of the utterance of the speech sound data to a speech recognition engine;
  
  with the speech recognition engine, analyzing the speech sound data comprising the utterance and the data structure specifying the contents of the textual representation of the utterance of the speech sound data to determine linguistic event values indicative of linguistic events in the speech sound data comprising the utterance and time values indicative of when within the speech sound data comprising the utterance the linguistic events occur; and
  
  annotating the speech sound data comprising the utterance with the linguistic event values and the time values to create annotated sound data for synchronizing speech output with other computer output or processing.
- View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 3. The method of claim 1, wherein a linguistic event value is indicative of a spoken phoneme.
  - 4. The method of claim 1, wherein a linguistic event value is indicative of a human mouth position.
  - 5. The method of claim 1, wherein the linguistic event value is indicative of a spoken word boundary.
  - 6. The method of claim 1, wherein a first linguistic event value is indicative of a spoken word boundary and a second linguistic event value is indicative of a spoken phoneme.
  - 7. The method of claim 1 further comprising:
8. The method of claim 1 wherein the sound file is created in a format that is the same as or compatible with the speech sound data, whereby the sound file can be played on a sound player that plays the speech sound data or compatible files.
9. The method of claim 1, further comprising:
- playing the speech sound data from the annotated sound data to present sound output;
  
  retrieving from the annotated sound data a linguistic event value and a time value; and
  
  performing an action in an animation indicative of the linguistic event at a time indicated by the time value, whereby the animation is synchronized with the linguistic event.
10. The method of claim 9, wherein the linguistic event value is indicative of a spoken phoneme and the action in the animation is the presentation of a mouth shape associated with the spoken phoneme.
11. The method of claim 9, wherein the linguistic event value is indicative of a spoken word and the action in the animation is a text presentation of the spoken word in a word balloon.

2. A computer-readable medium having computer-executable instructions for creating annotated sound data by performing the following:
- acquiring speech sound data comprising an utterance and a textual representation of the utterance of the speech sound data;
  
  supplying a data structure specifying the contents of the textual representation of the utterance of the speech sound data to a speech recognition engine;
  
  with the speech recognition engine, analyzing the speech sound data comprising the utterance and the data structure specifying the contents of the textual representation of the utterance of the speech sound data to determine linguistic event values indicative of linguistic events in the speech sound data comprising the utterance and time values indicative of when within the speech sound data comprising the utterance the linguistic events occur, and annotating the speech sound data comprising the utterance with the linguistic event values and the time values to create annotated sound data for synchronizing speech output with other computer output or processing.

12. A computer-implemented method for synchronizing a word balloon animation of an animated character with speech sound data via linguistic enhancement data specifying spoken word boundaries, the method comprising:
- playing the speech sound data to present sound output for the animated character;
  
  retrieving from the linguistic enhancement data a linguistic event value indicative of a spoken word boundary, and a time value indicative of when within the speech sound data the spoken word boundary occurs; and
  
  in the word balloon animation of the animated character, presenting an additional word at the time indicated by the time value whereby the word balloon animation is synchronized with the spoken word boundary.

13. A computer-readable medium having computer-executable instructions for synchronizing a word balloon animation of an animated character with speech sound data via linguistic enhancement data specifying spoken word boundaries by performing the following:
- playing the speech sound data to present sound output for the animated character;
  
  retrieving from the linguistic enhancement data a linguistic event value indicative of a spoken word boundary and a time value indicative of when within the speech sound data the spoken word boundary occurs; and
  
  in the word balloon animation of the animated character, presenting an additional word at the time indicated by the time value whereby the word balloon animation is synchronized with the spoken word boundary.

14. A computer-implemented method for synchronizing mouth animation of a character with speech sound data comprising an utterance by employing a speech recognition engine to determine when phonemes occur within the utterance of the speech sound data, the method comprising:
- providing a grammar based on a textual representation of the utterance of the speech sound data and the speech sound data to the speech recognition engine to produce an event list indicating when phonemes occur within the speech sound data, the event list comprising at least one phoneme event, the phoneme event comprising a phoneme type value indicative of a phoneme and a phoneme time value indicative of when within the utterance the phoneme occurs;
  
  annotating the speech sound data with the event list to produce a linguistically enhanced sound file;
  
  playing sound data from the linguistically enhanced sound file to produce sound output;
  
  reading the event list from the linguistically enhanced sound file;
  
  selecting a phoneme event in the list; and
  
  while playing the sound data, displaying a mouth shape associated with the phoneme indicated by the phoneme type value of the selected phoneme event at a time indicated by the phoneme time value of the selected phoneme event.

15. A computer-implemented method for synchronizing an animation of a character with speech sound data, the method comprising:
- building a grammar from a text of the speech sound data;
  
  providing the grammar and the speech sound data to a speech recognition engine to determine a phoneme value indicative of a member of the International Phonetic Alphabet occurring in the speech sound data, a phoneme time value indicative of when within the speech sound data the member occurs, and a word break time value indicative of when within the speech sound data a recognized word occurs;
  
  annotating the speech sound data with the phoneme value, the phoneme time value, and the word break time value to create a linguistically enhanced sound file;
  
  retrieving from the linguistically enhanced sound file the phoneme value, the phoneme time value, and the word break time value;
  
  dividing the speech sound data from the linguistically enhanced sound file into a plurality of segments according to the phoneme time value and the word break time value;
  
  sending the segments of the speech sound data from the linguistically enhanced sound file in an audio stream to an audio player to present sound output;
  
  sending between two segments in the audio stream to the audio player a notification item indicative of a phoneme value notification;
  
  sending between two segments in the audio stream to the audio player a notification item indicative of a word break;
  
  presenting in the character animation a mouth shape associated with the phoneme value when the audio player encounters the phoneme value notification item in the audio stream, whereby the character animation is synchronized with the sound output; and
  
  presenting in the character animation a text presentation of a word in a word balloon of the character when the audio player encounters the word break notification item in the audio stream, whereby the character animation is synchronized with the sound output.

16. A computer-readable medium having computer-executable instructions for for synchronizing an animation of a character with speech sound data by perforrming the following:
- building a grammar from a text of thc speech sound data;
  
  providing the grammar and the speech sound data to a speech recognition engine to determine a phoneme value indicative of a member of the International Phonetic Alphabet occurring in the speech sound data, a phoneme time value indicative of when within the speech sound data the member occurs, and a word break time value indicative of when within the speech sound data a recognized word occurs;
  
  annotating the speech sound data with the phoneme value, the phoneme time value, and the word break time value to create a linguistically enhanced sound file;
  
  retrieving from the linguistically enhanced sound file the phoneme value, the phoneme time value, and the word break time value;
  
  dividing the speech sound data from the linguistically enhanced sound file into a plurality of segments according to the phoneme time value and the word break time value;
  
  sending the segments of the speech sound data from the linguistically enhanced sound file in an audio stream to an audio player to present sound output;
  
  sending between two segments in the audio stream to the audio player a notification item indicative of a phoneme value notification;
  
  sending between two segments in the audio stream to the audio player a notification item indicative of a word break;
  
  presenting in the character animation a mouth shape associated with the phoneme value when the audio player encounters the phoneme value notification item in the audio stream, whereby the character animation is synchronized with the sound output; and
  
  presenting in the character animation a text presentation of a word in a word balloon of the character when the audio player encounters the word break notification item in the audio stream. whereby the character animation is synchronized with the sound output.

17. A computer-implemented system for synchronizing a character animation with speech sound data comprising an utterance, the system comprising:
- a speech recognition engine operable for receiving the speech sound data comprising the utterance and a list of one or more possibilities of the contents of the utterance of the speech sound data to provide a phoneme type value indicative of a phoneme occurring in the speech sound data and a phoneme time value indicative of when within the speech sound data the phoneme occurs;
  
  a linguistic information and sound editing tool operable for acquiring the speech sound data comprising the utterance and a textual representation of the contents of the utterance of the speech sound data, the linguistic information and sound editing tool operable for providing the sound data comprising the utterance to the speech recognition engine and the textual representation of the contents of the utterance of the speech sound data to the speech recognition engine as the list of one or more possibilities of the contents of the utterance of the speech sound data and further operable for annotating the speech sound data with the phoneme type value provided by the speech recognition engine and the phoneme time value provided by the speech recognition engine to create a linguistically enhanced sound file;
  
  a linguistically enhanced sound file player for playing the linguistically enhanced sound file to produce sound output from the sound data and operable to output the phoneme type value at a time indicated by the phoneme time value; and
  
  an animation server responsive to the phoneme type value output by the linguistically enhanced sound file player and operable to present in the character animation a mouth shape associated with the phoneme type value, whereby the character animation is synchronized with the sound output.
- View Dependent Claims (18, 19, 20, 21, 22)
- - 18. The system of claim 17 wherein,
19. The system of claim 17 further comprising:
- a text to speech engine operable to output synthetic speech and a phoneme type value at a time when a phoneme associated with the phoneme type value occurs in the synthetic speech;
  
  wherein the animation server is responsive to the phoneme type value output by the text to speech engine to present a mouth shape associated with the phoneme type value; and
  
  wherein a programming interface presented by the animation server to the linguistically enhanced sound file player for receiving a phoneme type value and a programming interface presented by the animation server to the text to speech engine for receiving a phoneme type value are the same or compatible.
20. The system of claim 17 further comprising:
- a text to speech engine operable to output synthetic speech and a phoneme type value at a time when a phoneme associated with the phoneme type value occurs in the synthetic speech;
  
  wherein the animation server is responsive to the phoneme type value output by the text to speech engine to present a mouth shape associated with the phoneme type value; and
  
  wherein the linguistically enhanced sound file player and the text to speech engine send a phoneme type value to the animation server in the same way.
21. The system of claim 17 wherein,the speech recognition engine is further operable to provide a word break time value indicative of when within the speech sound data a next word in the text of the speech sound data occurs;
- the linguistic information and sound editing tool is further operable to annotate the speech sound data with the word break time value provided by the speech recognition engine;
  
  the linguistically enhanced sound file player is further operable to output a next word notification at a time indicated by the word break time value from the linguistically enhanced sound file; and
  
  the animation server is further responsive to the next word notification output by the linguistically enhanced sound file player to present in the animation a next word in the text of the speech sound data, whereby the animation is synchronized with the sound output.
22. The system of claim 21 wherein,the linguistic information and sound editing tool presents the speech sound data as a graphical representation of sound waves;
- the word break time value is represented by the location of a graphical marker on the graphical representation of sound waves; and
  
  the linguistic information and sound editing tool is operable for modifying the word break time value when an edge of the graphical marker is manipulated.

23. A computer-readable medium having stored thereon a data structure for synchronizing speech sound data with a character animation, the data structure comprising at least two non-overlapping sections:
- a first section comprising continuous speech sound data comprising digitized recorded speech for use with an animated character, wherein the first section is positioned to be played by a sound player following a format not having linguistic enhancement data; and
  
  a second section not overlapping the first section comprising continuous speech sound data, the second section comprising a phoneme marking list comprising a list of phoneme events, wherein a phoneme event is indicative of a phoneme type and indicative of a time when within the speech sound data the phoneme type occurs, whereby the phoneme event can be used by a player to synchronize mouth movement of the animated character with the speech sound data.
- View Dependent Claims (24)
- - 24. The computer-readable medium of claim 23 wherein the data structure further comprises:

25. A computer-implemented system for synchronizing a character mouth animation with speech sound data comprising an utterance, the system comprising:
- a speech recognition means operable for receiving the speech sound data comprising the utterance and a grammar of the utterance of the speech sound data to provide a phoneme type value indicative of a phoneme occurring in the speech sound data and a phoneme time value indicative of when within the speech sound data the phoneme occurs;
  
  a linguistic information and sound editing means for acquiring the speech sound data and a textual representation of the utterance of the speech sound data, the linguistic information and sound editing means operable for providing the speech sound data to the speech recognition means and a grammar based on the textual representation of the utterance of the speech sound data to the speech recognition means as the grammar of the utterance and further operable for annotating the speech sound data with the phoneme type value provided by the speech recognition means and the phoneme time value provided by the speech recognition means to create a linguistically enhanced sound file;
  
  a linguistically enhanced sound file playing means for playing the linguistically enhanced sound file to produce sound output from the speech sound data and operable to output a phoneme type value at a time indicated by the phoneme time value; and
  
  an animation means responsive to the phoneme type value output by the linguistically enhanced sound file playing means and operable to present in a character animation a mouth shape associated with the phoneme type value, whereby the character mouth animation is synchronized with the sound output.

26. A computer-implemented method for creating an annotated file for synchronizing the mouth animation of an animated character with sound data comprising a recorded spoken utterance via a speech recognition engine, wherein the speech recognition engine is operable to accept a data structure specifying what to look for in the recorded spoken utterance, the method comprising:
- acquiring from a user a textual representation of the recorded spoken utterance;
  
  based on the textual representation of the recorded spoken utterance, constructing a data structure instructing the speech recognition engine to look in the recorded spoken utterance for phonemes corresponding to the textual representation;
  
  submitting to the speech recognition engine the sound data comprising the recorded spoken utterance and the data structure instructing the speech recognition engine to look in the recorded spoken utterance for phonemes corresponding to the textual representation;
  
  activating the speech recognition engine to identify times at which phonemes occur within the recorded spoken utterance; and
  
  creating a file comprising the sound data and annotations indicating the times at which phonemes occur within the recorded spoken utterance.
- View Dependent Claims (27, 28, 29, 30)
- - 27. The method of claim 26 wherein
28. The method of claim 26 wherein the file comprising the sound data and annotations indicating the times at which phonemes occur within the recorded spoken utterance is of a format in which the sound data is separate from and not intermingled with the annotations indicating the times at which phonemes occur within the recorded spoken utterance.
29. The method of claim 26 wherein the sound data in the file is of the same format as the recorded spoken utterance.
30. The method of claim 26 wherein the activating instructs the speech recognition engine to recognize a single possibility:
- linguistic content corresponding to the textual representation of the recorded spoken utterance.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Merrill, John Wickens Lamb, Weinberg, Mark Jeffrey, Trower, Tandy W. II
Primary Examiner(s)
Nguyen, Phu K.

Application Number

US09/059,681
Time in Patent Office

1,023 Days
Field of Search

345/435, 345/473, 345/474, 434/112, 434/118, 434/156, 434/167, 434/169, 434/185
US Class Current

345/473
CPC Class Codes

G10L 15/26 Speech to text systems G10L...

G10L 21/06 Transformation of speech in...

Synchronizing the moveable mouths of animated characters with recorded speech

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Synchronizing the moveable mouths of animated characters with recorded speech

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links