Method and system for converting text to lip-synchronized speech in real time

US 7,613,613 B2
Filed: 12/10/2004
Issued: 11/03/2009
Est. Priority Date: 12/10/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A method for presenting information in real time, the method comprising:

providing a plurality of rules for controlling modification of words of a sequence of words, the rules including rules to add a sound after a phrase, to replace words with words of different complexity, to remove certain verbs without replacing the verbs, and to modify words based on identification of a current expression derived from comparison of words of the sequence to be spoken;

providing an expression store with images of a character representing different expressions of emotion for that character;

receiving a sequence of words;

modifying the words of the received sequence by for each of a plurality of rules,determining whether the rule applies to words of the received sequence; and

when it is determined that the rule applies, modifying the words of the received sequence in accordance with the rule;

generating speech for the character corresponding to the modified words, the speech represented by a sequence of phonemes including replacing phonemes with other phonemes to achieve regional effects;

identifying expressions of emotion from the words of the received sequence;

mapping the phonemes of the speech and the identified expressions for the character to the words of the received sequence;

generating a sequence of images based on the images of the expression store to represent the character speaking the generated speech and having the identified expressions of emotion and to represent hands of the character moved to effect output of the modified words in a sign language, wherein the mapping to words of the received sequence is used to synchronize the movement of the lips representing the character enunciating the phonemes of the words with the image of the character exhibiting the identified expressions of emotion mapped to those words so that the speaking of a word is synchronized with the image of the character exhibiting the expression of emotion identified from that word; and

outputting the generated speech represented by the sequence of phonemes and the sequence of generated images to portray the character speaking the words of the modified received sequence and having the identified expressions.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for presenting lip-synchronized speech corresponding to the text received in real time is provided. A lip synchronization system provides an image of a character that is to be portrayed as speaking text received in real time. The lip synchronization system receives a sequence of text corresponding to the speech of the character. It may modify the received text in various ways before synchronizing the lips. It may generate phonemes for the modified text that are adapted to certain idioms. The lip synchronization system then generates the lip-synchronized images based on the phonemes generated from the modified texts and based on the identified expressions.

44 Citations

View as Search Results

24 Claims

1. A method for presenting information in real time, the method comprising:
- providing a plurality of rules for controlling modification of words of a sequence of words, the rules including rules to add a sound after a phrase, to replace words with words of different complexity, to remove certain verbs without replacing the verbs, and to modify words based on identification of a current expression derived from comparison of words of the sequence to be spoken;
  
  providing an expression store with images of a character representing different expressions of emotion for that character;
  
  receiving a sequence of words;
  
  modifying the words of the received sequence by for each of a plurality of rules,determining whether the rule applies to words of the received sequence; and
  
  when it is determined that the rule applies, modifying the words of the received sequence in accordance with the rule;
  
  generating speech for the character corresponding to the modified words, the speech represented by a sequence of phonemes including replacing phonemes with other phonemes to achieve regional effects;
  
  identifying expressions of emotion from the words of the received sequence;
  
  mapping the phonemes of the speech and the identified expressions for the character to the words of the received sequence;
  
  generating a sequence of images based on the images of the expression store to represent the character speaking the generated speech and having the identified expressions of emotion and to represent hands of the character moved to effect output of the modified words in a sign language, wherein the mapping to words of the received sequence is used to synchronize the movement of the lips representing the character enunciating the phonemes of the words with the image of the character exhibiting the identified expressions of emotion mapped to those words so that the speaking of a word is synchronized with the image of the character exhibiting the expression of emotion identified from that word; and
  
  outputting the generated speech represented by the sequence of phonemes and the sequence of generated images to portray the character speaking the words of the modified received sequence and having the identified expressions.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1 wherein the sequence of words is closed-captioned text of a television broadcast.
  - 3. The method of claim 1 wherein the sequence of words is entered via a keyboard by a participant in a computer-based chat session.
  - 4. The method of claim 1 wherein a rule specifies to modify the words of the sequence by expanding acronyms.
  - 5. The method of claim 1 wherein a rule specifies to modify the words of the sequence to reflect an idiom.
  - 6. The method of claim 5 wherein the idiom is associated with the character.
  - 7. The method of claim 1 wherein the generating of speech includes identifying phonemes from the modified words.
  - 8. The method of claim 7 wherein the phonemes are identified to reflect an idiom.

9. A system for presenting a lip-syncing character, comprising:
- a rules store containing rules for controlling modification of words of sequence of words, the rules including rules to add a sound after a phrase and to remove certain verbs;
  
  an expression store containing images of a character representing different expressions of emotion for that character;
  
  a modify word component that receives a sequence of words in real time and modifies the words of the sequence in accordance with the rules of the rules store;
  
  an identify expressions component that identifies expressions of emotion from the words of the sequence and maps the expressions of emotion to the words;
  
  a lip synchronization component that inputs the modified words of the sequence, the map of expressions of emotion to the words, and the images of the character representing different expressions of emotion and outputs in real time as the words are received speech corresponding to the modified words of the sequence and images of the character speaking the output speech and having the identified expressions of emotion synchronized to the speech as indicated by the map and images of hands of the character moving to effect output of the modified words in a sign language.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The system of claim 9 wherein the sequence of words is closed-captioned text of a television broadcast.
  - 11. The system of claim 9 wherein the sequence of words is entered via a keyboard by a participant in a computer-based chat session.
  - 12. The system of claim 9 wherein a rule specifies to modify the words of sequence by expanding acronyms.
  - 13. The system of claim 9 wherein a rule specifies to modify the words of the sequence to reflect an idiom.
  - 14. The system of claim 9 wherein the generating of speech includes identifying phonemes from the modified words of the sequence.
  - 15. The system of claim 14 wherein the phonemes are identified to reflect an idiom.

16. A computer-readable storage medium containing instructions for controlling a computer to present images of a character speaking, by a method comprising:
- providing a plurality of rules for controlling modification of words of a sequence of words, the rules including rules to add a sound after a phrase and to replace words with words of different complexity;
  
  providing images of a character representing different expressions of emotion of the character;
  
  receiving a sequence of words in real time;
  
  modifying the words of the sequence in accordance with the provided rules;
  
  after modifying the words, generating speech corresponding to the received sequence of words as modified;
  
  identifying expressions of emotion from the words of the received sequence of words;
  
  generating a sequence of images based on the provided images to represent the character speaking the generated speech and exhibiting the identified expressions of emotion so that the speaking of a word is synchronized with an expression of emotion identified from that word and to represent the character using a sign language to effect the output of modified words of the sequence; and
  
  outputting the generated speech and sequence of images to portray the character speaking the text with the identified expression of emotion.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24)
- - 17. The computer-readable medium of claim 16 wherein the sequence of words is closed-captioned text.
  - 18. The computer-readable medium of claim 16 wherein the sequence of words is entered by a participant in a computer-based chat session.
  - 19. The computer-readable medium of claim 16 wherein a rule specifies to modify the words of the sequence by expanding acronyms.
  - 20. The computer-readable medium of claim 16 wherein a rule specifies to modify the words of the sequence to reflect an idiom.
  - 21. The computer-readable medium of claim 16 wherein the generating of speech includes identifying phonemes from the words of the sequence.
  - 22. The computer-readable medium of claim 21 wherein the phonemes are identified to reflect an idiom.
  - 23. The computer-readable medium of claim 16 wherein different images of the character are provided for different expressions.
  - 24. The computer-readable medium of claim 16 wherein the generating of the sequence of images represents the character lip-syncing the generated speech.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Cotton, Brandon, Fields, Timothy V.
Primary Examiner(s)
Dorvil; Richemond
Assistant Examiner(s)
GODBOLD, DOUGLAS

Application Number

US11/009,966
Publication Number

US 20060129400A1
Time in Patent Office

1,789 Days
Field of Search

704/258, 704/260, 704/270, 704/272, 704/271, 385/473
US Class Current

704/272
CPC Class Codes

G10L 13/00   Speech synthesis; Text to s...

G10L 13/08   Text analysis or generation...

G10L 15/26   Speech to text systems G10L...

G10L 2021/105   Synthesis of the lips movem...

G10L 21/06   Transformation of speech in...

H04N 21/440236   by media transcoding, e.g. ...

H04N 5/278   Subtitling

Method and system for converting text to lip-synchronized speech in real time

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

44 Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for converting text to lip-synchronized speech in real time

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

44 Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links