System and method for triphone-based unit selection for visual speech synthesis

US 7,933,772 B1
Filed: 03/19/2008
Issued: 04/26/2011
Est. Priority Date: 05/10/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A non-transitory computer-readable storage medium storing instructions which, when executed by a computing device, cause the computing device to generate a video sequence having mouth movements synchronized with speech sounds, the instructions comprising:

calculating a target cost for each candidate n-phone from a database of n-phones for a target sequence;

building a video frame lattice of candidate video frames based on the candidate n-phones;

assigning a joint cost to each pair of adjacent video frames; and

constructing the video sequence according to a Viterbi search on the video frame lattice by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for generating a video sequence having mouth movements synchronized with speech sounds are disclosed. The system utilizes a database of n-phones as the smallest selectable unit, wherein n is larger than 1 and preferably 3. The system calculates a target cost for each candidate n-phone for a target frame using a phonetic distance, coarticulation parameter, and speech rate. For each n-phone in a target sequence, the system searches for candidate n-phones that are visually similar according to the target cost. The system samples each candidate n-phone to get a same number of frames as in the target sequence and builds a video frame lattice of candidate video frames. The system assigns a joint cost to each pair of adjacent frames and searches the video frame lattice to construct the video sequence by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.

Citations

28 Claims

1. A non-transitory computer-readable storage medium storing instructions which, when executed by a computing device, cause the computing device to generate a video sequence having mouth movements synchronized with speech sounds, the instructions comprising:
- calculating a target cost for each candidate n-phone from a database of n-phones for a target sequence;
  
  building a video frame lattice of candidate video frames based on the candidate n-phones;
  
  assigning a joint cost to each pair of adjacent video frames; and
  
  constructing the video sequence according to a Viterbi search on the video frame lattice by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The non-transitory computer-readable storage medium of claim 1, the instructions further comprising:
    - for each target frame in the target sequence, searching for candidate n-phones that are phonetically and/or visually similar according to the target cost; and
      
      sampling each candidate n-phone to get a same number of candidate phone frames as in the target sequence.
  - 3. The non-transitory computer-readable storage medium of claim 1, wherein where n-phone candidates cannot be selected, the instructions further comprise selecting candidate (n−
    - 1)-phones from a database of (n−
      
      1)-phones.
  - 4. The non-transitory computer-readable storage medium of claim 1, wherein if the number of n-phone candidates selected for a target frame is below a threshold, the instructions further comprise selecting candidate (n−
    - 1)-phones from a database of (n−
      
      1)-phones.
  - 5. The non-transitory computer-readable storage medium of claim 4, wherein the threshold number of n-phone candidates selected for a target frame is approximately 30.
  - 6. The non-transitory computer-readable storage medium of claim 1, wherein the database of n-phones further comprises a plurality of n-visemes.
  - 7. The non-transitory computer-readable storage medium of claim 6, wherein each n-viseme represents at least two n-phones sharing similar characteristics.
  - 8. The non-transitory computer-readable storage medium of claim 7, wherein each n-viseme is a tri-viseme.
  - 9. The non-transitory computer-readable storage medium of claim 1, wherein an n-phone is a smallest selectable unit in the database and n is larger than 1.
  - 10. The non-transitory computer-readable storage medium of claim 1, wherein calculating the target cost is based in a phonetic distance, coarticulation parameter, and speech rate.

11. A system for generating a bitstream having mouth movements synchronized with speech sounds, the system comprising:
- a processor;
  
  a first module configured to control the processor to calculate a target cost for each candidate n-phones from a database of n-phones for a target sequence;
  
  a second module configured to control the processor to build a video frame lattice of candidate video frames according to the candidate n-phones;
  
  a third module configured to control the processor to assign a joint cost to each pair of adjacent video frames; and
  
  a fourth module configured to control the processor to construct the video sequence according to a Viterbi search on the video frame lattice by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The system of claim 11, further comprising:
    - a fifth module configured to control the processor to, for each target frame in a target sequence, search for candidate n-phones that are phonetically and/or visually similar according to the target cost; and
      
      a sixth module configured to control the processor to sample each candidate n-phone to get a same number of candidate n-phone frames as in the target sequence.
  - 13. The system of claim 11, wherein when n-phone candidates cannot be selected, a fifth module controls the processor to select candidates (n−
    - 1)-phones from a database of (n−
      
      1)-phones.
  - 14. The system of claim 11, wherein if the number of n-phone candidates selected for a target frame is below a threshold, a fifth module controls the processor to select candidate (n−
    - 1)-phones from a database of (n−
      
      1)-phones.
  - 15. The system of claim 14, wherein the threshold number of n-phone candidates selected for a target frame is approximately 30.
  - 16. The system of claim 11, wherein the database of n-phones further comprises a plurality of n-visemes.
  - 17. The system of claim 16, wherein each n-viseme represents at least two n-phones sharing similar characteristics.
  - 18. The system of claim 17, wherein each n-viseme is a tri-viseme.
  - 19. The system of claim 11, wherein an n-phone is a small selectable unit in the database and n is larger than 1.
  - 20. The system of claim 11, wherein the first module is further configured to control the processor to calculate that target cost based on a phonetic distance, coarticulation parameter and speech rate.

21. A method of generating a video sequence having synchronized movement and sound, the method comprising:
- building a video frame lattice of candidate video frames based on a calculated target cost for each candidate n-phone associated with a target sequence;
  
  assigning, via a processor, a joint cost to each pair of adjacent video frames in the video frame lattice; and
  
  generating the video sequence by finding the optimal path through the video frame lattice based at least in part on the target cost and the joint cost over the video sequence.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28)
- - 22. The method of claim 21, further comprising:
    - for each target frame in the target sequence, searching for candidate n-phones that are phonetically and/or visually similar according to the target cost; and
      
      sampling each candidate n-phone to get a same number of candidate u-phone frames as in the target sequence.
  - 23. The method of claim 21, wherein where candidate n-phone cannot be selected, the method further comprises selecting candidate (n−
    - 1)-phones from a database of (n−
      
      1)-phones.
  - 24. The method of claim 21, wherein if the number of candidate n-phone selected for a target frame is below a threshold, then method further comprises selecting candidate (n−
    - 1)-phones from a database of (n−
      
      1)-phones.
  - 25. The method of claim 24, wherein the threshold number of candidate n-phone selected for a target frame is approximately 30.
  - 26. The method of claim 21, wherein the database of n-phones further comprises a plurality of n-visemes.
  - 27. The method of claim 26, wherein each n-viseme represents at least two n-phones sharing similar characteristics.
  - 28. The method of claim 21, wherein rendering the video sequence is based on a minimum of the sum of the target cost and joint cost.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Inventors
Cosatto, Eric, Graf, Hans Peter, Huang, Fu Jie
Primary Examiner(s)
Sked, Matthew J

Application Number

US12/051,311
Time in Patent Office

1,133 Days
Field of Search

None
US Class Current

704/235
CPC Class Codes

G10L 13/07   Concatenation rules

G10L 15/08   Speech classification or se...

G10L 15/26   Speech to text systems G10L...

G10L 2021/105   Synthesis of the lips movem...

H04N 19/00   Methods or arrangements for...

System and method for triphone-based unit selection for visual speech synthesis

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for triphone-based unit selection for visual speech synthesis

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links