System and method for triphone-based unit selection for visual speech synthesis

US 7,369,992 B1
Filed: 02/16/2007
Issued: 05/06/2008
Est. Priority Date: 05/10/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A method of generating a video sequence having mouth movements synchronized with speech sounds, the method utilizing a database of n-phones, the method comprising:

calculating a target cost for each candidate n-phone for a target sequence;

building a video frame lattice of candidate video frames based on the candidate n-phones;

assigning a joint cost to each pair of adjacent video frames; and

constructing the video sequence according to a Viterbi search on the video frame lattice by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for generating a video sequence having mouth movements synchronized with speech sounds are disclosed. The system utilizes a database of n-phones as the smallest selectable unit, wherein n is larger than 1 and preferably 3. The system calculates a target cost for each candidate n-phone for a target frame using a phonetic distance, coarticulation parameter, and speech rate. For each n-phone in a target sequence, the system searches for candidate n-phones that are visually similar according to the target cost. The system samples each candidate n-phone to get a same number of frames as in the target sequence and builds a video frame lattice of candidate video frames. The system assigns a joint cost to each pair of adjacent frames and searches the video frame lattice to construct the video sequence by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.

26 Citations

View as Search Results

21 Claims

1. A method of generating a video sequence having mouth movements synchronized with speech sounds, the method utilizing a database of n-phones, the method comprising:
- calculating a target cost for each candidate n-phone for a target sequence;
  
  building a video frame lattice of candidate video frames based on the candidate n-phones;
  
  assigning a joint cost to each pair of adjacent video frames; and
  
  constructing the video sequence according to a Viterbi search on the video frame lattice by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, further comprising:
    - for each target frame in the target sequence, searching for candidate n-phones that are phonetically and/or visually similar according to the target cost; and
      
      sampling each candidate n-phone to get a same number of candidate phone frames as in the target sequence.
  - 3. The method of claim 1, wherein where n-phone candidates cannot be selected, the method further comprises selecting candidate (n−
    - 1)—
      
      phones from a database of (n−
      
      1)—
      
      phones.
  - 4. The method of claim 1, wherein if the number of n-phone candidates selected for a target frame is below a threshold, then method further comprises selecting candidate (n−
    - 1)—
      
      phones from a database of (n−
      
      1)—
      
      phones.
  - 5. The method of claim 4, wherein the threshold number of n-phone candidates selected for a target frame is approximately 30.
  - 6. The method of claim 1, wherein the database of n-phones further comprises a plurality of n-visemes.
  - 7. The method of claim 6, wherein each n-viseme represents at least two n-phones sharing similar characteristics.
  - 8. The method of claim 7, wherein each n-viseme is a tri-viseme.
  - 9. The method of claim 1, wherein an n-phone is a smallest selectable unit in the database and n is larger than 1.
  - 10. The method of claim 1, wherein calculating the target cost is based in a phonetic distance, coarticulation parameter, and speech rate.

11. A computing device for generating a video sequence having mouth movements synchronized with speech sounds, the computing device utilizing a database of n-phones, the computing device comprising:
- a module configured to calculate a target cost for each candidate n-phones for a target sequence;
  
  a module configured to build a video frame lattice of candidate video frames according to the candidate n-phones;
  
  a module configured to assign a joint cost to each pair of adjacent video frames; and
  
  a module configured to construct the video sequence according to a Viterbi search on the video frame lattice by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The computing device of claim 11, further comprising:
    - a module configured, for each target frame in a target sequence, to search for candidate n-phones that are phonetically and/or visually similar according to the target cost; and
      
      a module configured to sample each candidate n-phone to get a same number of candidate n-phone frames as in the target sequence.
  - 13. The computing device of claim 11, wherein when n-phone candidates cannot be selected, the computing device further selects candidates (n−
    - 1)—
      
      phones from a database of (n—
      
      1)—
      
      phones.
  - 14. The computing device of claim 11, wherein if the number of n-phone candidates selected for a target frame is below a threshold, then the computing device further selects candidate (n−
    - 1)—
      
      phones from a database of (n−
      
      1)—
      
      phones.
  - 15. The computing device of claim 14, wherein the threshold number of n-phone candidates selected for a target frame is approximately 30.
  - 16. The computing device of claim 11, wherein the database of n-phones further comprises a plurality of n-visemes.
  - 17. The computing device of claim 16, wherein each n-viseme represents at least two n-phones sharing similar characteristics.
  - 18. The computing device of claim 17, wherein each n-viseme is a tri-viseme.
  - 19. The computing device of claim 11, wherein an n-phone is a small selectable unit in the database and n is larger than 1.
  - 20. The computing device of claim 11, wherein the module configured to calculate further calculates that target cost based on a phonetic distance, coarticulation parameter and speech rate.

21. A computer-readable medium storing a computer program having instructions for controlling a computing device to generate a video sequence having mouth movements synchronized with speech sounds, the computer program utilizing a database of n-phones, the instructions comprising:
- calculating a target cost for each candidate n-phone for a target sequence;
  
  building a video frame lattice of candidate video frames based on the candidate n-phones;
  
  assigning joint cost to each pair of adjacent video frames; and
  
  constructing video sequences according to a Viterbi search on the video frame lattice by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
AT&T Corporation (AT&T, Inc.)
Inventors
Cosatto, Eric, Graf, Hans Peter, Huang, Fu Jie
Primary Examiner(s)
Dorvil, Richemond
Assistant Examiner(s)
Sked, Matthew J.

Application Number

US11/675,813
Time in Patent Office

445 Days
Field of Search

None
US Class Current

704/235
CPC Class Codes

G10L 13/07   Concatenation rules

G10L 15/08   Speech classification or se...

G10L 15/26   Speech to text systems G10L...

G10L 2021/105   Synthesis of the lips movem...

H04N 19/00   Methods or arrangements for...

System and method for triphone-based unit selection for visual speech synthesis

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

26 Citations

21 Claims

Specification

Use Cases

Quick Links

Others

System and method for triphone-based unit selection for visual speech synthesis

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

26 Citations

21 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others