System and method for triphone-based unit selection for visual speech synthesis

US 7,209,882 B1
Filed: 05/10/2002
Issued: 04/24/2007
Est. Priority Date: 05/10/2002
Status: Active Grant

First Claim

Patent Images

1. A method of generating a video sequence having mouth movements synchronized with speech sounds, the method utilizing a database of n-phones as a smallest selectable unit, where n is larger than 1, the method comprising:

calculating a target cost for each candidate n-phone for a target sequence using a phonetic distance, coarticulation parameter, and speech rate;

for each target frame in the target sequence, searching for candidate n-phones that are phonetically and/or visually similar according to the target cost;

sampling each candidate n-phone to get a same number of candidate phone frames as in the target sequence;

building a video frame lattice of candidate video frames based on the candidate n-phones;

assigning a joint cost to each pair of adjacent video frames; and

constructing the video sequence according to a Viterbi search on the video frame lattice by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.

View all claims

11 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for generating a video sequence having mouth movements synchronized with speech sounds are disclosed. The system utilizes a database of n-phones as the smallest selectable unit, wherein n is larger than 1 and preferably 3. The system calculates a target cost for each candidate n-phone for a target frame using a phonetic distance, coarticulation parameter, and speech rate. For each n-phone in a target sequence, the system searches for candidate n-phones that are visually similar according to the target cost. The system samples each candidate n-phone to get a same number of frames as in the target sequence and builds a video frame lattice of candidate video frames. The system assigns a joint cost to each pair of adjacent frames and searches the video frame lattice to construct the video sequence by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.

43 Citations

View as Search Results

28 Claims

1. A method of generating a video sequence having mouth movements synchronized with speech sounds, the method utilizing a database of n-phones as a smallest selectable unit, where n is larger than 1, the method comprising:
- calculating a target cost for each candidate n-phone for a target sequence using a phonetic distance, coarticulation parameter, and speech rate;
  
  for each target frame in the target sequence, searching for candidate n-phones that are phonetically and/or visually similar according to the target cost;
  
  sampling each candidate n-phone to get a same number of candidate phone frames as in the target sequence;
  
  building a video frame lattice of candidate video frames based on the candidate n-phones;
  
  assigning a joint cost to each pair of adjacent video frames; and
  
  constructing the video sequence according to a Viterbi search on the video frame lattice by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein where n-phone candidates cannot be selected, the method further comprises selecting candidate (n−
    - 1)-phones from a database of (n−
      
      1)-phones.
  - 3. The method of claim 1, wherein if the number of n-phone candidates selected for a target frame is below a threshold, then method further comprises selecting candidate (n−
    - 1)-phones from a database of (n−
      
      1)-phones.
  - 4. The method of claim 3, wherein the threshold number of n-phone candidates selected for a target frame is approximately 30.
  - 5. The method of claim 1, wherein the database of n-phones further comprises a plurality of n-visemes.
  - 6. The method of claim 5, wherein each n-viseme represents at least two n-phones sharing similar characteristics.
  - 7. The method of claim 6, wherein each n-viseme is a tri-viseme.
  - 8. The method of claim 1, wherein n=3.

9. A method of generating a video sequence having mouth movements synchronized with speech sounds, the method utilizing a database of n-phones as a smallest selectable unit, where n is larger than 1, the method comprising:
- searching the database of n-phones for a plurality of candidate n-phones for each target frame of a target sequence;
  
  building a video frame lattice of candidate video frames according to the candidate n-phones; and
  
  searching the video frame lattice using a Viterbi search to construct the video sequence by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.

10. A system for generating a video sequence having mouth movements synchronized with speech sounds, the system utilizing a database of n-phones as a smallest selectable unit, where n is larger than 1, the system comprising:
- (l) a unit selection module that performs the steps of;
  
  (a) calculating a target cost for each candidate n-phone for a target frame in a target sequence using a phonetic distance, coarticulation parameter, and speech rate;
  
  (b) for each n-phone in the target sequence, searching for candidate n-phones that are visually and/or phonetically similar according to the target costs;
  
  (c) sampling each candidate n-phone to get a same number of frames as in the target sequence;
  
  (d) building a video frame lattice of candidate video frames; and
  
  (e) assigning a joint cost to each pair of adjacent candidate video frames; and
  
  (2) a search module that performs the step of;
  
  (a) searching the video frame lattice to construct the video sequence by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.
- View Dependent Claims (11)
- - 11. The system of claim 10, wherein n=3.

12. An apparatus for generating a video sequence having mouth movements synchronized with speech sounds, the system utilizing a database of n-phones as a smallest selectable unit, where n is larger than 1, the apparatus being controlled by a program implementing a series of steps comprising:
- calculating a target cost for each candidate n-phone for a target sequence using a phonetic distance, coarticulation parameter, and speech rate;
  
  for each target frame in the target sequence, searching for candidate n-phones that are visually and/or phonetically similar according to the target cost;
  
  sampling each candidate n-phone to get a same number of candidate frames as in the target sequence;
  
  building a video frame lattice of candidate video frames;
  
  assigning a joint cost to each pair of adjacent video frames; and
  
  constructing the video sequence using a Viterbi search on the video frame lattice by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.
- View Dependent Claims (13)
- - 13. The computer-readable medium of claim 12, wherein the database of n-phones comprises a plurality of n-visemes.

14. A computer-readable medium storing a program for controlling a computer device to perform a set of steps for generating a video sequence having mouth movements synchronized with speech sounds, the method utilizing a database of n-phones as a smallest selectable unit, where n is larger than 1, the set of steps comprising:
- calculating a target cost for each candidate n-phone for a target sequence using a phonetic distance, coarticulation parameter, and speech rate;
  
  for each target frame in the target sequence, searching for candidate n-phones that are phonetically and/or visually similar according to the target cost;
  
  sampling each candidate n-phone to get a same number of candidate frames as in the target sequence;
  
  building a video frame lattice of candidate video frames;
  
  assigning a joint cost to each pair of adjacent video frames; and
  
  generating the video sequence using a Viterbi search on the video frame lattice by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The computer-readable medium of claim 14, wherein when n-phone candidates cannot be selected, the set of steps further comprises selecting candidate (n−
    - 1)-phones from a database of (n−
      
      1)-phones.
  - 16. The computer-readable medium of claim 14, wherein if the number of n-phone candidates selected for a target frame is below a threshold, then the set of steps further comprises selecting candidate (n−
    - 1)-phones from a database of (n−
      
      1)-phones.
  - 17. The computer-readable medium of claim 16, wherein the threshold number of n-phone candidates selected for a target frame is approximately 30.
  - 18. The computer-readable medium of claim 17, wherein each n-viseme represents at least two n-phones sharing similar characteristics.
  - 19. The computer-readable medium of claim 18, wherein each n-viseme is a tri-viseme.
  - 20. The computer-readable medium of claim 14, wherein n=3.

21. A method of generating a video sequence having mouth movements synchronized with speech sounds, the method comprising:
- receiving data associated with speech generated from text;
  
  calculating a target cost for each candidate n-phone for a target sequence;
  
  for each target frame in the target sequence, searching for candidate n-phones that are phonetically and/or visually similar according to the target cost in a database where the n-phones are the smallest selectable unit;
  
  building a video frame lattice of candidate video frames; and
  
  constructing the video sequence according to a Viterbi search on the video frame lattice by finding the optimal path through the video frame lattice according to the minimum of the sum of the target cost and a joint cost over the sequence.
- View Dependent Claims (22, 23, 24, 25)
- - 22. The method of claim 21, wherein the joint cost is assigned to each pair of adjacent video frames.
  - 23. The method of claim 22, further comprising:
    - sampling each candidate n-phone to get a same number of candidate frames as in the target sequence.
  - 24. The method of claim 23, wherein calculating the target cost for each candidate n-phone is performed using a phonetic distance, coarticulation parameter, and speech rate.
  - 25. The method of claim 24, wherein n=3.

26. A virtual agent that interacts in real time with a user, the virtual agent being controlled according to a method of generating a video sequence utilizing a database of n-phones as a smallest selectable unit, where n is larger than 1, the method comprising:
- calculating a target cost for each candidate n-phone for a target sequence using a phonetic distance, coarticulation parameter, and speech rate;
  
  for each target frame in the target sequence, searching for candidate n-phones that are phonetically and/or visually similar according to the target cost;
  
  sampling each candidate n-phone to get a same number of candidate frames as in the target sequence;
  
  building a video frame lattice of candidate video frames;
  
  assigning joint cost to each pair of adjacent video frames; and
  
  generating movements for the virtual agent by constructing video sequences according to a Viterbi search on the video frame lattice by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.
- View Dependent Claims (27)
- - 27. The virtual agent of claim 26, wherein n=3.

28. A method of generating a video sequence having mouth movements synchronized with speech sounds the method utilizing a database of n-phones as a smallest selectable unit where n is larger than 1, the method comprising:
- receiving text-to-speech data;
  
  selecting a plurality of n-phones from the database of n-phones;
  
  building a video frame lattice of candidate video frames based on the selected plurality of n-phones, wherein each candidate video frame is indexed to an n-phone in the database of n-phones; and
  
  constructing the video sequence from the video frame lattice using a Viterbi search on the video frame lattice by finding the optimal path through the lattice according to the minimum of the sum of a target cost and a joint cost over the video sequence.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Inc., Cerence Operating Company (Cerence Inc.)
Original Assignee
AT&T Corporation (AT&T, Inc.)
Inventors
Graf, Hans Peter, Huang, Fu Jie, Cosatto, Eric
Primary Examiner(s)
Hudspeth, David
Assistant Examiner(s)
Sked, Matthew J

Application Number

US10/143,717
Time in Patent Office

1,810 Days
Field of Search

704/260, 704/235, 704/256
US Class Current

704/235
CPC Class Codes

G10L 13/07   Concatenation rules

G10L 15/08   Speech classification or se...

G10L 15/26   Speech to text systems G10L...

G10L 2021/105   Synthesis of the lips movem...

H04N 19/00   Methods or arrangements for...

System and method for triphone-based unit selection for visual speech synthesis

First Claim

11 Assignments

0 Petitions

Accused Products

Abstract

43 Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for triphone-based unit selection for visual speech synthesis

First Claim

11 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

43 Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links