System and method for triphone-based unit selection for visual speech synthesis

US 9,583,098 B1
Filed: 10/25/2007
Issued: 02/28/2017
Est. Priority Date: 05/10/2002
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving text for conversion to speech;

calculating a target cost of a target sequence of tri-phones associated with the speech based on a phonetic distance, a coarticulation parameter, and a speech rate of the speech, to yield a calculated target cost;

identifying, based on the calculated target cost and following phonemes associated with a plurality of tri-phones, a plurality of candidate tri-phones;

sampling each candidate tri-phone in the plurality of candidate tri-phones to identify how many frames are associated with the each candidate tri-phone;

adding, where necessary, at least one frame to frames in the each candidate tri-phone of the plurality of candidate tri-phones to reach a same number of frames as in a corresponding tri-phone in the target sequence of tri-phones, to yield an updated candidate tri-phone;

building a video frame lattice of candidate video frames, wherein each candidate video frame in the candidate video frames is associated with a tri-phone comprising one of the updated candidate tri-phone or another tri-phone from the plurality of candidate tri-phones;

determining image coefficients for each frame in the video frame lattice of candidate video frames, wherein the image coefficients for the each frame are based on a turning point of the updated candidate tri-phone, the turning point being a change of direction in a mouth of a speaker pronouncing the updated candidate tri-phone;

assigning a joint cost to each pair of adjacent video frames in the video frame lattice, where the joint cost is based on the image coefficients and geometric features of the each pair of adjacent video frames in the video frame lattice; and

constructing a video sequence of the mouth of the speaker moving in synchronization with the speech by finding, using a Viterbi search, a path through the video frame lattice based on a minimum of a sum of the calculated target cost and the joint cost over the video sequence.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for generating a video sequence having mouth movements synchronized with speech sounds are disclosed. The system utilizes a database of n-phones as the smallest selectable unit, wherein n is larger than 1 and preferably 3. The system calculates a target cost for each candidate n-phone for a target frame using a phonetic distance, coarticulation parameter, and speech rate. For each n-phone in a target sequence, the system searches for candidate n-phones that are visually similar according to the target cost. The system samples each candidate n-phone to get a same number of frames as in the target sequence and builds a video frame lattice of candidate video frames. The system assigns a joint cost to each pair of adjacent frames and searches the video frame lattice to construct the video sequence by finding the optimal path through the lattice according to the minimum of the sum of the target cost and the joint cost over the sequence.

25 Citations

17 Claims

1. A method comprising:
- receiving text for conversion to speech;
  
  calculating a target cost of a target sequence of tri-phones associated with the speech based on a phonetic distance, a coarticulation parameter, and a speech rate of the speech, to yield a calculated target cost;
  
  identifying, based on the calculated target cost and following phonemes associated with a plurality of tri-phones, a plurality of candidate tri-phones;
  
  sampling each candidate tri-phone in the plurality of candidate tri-phones to identify how many frames are associated with the each candidate tri-phone;
  
  adding, where necessary, at least one frame to frames in the each candidate tri-phone of the plurality of candidate tri-phones to reach a same number of frames as in a corresponding tri-phone in the target sequence of tri-phones, to yield an updated candidate tri-phone;
  
  building a video frame lattice of candidate video frames, wherein each candidate video frame in the candidate video frames is associated with a tri-phone comprising one of the updated candidate tri-phone or another tri-phone from the plurality of candidate tri-phones;
  
  determining image coefficients for each frame in the video frame lattice of candidate video frames, wherein the image coefficients for the each frame are based on a turning point of the updated candidate tri-phone, the turning point being a change of direction in a mouth of a speaker pronouncing the updated candidate tri-phone;
  
  assigning a joint cost to each pair of adjacent video frames in the video frame lattice, where the joint cost is based on the image coefficients and geometric features of the each pair of adjacent video frames in the video frame lattice; and
  
  constructing a video sequence of the mouth of the speaker moving in synchronization with the speech by finding, using a Viterbi search, a path through the video frame lattice based on a minimum of a sum of the calculated target cost and the joint cost over the video sequence.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein when the plurality of candidate tri-phones cannot be identified, the method further comprises identifying candidate bi-phones from a database of bi-phones.
  - 3. The method of claim 1, further comprising:
    - when the plurality of candidate tri-phones selected for the target sequence is below a threshold, selecting candidate bi-phones from a database of bi-phones.
  - 4. The method of claim 3, wherein a threshold number of candidate tri-phones selected for the target sequence is approximately 30.
  - 5. The method of claim 1, wherein a database of tri-phones from which the plurality of candidate tri-phones are identified comprises a plurality of tri-visemes.
  - 6. The method of claim 5, wherein each tri-viseme in the plurality of tri-visemes represents two tri-phones sharing similar characteristics.
  - 7. The method of claim 1, wherein constructing the video sequence is based on a minimum of the sum of the calculated target cost and joint cost.

8. A computing device comprising:
- a processor; and
  
  a computer-readable storage medium having instructions stored which, when executed by the processor, perform operations comprising;
  
  receiving text for conversion to speech;
  
  calculating a target cost of a target sequence of tri-phones associated with the speech based on a phonetic distance, a coarticulation parameter, and a speech rate of the speech, to yield a calculated target cost;
  
  identifying, based on the calculated target cost and following phonemes associated with a plurality of tri-phones, a plurality of candidate tri-phones;
  
  sampling each candidate tri-phone in the plurality of candidate tri-phones to identify how many frames are associated with the each candidate tri-phone;
  
  adding, where necessary, at least one frame to frames in the each candidate tri-phone of the plurality of candidate tri-phones to reach a same number of frames as in a corresponding tri-phone in the target sequence of tri-phones, to yield an updated candidate tri-phone;
  
  building a video frame lattice of candidate video frames, wherein each candidate video frame in the candidate video frames is associated with a tri-phone comprises one of the updated candidate tri-phone or another tri-phone from the plurality of candidate tri-phones;
  
  determining image coefficients for each frame in the video frame lattice of candidate video frames, wherein the image coefficients for the each frame are based on a turning point of the updated candidate tri-phone, the turning point being a change of direction in a mouth of a speaker pronouncing the updated candidate tri-phone;
  
  assigning a joint cost to each pair of adjacent video frames in the video frame lattice, where the joint cost is based on the image coefficients and geometric features of the each pair of adjacent video frames in the video frame lattice; and
  
  constructing a video sequence of the mouth of the speaker moving in synchronization with the speech by finding, using a Viterbi search, a path through the video frame lattice based on a minimum of a sum of the calculated target cost and the joint cost over the video sequence.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computing device of claim 8, the computer-readable storage medium having additional instructions stored which result in the operations further comprising:
    - when the plurality of candidate tri-phone cannot be identified, identifying candidate bi-phones from a database of bi-phones.
  - 10. The computing device of claim 8, the computer-readable storage medium having additional instructions stored which result in the operations further comprising:
    - when a number of the plurality of candidate tri-phone selected for the target sequence is below a threshold, identifying candidate bi-phones from a database of bi-phones.
  - 11. The computing device of claim 10, wherein a threshold number of candidate tri-phones selected for the target sequence is approximately 30.
  - 12. The computing device of claim 8, wherein a database of tri-phones from which the plurality of candidate tri-phones are identified comprises a plurality of candidate tri-visemes.
  - 13. The computing device of claim 12, wherein each tri-viseme in the plurality of candidate tri-visemes represents two tri-phones sharing similar characteristics.
  - 14. The computing device of claim 13, wherein constructing the video sequence is based on a minimum of the sum of the calculated target cost and joint cost.

15. A computer-readable storage device having instructions stored, which, when executed by a computing device, cause the computing device to perform operations comprising:
- receiving text for conversion to speech;
  
  calculating a target cost of a target sequence of tri-phones associated with the speech based on a phonetic distance, a coarticulation parameter, and a speech rate of the speech, to yield a calculated target cost;
  
  identifying, based on the calculated target cost and following phonemes associated with a plurality of tri-phones, a plurality of candidate tri-phones;
  
  sampling each candidate tri-phone in the plurality of candidate tri-phones to identify how many frames are associated with the each candidate tri-phone;
  
  adding at least one frame to at least one candidate tri-phone to reach a same number of frames as in a corresponding tri-phone in the target sequence of tri-phones, to yield updated candidate tri-phones;
  
  building a video frame lattice of candidate video frames, wherein each candidate video frame in the candidate video frames is associated with an updated candidate tri-phone in the updated candidate tri-phones;
  
  determining image coefficients for each frame in the video frame lattice, wherein the image coefficients for the each frame are based on a turning point of the updated candidate tri-phone associated with the each frame, the turning point being a change of direction in a mouth of a speaker pronouncing the updated candidate tri-phone;
  
  assigning a joint cost to each pair of adjacent video frames in the video frame lattice, where the joint cost is based on the image coefficients and geometric features of the each pair of adjacent video frames; and
  
  constructing a video sequence of a mouth moving in synchronization with the speech by finding, using a Viterbi search, a path through the video frame lattice based on a minimum of a sum of the calculated target cost and the joint cost over the video sequence.
- View Dependent Claims (16, 17)
- - 16. The computer-readable storage device of claim 15, the computer-readable storage device having additional instructions stored which result in the operations further comprising:
    - where the plurality of candidate tri-phones cannot be identified, identifying candidate bi-phones from a database of bi-phones.
  - 17. The computer-readable storage device of claim 15, wherein constructing the video sequence is based on a minimum of the sum of the calculated target cost and joint cost.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
AT&T Intellectual Property II LP (AT&T, Inc.)
Inventors
Cosatto, Eric, Graf, Hans Peter, Huang, Fu Jie
Primary Examiner(s)
Ustaris, Joseph
Assistant Examiner(s)
Navas, Jr., Edemio

Application Number

US11/924,025
Time in Patent Office

3,414 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G10L 13/07   Concatenation rules

G10L 15/08   Speech classification or se...

G10L 15/26   Speech to text systems G10L...

G10L 2021/105   Synthesis of the lips movem...

H04N 19/00   Methods or arrangements for...

System and method for triphone-based unit selection for visual speech synthesis

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

25 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for triphone-based unit selection for visual speech synthesis

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

25 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links