Method, apparatus and computer program product providing prosodic-categorical enhancement to phrase-spliced text-to-speech synthesis

US 20070055526A1
Filed: 08/25/2005
Published: 03/08/2007
Est. Priority Date: 08/25/2005
Status: Abandoned Application

First Claim

Patent Images

1. A computer program product comprising a computer useable medium including a computer readable program, wherein the computer readable program when executed on the computer causes the computer to operate in accordance with a text-to-speech synthesis function by operations comprising:

labeling a phrase according to a symbolic categorization of prosodic phenomena; and

constructing a data structure that comprises word/prosody-categories and word/prosody-category sequences for the phrase, and that further provides a phone sequence associated with the phrase.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed is a method, a system and a computer program product for text-to-speech synthesis. The computer program product comprises a computer useable medium including a computer readable program, where the computer readable program when executed on the computer causes the computer to operate in accordance with a text-to-speech synthesis function by operations that include, responsive to at least one phrase represented as recorded human speech to be employed in synthesizing speech, labeling the phrase according to a symbolic categorization of prosodic phenomena; and constructing a data structure that includes word/prosody-categories and word/prosody-category sequences for the phrase, and that further includes information pertaining to a phone sequence associated with the constituent word or word sequence for the phrase.

78 Citations

View as Search Results

20 Claims

1. A computer program product comprising a computer useable medium including a computer readable program, wherein the computer readable program when executed on the computer causes the computer to operate in accordance with a text-to-speech synthesis function by operations comprising:
- labeling a phrase according to a symbolic categorization of prosodic phenomena; and
  
  constructing a data structure that comprises word/prosody-categories and word/prosody-category sequences for the phrase, and that further provides a phone sequence associated with the phrase.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The computer program product as in claim 1, where the data structure is constructed to enable a search of word/prosody categories and word/prosody-category sequences for phrases in a corpus of recordings, and which further comprises a sequence of concatenation units associated with a constituent word or word sequence for the phrase.
  - 3. The computer program product as in claim 1, further comprising:
    - in response to input text to be converted to speech, labeling at least one phrase of the input text with a target prosodic category;
      
      comparing the input text to data in the data structure to identify individual occurrences of a phrase labeled with prosody categories corresponding to the in/put text for constructing a phone sequence; and
      
      constructing output speech according to the phone sequence.
  - 4. The computer program product as in claim 3, where if comparing the input text to data in the data structure does not identify an occurrence of a phrase, the operations comprise instead comparing the input text to a pronunciation dictionary.
  - 5. The computer program product as in claim 1, where the symbolic categorization of the prosodic phenomena comprises considering a presence or absence of silence that at least one of proceeds or follows a current word.
  - 6. The computer program product as in claim 1, where the symbolic categorization of the prosodic phenomena comprises considering a number of words since at least one of a beginning of a current utterance, phrase or silence-delimited speech, or a number of words until the end of the utterance, phrase or silence-delimited speech.
  - 7. The computer program product as in claim 1, where the symbolic categorization of the prosodic phenomena comprises considering at least one of a last punctuation mark preceding at least one of the word and/or the number of words since the punctuation mark, or a next punctuation mark following at least one of the word and/or the number of words until that punctuation mark.
  - 8. The computer program product as in claim 1, where the symbolic categorization of the prosodic phenomena comprises a prosodic phonology.
  - 9. The computer program product as in claim 3, where the operation of comparing the input text to the data in the data structure comprises testing for an exact match of prosodic categories.
  - 10. The computer program product as in claim 3, where the operation of comparing the input text to the data in the data structure comprises applying a cost function of various category mismatches to a search process involving at least one other matching criterion.
  - 11. The computer program product as in claim 1, where labeling a constituent word or word sequence of a phrase according to a symbolic categorization of prosodic phenomena comprises using a Tones and Break Indices (ToBI) analysis.

12. A text-to-speech synthesis system comprising:
- means, responsive to at least one phrase represented as recorded human speech to be employed in synthesizing speech, for labeling a constituent word or word sequence of the phrase according to a symbolic categorization of prosodic phenomena; and
  
  means for constructing a data structure comprising word/prosody-categories and word/prosody-category sequences for the phrase, and that further comprises information pertaining to a phone sequence associated with the constituent word or word sequence for the phrase.
- View Dependent Claims (13, 14, 15, 16, 17, 18)
- - 13. The system as in claim 12, further comprising:
    - means, responsive to input text to be converted to speech, for labeling words of the input text with a target prosodic category;
      
      means for comparing the input text to data in the data structure to identify individual occurrences of a word or word sequence labeled with prosody categories corresponding to the input text for constructing a phone sequence; and
      
      means for constructing output speech according to the phone sequence.
  - 14. The system as in claim 13, where if said means for comparing the input text to data in the data structure does not identify individual occurrences of a word or word sequence, comparing instead the input text to a pronunciation dictionary.
  - 15. The system as in claim 12, where the symbolic categorization of the prosodic phenomena comprises considering at least one of a presence or absence of silence that at least one of proceeds or follows a current word;
    - a number of words since at least one of a beginning of a current utterance, phrase or silence-delimited speech, or a number of words until the end of the utterance, phrase or silence-delimited speech;
      
      at least one of a last punctuation mark preceding at least one of the word or the number of words since the punctuation mark, or a next punctuation mark following at least one of the word or the number of words until that punctuation mark.
  - 16. The system as in claim 12, where the symbolic categorization of the prosodic phenomena comprises a prosodic phonology.
  - 17. The system as in claim 13, where said comparing means operates to at least one of test for an exact match of prosodic categories, and apply a cost function of various category mismatches to a search process involving at least one other matching criterion.
  - 18. The system as in claim 12, where said labeling means uses a Tones and Break Indices (ToBI) analysis.

19. A method to operate a text-to-speech synthesis system, comprising:
- responsive to at least one phrase represented as recorded human speech to be employed in synthesizing speech, labeling the phrase in accordance with a symbolic categorization of prosodic phenomena;
  
  constructing a data structure that comprises word/prosody-categories and word/prosody-category sequences for the phrase, and that further includes information pertaining to a phone sequence associated with the constituent word or word sequence for the phrase;
  
  responsive to input text to be converted to speech, labeling phrases of the input text with a target prosodic category;
  
  comparing the input text to data in the data structure to identify an occurrences of a phrase labeled with prosody categories corresponding to the input text for constructing a phone sequence; and
  
  constructing output speech according to the phone sequence, where if comparing the input text to data in the data structure does not identify an occurrence of a phrase, obtaining instead a phonetic or sub-phonetic representation.
- View Dependent Claims (20)
- - 20. The method as in claim 19, where the symbolic categorization of the prosodic phenomena comprises considering at least one of a presence or absence of silence that at least one of proceeds or follows a current word;
    - a number of words since at least one of a beginning of a current utterance, phrase or silence-delimited speech, or a number of words until the end of the utterance, phrase or silence-delimited speech;
      
      at least one of a last punctuation mark preceding at least one of the word or the number of words since the punctuation mark, or a next punctuation mark following at least one of the word or the number of words until that punctuation mark, and where the symbolic categorization of the prosodic phenomena comprises a prosodic phonology, where comparing means operates to at least one of test for an exact match of prosodic categories and apply a cost function of various category mismatches to a search process involving at least one other matching criterion, and where labeling comprises using a Tones and Break Indices (ToBI) analysis, further comprising allowing for at least one of hand or automatic labeling of a corpus, as well as for the use of one of hand-generated or automatically generated labels at run-time.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Viswanathan, Mahesh, Fernandez, Raul, Eide, Ellen, Pitrelli, John

Application Number

US11/212,432
Publication Number

US 20070055526A1
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/10 Prosody rules derived from ...

Method, apparatus and computer program product providing prosodic-categorical enhancement to phrase-spliced text-to-speech synthesis

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

78 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method, apparatus and computer program product providing prosodic-categorical enhancement to phrase-spliced text-to-speech synthesis

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

78 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links