Methods and apparatus for predicting prosody in speech synthesis

US 9,286,886 B2
Filed: 01/24/2011
Issued: 03/15/2016
Est. Priority Date: 01/24/2011
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

comparing an input text to a data set of text fragments to select a corresponding text fragment for at least a portion of the input text, wherein selecting the corresponding text fragment comprisesidentifying within the at least a portion of the input text a first sequence of words beginning with a first function word and including one or more words following the first function word,identifying a grammatical type of the first function word beginning the first sequence of words,constraining the identified first sequence of words within the at least a portion of the input text to be matched as a unit to a contiguous sequence of words in a text fragment in the data set, andselecting as the corresponding text fragment a text fragment including as the contiguous sequence of words a second sequence of words beginning with a second function word that is a different word from the first function word but is of the same grammatical type as the first function word, the corresponding text fragment being associated with spoken audio of at least the second sequence of words, wherein the second sequence of words within the corresponding text fragment includes at least one word not present in the first sequence of words within the at least a portion of the input text;

determining an alignment of the corresponding text fragment with the at least a portion of the input text; and

using a computer, synthesizing speech from the at least a portion of the input text, wherein the synthesizing comprises extracting prosody from the spoken audio of the second sequence of words, including from the at least one word not present in the first sequence of words, and applying the extracted prosody in synthesizing the speech using the alignment of the corresponding text fragment with the at least a portion of the input text.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for predicting prosody in speech synthesis may make use of a data set of example text fragments with corresponding aligned spoken audio. To predict prosody for synthesizing an input text, the input text may be compared with the data set of example text fragments to select a best matching sequence of one or more example text fragments, each example text fragment in the sequence being paired with a portion of the input text. The selected example text fragment sequence may be aligned with the input text, e.g., at the word level, such that prosody may be extracted from the audio aligned with the example text fragments, and the extracted prosody may be applied to the synthesis of the input text using the alignment between the input text and the example text fragments.

Citations

60 Claims

1. A method comprising:
- comparing an input text to a data set of text fragments to select a corresponding text fragment for at least a portion of the input text, wherein selecting the corresponding text fragment comprisesidentifying within the at least a portion of the input text a first sequence of words beginning with a first function word and including one or more words following the first function word,identifying a grammatical type of the first function word beginning the first sequence of words,constraining the identified first sequence of words within the at least a portion of the input text to be matched as a unit to a contiguous sequence of words in a text fragment in the data set, andselecting as the corresponding text fragment a text fragment including as the contiguous sequence of words a second sequence of words beginning with a second function word that is a different word from the first function word but is of the same grammatical type as the first function word, the corresponding text fragment being associated with spoken audio of at least the second sequence of words, wherein the second sequence of words within the corresponding text fragment includes at least one word not present in the first sequence of words within the at least a portion of the input text;
  
  determining an alignment of the corresponding text fragment with the at least a portion of the input text; and
  
  using a computer, synthesizing speech from the at least a portion of the input text, wherein the synthesizing comprises extracting prosody from the spoken audio of the second sequence of words, including from the at least one word not present in the first sequence of words, and applying the extracted prosody in synthesizing the speech using the alignment of the corresponding text fragment with the at least a portion of the input text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 2. The method of claim 1, further comprising selecting a second corresponding text fragment for a second portion of the input text, wherein selecting the second corresponding text fragment comprises:
    - identifying a first marker included in the second portion of the input text;
      
      identifying a class of the first marker; and
      
      selecting the second corresponding text fragment based at least in part on the second corresponding text fragment comprising a second marker of the same class as the first marker.
  - 3. The method of claim 2, wherein the class of the first marker is selected from the group consisting of one or more punctuation classes, one or more context markup classes and one or more filler classes.
  - 4. The method of claim 2, wherein determining the alignment comprises aligning the second marker with the first marker.
  - 5. The method of claim 1, wherein identifying the grammatical type of the first function word comprises identifying the first function word as an auxiliary, a conjunction, a subordinate conjunction, a determiner, an interrogative pronoun, a preposition, a pronoun, or a personal pronoun.
  - 6. The method of claim 1, wherein the comparing comprises selecting the corresponding text fragment based at least in part on a similarity measure between one or more linguistic features of the at least a portion of the input text and the corresponding text fragment.
  - 7. The method of claim 6, wherein the similarity measure is determined based at least in part on a ratio of words that appear in both the at least a portion of the input text and the corresponding text fragment.
  - 8. The method of claim 6, wherein the similarity measure is determined based at least in part on a ratio of words having matching parts of speech between the at least a portion of the input text and the corresponding text fragment.
  - 9. The method of claim 6, wherein the one or more linguistic features comprise one or more features selected from the group consisting of a named entity feature, a verb semantics feature, a noun semantics feature, an adjective semantics feature, an adverb semantics feature, and a syllable structure feature.
  - 10. The method of claim 1, wherein the comparing comprises selecting a sequence of corresponding text fragments for the input text.
  - 11. The method of claim 10, wherein the comparing further comprises:
    - analyzing the input text to identify a sequence of markers in the input text; and
      
      selecting the sequence of corresponding text fragments from one or more candidate sequences matching the sequence of markers.
  - 12. The method of claim 11, wherein determining the alignment comprises aligning the sequence of markers in the input text with markers in the sequence of corresponding text fragments.
  - 13. The method of claim 11, wherein the comparing further comprises:
    - computing a join cost for each of the one or more candidate sequences; and
      
      selecting the sequence of corresponding text fragments from the one or more candidate sequences based at least in part on the join cost.
  - 14. The method of claim 10, wherein the comparing further comprises:
    - inputting the input text to a statistical model to divide the input text into a sequence of input text fragments; and
      
      selecting the sequence of corresponding text fragments from one or more candidate sequences matching the sequence of input text fragments.
  - 15. The method of claim 10, wherein at least a first text fragment is adjacent in the sequence of corresponding text fragments to a second text fragment, the first text fragment being associated with first spoken audio and the second text fragment being associated with second spoken audio, wherein the first spoken audio was not spoken consecutively with the second spoken audio.
  - 16. The method of claim 1, wherein the spoken audio is aligned with the corresponding text fragment, and the synthesizing comprises extracting prosody from the spoken audio using the alignment of the spoken audio with the corresponding text fragment.
  - 17. The method of claim 1, wherein the synthesizing comprises extracting at least one prosodic feature from the spoken audio of the at least one word present in the second sequence of the corresponding text fragment and not in the first sequence of the at least a portion of the input text, and incorporating into the synthesized speech the at least one prosodic feature extracted from the at least one word, without incorporating any phonemes of the spoken audio of the at least one word into the synthesized speech.
  - 18. The method of claim 1, wherein the extracting comprises specifying prosody for synthesizing the at least a portion of the input text by inputting the corresponding text fragment to a statistical model trained at least partly on the spoken audio.
  - 19. The method of claim 1, wherein the synthesizing comprises specifying at least one prosodic contour for synthesizing the at least a portion of the input text, wherein the at least one prosodic contour is selected from the group consisting of a fundamental frequency contour, an amplitude contour and a duration contour.
  - 20. The method of claim 1, wherein the data set is specific to a domain to which the input text belongs.

21. A system comprising:
- at least one memory storing processor-executable instructions; and
  
  at least one processor operatively coupled to the at least one memory, the at least one processor being configured to execute the processor-executable instructions to perform a method comprising;
  
  comparing an input text to a data set of text fragments to select a corresponding text fragment for at least a portion of the input text, wherein selecting the corresponding text fragment comprisesidentifying within the at least a portion of the input text a first sequence of words beginning with a first function word and including one or more words following the first function word,identifying a grammatical type of the first function word beginning the first sequence of words,constraining the identified first sequence of words within the at least a portion of the input text to be matched as a unit to a contiguous sequence of words in a text fragment in the data set, andselecting as the corresponding text fragment a text fragment including as the contiguous sequence of words a second sequence of words beginning with a second function word that is a different word from the first function word but is of the same grammatical type as the first function word, the corresponding text fragment being associated with spoken audio of at least the second sequence of words, wherein the second sequence of words within the corresponding text fragment includes at least one word not present in the first sequence of words within the at least a portion of the input text;
  
  determining an alignment of the corresponding text fragment with the at least a portion of the input text; and
  
  synthesizing speech from the at least a portion of the input text, wherein the synthesizing comprises extracting prosody from the spoken audio of the second sequence of words, including from the at least one word not present in the first sequence of words, and applying the extracted prosody in synthesizing the speech using the alignment of the corresponding text fragment with the at least a portion of the input text.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
- - 22. The system of claim 21, wherein the method further comprises selecting a second corresponding text fragment for a second portion of the input text, wherein selecting the second corresponding text fragment comprises:
    - identifying a first marker included in the second portion of the input text;
      
      identifying a class of the first marker; and
      
      selecting the second corresponding text fragment based at least in part on the second corresponding text fragment comprising a second marker of the same class as the first marker.
  - 23. The system of claim 22, wherein the class of the first marker is selected from the group consisting of one or more punctuation classes, one or more context markup classes and one or more filler classes.
  - 24. The system of claim 22, wherein determining the alignment comprises aligning the second marker with the first marker.
  - 25. The system of claim 21, wherein identifying the grammatical type of the first function word comprises identifying the first function word as an auxiliary, a conjunction, a subordinate conjunction, a determiner, an interrogative pronoun, a preposition, a pronoun, or a personal pronoun.
  - 26. The system of claim 21, wherein the comparing comprises selecting the corresponding text fragment based at least in part on a similarity measure between one or more linguistic features of the at least a portion of the input text and the corresponding text fragment.
  - 27. The system of claim 26, wherein the similarity measure is determined based at least in part on a ratio of words that appear in both the at least a portion of the input text and the corresponding text fragment.
  - 28. The system of claim 26, wherein the similarity measure is determined based at least in part on a ratio of words having matching parts of speech between the at least a portion of the input text and the corresponding text fragment.
  - 29. The system of claim 26, wherein the one or more linguistic features comprise one or more features selected from the group consisting of a named entity feature, a verb semantics feature, a noun semantics feature, an adjective semantics feature, an adverb semantics feature, and a syllable structure feature.
  - 30. The system of claim 21, wherein the comparing comprises selecting a sequence of corresponding text fragments for the input text.
  - 31. The system of claim 30, wherein the comparing further comprises:
    - analyzing the input text to identify a sequence of markers in the input text; and
      
      selecting the sequence of corresponding text fragments from one or more candidate sequences matching the sequence of markers.
  - 32. The system of claim 31, wherein determining the alignment comprises aligning the sequence of markers in the input text with markers in the sequence of corresponding text fragments.
  - 33. The system of claim 31, wherein the comparing further comprises:
    - computing a join cost for each of the one or more candidate sequences; and
      
      selecting the sequence of corresponding text fragments from the one or more candidate sequences based at least in part on the join cost.
  - 34. The system of claim 30, wherein the comparing further comprises:
    - inputting the input text to a statistical model to divide the input text into a sequence of input text fragments; and
      
      selecting the sequence of corresponding text fragments from one or more candidate sequences matching the sequence of input text fragments.
  - 35. The system of claim 30, wherein at least a first text fragment is adjacent in the sequence of corresponding text fragments to a second text fragment, the first text fragment being associated with first spoken audio and the second text fragment being associated with second spoken audio, wherein the first spoken audio was not spoken consecutively with the second spoken audio.
  - 36. The system of claim 21, wherein the spoken audio is aligned with the corresponding text fragment, and the synthesizing comprises extracting prosody from the spoken audio using the alignment of the spoken audio with the corresponding text fragment.
  - 37. The system of claim 21, wherein the synthesizing comprises extracting at least one prosodic feature from the spoken audio of the at least one word present in the second sequence of the corresponding text fragment and not in the first sequence of the at least a portion of the input text, and incorporating into the synthesized speech the at least one prosodic feature extracted from the at least one word, without incorporating any phonemes of the spoken audio of the at least one word into the synthesized speech.
  - 38. The system of claim 21, wherein the extracting comprises specifying prosody for synthesizing the at least a portion of the input text by inputting the corresponding text fragment to a statistical model trained at least partly on the spoken audio.
  - 39. The system of claim 21, wherein the synthesizing comprises specifying at least one prosodic contour for synthesizing the at least a portion of the input text, wherein the at least one prosodic contour is selected from the group consisting of a fundamental frequency contour, an amplitude contour and a duration contour.
  - 40. The system of claim 21, wherein the data set is specific to a domain to which the input text belongs.

41. At least one non-transitory computer-readable storage medium encoded with a plurality of computer-executable instructions that, when executed, perform a method comprising:
- comparing an input text to a data set of text fragments to select a corresponding text fragment for at least a portion of the input text, wherein selecting the corresponding text fragment comprisesidentifying within the at least a portion of the input text a first sequence of words beginning with a first function word and including one or more words following the first function word,identifying a grammatical type of the first function word beginning the first sequence of words,constraining the identified first sequence of words within the at least a portion of the input text to be matched as a unit to a contiguous sequence of words in a text fragment in the data set, andselecting as the corresponding text fragment a text fragment including as the contiguous sequence of words a second sequence of words beginning with a second function word that is a different word from the first function word but is of the same grammatical type as the first function word, the corresponding text fragment being associated with spoken audio of at least the second sequence of words, wherein the second sequence of words within the corresponding text fragment includes at least one word not present in the first sequence of words within the at least a portion of the input text;
  
  determining an alignment of the corresponding text fragment with the at least a portion of the input text; and
  
  synthesizing speech from the at least a portion of the input text, wherein the synthesizing comprises extracting prosody from the spoken audio of the second sequence of words, including from the at least one word not present in the first sequence of words, and applying the extracted prosody in synthesizing the speech using the alignment of the corresponding text fragment with the at least a portion of the input text.
- View Dependent Claims (42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60)
- - 42. The at least one computer-readable storage medium of claim 41, wherein the method further comprises selecting a second corresponding text fragment for a second portion of the input text, wherein selecting the second corresponding text fragment comprises:
    - identifying a first marker included in the second portion of the input text;
      
      identifying a class of the first marker; and
      
      selecting the second corresponding text fragment based at least in part on the second corresponding text fragment comprising a second marker of the same class as the first marker.
  - 43. The at least one computer-readable storage medium of claim 42, wherein the class of the first marker is selected from the group consisting of one or more punctuation classes, one or more context markup classes and one or more filler classes.
  - 44. The at least one computer-readable storage medium of claim 42, wherein determining the alignment comprises aligning the second marker with the first marker.
  - 45. The at least one computer-readable storage medium of claim 41, wherein identifying the grammatical type of the first function word comprises identifying the first function word as an auxiliary, a conjunction, a subordinate conjunction, a determiner, an interrogative pronoun, a preposition, a pronoun, or a personal pronoun.
  - 46. The at least one computer-readable storage medium of claim 41, wherein the comparing comprises selecting the corresponding text fragment based at least in part on a similarity measure between one or more linguistic features of the at least a portion of the input text and the corresponding text fragment.
  - 47. The at least one computer-readable storage medium of claim 46, wherein the similarity measure is determined based at least in part on a ratio of words that appear in both the at least a portion of the input text and the corresponding text fragment.
  - 48. The at least one computer-readable storage medium of claim 46, wherein the similarity measure is determined based at least in part on a ratio of words having matching parts of speech between the at least a portion of the input text and the corresponding text fragment.
  - 49. The at least one computer-readable storage medium of claim 46, wherein the one or more linguistic features comprise one or more features selected from the group consisting of a named entity feature, a verb semantics feature, a noun semantics feature, an adjective semantics feature, an adverb semantics feature, and a syllable structure feature.
  - 50. The at least one computer-readable storage medium of claim 41, wherein the comparing comprises selecting a sequence of corresponding text fragments for the input text.
  - 51. The at least one computer-readable storage medium of claim 50, wherein the comparing further comprises:
    - analyzing the input text to identify a sequence of markers in the input text; and
      
      selecting the sequence of corresponding text fragments from one or more candidate sequences matching the sequence of markers.
  - 52. The at least one computer-readable storage medium of claim 51, wherein determining the alignment comprises aligning the sequence of markers in the input text with markers in the sequence of corresponding text fragments.
  - 53. The at least one computer-readable storage medium of claim 51, wherein the comparing further comprises:
    - computing a join cost for each of the one or more candidate sequences; and
      
      selecting the sequence of corresponding text fragments from the one or more candidate sequences based at least in part on the join cost.
  - 54. The at least one computer-readable storage medium of claim 50, wherein the comparing further comprises:
    - inputting the input text to a statistical model to divide the input text into a sequence of input text fragments; and
      
      selecting the sequence of corresponding text fragments from one or more candidate sequences matching the sequence of input text fragments.
  - 55. The at least one computer-readable storage medium of claim 50, wherein at least a first text fragment is adjacent in the sequence of corresponding text fragments to a second text fragment, the first text fragment being associated with first spoken audio and the second text fragment being associated with second spoken audio, wherein the first spoken audio was not spoken consecutively with the second spoken audio.
  - 56. The at least one computer-readable storage medium of claim 41, wherein the spoken audio is aligned with the corresponding text fragment, and the synthesizing comprises extracting prosody from the spoken audio using the alignment of the spoken audio with the corresponding text fragment.
  - 57. The at least one computer-readable storage medium of claim 41, wherein the synthesizing comprises extracting at least one prosodic feature from the spoken audio of the at least one word present in the second sequence of the corresponding text fragment and not in the first sequence of the at least a portion of the input text, and incorporating into the synthesized speech the at least one prosodic feature extracted from the at least one word, without incorporating any phonemes of the spoken audio of the at least one word into the synthesized speech.
  - 58. The at least one computer-readable storage medium of claim 41, wherein the extracting comprises specifying prosody for synthesizing the at least a portion of the input text by inputting the corresponding text fragment to a statistical model trained at least partly on the spoken audio.
  - 59. The at least one computer-readable storage medium of claim 41, wherein the synthesizing comprises specifying at least one prosodic contour for synthesizing the at least a portion of the input text, wherein the at least one prosodic contour is selected from the group consisting of a fundamental frequency contour, an amplitude contour and a duration contour.
  - 60. The at least one computer-readable storage medium of claim 41, wherein the data set is specific to a domain to which the input text belongs.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Minnis, Stephen, Breen, Andrew P.
Primary Examiner(s)
WOZNIAK, JAMES S

Application Number

US13/012,740
Publication Number

US 20120191457A1
Time in Patent Office

1,877 Days
Field of Search

704/9, 704/258, 704260-261, 704/266, 704/268
US Class Current

1/1
CPC Class Codes

G10L 13/08 Text analysis or generation...

G10L 13/10 Prosody rules derived from ...

Methods and apparatus for predicting prosody in speech synthesis

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

60 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for predicting prosody in speech synthesis

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

60 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links