Using non-speech sounds during text-to-speech synthesis
First Claim
Patent Images
1. A method, comprising:
- parsing text into speech units and non-speech units at a first speech unit level;
attempting to match a non-speech unit with a first audio segment;
determining that there are unmatched non-speech units at the first speech unit level;
parsing speech units adjacent to unmatched non-speech units into speech units at a second speech unit level;
attempting to match an unmatched non-speech unit having an adjacent speech unit at the second speech unit level with a second audio segment; and
creating a portion of speech by synthesizing a portion of the text string containing speech units into speech andaugmenting the portion of synthesized speech with the first or second audio segment.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems, apparatus, methods and computer program products are described for producing text-to-speech synthesis with non-speech sounds. In general, some of the pauses or silences that would otherwise be generated in synthesized speech are instead synthesized as non-speech sounds such as breaths. Non-speech sounds can be identified from pre-recorded speech that can include meta-data such as the grammatical and phrasal structure of words and sounds that precede and succeed non-speech sounds. A non-speech sound can be selected for use in synthesized speech based on the words, punctuation, grammatical and phrasal structure of text from which the speech is being synthesized, or other characteristics.
39 Citations
13 Claims
-
1. A method, comprising:
-
parsing text into speech units and non-speech units at a first speech unit level; attempting to match a non-speech unit with a first audio segment; determining that there are unmatched non-speech units at the first speech unit level; parsing speech units adjacent to unmatched non-speech units into speech units at a second speech unit level; attempting to match an unmatched non-speech unit having an adjacent speech unit at the second speech unit level with a second audio segment; and creating a portion of speech by synthesizing a portion of the text string containing speech units into speech and augmenting the portion of synthesized speech with the first or second audio segment. - View Dependent Claims (2)
-
-
3. A computer-readable, non-transitory storage medium having instructions stored thereon, which, when executed by a processor, causes the processor to perform operations, comprising:
-
parsing text into speech units and non-speech units at a first speech unit level; attempting to match a non-speech unit with a first audio segment; determining that there are unmatched non-speech units at the first speech unit level; parsing speech units adjacent to unmatched non-speech units into speech units at a second speech unit level; attempting to match an unmatched non-speech unit having an adjacent speech unit at the second speech unit level with a second audio segment; and creating a portion of speech by synthesizing a portion of the text string containing speech units into speech; and augmenting the portion of synthesized speech with the first or second audio segment.
-
-
4. A system comprising:
-
a processor; memory having instructions stored thereon, which, when executed by the processor, cause the processor to perform operations, comprising; parsing text into speech units and non-speech units at a first speech unit level; attempting to match a non-speech unit with a first audio segment; determining that there are unmatched non-speech units at the first speech unit level; parsing speech units adjacent to unmatched non-speech units into speech units at a second speech unit level; attempting to match an unmatched non-speech unit having an adjacent speech unit at the second speech unit level with a second audio segment; and creating a portion of speech by synthesizing a portion of the text string containing speech units into speech and augmenting the portion of synthesized speech with the first or second audio segment.
-
-
5. A method comprising:
-
parsing a text string into phrase units and non-speech units; attempting to match a non-speech unit to a first audio segment; determining that there are unmatched non-speech units; parsing phrase units adjacent to unmatched non-speech units into word units; attempting to match an unmatched non-speech unit having an adjacent word unit to a second audio segment; and creating a portion of speech by synthesizing a portion of the text string containing speech units into speech and augmenting the portion of synthesized speech with the first or second audio segment. - View Dependent Claims (6, 7)
-
-
8. A computer-readable, non-transitory storage medium having instructions stored thereon, which, when executed by a processor, causes the processor to perform operations, comprising:
-
parsing a text string into phrase units and non-speech units; attempting to match a non-speech unit to a first audio segment; determining that there are unmatched non-speech units; parsing phrase units adjacent to unmatched non-speech units into word units; attempting to match an unmatched non-speech unit having an adjacent word unit to a second audio segment; and creating a portion of speech by synthesizing a portion of the text string containing speech units into speech and augmenting the portion of synthesized speech with the first or second audio segment. - View Dependent Claims (9, 10)
-
-
11. A system comprising:
-
a processor; memory having instructions stored thereon, which, when executed by the processor, cause the processor to perform operations, comprising; parsing a text string into phrase units and non-speech units; attempting to match a non-speech unit to a first audio segment; determining that there are unmatched non-speech units; parsing phrase units adjacent to unmatched non-speech units into word units; attempting to match an unmatched non-speech unit having an adjacent word unit to a second audio segment; and creating a portion of speech by synthesizing a portion of the text string containing speech units into speech and augmenting the portion of synthesized speech with the first or second audio segment. - View Dependent Claims (12, 13)
-
Specification