Multi-unit approach to text-to-speech synthesis

US 8,036,894 B2
Filed: 02/16/2006
Issued: 10/11/2011
Est. Priority Date: 02/16/2006
Status: Expired due to Fees

First Claim

Patent Images

1. A method, including:

matching, using a processor, phrase units of a received input string to audio segments from a plurality of audio segments including, for each phrase unit, matching the phrase unit to an audio segment using properties of the phrase unit in comparison to properties of other units in a hierarchy of units including phrases and words and using articulation relationships between the phrase unit and the other units to locate matching audio segments from a plurality of selections including determining concatenation costs and unit costs, where concatenation costs measure how well a unit fits with a neighbor unit and unit costs measure a similarity or difference from an ideal model;

identifying unmatched phrase units that are not matched to audio segments;

parsing the unmatched phrase units into word units;

matching the word units to audio segments including, for each word unit, matching a word unit to an audio segment using properties of the word unit in comparison to properties of other units in the hierarchy of units including the phrases and the words and using articulation relationships between the word and the other units to locate matching audio segments from the plurality of selections; and

synthesizing the received input string including combining the audio segments associated with the matching phrase units and the matching word units.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, apparatus, systems, and computer program products are provided for synthesizing speech. One method includes matching a first level of units of a received input string to audio segments from a plurality of audio segments including using properties of or between first level units to locate matching audio segments from a plurality of selections, parsing unmatched first level units into second level units, matching the second level units to audio segments using properties of or between the units to locate matching audio segments from a plurality of selections and synthesizing the input string, including combining the audio segments associated with the first and second units.

Citations

33 Claims

1. A method, including:
- matching, using a processor, phrase units of a received input string to audio segments from a plurality of audio segments including, for each phrase unit, matching the phrase unit to an audio segment using properties of the phrase unit in comparison to properties of other units in a hierarchy of units including phrases and words and using articulation relationships between the phrase unit and the other units to locate matching audio segments from a plurality of selections including determining concatenation costs and unit costs, where concatenation costs measure how well a unit fits with a neighbor unit and unit costs measure a similarity or difference from an ideal model;
  
  identifying unmatched phrase units that are not matched to audio segments;
  
  parsing the unmatched phrase units into word units;
  
  matching the word units to audio segments including, for each word unit, matching a word unit to an audio segment using properties of the word unit in comparison to properties of other units in the hierarchy of units including the phrases and the words and using articulation relationships between the word and the other units to locate matching audio segments from the plurality of selections; and
  
  synthesizing the received input string including combining the audio segments associated with the matching phrase units and the matching word units.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein matching the phrase units further comprises:
    - searching metadata associated with the plurality of audio segments and that describes the properties of or between the plurality of audio segments.
  - 3. The method of claim 1, wherein matching the word units further comprises:
    - searching metadata associated with the plurality of audio segments and that describes the properties of or between the plurality of audio segments.
  - 4. The method of claim 1, further comprising:
    - parsing unmatched word units into sub-word units; and
      
      matching the sub-word units to audio segments including searching metadata associated with the plurality of audio segments and that describes properties of or between the plurality of audio segments.
  - 5. The method of claim 1, further comprising:
    - parsing unmatched word units into phonetic segment units; and
      
      matching the phonetic segment units to audio segments including searching metadata associated with the plurality of audio segments and that describes properties of or between the plurality of audio segments.
  - 6. The method of claim 5, wherein synthesizing the input string includes:
    - combining the audio segments associated with phrase, word, sub-word and phonetic segment units.
  - 7. The method of claim 1, further comprising:
    - providing an index to the plurality of audio segments.
  - 8. The method of claim 1, further comprising:
    - generating metadata associated with the plurality of audio segments.
  - 9. The method of claim 8, wherein generating the metadata comprises:
    - receiving a voice sample;
      
      determining two or more portions of the voice sample having properties; and
      
      generating a portion of the metadata associated with a first portion of the voice sample to associate a second portion of the voice sample, and a portion of the metadata associated with the second portion of the voice sample to associate the first portion of the voice sample.
  - 10. The method of claim 8, wherein generating the metadata comprises:
    - receiving a voice sample;
      
      delimiting a portion of the voice sample in which articulation relationships are substantially self-contained; and
      
      generating a portion of the metadata to describe the portion of the voice sample.
  - 11. The method of claim 1, wherein the phrase units each comprise one or more of one or more sentences, one or more phrases, one or more word pairs, or one or more words.
  - 12. The method of claim 1, wherein the input string is received from an application or an operating system.
  - 13. The method of claim 1, further comprising:
    - transforming unmatched portions of the input string to uncorrelated phonemes.
  - 14. The method of claim 1, wherein the input string comprises ASCII or Unicode characters.
  - 15. The method of claim 1, further comprising:
    - outputting amplified speech comprising the combined audio segments.

16. A method, including:
- receiving a stream of textual input;
  
  matching portions of the stream of textual input to audio segments derived from one or more voice samples at multiple levels including;
  
  matching the portions based on properties of individual portions and articulation relationships between the portions; and
  
  comparing the portions to target voice samples at different hierarchical levels to identify matched portions where the hierarchical levels are selected from a group comprising target matching phrases, words, sub-words and phonetic segments;
  
  identifying unmatched portions that are not matched to the audio segments;
  
  parsing portions including the unmatched portions using a hierarchical level different from that used for matching the target voice samples;
  
  comparing the parsed portions to the audio segments to identify additional matched portions; and
  
  synthesizing audio segments corresponding to the matched portions and the additional matched portions into speech output.
- View Dependent Claims (17)
- - 17. The method of claim 16, wherein synthesizing comprises:
    - synthesizing both matching audio segments for successfully matched portions of the stream of textual input and uncorrelated phonemes for unmatched portions of the stream of textual input.

18. A non-transitory computer program product including instructions tangibly stored on a computer-readable medium, the product including instructions for causing a computing device to:
- match, by a processor, phrase units of an input string to audio segments from a plurality of audio segments;
  
  identify unmatched phrase units that are not matched to the audio segments;
  
  parse the unmatched phrase units into word units;
  
  match the word units to audio segments including, for word units adjacent to matched phrase units in the input string, matching the word units based on properties of the word units and the adjacent matched phrase units; and
  
  synthesize the input string including combining the audio segments associated with the matching phrase units and the matching word units.

19. A system, including:
- an input capture routine to receive an input string that includes phrase units;
  
  a unit matching engine, in communication with the input capture routine, to match the phrase units to audio segments from a plurality of audio segments including using properties of or between the audio segments for matching the phrase units, and to identify unmatched phrase units that are not matched to the audio segments;
  
  a parsing engine, in communication with the unit matching engine, to parse unmatched phrase units into word units, the unit matching engine configured to match the word units to audio segments including using properties of the audio segments associated with the word units and audio segments associated with adjacent phrase units;
  
  a synthesis block, in communication with the unit matching engine, to synthesize the input string including combining the audio segments associated with the phrase units and the word units; and
  
  a storage unit to store the plurality of audio segments and properties of or between the plurality of audio segments.

20. A method includingproviding, by a processor, a library of audio segments and associated metadata defining properties of or between a given segment and another segment, the library including one or more levels of units in accordance with a hierarchy;
- matching, at a first level of the hierarchy, units of a received input string to audio segments, the received input string having one or more units at a first level;
  
  identifying unmatched units that are not matched to the audio segments;
  
  parsing the unmatched units to units at a second level in the hierarchy;
  
  matching one or more units at the second level of the hierarchy to audio segments including matching the one or more units at the second level based on properties of a given unit at the second level and adjacent units at the first level where the levels are selected from the group comprising phrases, words, sub-words and phonetic segments; and
  
  synthesizing the input string including combining the audio segments associated with the first and second levels.

21. A method includingreceiving, by a processor, audio segments;
- parsing the audio segments into units of a first level in a hierarchy of levels;
  
  defining properties of and between the units at different levels, the properties between the units including articulation relationships;
  
  storing the units and the properties associated with the units;
  
  parsing the units into sub-units;
  
  defining properties of and between the sub-units and between adjacent units and the sub-units; and
  
  storing the sub-units and the properties associated with the sub-units.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
- - 22. The method of claim 21 further comprising;
    - parsing a received input string to units;
      
      determining properties of and between the units if any;
      
      matching the units to the stored units using the properties;
      
      parsing unmatched units to sub-units;
      
      determining properties of and between the sub-units and the units if any;
      
      matching one or more sub-units to the stored sub-units; and
      
      synthesizing the input string including combining the audio segments associated with the units and the sub-units.
  - 23. The method of claim 21 where the units are phrase units and the sub-units are word units.
  - 24. The method of claim 21 where the units are word units and the sub-units are phonetic segments.
  - 25. The method of claim 21 further comprising:
    - defining properties between the units and the sub-units; and
      
      storing the properties associated with the units and the sub-units.
  - 26. The method of claim 21 further comprising:
    - continuing to parse the sub-units to phonetic segments;
      
      determining properties of and between phonetic segments if any; and
      
      storing the phonetic segments including the properties associated with the phonetic segments.
  - 27. The method of claim 26 further comprising storing the phonetic segments without the properties associated with the phonetic segments.
  - 28. The method of claim 27 further comprisingparsing a received input string to units;
    - determining properties of and between the units if any;
      
      matching the units to the stored units using the properties;
      
      parsing unmatched units to sub-units;
      
      determining properties of and between the sub-units and units if any;
      
      matching one or more sub-units to stored sub-units;
      
      parsing unmatched sub-units;
      
      determining properties of and between the parsed sub-units, the sub-units and units if any;
      
      matching the parsed sub-units to stored parsed units; and
      
      synthesizing the input string including combining the audio segments associated with the units, sub-units and parsed sub-units.
  - 29. The method of claim 27 further comprisingstoring the parsed sub-units without the properties.
  - 30. The method of claim 29 further comprisingparsing a received input string to units;
    - determining properties between the units if any;
      
      matching the units to the stored units using the properties;
      
      parsing unmatched units to sub-units;
      
      determining properties between the sub-units if any;
      
      matching one or more sub-units to the stored sub-units;
      
      parsing the unmatched sub-units;
      
      determining properties between the parsed sub-units if any;
      
      matching the parsed sub-units to the stored parsed units using the properties; and
      
      synthesizing the input string including combining the audio segments associated with the units, sub-units and parsed sub-units.
  - 31. The method of claim 21 further comprising:
    - parsing the sub-units into parsed sub-units;
      
      defining properties of and between the parsed sub-units, the units and sub-units; and
      
      storing the parsed sub-units and the properties associated with the parsed sub-units.

32. A method including:
- receiving audio segments;
  
  parsing the audio segments into units of a first level in a hierarchy of levels;
  
  defining properties between the units;
  
  storing the units and the properties;
  
  parsing the units into units of a next level in the hierarchy of levels;
  
  defining properties between the units in the next level and units in the first level;
  
  storing the units of the next level and the properties between the units in the next level and the units in the first level;
  
  continuing to parse units at a given level into units at a next level in the hierarchy until a final parsing is performed;
  
  at each level, defining properties between units and adjacent units at a different level and storing the units and the properties between the units and the adjacent units at the different level; and
  
  at a final level in the hierarchy, storing units where the hierarchy of levels are selected from a group comprising phrases, words, sub-words and phonetic segments.
- View Dependent Claims (33)
- - 33. The method of claim 32 further comprising;
    - parsing a received input string to units;
      
      determining properties for the units if any;
      
      matching units having properties to stored units at a first level in the hierarchy;
      
      parsing unmatched units in a given level of the hierarchy to units at a next level in the hierarchy;
      
      determining properties for the parsed units if any;
      
      matching one or more parsed units to stored units at a given level in the hierarchy; and
      
      synthesizing the input string including combining the audio segments associated with the units and parsed units.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Silverman, Kim E.A., Aitken, Kevin B., Neeracher, Matthias, Naik, Devang K., Bellegarda, Jerome R.
Primary Examiner(s)
Dorvil; Richemond
Assistant Examiner(s)
GODBOLD, DOUGLAS

Application Number

US11/357,736
Publication Number

US 20070192105A1
Time in Patent Office

2,063 Days
Field of Search

704/10, 704/243, 704/258, 704/260, 704/266, 704/267
US Class Current

704/267
CPC Class Codes

G10L 13/08 Text analysis or generation...

Multi-unit approach to text-to-speech synthesis

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

33 Claims

Specification

Solutions

Use Cases

Quick Links

Multi-unit approach to text-to-speech synthesis

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

33 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links