USING NON-SPEECH SOUNDS DURING TEXT-TO-SPEECH SYNTHESIS

US 20080071529A1
Filed: 09/15/2006
Published: 03/20/2008
Est. Priority Date: 09/15/2006
Status: Active Grant

First Claim

Patent Images

1. A method, including:

augmenting a portion of synthesized speech with a non-speech sound other than silence, the augmentation based on characteristics of the synthesized speech.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, apparatus, methods and computer program products are described for producing text-to-speech synthesis with non-speech sounds. In general, some of the pauses or silences that would otherwise be generated in synthesized speech are instead synthesized as non-speech sounds such as breaths. Non-speech sounds can be identified from pre-recorded speech that can include meta-data such as the grammatical and phrasal structure of words and sounds that precede and succeed non-speech sounds. A non-speech sound can be selected for use in synthesized speech based on the words, punctuation, grammatical and phrasal structure of text from which the speech is being synthesized, or other characteristics.

Citations

27 Claims

1. A method, including:
- augmenting a portion of synthesized speech with a non-speech sound other than silence, the augmentation based on characteristics of the synthesized speech.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, further comprising:
    - replacing a pause in the portion of synthesized speech with a non-speech sound.
  - 3. The method of claim 2, where augmentation further comprises:
    - identifying the non-speech sound based on punctuation, grammatical or phrasal structure of text associated with the portion of synthesized speech.
  - 4. The method of claim 1, where a non-speech sound includes the sound of one or more of:
    - inhalation;
      
      exhalation;
      
      mouth clicks;
      
      lip smacks;
      
      tongue flicks; and
      
      salivation.

5. A method, including:
- identifying a non-speech unit in a received input string, the non-speech unit not having an associated specific textual reference in the input string;
  
  matching the non-speech unit to an audio segment, the audio segment a voice sample of a non-speech sound; and
  
  synthesizing the input string, including combining the audio segments matched with the non-speech unit.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 6. The method of claim 5, further comprising:
    - identifying the non-speech unit based on punctuation, grammatical and phrasal structure of the input string.
  - 7. The method of claim 5, further comprising:
    - determining the duration of the non-speech unit.
  - 8. The method of claim 7, further comprising:
    - matching the non-speech unit with non-speech sounds based on duration of the non-speech unit.
  - 9. The method of claim 5, further comprising:
    - generating metadata associated with the plurality of audio segments.
  - 10. The method of claim 9, wherein generating the metadata comprises:
    - receiving a voice sample;
      
      determining two or more portions of the voice sample having properties;
      
      generating a portion of the metadata associated with a first portion of the voice sample to associate a second portion of the voice sample with the first portion of the voice sample; and
      
      generating a portion of the metadata associated with the second portion of the voice sample to associate the first portion of the voice sample with the second portion of the voice sample.
  - 11. The method of claim 9, wherein generating the metadata comprises:
    - receiving a voice sample;
      
      delimiting a portion of the voice sample in which articulation relationships are substantially self-contained; and
      
      generating a portion of the metadata to describe the portion of the voice sample.
  - 12. The method of claim 5, further comprising:
    - identifying a speech unit in a received input string, the speech unit preceding or following the non-speech unit; and
      
      matching the non-speech unit with the non-speech sound based on the speech unit.
  - 13. The method of claim 12, further comprising:
    - parsing the speech unit into sub units, at least one sub unit preceding or following the non-speech unit; and
      
      matching the non-speech unit with non-speech sounds based on the at least one sub unit.
  - 14. The method of claim 12, where speech units are phrases, words, or sub-words in the input string.
  - 15. The method of claim 5, further comprising:
    - limiting synthesizing non-speech units based on a proximity to preceding synthesized pauses.
  - 16. The method of claim 5, wherein the input string comprises ASCII or Unicode characters.
  - 17. The method of claim 5, further comprising:
    - outputting amplified speech comprising the combined audio segments.

18. A method, including:
- receiving audio segments;
  
  parsing the audio segments into speech units and non-speech units;
  
  defining properties of or between speech units and non-speech units; and
  
  storing the units and the properties.
- View Dependent Claims (19, 20, 21)
- - 19. The method of claim 18 further comprising:
    - parsing the speech units into sub units;
      
      defining properties of or between the sub units; and
      
      storing the sub units and properties.
  - 20. The method of claim 18 further comprising:
    - parsing a received input string into speech units and non-speech units;
      
      determining properties of or between the speech units and non-speech units if any;
      
      matching units to stored units using the properties; and
      
      synthesizing the input string including combining the audio segments matched with the speech-units and non-speech units.
  - 21. The method of claim 18 further comprising:
    - defining properties between speech-units and non-speech units.

22. A computer program product, encoded on a computer-readable medium, operable to cause a data processing apparatus to:
- Augmenting a portion of synthesized speech with a non-speech sound other than silence, the augmentation based on characteristics of the synthesized speech.

23. A computer program product, encoded on a computer-readable medium, operable to cause a data processing apparatus to:
- identifying a non-speech unit in a received input string, the non-speech unit not having an associated specific textual reference in the input string;
  
  matching the non-speech unit to an audio segment, the audio segment a voice sample of a non-speech sounds; and
  
  synthesizing the input string, including combining the audio segments matched with the non-speech unit.

24. A computer program product, encoded on a computer-readable medium, operable to cause a data processing apparatus to:
- receiving audio segments;
  
  parsing the audio segments into speech units and non-speech units;
  
  defining properties of or between speech units and non-speech units; and
  
  storing the units and the properties.

25. A system comprising:
- augmenting a portion of synthesized speech with the a non-speech sounds other than silence, the augmentation based on characteristics of the synthesized speech

26. A system comprising:
- means for identifying a non-speech unit in a received input string, the non-speech unit not having an associated specific textual reference in the input string;
  
  means for matching the non-speech unit to an audio segment, the audio segment a voice sample of a non-speech sounds; and
  
  means for synthesizing the input string, including combining the audio segments matched with the non-speech unit.

27. A system comprising:
- means for receiving audio segments;
  
  means for parsing the audio segments into speech units and non-speech units;
  
  means for defining properties of or between speech units and non-speech units; and
  
  means for storing the units and the properties.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Silverman, Kim E.A., Neeracher, Matthias

Granted Patent

US 8,027,837 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/220
CPC Class Codes

G10L 13/02 Methods for producing synth...

G10L 21/0364 for improving intelligibility

USING NON-SPEECH SOUNDS DURING TEXT-TO-SPEECH SYNTHESIS

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

USING NON-SPEECH SOUNDS DURING TEXT-TO-SPEECH SYNTHESIS

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links