Systems and methods for concatenation of words in text to speech synthesis

US 8,396,714 B2
Filed: 09/29/2008
Issued: 03/12/2013
Est. Priority Date: 09/29/2008
Status: Active Grant

First Claim

Patent Images

1. A method for concatenating words in a text string, performed at an electronic device having one or more processors and memory storing one or more programs for execution by the one or more processors, the method comprising:

obtaining phonemes for a text string, the text string comprising at least a preceding word and a succeeding word to be concatenated;

identifying a last letter of the preceding word to be concatenated, and identifying a first letter of the succeeding word to be concatenated;

selecting a connector term and a connector term type based on the identified last letter and the identified first letter; and

creating a modified text string for speech synthesis including the selected connector term and the selected connector type.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Algorithms for synthesizing speech used to identify media assets are provided. Speech may be selectively synthesized form text strings associated with media assets. A text string may be normalized and its native language determined for obtaining a target phoneme for providing human-sounding speech in a language (e.g., dialect or accent) that is familiar to a user. The algorithms may be implemented on a system including several dedicated render engines. The system may be part of a back end coupled to a front end including storage for media assets and associated synthesized speech, and a request processor for receiving and processing requests that result in providing the synthesized speech. The front end may communicate media assets and associated synthesized speech content over a network to host devices coupled to portable electronic devices on which the media assets and synthesized speech are played back.

619 Citations

36 Claims

1. A method for concatenating words in a text string, performed at an electronic device having one or more processors and memory storing one or more programs for execution by the one or more processors, the method comprising:
- obtaining phonemes for a text string, the text string comprising at least a preceding word and a succeeding word to be concatenated;
  
  identifying a last letter of the preceding word to be concatenated, and identifying a first letter of the succeeding word to be concatenated;
  
  selecting a connector term and a connector term type based on the identified last letter and the identified first letter; and
  
  creating a modified text string for speech synthesis including the selected connector term and the selected connector type.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein the text string is generated based on metadata associated with or identifying a media asset.
  - 3. The method of claim 2, further comprising:
    - synthesizing a speech segment based on the modified text string; and
      
      providing the speech segment to a user device for playback with the media asset on the user device.
  - 4. The method of claim 3, wherein the connecter term type specifies a respective pronunciation version for the connector term, and wherein synthesizing the speech segment based on the modified text string further comprises:
    - selecting a particular pronunciation for the connector term based on the respective pronunciation version; and
      
      synthesizing the speech segment in accordance with the particular pronunciation for the connector term and the phonemes obtained for the text string.
  - 5. The method of claim 2, wherein the text string includes one or more fields of information extracted the metadata and omits at least one field of information available in the metadata.
  - 6. The method of claim 1, wherein the text string includes information identifying one or more of an artist, performer, composer, title, genre, personal preference rating, playlist name, album name, and compilation name pertaining to the media asset.
  - 7. The method of claim 1, further comprising:
    - synthesizing a speech segment based on the modified text string; and
      
      combining the media asset with the synthesized speech segment into a single file.
  - 8. The method of claim 1, further comprising:
    - determining a target language for the speech synthesis; and
      
      obtaining the phonemes for the text string in the determined target language.
  - 9. The method of claim 8, wherein the target language is selected from languages different from a respective language in which the text string was written.
  - 10. The method of claim 8, wherein the target language is a regional dialect of a respective language in which the text string is written.
  - 11. The method of claim 8, wherein the target language is a first language spoken in an accent of a second language different from the first language.
  - 12. The method of claim 11, wherein the second language is a respective language in which the text string was written.

13. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors, cause the one or more processors to perform operations comprising:
- obtaining phonemes for a text string, the text string comprising at least a preceding word and a succeeding word to be concatenated;
  
  identifying a last letter of the preceding word to be concatenated, and identifying a first letter of the succeeding word to be concatenated;
  
  selecting a connector term and a connector term type based on the identified last letter and the identified first letter; and
  
  creating a modified text string for speech synthesis including the selected connector term and the selected connector type.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 14. The computer-readable medium of claim 13, wherein the text string is generated based on metadata associated with or identifying a media asset.
  - 15. The computer-readable medium of claim 14, wherein the operations further comprise:
    - synthesizing a speech segment based on the modified text string; and
      
      providing the speech segment to a user device for playback with the media asset on the user device.
  - 16. The computer-readable medium of claim 15, wherein the connecter term type specifies a respective pronunciation version for the connector term, and wherein synthesizing the speech segment based on the modified text string further comprises:
    - selecting a particular pronunciation for the connector term based on the respective pronunciation version; and
      
      synthesizing the speech segment in accordance with the particular pronunciation for the connector term and the phonemes obtained for the text string.
  - 17. The computer-readable medium of claim 14, wherein the text string includes one or more fields of information extracted the metadata and omits at least one field of information available in the metadata.
  - 18. The computer-readable medium of claim 13, wherein the text string includes information identifying one or more of an artist, performer, composer, title, genre, personal preference rating, playlist name, album name, and compilation name pertaining to the media asset.
  - 19. The computer-readable medium of claim 13, wherein the operations further comprise:
    - synthesizing a speech segment based on the modified text string; and
      
      combining the media asset with the synthesized speech segment into a single file.
  - 20. The computer-readable medium of claim 13, wherein the operations further comprise:
    - determining a target language for the speech synthesis; and
      
      obtaining the phonemes for the text string in the determined target language.
  - 21. The computer-readable medium of claim 20, wherein the target language is selected from languages different from a respective language in which the text string was written.
  - 22. The computer-readable medium of claim 20, wherein the target language is a regional dialect of a respective language in which the text string is written.
  - 23. The computer-readable medium of claim 20, wherein the target language is a first language spoken in an accent of a second language different from the first language.
  - 24. The computer-readable medium of claim 23, wherein the second language is a respective language in which the text string was written.

25. A system, comprising:
- one or more processors; and
  
  memory, the memory storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors, cause the one or more processors to perform operations comprising;
  
  obtaining phonemes for a text string, the text string comprising at least a preceding word and a succeeding word to be concatenated;
  
  identifying a last letter of the preceding word to be concatenated, and identifying a first letter of the succeeding word to be concatenated;
  
  selecting a connector term and a connector term type based on the identified last letter and the identified first letter; and
  
  creating a modified text string for speech synthesis including the selected connector term and the selected connector type.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
- - 26. The system of claim 25, wherein the text string is generated based on metadata associated with or identifying a media asset.
  - 27. The system of claim 26, wherein the operations further comprise:
    - synthesizing a speech segment based on the modified text string; and
      
      providing the speech segment to a user device for playback with the media asset on the user device.
  - 28. The system of claim 27, wherein the connecter term type specifies a respective pronunciation version for the connector term, and wherein synthesizing the speech segment based on the modified text string further comprises:
    - selecting a particular pronunciation for the connector term based on the respective pronunciation version; and
      
      synthesizing the speech segment in accordance with the particular pronunciation for the connector term and the phonemes obtained for the text string.
  - 29. The system of claim 26, wherein the text string includes one or more fields of information extracted the metadata and omits at least one field of information available in the metadata.
  - 30. The system of claim 25, wherein the text string includes information identifying one or more of an artist, performer, composer, title, genre, personal preference rating, playlist name, album name, and compilation name pertaining to the media asset.
  - 31. The system of claim 25, wherein the operations further comprise:
    - synthesizing a speech segment based on the modified text string; and
      
      combining the media asset with the synthesized speech segment into a single file.
  - 32. The system of claim 25, wherein the operations further comprise:
    - determining a target language for the speech synthesis; and
      
      obtaining the phonemes for the text string in the determined target language.
  - 33. The system of claim 32, wherein the target language is selected from languages different from a respective language in which the text string was written.
  - 34. The system of claim 32, wherein the target language is a regional dialect of a respective language in which the text string is written.
  - 35. The system of claim 32, wherein the target language is a first language spoken in an accent of a second language different from the first language.
  - 36. The system of claim 35, wherein the second language is a respective language in which the text string was written.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Rogers, Matthew, Silverman, Kim, Naik, Devang, Rottler, Benjamin
Primary Examiner(s)
AZAD, ABUL K

Application Number

US12/240,433
Publication Number

US 20100082347A1
Time in Patent Office

1,625 Days
Field of Search

704251-257, 704/260
US Class Current

704/260
CPC Class Codes

G10L 13/08 Text analysis or generation...

Systems and methods for concatenation of words in text to speech synthesis

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

619 Citations

36 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for concatenation of words in text to speech synthesis

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

619 Citations

36 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links