SINGLE INTERFACE FOR LOCAL AND REMOTE SPEECH SYNTHESIS

US 20150262571A1
Filed: 02/13/2015
Published: 09/17/2015
Est. Priority Date: 10/25/2012
Status: Active Grant

First Claim

Patent Images

1-30. -30. (canceled)

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Features are disclosed for providing a consistent interface for local and distributed text to speech (TTS) systems. Some portions of the TTS system, such as voices and TTS engine components, may be installed on a client device, and some may be present on a remote system accessible via a network link. Determinations can be made regarding which TTS system components to implement on the client device and which to implement on the remote server. The consistent interface facilitates connecting to or otherwise employing the TTS system through use of the same methods and techniques regardless of the which TTS system configuration is implemented.

8 Citations

View as Search Results

50 Claims

1-30. -30. (canceled)

31. A system comprising:
- a computer-readable memory storing executable instructions; and
  
  one or more computer processors in communication with the computer-readable memory, wherein the one or more computer processors are programmed by the executable instructions to at least;
  
  receive, from a remote storage location, voice recordings of subword units;
  
  generate a text-to-speech presentation by concatenating two or more of the voice recordings, wherein individual voice recordings of the two or more voice recordings correspond to subword units for individual words in a text to be presented audibly;
  
  determine system performance in generating the text-to-speech presentation;
  
  determine, based at least partly on the system performance, that accessing the voice recordings at a local storage location will likely improve system performance in generating a subsequent text-to-speech presentation;
  
  store at least the portion of the voice recordings in the local storage location;
  
  access at least the portion of the voice recordings at the local storage location; and
  
  generate the subsequent text-to-speech presentation using the portion of voice recordings accessed at the local storage location.
- View Dependent Claims (32, 33, 34)
- - 32. The system of claim 31, wherein the executable instructions to determine that accessing the voice recordings at the local storage location will likely improve system performance comprise instructions to determine that a latency of a network connection to the remote storage location exceeds a threshold.
  - 33. The system of claim 31, wherein the executable instructions to determine that accessing the voice recordings at the local storage location will likely improve system performance comprise instructions to determine that a frequency of use of the voice recordings exceeds a threshold.
  - 34. The system of claim 31, wherein the executable instructions further comprise instructions to:
    - determine that accessing additional voice recordings at the remote storage location will likely not reduce system performance in generating an additional text-to-speech presentation;
      
      remove at least a portion of the additional voice recordings from the local storage location;
      
      access at least the portion of the additional voice recordings at the remote storage location; and
      
      generate the additional text-to-speech presentation using the portion of additional voice recordings accessed at the remote storage location.

35. A computer-implemented method comprising:
- as implemented by one or more computing devices configured to execute specific instructions,accessing first voice data at a current storage location;
  
  generating a plurality of text-to-speech presentations using the first voice data accessed at the current storage location;
  
  generating usage data regarding generation of the plurality of text-to-speech presentations;
  
  determining a preferred storage location for the first voice data based at least partly on the usage data, wherein the preferred storage location corresponds to one of a local storage location or a remote storage location, and wherein the preferred storage location is different than the current storage location;
  
  accessing first voice data at the preferred storage location; and
  
  generating a subsequent text-to-speech presentation using the first voice data accessed at the preferred storage location.
- View Dependent Claims (36, 37, 38, 39, 40, 41, 42)
- - 36. The computer-implemented method of claim 35, wherein the usage data relates to at least one of:
    - network latency in accessing the first voice data at the current storage location;
      
      bandwidth of a network connection used to access the first voice data at the current storage location;
      
      an identity of an application that causes generation of a text-to-speech presentation;
      
      text used to generate a text-to-speech presentation;
      
      or frequency with which the first voice data is used to generate text-to-speech presentations.
  - 37. The computer-implemented method of claim 35, wherein determining the preferred storage location for the first voice data comprises determining that the first voice data is to be stored at the local storage location based at least partly on a latency of a network connection to the remote storage location exceeding a threshold.
  - 38. The computer-implemented method of claim 35, wherein determining the preferred storage location for the first voice data comprises determining that the first voice data is to be stored at the remote storage location based at least partly a latency of a network connection to the remote storage location failing to exceed a threshold.
  - 39. The computer-implemented method of claim 35, wherein generation of at least a first text-to-speech presentation of the one or more text-to-speech presentations using the first voice data comprises concatenating voice recordings of subword units for individual words in a text to be presented audibly, wherein the first voice data comprises the voice recordings.
  - 40. The computer-implemented method of claim 35, wherein determining the preferred storage location for the first voice data comprises determining that the first voice data is to be stored at the remote storage location based at least partly on usage data indicating that frequency of use of the first voice data falls below a threshold.
  - 41. The computer-implemented method of claim 35, wherein determining the preferred storage location for the first voice data comprises determining that the first voice data is to be stored at the local storage location based at least partly on usage data indicating that frequency of use of the first voice data exceeds a threshold.
  - 42. The computer-implemented method of claim 35, wherein determining the preferred storage location for the first voice data is performed by a server computing device separate from a client computing device on which the subsequent text-to-speech presentation is to be presented.

43. A non-transitory computer storage medium which stores an executable code module that directs a client computing device to perform a process comprising:
- accessing first voice data at a current storage location;
  
  generating a plurality of text-to-speech presentations using the first voice data accessed at the current storage location;
  
  generating usage data regarding generation of the plurality of text-to-speech presentations;
  
  determining a preferred storage location for the first voice data based at least partly on the usage data, wherein the preferred storage location corresponds to one of a local storage location or a remote storage location, and wherein the preferred storage location is different than the current storage location;
  
  accessing first voice data at the preferred storage location; and
  
  generating a subsequent text-to-speech presentation using the first voice data accessed at the preferred storage location.
- View Dependent Claims (44, 45, 46, 47, 48, 49, 50)
- - 44. The non-transitory computer storage medium of claim 43, wherein the usage data relates to at least one of:
    - network latency in accessing the first voice data at the current storage location;
      
      bandwidth of a network connection used to access the first voice data at the current storage location;
      
      an identity of an application that causes generation of a text-to-speech presentation;
      
      text used to generate a text-to-speech presentation;
      
      or frequency with which the first voice data is used to generate text-to-speech presentations.
  - 45. The non-transitory computer storage medium of claim 43, wherein determining the preferred storage location for the first voice data comprises determining that the first voice data is to be stored at the local storage location based at least partly on a latency of a network connection to the remote storage location exceeding a threshold.
  - 46. The non-transitory computer storage medium of claim 43, wherein determining the preferred storage location for the first voice data comprises determining that the first voice data is to be stored at the remote storage location based at least partly a latency of a network connection to the remote storage location failing to exceed a threshold.
  - 47. The non-transitory computer storage medium of claim 43, wherein generation of at least a first text-to-speech presentation of the one or more text-to-speech presentations using the first voice data comprises concatenating voice recordings of subword units for individual words in a text to be presented audibly, wherein the first voice data comprises the voice recordings.
  - 48. The non-transitory computer storage medium of claim 43, wherein determining the preferred storage location for the first voice data comprises determining that the first voice data is to be stored at the remote storage location based at least partly on usage data indicating that frequency of use of the first voice data falls below a threshold.
  - 49. The non-transitory computer storage medium of claim 43, wherein determining the preferred storage location for the first voice data comprises determining that the first voice data is to be stored at the local storage location based at least partly on usage data indicating that frequency of use of the first voice data exceeds a threshold.
  - 50. The non-transitory computer storage medium of claim 43, wherein generating the subsequent text-to-speech presentation comprises employing a remote text-to-speech system to generate the subsequent text-to-speech presentation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
IVONA Software Sp zoo (Amazon.com, Inc.)
Inventors
Kaszczuk, Michal T., Osowski, Lukasz M.

Granted Patent

US 9,595,255 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G10L 13/02 Methods for producing synth...

G10L 13/04 Details of speech synthesis...

SINGLE INTERFACE FOR LOCAL AND REMOTE SPEECH SYNTHESIS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

8 Citations

50 Claims

Specification

Solutions

Use Cases

Quick Links

SINGLE INTERFACE FOR LOCAL AND REMOTE SPEECH SYNTHESIS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

8 Citations

50 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links