Automated text to speech voice development

US 9,196,240 B2
Filed: 12/19/2012
Issued: 11/24/2015
Est. Priority Date: 10/26/2012
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

one or more processors;

a computer-readable memory; and

a module comprising executable instructions stored in the computer-readable memory, the module, when executed by the one or more processors, configured to;

generate an audio representation of a text,wherein the audio representation comprises a sequence of speech segments selected from a plurality of speech segments,wherein the selection of the sequence of speech segments is based at least in part on a plurality of conversion rules, andwherein each speech segment of the sequence of speech segments corresponds to a subword unit of the text;

transmit, to a plurality of client devices, the text and the audio representation;

receive, from a first client device of the plurality of client devices, first feedback data associated with the audio representation;

receive, from a second client device of the plurality of client devices, second feedback data associated with the audio representation; and

use the first feedback data and the second feedback data to modify, at least in part, the plurality of speech segments or the plurality of conversion rules.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A group of users may be presented with text and a synthesized speech recording of the text. The users can listen to the synthesized speech recording and submit feedback regarding errors or other issues with the synthesized speech. A system of one or more computing devices can analyze the feedback, modify the voice or language rules, and recursively test the modifications. The modifications may be determined through the use of machine learning algorithms or other automated processes.

Citations

31 Claims

1. A system comprising:
- one or more processors;
  
  a computer-readable memory; and
  
  a module comprising executable instructions stored in the computer-readable memory, the module, when executed by the one or more processors, configured to;
  
  generate an audio representation of a text,wherein the audio representation comprises a sequence of speech segments selected from a plurality of speech segments,wherein the selection of the sequence of speech segments is based at least in part on a plurality of conversion rules, andwherein each speech segment of the sequence of speech segments corresponds to a subword unit of the text;
  
  transmit, to a plurality of client devices, the text and the audio representation;
  
  receive, from a first client device of the plurality of client devices, first feedback data associated with the audio representation;
  
  receive, from a second client device of the plurality of client devices, second feedback data associated with the audio representation; and
  
  use the first feedback data and the second feedback data to modify, at least in part, the plurality of speech segments or the plurality of conversion rules.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system of claim 1, wherein a speech segment of the plurality of speech segments comprises a recording of one of a phoneme, a diphone, or a triphone.
  - 3. The system of claim 1, wherein the plurality of speech segments is modified to exclude a speech segment.
  - 4. The system of claim 1, wherein the module, when executed, is further configured to:
    - generate a notification to the first client device indicating a difference between the first feedback data and the second feedback data; and
      
      receive, from the first client device, third feedback data, wherein the third feedback data is different from the first feedback data.
  - 5. The system of claim 1, wherein the module, when executed, is further configured to:
    - transmit, to the plurality of client devices, a control text and a corresponding control recording of a human reading the control text;
      
      receive, from the first client device;
      
      a first quality score of the audio representation; and
      
      a second quality score of the control recording; and
      
      use the first quality score and the second quality score to modify, at least in part, the plurality of speech segments or the plurality of conversion rules.

6. A computer-implemented method comprising:
- under control of one or more computing devices configured with specific computer-executable instructions,generating an audio representation of a text,wherein the text comprises a word,wherein the audio representation comprises a sequence of speech segments of a plurality of speech segments, andwherein selection of the sequence of speech segments is based at least in part on a plurality of conversion rules;
  
  transmitting the audio representation and the text to a first client device and a second client device of a plurality of client devices;
  
  receiving first feedback data from the first client device, the first feedback data relating to the audio representation;
  
  receiving second feedback data from the second client device, the second feedback data relating to the audio representation; and
  
  determining, based at least in part on the first feedback data and the second feedback data, whether to modify at least one of (i) the plurality of speech segments or (ii) the plurality of conversion rules.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 7. The computer-implemented method of claim 6, wherein the plurality of conversion rules comprises rules for determining pronunciation, accentuation, or prosody.
  - 8. The computer-implemented method of claim 6, further comprising:
    - modifying the plurality of speech segments.
  - 9. The computer-implemented method of claim 6, further comprising:
    - modifying the plurality of conversion rules.
  - 10. The computer-implemented method of claim 8, wherein modifying the plurality of speech segments comprises excluding one of the plurality of speech segments.
  - 11. The computer-implemented method of claim 9, wherein modifying the plurality of conversion rules comprises adding a new conversion rule to the plurality of conversion rules.
  - 12. The computer-implemented method of claim 6, further comprising:
    - generating a second audio representation of the text comprising a second sequence of speech segments of the plurality of speech segments, the second sequence based at least in part on the plurality of conversion rules; and
      
      transmitting the second audio representation and the text to a third client device of the plurality of client devices.
  - 13. The computer-implemented method of claim 12, wherein the third client device comprises one of the first client device or the second client device.
  - 14. The computer-implemented method of claim 6, wherein a speech segment of the plurality of speech segments comprises a recording of one of a phoneme, a diphone, or a triphone.
  - 15. The computer-implemented method of claim 6, wherein the text is selected from a plurality of texts associated with a common characteristic.
  - 16. The computer-implemented method of claim 15, wherein the common characteristic comprises one of a language, vocabulary, or subject matter.
  - 17. The computer-implemented method of claim 6, wherein the first feedback data comprises one of an incorrect homograph disambiguation, a mispronunciation, a prosody issue, a text-expansion issue, a discontinuity, or an inaudibility.
  - 18. The computer-implemented method of claim 6, wherein the determining comprises determining whether the first feedback data is substantially equivalent to the second feedback data.
  - 19. The computer-implemented method of claim 6, further comprising, generating a notification to the first client device comprising an indication of a difference between the first feedback data and the second feedback data.
  - 20. The computer-implemented method of claim 6, further comprising:
    - transmitting, to the first client device, a control text and a control recording of a human reading the control text;
      
      receiving, from the first client device;
      
      a first quality of the audio representation; and
      
      a second quality score of the control recording; and
      
      using the first quality score and the second quality score to modify at least one of (i) the plurality of speech segments or (ii) the plurality of conversion rules.

21. A system comprising:
- one or more processors;
  
  a computer-readable memory; and
  
  a module comprising executable instructions stored in the computer-readable memory, the module, when executed by the one or more processors, configured to;
  
  generate an audio representation of a text,wherein the audio representation comprises a sequence of speech segments of a plurality of speech segments, andwherein the sequence is based at least in part on a plurality of conversion rules;
  
  transmit the audio representation to a first client device and a second client device of a plurality of client devices;
  
  receive first feedback data from the first client device, wherein the first feedback data relates to the audio representation;
  
  receive second feedback data from the second client device, wherein the second feedback data relates to the audio representation; and
  
  determine whether to modify at least one of (i) the plurality of conversion rules or (ii) the plurality of speech segments based at least in part on at least one of the first feedback data and the second feedback data.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
- - 22. The system of claim 21, wherein the plurality of conversion rules comprises rules for determining pronunciation, accentuation, or prosody.
  - 23. The system of claim 21, wherein a speech segment of the plurality of speech segments comprises a recording of one of a phoneme, a diphone, or a triphone.
  - 24. The system of claim 21, wherein the text is selected from a plurality of texts associated with a common characteristic.
  - 25. The system of claim 24, wherein the common characteristic comprises one of a language, a vocabulary, or a subject matter.
  - 26. The system of claim 21, wherein the text comprises a sequence of words, wherein a portion of the audio representation corresponds to a first word of the sequence of words, and wherein the first feedback data indicates a conversion issue associated with the portion of the audio representation.
  - 27. The system of claim 26, wherein the conversion issue comprises one of the following:
    - an incorrect homograph disambiguation;
      
      a mispronunciation;
      
      a prosody issue;
      
      a text-expansion issue;
      
      a discontinuity;
      
      or an inaudibility.
  - 28. The system of claim 21, wherein the first feedback data comprises an indication of a quality of the audio representation.
  - 29. The system of claim 21, wherein the module, when executed by the one or more processors, is further configured to:
    - generate a second audio representation of a second text,wherein the second audio representation comprises a second sequence of speech segments of the plurality of speech segments, andwherein the second sequence is based at least in part on the plurality of conversion rules;
      
      transmit the second audio representation to the first client device;
      
      receive third feedback data from the first client device, wherein the third feedback data relates to the second audio representation; and
      
      determine whether to modify at least one of (i) the plurality of conversion rules or (ii) the plurality of speech segments based at least in part on the third feedback data.
  - 30. The system of claim 21, wherein the module, when executed by the one or more processors, is further configured to:
    - transmit the first audio representation to a third client device of the plurality of client device;
      
      receive third feedback data from the third client device, wherein the third feedback data relates to the first audio representation;
      
      determine whether to modify at least one of (i) the plurality of conversion rules or (ii) the plurality of speech segments based at least in part on the third feedback data.
  - 31. The system of claim 21, wherein the module, when executed, is further configured to:
    - transmit a control recording comprising a recording of a human reading a control text to the first client device;
      
      receive, from the first client device;
      
      a first quality score of the audio representation; and
      
      a second quality score of the control recording; and
      
      use the first quality score and the second quality score to modify at least one of (i) the plurality of conversion rules or (ii) the plurality of speech segments.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
IVONA Software Sp zoo (Amazon.com, Inc.)
Inventors
Kaszczuk, Michal T., Osowski, Lukasz M.
Primary Examiner(s)
GUERRA-ERAZO, EDGAR X

Application Number

US13/720,925
Publication Number

US 20140122081A1
Time in Patent Office

1,070 Days
Field of Search

704/258, 704/260, 704/261, 704/270, 704/270.1, 704/275, 704/277
US Class Current

1/1
CPC Class Codes

G10L 13/033 Voice editing, e.g. manipul...

G10L 13/08 Text analysis or generation...

Automated text to speech voice development

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

Automated text to speech voice development

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links