Synthesizing an aggregate voice

US 9,613,616 B2
Filed: 05/31/2016
Issued: 04/04/2017
Est. Priority Date: 09/30/2014
Status: Expired due to Fees

First Claim

Patent Images

1. A computer implemented method for synthesizing multi-person speech into an aggregate voice, the method comprising:

crowd-sourcing a data message configured to include a textual passage;

collecting, from a plurality of speakers, a set of vocal data for the textual passage, wherein the set of vocal data includes a first set of enunciation data corresponding to a first portion of the textual passage, a second set of enunciation data corresponding to a second portion of the textual passage, and a third set of enunciation data corresponding to both the first and second portions of the textual passage;

mapping a source voice profile to a subset of the set of vocal data to synthesize the aggregate voice;

wherein mapping the source voice profile includes;

extracting phonological data from the set of vocal data, wherein the phonological data includes pronunciation tags, intonation tags, and syllable rates;

converting, based on the phonological data including pronunciation tags, intonation tags and syllable rates, the set of vocal data into a set of phoneme strings; and

applying, to the set of phoneme strings, the source voice profile;

assigning, based on evaluating the phonological data from the set of vocal data, a first quality score to the first set of enunciation data; and

transmitting, in response to determining that the first quality score is greater than a first quality threshold, bonus credits to a first speaker of the first set of enunciation data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and computer-implemented method for synthesizing multi-person speech into an aggregate voice is disclosed. The method may include crowd-sourcing a data message configured to include a textual passage. The method may include collecting, from a plurality of speakers, a set of vocal data for the textual passage. Additionally, the method may also include mapping a source voice profile to a subset of the set of vocal data to synthesize the aggregate voice.

Citations

15 Claims

1. A computer implemented method for synthesizing multi-person speech into an aggregate voice, the method comprising:
- crowd-sourcing a data message configured to include a textual passage;
  
  collecting, from a plurality of speakers, a set of vocal data for the textual passage, wherein the set of vocal data includes a first set of enunciation data corresponding to a first portion of the textual passage, a second set of enunciation data corresponding to a second portion of the textual passage, and a third set of enunciation data corresponding to both the first and second portions of the textual passage;
  
  mapping a source voice profile to a subset of the set of vocal data to synthesize the aggregate voice;
  
  wherein mapping the source voice profile includes;
  
  extracting phonological data from the set of vocal data, wherein the phonological data includes pronunciation tags, intonation tags, and syllable rates;
  
  converting, based on the phonological data including pronunciation tags, intonation tags and syllable rates, the set of vocal data into a set of phoneme strings; and
  
  applying, to the set of phoneme strings, the source voice profile;
  
  assigning, based on evaluating the phonological data from the set of vocal data, a first quality score to the first set of enunciation data; and
  
  transmitting, in response to determining that the first quality score is greater than a first quality threshold, bonus credits to a first speaker of the first set of enunciation data.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the source voice profile includes a predetermined set of phonological and prosodic characteristics corresponding to a voice of a first individual.
  - 3. The method of claim 2, wherein the phonological and prosodic characteristics include rhythm, stress, tone, and intonation.
  - 4. The method of claim 1, further comprising:
    - detecting, by an incentive system, a transition phase of an entertainment content sequence;
      
      presenting, during the transition phase of the entertainment content sequence, a speech sample collection module configured to record enunciation data for the textual passage; and
      
      advancing, in response to recording enunciation data for the textual passage, the entertainment content sequence.
  - 5. The method of claim 1, wherein transmitting bonus credits is in further response to determining the first set of enunciation data has a usage above a usage threshold.
  - 6. The method of claim 1, wherein collecting a set of vocal data further comprises:
    - prompting a respective speaker of the plurality of speakers to read the first portion of the textual passage; and
      
      recording the respective speaker reading the first portion of the textual passage.
  - 7. The method of claim 6, wherein collecting a set of vocal data further comprises:
    - determining, based on the first set of enunciation data, that the first portion of the textual passage needs to be recorded again; and
      
      indicating to the respective user that the first portion of the textual passage needs to be recorded again.

8. A system for synthesizing multi-person speech into an aggregate voice, the system comprising:
- a crowd-sourcing module configured to crowd-source a data message including a textual passage;
  
  a collecting module configured to collect, from a plurality of speakers, a set of vocal data for the textual passage, wherein the set of vocal data includes a first set of enunciation data corresponding to a first portion of the textual passage, a second set of enunciation data corresponding to a second portion of the textual passage, and a third set of enunciation data corresponding to both the first and second portions of the textual passage;
  
  a mapping module configured to map a source voice profile to a subset of the set of vocal data to synthesize the aggregate voice, wherein mapping the source voice profile to a subset of the set of vocal data to synthesize the aggregate voice includes;
  
  an extracting module configured to extract phonological data from the set of vocal data, wherein the phonological data includes pronunciation tags, intonation tags, and syllable rates;
  
  a converting module configured to convert, based on the phonological data including pronunciation tags, intonation tags and syllable rates, the set of vocal data into a set of phoneme strings; and
  
  an applying module configured to apply, to the set of phoneme strings, the source voice profile;
  
  an assigning module configured to assign, based on evaluating the phonological data from the set of vocal data, a first quality score to the first set of enunciation data; and
  
  a transmitting module configured to transmit, in response to determining that the first quality score is greater than a first quality threshold, bonus credits to a first speaker of the first set of enunciation data.
- View Dependent Claims (9, 10, 11)
- - 9. The system of claim 8, wherein the source voice profile includes a predetermined set of phonological and prosodic characteristics corresponding to a voice of a first individual.
  - 10. The system of claim 9, wherein the phonological and prosodic characteristics include rhythm, stress, tone, and intonation.
  - 11. The system of claim 8, further comprising:
    - a detecting module configured to detect, using an incentive system, a transition phase of an entertainment content sequence;
      
      a presenting module configured to present, during the transition phase of the entertainment content sequence, a speech sample collection module configured to record enunciation data for the textual passage; and
      
      an advancing module configured to advance, in response to recording enunciation data for the textual passage, the entertainment content sequence.

12. A computer program product comprising a computer readable storage medium having a computer readable program stored therein, wherein the computer readable storage medium does not comprise a transitory signal per se, wherein the computer readable program, when executed on a first computing device, causes the first computing device to:
- crowd-source a data message configured to include a textual passage;
  
  collect, from a plurality of speakers, a set of vocal data for the textual passage, wherein the set of vocal data includes a first set of enunciation data corresponding to a first portion of the textual passage, a second set of enunciation data corresponding to a second portion of the textual passage, and a third set of enunciation data corresponding to both the first and second portions of the textual passage;
  
  map a source voice profile to a subset of the set of vocal data to synthesize the aggregate voice;
  
  extract phonological data from the set of vocal data, wherein the phonological data includes pronunciation tags, intonation tags, and syllable rates;
  
  convert, based on the phonological data including pronunciation tags, intonation tags and syllable rates, the set of vocal data into a set of phoneme strings;
  
  apply, to the set of phoneme strings, the source voice profile;
  
  assign, based on evaluating the phonological data from the set of vocal data, a first quality score to the first set of enunciation data; and
  
  transmit, in response to determining that the first quality score is greater than a first quality threshold, bonus credits to a first speaker of the first set of enunciation data.
- View Dependent Claims (13, 14, 15)
- - 13. The computer program product of claim 12, wherein the source voice profile includes a predetermined set of phonological and prosodic characteristics corresponding to a voice of a first individual.
  - 14. The computer program product of claim 13, wherein the phonological and prosodic characteristics include rhythm, stress, tone, and intonation.
  - 15. The computer program product of claim 12, further comprising computer readable program code configured to:
    - detect, by an incentive system, a transition phase of an entertainment content sequence;
      
      present, during the transition phase of the entertainment content sequence, a speech sample collection module configured to record enunciation data for the textual passage; and
      
      advance, in response to recording enunciation data for the textual passage, the entertainment content sequence.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
de Freitas, Jose A. G., Hindle, Guy P., Taylor, James S.
Primary Examiner(s)
SINGH, SATWANT K

Application Number

US15/168,599
Publication Number

US 20160275935A1
Time in Patent Office

308 Days
Field of Search

None
US Class Current
CPC Class Codes

G10L 13/00   Speech synthesis; Text to s...

G10L 13/027   Concept to speech synthesis...

G10L 13/033   Voice editing, e.g. manipul...

G10L 13/10   Prosody rules derived from ...

Synthesizing an aggregate voice

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Synthesizing an aggregate voice

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links