Automatic conversion of speech into song, rap or other audible expression having target meter or rhythm

US 9,324,330 B2
Filed: 03/29/2013
Issued: 04/26/2016
Est. Priority Date: 03/29/2012
Status: Active Grant

First Claim

Patent Images

1. A computational method for transforming an input audio encoding of speech into an output that is rhythmically consistent with a target song, the method comprising:

segmenting the input audio encoding of the speech into plural segments, the segments corresponding to successive sequences of samples of the audio encoding and delimited by onsets identified therein;

mapping individual ones of the plural segments to respective sub-phrase portions of a phrase template for the target song, the mapping establishing one or more phrase candidates;

temporally aligning at least one of the phrase candidates with a rhythmic skeleton for the target song; and

preparing a resultant audio encoding of the speech in correspondence with the temporally aligned phrase candidate-mapped from onset-delimited segments of the input audio encoding.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Captured vocals may be automatically transformed using advanced digital signal processing techniques that provide captivating applications, and even purpose-built devices, in which mere novice user-musicians may generate, audibly render and share musical performances. In some cases, the automated transformations allow spoken vocals to be segmented, arranged, temporally aligned with a target rhythm, meter or accompanying backing tracks and pitch corrected in accord with a score or note sequence. Speech-to-song music applications are one such example. In some cases, spoken vocals may be transformed in accord with musical genres such as rap using automated segmentation and temporal alignment techniques, often without pitch correction. Such applications, which may employ different signal processing and different automated transformations, may nonetheless be understood as speech-to-rap variations on the theme.

Citations

28 Claims

1. A computational method for transforming an input audio encoding of speech into an output that is rhythmically consistent with a target song, the method comprising:
- segmenting the input audio encoding of the speech into plural segments, the segments corresponding to successive sequences of samples of the audio encoding and delimited by onsets identified therein;
  
  mapping individual ones of the plural segments to respective sub-phrase portions of a phrase template for the target song, the mapping establishing one or more phrase candidates;
  
  temporally aligning at least one of the phrase candidates with a rhythmic skeleton for the target song; and
  
  preparing a resultant audio encoding of the speech in correspondence with the temporally aligned phrase candidate-mapped from onset-delimited segments of the input audio encoding.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 2. The computational method of claim 1, further comprising:
    - mixing the resultant audio encoding with an audio encoding of a backing track for the target song; and
      
      audibly rendering the mixed audio.
  - 3. The computational method of claim 1, further comprising:
    - from a microphone input of a portable handheld device, capturing speech voiced by a user thereof as the input audio encoding; and
      
      responsive to a selection of the target song by the user, retrieving a computer readable encoding of at least one of the phrase template and the rhythmic skeleton.
  - 4. The computational method of claim 3,wherein the retrieving responsive to user selection includes obtaining, from a remote store and via a communication interface of the portable handheld device, at least the phrase template.
  - 5. The computational method of claim 1, wherein the segmenting includes:
    - applying a spectral difference type (SDF-type) function to the audio encoding of the speech and picking temporally indexed peaks in a result thereof as onset candidates within the speech encoding; and
      
      agglomerating adjacent onset candidate-delimited sub-portions of the speech encoding into segments based, at least in part, on comparative strength of onset candidates.
  - 6. The computational method of claim 5,wherein the SDF-type function operates on a psychoacoustically-based representation of power spectrum for the speech encoding.
  - 7. The computational method of claim 5,wherein the agglomerating is performed, at least in part, based on a minimum segment length threshold.
  - 8. The computational method of claim 5, further comprising:
    - iterating on the agglomerating to achieve a total number of segments within a target range.
  - 9. The computational method of claim 1, wherein the mapping includes:
    - enumerating a set of onset-delimited, N-part, partitionings of the speech encoding based on groupings of adjacent ones of the segments, wherein N corresponds to the number of sub-phrase portions of the phrase template;
      
      for each of the partitionings, constructing a corresponding mapping of the speech encoding segment groupings to sub-phrase portions, the mappings providing plural of the phrase candidates.
  - 10. The computational method of claim 1,wherein the mapping provides plural phrase candidates;
    - wherein the temporal aligning is performed for each of the plural phrase candidates; and
      
      further comprising selecting from amongst the plural phrase candidates based upon degree of rhythmic alignment with the rhythmic skeleton for the target song.
  - 11. The computational method of claim 1,wherein the rhythmic skeleton corresponds to a pulse train encoding of tempo of the target song.
  - 12. The computational method of claim 11,wherein the target song includes plural constituent rhythms, andwherein the pulse train encoding includes respective pulses scaled in accord with relative strengths of the constituent rhythms.
  - 13. The computational method of claim 1, further comprising:
    - performing beat detection for a backing track of the target song to produce the rhythmic skeleton.
  - 14. The computational method of claim 1, further comprising:
    - pitch shifting the resultant audio encoding in accord with a note sequence for the target song.
  - 15. The computational method of claim 14,wherein the pitch shifting employs cross synthesis of a glottal pulse.
  - 16. The computational method of claim 15,wherein the cross synthesis uses a glottal pulse as source excitation and spectrum of the input speech as target spectrum.
  - 17. The computational method of claim 14, further comprising:
    - retrieving a computer readable encoding of the note sequence.
  - 18. The computational method of claim 17,wherein the retrieving is responsive to user selection at a user interface of a portable handheld device and obtains at least the phrase template and the note sequence for the target song from a remote store via a communication interface of the portable handheld device.
  - 19. The computational method of claim 1, further comprising:
    - mapping onsets of notes for the target song to temporally-proximate, segment delimiting onsets in the speech encoding; and
      
      for respective portions of the speech encoding that correspond to the mapped note onsets, temporally stretching or compressing the respective portion to fill duration of the mapped note.
  - 20. The computational method of claim 19, further comprising:
    - characterizing frames of the speech encoding based, at least in part, on spectral roll-off, wherein generally greater roll-off of high frequency content is indicative of voiced vowels; and
      
      dynamically varying magnitude of the temporal stretching applied to a respective portion of the speech encoding based on the characterized vowel-indicative spectral roll-off for the corresponding frame.
  - 21. The computational method of claim 20,wherein the dynamic varying employs a composition of a melodic density vector for the target song and a spectral roll-off vector for the speech encoding.
  - 22. The computational method of claim 1, performed on a portable computing device selected from the group of:
    - a computing pad;
      
      a personal digital assistant or book reader; and
      
      a mobile phone or media player.

23. An apparatus comprising:
- a portable computing device; and
  
  machine readable code embodied in a non-transitory medium and executable on the portable computing device to transform an input audio encoding of speech into an output that is rhythmically consistent with a target song, the machine readable code including instructions executable to segment the input audio encoding of the speech into plural segments, the segments corresponding to successive sequences of samples of the audio encoding and delimited by onsets identified therein;
  
  the machine readable code further executable to map individual ones of the plural segments to respective sub-phrase portions of a phrase template for the target song, the mapping establishing one or more phrase candidates;
  
  the machine readable code further executable to temporally align at least one of the phrase candidates with a rhythmic skeleton for the target song; and
  
  the machine readable code further executable to prepare a resultant audio encoding of the speech in correspondence with the temporally aligned phrase candidate-mapped from onset-delimited segments of the input audio encoding.
- View Dependent Claims (24, 25)
- - 24. The apparatus of claim 23,embodied as one or more of a computing pad, a handheld mobile device, a mobile phone, a personal digital assistant, a smart phone, a media player and a book reader.
  - 25. The computer program product of claim 23, wherein the media are readable by the portable computing device or readable incident to a computer program product conveying transmission to the portable computing device.

26. A computer program product encoded in non-transitory media and including instructions executable to transform an input audio encoding of speech into an output that is rhythmically consistent with a target song, the computer program product encoding and comprising:
- instructions executable to segment the input audio encoding of the speech into plural segments, the segments corresponding to successive sequences of samples of the audio encoding and delimited by onsets identified therein;
  
  instructions executable to map individual ones of the plural segments to respective sub-phrase portions of a phrase template for the target song, the mapping establishing a one or more phrase candidates;
  
  instructions executable to temporally align at least one of the phrase candidates with a rhythmic skeleton for the target song; and
  
  instructions executable to prepare a resultant audio encoding of the speech in correspondence with the temporally aligned phrase candidate-mapped from onset delimited segments of the input audio encoding.
- View Dependent Claims (27, 28)
- - 27. The computer program product of claim 26, wherein the computer program product is executable on a processor of a portable computing device.
  - 28. The computer program product of claim 27, wherein the one or more media are readable by the portable computing device or readable incident to a computer program product conveying transmission to the portable computing device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Smule, Inc.
Original Assignee
Smule, Inc.
Inventors
Chordia, Parag, Godfrey, Mark, Rae, Alexander, Gupta, Prerna, Cook, Perry R.
Primary Examiner(s)
Lerner, Martin

Application Number

US13/853,759
Publication Number

US 20140074459A1
Time in Patent Office

1,124 Days
Field of Search

704/236, 704/241, 704/242, 704/261, 704/264, 704/369, 704/269, 704/230, 704/503, 846/09, 846/10, 846/11, 846/34, 846/35, 846/49, 846/50, 846/59, 846/66, 846/67
US Class Current

1/1
CPC Class Codes

G10H 1/366   with means for modifying or...

G10H 2210/051   for extraction or detection...

G10H 2240/141   Library retrieval matching,...

G10H 2250/235   Fourier transform; Discrete...

G10L 19/00   Speech or audio signals ana...

G10L 19/02   using spectral analysis, e....

G10L 21/055   for synchronising with othe...

Automatic conversion of speech into song, rap or other audible expression having target meter or rhythm

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic conversion of speech into song, rap or other audible expression having target meter or rhythm

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links