Prosody Conversion

US 20080082333A1
Filed: 09/29/2006
Published: 04/03/2008
Est. Priority Date: 09/29/2006
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

(a) receiving data for a plurality of segments of a passage in a source voice, wherein the data for each segment of the plurality models a prosodic component of the source voice for that segment;

(b) identifying a target voice entry in a codebook for each of the source voice passage segments, wherein each of the identified target voice entries models a prosodic component of a target voice for a different segment of codebook training material, and wherein the codebook training material is substantially different from the passage; and

(c) generating a target voice version of the plurality of passage segments by altering the modeled source voice prosodic component for each segment to replicate the target voice prosodic component modeled by the target voice entry identified for that segment in (b).

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A contour for a syllable (or other speech segment) in a voice undergoing conversion is transformed. The transform of that contour is then used to identify one or more source syllable transforms in a codebook. Information regarding the context and/or linguistic features of the contour being converted can also be compared to similar information in the codebook when identifying an appropriate source transform. Once a codebook source transform is selected, an inverse transformation is performed on a corresponding codebook target transform to yield an output contour. The corresponding codebook target transform represents a target voice version of the same syllable represented by the selected codebook source transform. The output contour may be further processed to improve conversion quality.

38 Citations

View as Search Results

35 Claims

1. A method comprising:
- (a) receiving data for a plurality of segments of a passage in a source voice, wherein the data for each segment of the plurality models a prosodic component of the source voice for that segment;
  
  (b) identifying a target voice entry in a codebook for each of the source voice passage segments, wherein each of the identified target voice entries models a prosodic component of a target voice for a different segment of codebook training material, and wherein the codebook training material is substantially different from the passage; and
  
  (c) generating a target voice version of the plurality of passage segments by altering the modeled source voice prosodic component for each segment to replicate the target voice prosodic component modeled by the target voice entry identified for that segment in (b).
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein operation (a) includes receiving data for one or more additional segments of the passage in a source voice, and wherein the method further comprises:
    - (d) generating a target voice version of each of the one or more additional source voice passage segments according to $x_{i} (n) |_{MV} = \frac{x_{i}^{SRC} (n) - μ_{SRC}}{σ_{SRC}} * σ_{TGT} + μ_{TGT}$ whereinμ
      
      _SRCis a mean of all F0 values for source voice versions of segments in the codebook training material,σ
      
      _SRCis a standard deviation of all F0 values for source voice versions of segments in the codebook training material,μ
      
      _TGTis a mean of all F0 values for target voice versions of segments in the codebook training material,σ
      
      ^TGTis a standard deviation of all F0 values for target voice versions of segments in the codebook training material,x_i^SRC(n) is a value for F0 at time n in an F0 contour for segment i of the additional segments, andx_i(n)|_MVis a value for F0 at time n in an F0 contour for a target voice version of segment i of the additional segments.
  - 3. The method of claim 1, whereinthe codebook includes multiple source voice entries,each of the multiple source voice entries models a prosodic component of the source voice for a different segment of the codebook training material,each of the multiple source voice entries corresponds to a target voice entry modeling a prosodic component of the target voice for the segment of the codebook training material for which the corresponding source voice entry models the prosodic component of the source voice, andoperation (b) includes, for each source voice passage segment, identifying a target voice entry by comparing data for the source voice passage segment to one or more of the multiple source voice entries.
  - 4. The method of claim 3, whereineach of the multiple source voice entries and its corresponding target voice entry includes a plurality of transform coefficients representing a contour for the modeled prosodic component, andoperation (b) includes, for each source voice passage segment, identifying a target voice entry by comparing transform coefficients representing a contour for the prosodic component of the source voice passage segment to the transform coefficients for one or more of the multiple source voice entries.
  - 5. The method of claim 4, whereineach of the multiple source voice entries is associated with a different feature vector,each of the associated feature vectors includes values of a set of linguistic features for the codebook training speech segment for which the associated source voice entry models the prosodic component of the source voice,data for each of the source voice passage segments includes a feature vector that includes values of the set of linguistic features for that source voice passage segment, andoperation (b) includes, for each source voice passage segment,(b1) identifying multiple candidate source voice entries based the transform coefficient comparisons, and(b2) selecting the identified target voice entry based on a comparison of the feature vector for the source voice passage segment with each of the feature vectors associated with the multiple candidate source voice entries identified in (b1).
  - 6. The method of claim 5, wherein the selecting in operation (b2) is also based on comparison of a duration of the source voice passage segment with durations of each of the candidate source voice entries identified in (b1).

7. A machine-readable medium having machine-executable instructions for performing a method comprising:
- (a) receiving data for a plurality of segments of a passage in a source voice, wherein the data for each segment of the plurality models a prosodic component of the source voice for that segment;
  
  (b) identifying a target voice entry in a codebook for each of the source voice passage segments, wherein each of the identified target voice entries models a prosodic component of a target voice for a different segment of codebook training material, and wherein the codebook training material is substantially different from the passage; and
  
  (c) generating a target voice version of the plurality of passage segments by altering the modeled source voice prosodic component for each segment to replicate the target voice prosodic component modeled by the target voice entry identified for that segment in (b).
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 8. The machine-readable medium of claim 7, wherein operation (a) includes receiving data for one or more additional segments of the passage in a source voice, and comprising additional machine-executable instructions for:
    - (d) generating a target voice version of each of the one or more additional source voice passage segments according to $x_{i} (n) |_{MV} = \frac{x_{i}^{SRC} (n) - μ_{SRC}}{σ_{SRC}} * σ_{TGT} + μ_{TGT}$ whereinμ
      
      _SRCis a mean of all F0 values for source voice versions of segments in the codebook training material,σ
      
      _SRCis a standard deviation of all F0 values for source voice versions of segments in the codebook training material,μ
      
      _TGTis a mean of all F0 values for target voice versions of segments in the codebook training material,σ
      
      _TGTis a standard deviation of all F0 values for target voice versions of segments in the codebook training material,x_i^SRC(n) is a value for F0 at time n in an F0 contour for segment i of the additional segments, andx_i(n)|_MVis a value for F0 at time n in an F0 contour for a target voice version of segment i of the additional segments.
  - 9. The machine-readable medium of claim 8, wherein the data for the passage segments in the source voice is generated by a text-to-speech system.
  - 10. The machine-readable medium of claim 7, wherein the modeled prosodic components are pitch contours.
  - 11. The machine-readable medium of claim 7, whereinthe codebook includes multiple source voice entries,each of the multiple source voice entries models a prosodic component of the source voice for a different segment of the codebook training material,each of the multiple source voice entries corresponds to a target voice entry modeling a prosodic component of the target voice for the segment of the codebook training material for which the corresponding source voice entry models the prosodic component of the source voice, andoperation (b) includes, for each source voice passage segment, identifying a target voice entry by comparing data for the source voice passage segment to one or more of the multiple source voice entries.
  - 12. The machine-readable medium of claim 11, whereineach of the multiple source voice entries and its corresponding target voice entry includes a plurality of transform coefficients representing a contour for the modeled prosodic component, andoperation (b) includes, for each source voice passage segment, identifying a target voice entry by comparing transform coefficients representing a contour for the prosodic component of the source voice passage segment to the transform coefficients for one or more of the multiple source voice entries.
  - 13. The machine-readable medium of claim 12, wherein the transform is a discrete cosine transform.
  - 14. The machine-readable medium of claim 12, whereineach of the multiple source voice entries is associated with a different feature vector,each of the associated feature vectors includes values of a set of linguistic features for the codebook training speech segment for which the associated source voice entry models the prosodic component of the source voice,data for each of the source voice passage segments includes a feature vector that includes values of the set of linguistic features for that source voice passage segment, andoperation (b) includes, for each source voice passage segment,(b1) identifying multiple candidate source voice entries based the transform coefficient comparisons, and(b2) selecting the identified target voice entry based on a comparison of the feature vector for the source voice passage segment with each of the feature vectors associated with the multiple candidate source voice entries identified in (b1).
  - 15. The machine-readable medium of claim 14, wherein the selecting in operation (b2) is also based on comparison of a duration of the source voice passage segment with durations of each of the candidate source voice entries identified in (b1).
  - 16. The machine-readable medium of claim 12, wherein operation (c) includes,(c1) performing an inverse transform on the target voice entry identified for one of the source voice passage segments,(c2) adjusting the result of (c1) according to
    x_i^TGT(n)|_a=x_i^TGT(n)+x_i^SRC(n)−
    - z_j^SRC(n),wherein x_i^TGT(n) is a value for pitch at time n and is the result of (c1), X_i^SRC(n) is a value for pitch at time n from a pitch contour for the source voice passage segment for which the inverse transform was performed in (c1), z_j^SRC(n) is a value for pitch at time n obtained from the inverse transform of the source voice entry corresponding to the identified target voice entry of (c1), and x_i^TGT(n)|_ais an adjusted pitch value at time n.
  - 17. The machine-readable medium of claim 16, wherein operation (c) includes(c3) further adjusting the result of (c2) according to
    x_i^TGT(n)|_{a, μ}
    - =x_i^TGT(n)|_a+x_i(n)|_MV,wherein $x_{i} (n) |_{MV} = \frac{x_{i}^{SRC} (n) - μ_{SRC}}{σ_{SRC}} * σ_{TGT} + μ_{TGT}$ and whereinμ
      
      _SRCis a mean of all F0 values for source voice versions of segments in the codebook training material,σ
      
      _SRCis a standard deviation of all F0 values for source voice versions of segments in the codebook training material,μ
      
      _TGTis a mean of all F0 values for target voice versions of segments in the codebook training material, andσ
      
      _TGTis a standard deviation of all F0 values for target voice versions of segments in the codebook training material.
  - 18. The machine-readable medium of claim 17, wherein operation (c) includes(c4) determining whether a boundary between the source voice passage segment for which the inverse transform was performed in (c1) and an adjacent source voice passage segment is continuous in voicing energy level, and(c5) upon determining in (c4) that the boundary is continuous in voicing energy level, adding a bias value to the result of (c3) to preserve a continuous pitch level.

19. A device, comprising:
- one or more processors configured to perform a method, the method including(a) receiving data for a plurality of segments of a passage in a source voice, wherein the data for each segment of the plurality models a prosodic component of the source voice for that segment,(b) identifying a target voice entry in a codebook for each of the source voice passage segments, wherein each of the identified target voice entries models a prosodic component of a target voice for a different segment of codebook training material, and wherein the codebook training material is substantially different from the passage, and(c) generating a target voice version of the plurality of passage segments by altering the modeled source voice prosodic component for each segment to replicate the target voice prosodic component modeled by the target voice entry identified for that segment in (b).
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
- - 20. The device of claim 19, wherein operation (a) includes receiving data for one or more additional segments of the passage in a source voice, and wherein the one or more processors are configured to generate a target voice version of each of the one or more additional source voice passage segments according to $x_{i}$
    - ( n ) 
      
      | MV = x i SRC 
      
      ( n ) - μ
      
      SRC σ
      
      SRC * σ
      
      TGT + μ
      
      TGT whereinμ
      
      _SRCis a mean of all F0 values for source voice versions of segments in the codebook training material,σ
      
      _SRCis a standard deviation of all F0 values for source voice versions of segments in the codebook training material,μ
      
      _TGTis a mean of all F0 values for target voice versions of segments in the codebook training material,σ
      
      _TGTis a standard deviation of all F0 values for target voice versions of segments in the codebook training material,x_i^SRC(n) is a value for F0 at time n in an F0 contour for segment i of the additional segments, andx_i(n)|_MVis a value for F0 at time n in an F0 contour for a target voice version of segment i of the additional segments.
  - 21. The device of claim 20, wherein the data for the passage segments in the source voice is generated by a text-to-speech system.
  - 22. The device of claim 19, wherein the modeled prosodic components are pitch contours.
  - 23. The device of claim 19, whereinthe codebook includes multiple source voice entries,each of the multiple source voice entries models a prosodic component of the source voice for a different segment of the codebook training material,each of the multiple source voice entries corresponds to a target voice entry modeling a prosodic component of the target voice for the segment of the codebook training material for which the corresponding source voice entry models the prosodic component of the source voice, andoperation (b) includes, for each source voice passage segment, identifying a target voice entry by comparing data for the source voice passage segment to one or more of the multiple source voice entries.
  - 24. The device of claim 23, whereineach of the multiple source voice entries and its corresponding target voice entry includes a plurality of transform coefficients representing a contour for the modeled prosodic component, andoperation (b) includes, for each source voice passage segment, identifying a target voice entry by comparing transform coefficients representing a contour for the prosodic component of the source voice passage segment to the transform coefficients for one or more of the multiple source voice entries.
  - 25. The device of claim 24, wherein the transform is a discrete cosine transform.
  - 26. The device of claim 24, whereineach of the multiple source voice entries is associated with a different feature vector,each of the associated feature vectors includes values of a set of linguistic features for the codebook training speech segment for which the associated source voice entry models the prosodic component of the source voice,data for each of the source voice passage segments includes a feature vector that includes values of the set of linguistic features for that source voice passage segment, andoperation (b) includes, for each source voice passage segment,(b1) identifying multiple candidate source voice entries based the transform coefficient comparisons, and(b2) selecting the identified target voice entry based on a comparison of the feature vector for the source voice passage segment with each of the feature vectors associated with the multiple candidate source voice entries identified in (b1).
  - 27. The device of claim 26, wherein the selecting in operation (b2) is also based on comparison of a duration of the source voice passage segment with durations of each of the candidate source voice entries identified in (b1).
  - 28. The device of claim 24, wherein operation (c) includes,(c1) performing an inverse transform on the target voice entry identified for one of the source voice passage segments,(c2) adjusting the result of (c1) according to
    x_i^TGT(n)|_a=x_i^TGT(n)+x_i^SRC(n)−
    - z_j^SRC(n),wherein x_i^TGT(n) is a value for pitch at time n and is the result of (c1), X_i^SRC(n) is a value for pitch at time n from a pitch contour for the source voice passage segment for which the inverse transform was performed in (c1), z_j^SRC(n) is a value for pitch at time n obtained from the inverse transform of the source voice entry corresponding to the identified target voice entry of (c1), and x_i^TGT(n)|_ais an adjusted pitch value at time n.
  - 29. The device of claim 28, wherein operation (c) includes(c3) further adjusting the result of (c2) according to $x_{i}^{TGT}$
    - ( n ) 
      
      
      
      a , μ
      
      = x i TGT 
      
      ( n ) 
      
      | a + x i 
      
      ( n ) 
      
      | MV , 
      
      wherein $x_{i} (n) |_{MV} = \frac{x_{i}^{SRC} (n) - μ_{SRC}}{σ_{SRC}} * σ_{TGT} + μ_{TGT}$ and whereinμ
      
      _SRCis a mean of all F0 values for source voice versions of segments in the codebook training material,σ
      
      _SRCis a standard deviation of all F0 values for source voice versions of segments in the codebook training material,μ
      
      _TGTis a mean of all F0 values for target voice versions of segments in the codebook training material, andσ
      
      _TGTis a standard deviation of all F0 values for target voice versions of segments in the codebook training material.
  - 30. The device of claim 29, wherein operation (c) includes(c4) determining whether a boundary between the source voice passage segment for which the inverse transform was performed in (c1) and an adjacent source voice passage segment is continuous in voicing energy level, and(c5) upon determining in (c4) that the boundary is continuous in voicing energy level, adding a bias value to the result of (c3) to preserve a continuous pitch level.
  - 31. The device of claim 19, wherein the device is a mobile communication device.
  - 32. The device of claim 19, wherein the device is a computer.

33. A device, comprising:
- a voice converter, the voice converter includingmeans for receiving data for a plurality of segments of a passage in a source voice,means for identifying target voice data entries in a codebook for segments of the source voice passage, andmeans for generating a target voice version of the passage segments based on identified target voice data entries.
- View Dependent Claims (34, 35)
- - 34. The device of claim 33, wherein the identification means include means for comparing transformed representations of source passage pitch contours to transformed representations of codebook training material pitch contours.
  - 35. The device of claim 33, wherein the identification means include means for comparing feature vectors of source passage segments to feature vectors of codebook training material segments.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
WSOU Investments, LLC (WSOU Holdings, LLC)
Original Assignee
Nokia Corporation
Inventors
Helander, Elina, Nurminen, Jani K.

Granted Patent

US 7,996,222 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/250
CPC Class Codes

G10L 13/04   Details of speech synthesis...

G10L 2021/0135   Voice conversion or morphing

G10L 21/00   Speech or voice signal proc...

Prosody Conversion

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

38 Citations

35 Claims

Specification

Solutions

Use Cases

Quick Links

Prosody Conversion

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

38 Citations

35 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links