Speech processing device and method

US 9,672,809 B2
Filed: 04/24/2014
Issued: 06/06/2017
Est. Priority Date: 06/17/2013
Status: Active Grant

First Claim

Patent Images

1. A speech processing device comprising:

a processor; and

a memory which stores a plurality of instructions, which when executed by the processor, cause the processor to execute;

obtaining input speech, the input speech including a plurality of vowel segments and a plurality of consonant segments,detecting the vowel segments contained in the input speech,estimating a stress segment among the plurality of vowel segments by comparing pitch variation rate or power variation rate per unit time of the plurality of vowel segments, respectively, the stress segment being a segment that has a strong trend of decrease in the pitch variation rate or the power variation rate per unit time,detecting sound lengths of each of the plurality of vowel segments,transforming the input speech so that a first sound length becomes longer than each of second sound lengths when the input speech includes at least one of the second sound lengths that is longer than the first sound length, the first sound length being a sound length of a vowel segment containing the stress segment, the second sound lengths being sound lengths of vowel segments excluding the stress segment, the transforming including extending the first sound length or shortening at least one of the second sound lengths, the first sound length being extended by inserting a part of segment obtained based on the vowel segment containing the stress segment into the vowel segment containing the stress segment, the at least one of the second sound lengths being shortened by deleting a part of segment from the at least one of the second sound lengths, a length to be inserted or to be shortened being determined based on the detected first sound length and the detected second sound length and a prescribed target scaling factor, andoutputting the transformed input speech in which the first sound length is extended or in which the at least one of the second sound lengths is shortened.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech processing device includes a processor; and a memory which stores a plurality of instructions, which when executed by the processor, cause the processor to execute: obtaining input speech, detecting a vowel segment contained in the input speech, estimating an accent segment contained in the input speech, calculating a first vowel segment length containing the accent segment and a second vowel segment length excluding the accent segment, and controlling at least one of the first vowel segment length and the second vowel segment length.

17 Citations

View as Search Results

14 Claims

1. A speech processing device comprising:
- a processor; and
  
  a memory which stores a plurality of instructions, which when executed by the processor, cause the processor to execute;
  
  obtaining input speech, the input speech including a plurality of vowel segments and a plurality of consonant segments,detecting the vowel segments contained in the input speech,estimating a stress segment among the plurality of vowel segments by comparing pitch variation rate or power variation rate per unit time of the plurality of vowel segments, respectively, the stress segment being a segment that has a strong trend of decrease in the pitch variation rate or the power variation rate per unit time,detecting sound lengths of each of the plurality of vowel segments,transforming the input speech so that a first sound length becomes longer than each of second sound lengths when the input speech includes at least one of the second sound lengths that is longer than the first sound length, the first sound length being a sound length of a vowel segment containing the stress segment, the second sound lengths being sound lengths of vowel segments excluding the stress segment, the transforming including extending the first sound length or shortening at least one of the second sound lengths, the first sound length being extended by inserting a part of segment obtained based on the vowel segment containing the stress segment into the vowel segment containing the stress segment, the at least one of the second sound lengths being shortened by deleting a part of segment from the at least one of the second sound lengths, a length to be inserted or to be shortened being determined based on the detected first sound length and the detected second sound length and a prescribed target scaling factor, andoutputting the transformed input speech in which the first sound length is extended or in which the at least one of the second sound lengths is shortened.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The device according to claim 1, wherein the estimating comprises estimating the stress segment based on an amount of change of a pitch frequency or power of the input speech per unit time.
  - 3. The device according to claim 1, wherein the memory further causes the processor to execute detecting a fundamental period for a vowel segment to be extended or to be shortened in the transforming,wherein, in the transforming, the length to be inserted or to be shortened is determined based on the fundamental period.
  - 4. The device according to claim 3, wherein the detecting the fundamental period comprises further detecting an amount of acoustic feature that includes at least one of pitch frequency, formant frequency, and autocorrelation of the vowel segment to be extended or to be shortened in the transforming, andwherein the transforming comprises extending the first sound length or shortening the second sound length, when it is determined that the vowel segment to be extended or to be shortened in the transforming is a segment for which the amount of change of the amount of acoustic feature per unit time is less than a predetermined first threshold.
  - 5. The device according to claim 1, wherein the transforming comprises transforming the first sound length or the second sound length when it is determined that the vowel segment to be extended or to be shortened in the transforming is a segment for which the autocorrelation value is equal to or greater than a predetermined threshold or when it is determined that the vowel segment to be extended or to be shortened in the transforming is a segment for which amplitude is less than a predetermined threshold.
  - 6. The device according to claim 1, wherein the transforming comprises extending the first sound length or shortening the second sound length by adding a signal in which a weighting factor that decreases over time is applied to a segment preceding the part of the segment to be inserted or shortened, and a signal in which a weighting factor that increases over time is applied to a frame following the part of segment to be inserted or shortened.
  - 7. The device according to claim 1, wherein the memory further causes the processor to execute recognizing the input speech as text information,wherein the recognizing comprises detecting the first sound length or the second sound length based on the text information.

8. A speech processing method comprising:
- obtaining input speech, the input speech including a plurality of vowel segments and a plurality of consonant segments,detecting the vowel segments contained in the input speech,estimating a stress segment among the plurality of vowel segments by comparing pitch variation rate or power variation rate per unit time of the plurality of vowel segments, respectively, the stress segment being a segment that has a strong trend of decrease in the pitch variation rate or the power variation rate per unit time,detecting sound lengths of each of the plurality of vowel segments,transforming the input speech so that a first sound length becomes longer than each of second sound lengths when the input speech includes at least one of the second sound lengths that is longer than the first sound length, the first sound length being a sound length of a vowel segment containing the stress segment, the second sound lengths being sound lengths of vowel segments excluding the stress segment, the transforming including extending the first sound length or shortening at least one of the second sound lengths, the first sound length being extended by inserting a part of segment obtained based on the vowel segment containing the stress segment into the vowel segment containing the stress segment, the at least one of the second sound lengths being shortened by deleting a part of segment from the at least one of the second sound lengths, a length to be inserted or to be shortened being determined based on the detected first sound length and the detected second sound length and a prescribed target scaling factor, andoutputting the transformed input speech in which the first sound length is extended or in which the at least one of the second sound lengths is shortened.
- View Dependent Claims (9, 10, 11, 12)
- - 9. The method according to claim 8, wherein the estimating comprises estimating the stress segment based on an amount of change of a pitch frequency or power of the input speech per unit time.
  - 10. The method according to claim 8, further comprising:
    - detecting a fundamental period for a vowel segment to be extended or to be shortened in the transforming,wherein, in the transforming, the length to be inserted or to be shortened is determined based on the fundamental period.
  - 11. The method according to claim 10, wherein the detecting the fundamental period comprises further detecting an amount of acoustic feature that includes at least one of pitch frequency, formant frequency, and autocorrelation of the vowel segment to be extended or to be shortened in the transforming, andwherein the transforming comprises extending the first sound length or shortening the second sound length, when it is determined that the vowel segment to be extended or to be shortened in the transforming is a segment for which the amount of change of the amount of acoustic feature per unit time is less than a predetermined first threshold.
  - 12. The method according to claim 8, wherein the transforming comprises transforming the first sound length or the second sound length when it is determined that the vowel segment to be extended or to be shortened in the transforming is a segment for which the autocorrelation value is equal to or greater than a predetermined threshold or when it is determined that the vowel segment to be extended or to be shortened in the transforming is a segment for which amplitude is less than a predetermined threshold.

13. A non-transitory computer-readable storage medium storing a speech processing program that causes a computer to execute a process comprising:
- obtaining input speech, the input speech including a plurality of vowel segments and a plurality of consonant segments,detecting the vowel segments contained in the input speech,estimating a stress segment among the plurality of vowel segments by comparing pitch variation rate or power variation rate per unit time of the plurality of vowel segments, respectively, the stress segment being a segment that has a strong trend of decrease in the pitch variation rate or the power variation rate per unit time,detecting sound lengths of each of the plurality of vowel segments,transforming the input speech so that a first sound length becomes longer than each of second sound lengths when the input speech includes at least one of the second sound lengths that is longer than the first sound length, the first sound length being a sound length of a vowel segment containing the stress segment, the second sound lengths being sound lengths of vowel segments excluding the stress segment, the transforming including extending the first sound length or shortening at least one of the second sound lengths, the first sound length being extended by inserting a part of segment obtained based on the vowel segment containing the stress segment into the vowel segment containing the stress segment, the at least one of the second sound lengths being shortened by deleting a part of segment from the at least one of the second sound lengths, a length to be inserted or to be shortened being determined based on the detected first sound length and the detected second sound length and a prescribed target scaling factor, andoutputting the transformed input speech in which the first sound length is extended or in which the at least one of the second sound lengths is shortened.

14. A portable terminal device comprising:
- a microphone that inputs a speaker'"'"'s voice as input speech;
  
  a processor; and
  
  a memory which stores a plurality of instructions, which when executed by the processor, cause the processor to execute;
  
  obtaining the input speech, the input speech including a plurality of vowel segments and a plurality of consonant segments,detecting the vowel segments contained in the input speech,estimating a stress segment among the plurality of vowel segments by comparing pitch variation rate or power variation rate per unit time of the plurality of vowel segments, respectively, the stress segment being a segment that has a strong trend of decrease in the pitch variation rate or the power variation rate per unit time,detecting sound lengths of each of the plurality of vowel segments,transforming the input speech so that a first sound length becomes longer than each of second sound lengths when the input speech includes at least one of the second sound lengths that is longer than the first sound length, the first sound length being a sound length of a vowel segment containing the stress segment, the second sound lengths being sound lengths of vowel segments excluding the stress segment, the transforming including extending the first sound length or shortening at least one of the second sound lengths, the first sound length being extended by inserting a part of segment obtained based on the vowel segment containing the stress segment into the vowel segment containing the stress segment, the at least one of the second sound lengths being shortened by deleting a part of segment from the at least one of the second sound lengths, a length to be inserted or to be shortened being determined based on the detected first sound length and the detected second sound length and a prescribed target scaling factor, andoutputting the transformed input speech in which the first sound length is extended or in which the at least one of the second sound lengths is shortened,a speaker configured to output an output speech generated by controlling the input speech.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fujitsu Limited
Original Assignee
Fujitsu Limited
Inventors
Shioda, Chisato, Otani, Takeshi, Togawa, Taro
Primary Examiner(s)
Poon, King
Assistant Examiner(s)
Siddo, Ibrahim

Application Number

US14/260,449
Publication Number

US 20140372121A1
Time in Patent Office

1,139 Days
Field of Search
US Class Current
CPC Class Codes

G10L 13/027   Concept to speech synthesis...

G10L 15/02   Feature extraction for spee...

G10L 15/04   Segmentation; Word boundary...

G10L 15/08   Speech classification or se...

G10L 21/02   Speech enhancement, e.g. no...

G10L 21/0364   for improving intelligibility

G10L 21/057   for improving intelligibility

Speech processing device and method

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

17 Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Speech processing device and method

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

17 Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links