Speech processing method and apparatus for deciding emphasized portions of speech, and program therefor
First Claim
Patent Images
1. A speech processing method performed using a processor for deciding whether a portion of input speech is emphasized or not based on a set of speech parameters for each frame, comprising the steps of:
- (a) obtaining from a codebook a plurality of speech parameter vectors each corresponding to a respective set of speech parameters obtained from respective ones of a plurality of frames in the portion of the input speech, said codebook storing, for each of a plural number of predetermined speech parameter vectors, a corresponding pair of a normal-state appearance probability and an emphasized-state appearance probability both predetermined using a training speech signal, each of said plural number of predetermined speech parameter vectors being composed of a set of speech parameters including at least one of a fundamental frequency, power and a temporal variation of dynamic-measure and/or an inter-frame difference in at least one of those speech parameters, and obtaining from said codebook a pair of an emphasized-state appearance probability and a normal-state appearance probability both corresponding to each speech parameter vector obtained for the respective ones of the plurality of frames in the portion of the input speech;
(b) using the processor, calculating an emphasized-state likelihood of the portion of the input speech by multiplying together emphasized-state appearance probabilities corresponding to the respective speech parameter vectors for the plurality of frames in the portion of the input speech, and calculating a normal-state likelihood of the portion of the input speech by multiplying together normal-state appearance probabilities corresponding to the respective speech parameter vectors for the plurality of frames in the portion of the input speech; and
(c) deciding whether the portion of the input speech is emphasized or not based on said calculated emphasized-state likelihood and said calculated normal-state likelihood, and outputting a decision result of said deciding, the decision result indicating whether the portion of the input speech is emphasized or not,wherein the codebook stores, for each of the plural predetermined speech parameter vectors, a respective independent emphasized-state appearance probability and a respective set of conditional emphasized-state appearance probabilities, both used as respective said emphasized-state appearance probability, and stores, for each of the plural predetermined speech parameter vectors, a respective independent normal-state appearance probability and a set of conditional normal-state appearance probabilities, both used as respective said normal-state appearance probability, such that there is at least stored a separate conditional emphasized-state appearance probability and a separate conditional normal-state appearance probability for a possible speech parameter vector that immediately follows the respective speech parameter vector in the codebook, andwherein the step of calculating the emphasized-state likelihood in said step (b) is implemented by multiplying together the independent emphasized-state appearance probability and the conditional emphasized-state appearance probabilities corresponding to the speech parameter vectors of respective first frame and subsequent frames in said portion of the input speech, and the step of calculating the normal-state likelihood in said step (b) is implemented by multiplying together the independent normal-state appearance probability and the conditional normal-state appearance probabilities corresponding to the speech parameter vectors of respective said first frame and said subsequent frames in said portion of the input speech.
0 Assignments
0 Petitions
Accused Products
Abstract
A scheme to judge emphasized speech portions, wherein the judgment is executed by a statistical processing in terms of a set of speech parameters including a fundamental frequency, power and a temporal variation of a dynamic measure and/or their derivatives. The emphasized speech portions are used for clues to summarize an audio content or a video content with a speech.
22 Citations
28 Claims
-
1. A speech processing method performed using a processor for deciding whether a portion of input speech is emphasized or not based on a set of speech parameters for each frame, comprising the steps of:
-
(a) obtaining from a codebook a plurality of speech parameter vectors each corresponding to a respective set of speech parameters obtained from respective ones of a plurality of frames in the portion of the input speech, said codebook storing, for each of a plural number of predetermined speech parameter vectors, a corresponding pair of a normal-state appearance probability and an emphasized-state appearance probability both predetermined using a training speech signal, each of said plural number of predetermined speech parameter vectors being composed of a set of speech parameters including at least one of a fundamental frequency, power and a temporal variation of dynamic-measure and/or an inter-frame difference in at least one of those speech parameters, and obtaining from said codebook a pair of an emphasized-state appearance probability and a normal-state appearance probability both corresponding to each speech parameter vector obtained for the respective ones of the plurality of frames in the portion of the input speech; (b) using the processor, calculating an emphasized-state likelihood of the portion of the input speech by multiplying together emphasized-state appearance probabilities corresponding to the respective speech parameter vectors for the plurality of frames in the portion of the input speech, and calculating a normal-state likelihood of the portion of the input speech by multiplying together normal-state appearance probabilities corresponding to the respective speech parameter vectors for the plurality of frames in the portion of the input speech; and (c) deciding whether the portion of the input speech is emphasized or not based on said calculated emphasized-state likelihood and said calculated normal-state likelihood, and outputting a decision result of said deciding, the decision result indicating whether the portion of the input speech is emphasized or not, wherein the codebook stores, for each of the plural predetermined speech parameter vectors, a respective independent emphasized-state appearance probability and a respective set of conditional emphasized-state appearance probabilities, both used as respective said emphasized-state appearance probability, and stores, for each of the plural predetermined speech parameter vectors, a respective independent normal-state appearance probability and a set of conditional normal-state appearance probabilities, both used as respective said normal-state appearance probability, such that there is at least stored a separate conditional emphasized-state appearance probability and a separate conditional normal-state appearance probability for a possible speech parameter vector that immediately follows the respective speech parameter vector in the codebook, and wherein the step of calculating the emphasized-state likelihood in said step (b) is implemented by multiplying together the independent emphasized-state appearance probability and the conditional emphasized-state appearance probabilities corresponding to the speech parameter vectors of respective first frame and subsequent frames in said portion of the input speech, and the step of calculating the normal-state likelihood in said step (b) is implemented by multiplying together the independent normal-state appearance probability and the conditional normal-state appearance probabilities corresponding to the speech parameter vectors of respective said first frame and said subsequent frames in said portion of the input speech. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A speech processing method performed using a processor for deciding whether a portion of input speech is emphasized or not based on a set of speech parameters for each frame using an acoustical model including a codebook,
wherein said codebook stores, as a normal initial-state appearance probability and an emphasized initial-state appearance probability, both for each of a plural number of predetermined speech parameter vectors, a corresponding pair of normal-state appearance probability and an emphasized-state appearance probability, both predetermined using a training speech signal, a predetermined number of states including an initial state and a final state, state transitions each defining a transition from each state to itself or another state, an output probability table storing emphasized-state output probabilities and normal-state output probabilities both for each of the plural number of speech parameter vectors at the respective states and a transition probability table storing an emphasized-state transition probability and a normal-state transition probability both for each of the state transitions, and wherein each of said speech parameter vectors is composed of a set of speech parameters including at least one of a fundamental frequency, power and a temporal variation of dynamic-measure and/or an inter-frame difference in at least one of those parameters, the method comprising the steps of: -
judging each frame as voiced or unvoiced; judging, as a speech sub-block, a portion which includes a voiced portion of at least one frame and which is laid between unvoiced portions longer than a predetermined number of frames; obtaining from the codebook an emphasized initial-state probability and a normal initial-state probability both corresponding to a speech parameter vector which is a quantized set of speech parameters for an initial frame in said speech sub-block; obtaining from the output probability table emphasized-state output probabilities and normal-state output probabilities both for respective state transitions corresponding to respective speech parameter vectors each of which is a quantized set of speech parameters obtained for respective one of frames after said initial frame in said speech sub-block, and obtaining from the transition probability table emphasized-state transition probabilities and normal-state transition probabilities both corresponding to state transitions for respective frames after said initial frame in said speech sub-block; calculating, using the processor, a probability of emphasized-state by multiplying together said emphasized initial-state probability, said emphasized-state output probabilities and said emphasized-state transition probabilities both along every path of state transitions via the predetermined number of states and calculating, using the processor, a probability of normal-state by multiplying together said normal initial-state probability, said output probability and said normal-state transition probability both alone every state transition path; deciding a largest one or total sum of the probabilities of emphasized-state for all the state transition paths as an emphasized-state likelihood and a largest one or total sum of the probabilities of normal-state for all the state transition paths as a normal-state likelihood; and comparing said emphasized-state likelihood with said normal-state likelihood to decide whether the speech sub-block is emphasized state or normal state.
-
-
20. A speech processing apparatus for deciding whether a portion of input speech is emphasized or not based on a set of speech parameters for each frame of said input speech, said apparatus comprising:
-
a codebook which stores, for each of a plural number of predetermined speech parameter vectors, a corresponding pair of a normal state appearance probability and an emphasized-state appearance probability, both predetermined using a training speech signal, each of said predetermined speech parameter vectors being composed of a set of speech parameters including at least two of a fundamental frequency, power and temporal variation of dynamic measure and/or an inter-frame difference in at least one of those speech parameters; means for obtaining from said codebook a plurality of speech parameter vectors each corresponding to a respective set of speech parameters for obtained from each of a plurality of frames in the portion of the input speech; a normal state likelihood calculating part that calculates a normal-state likelihood of the portion of the input speech by multiplying together normal-state appearance probabilities corresponding to the respective speech parameter vectors for the plurality of frames in the portion of the input speech; an emphasized-state likelihood calculating part that calculates an emphasized-state likelihood of the portion of the input speech by multiplying together emphasized-state appearance probabilities corresponding to the respective speech parameter vectors for the plurality of frames in the portion of the input speech; an emphasized state deciding part that decides whether the portion of the input speech is emphasized or not based on a comparison of said calculated emphasized-state likelihood to said calculated normal-state likelihood; and outputting unit that outputs the decision result representing whether the portion of the input speech is emphasized or not, wherein the codebook further stores, for each of the plural predetermined speech parameter vectors, a respective independent emphasized-state appearance probability and a respective independent normal-state appearance probability, both predetermined using the training speech signal, and stores for each of the plural predetermined speech parameter vectors, a respective set of conditional emphasized-state appearance probabilities and a respective set of conditional normal-state appearance probabilities, both predetermined using the training speech signal, such that there is at least stored a separate conditional emphasized-state appearance probability and a separate conditional normal-state appearance probability for a possible instance speech parameter vector that immediately follows the respective speech parameter vector in the codebook, wherein said emphasized-state likelihood calculating part is configured to calculate the emphasized-state likelihood by multiplying together an independent emphasized-state appearance probability and conditional emphasized-state appearance probabilities corresponding to the speech parameter vectors of respective first frame and subsequent frames in the portion of the input speech, and wherein said normal-state likelihood calculating part is configured to calculate the normal-state likelihood by multiplying together an independent normal-state appearance probability and conditional normal-state appearance probabilities corresponding to the speech parameter vectors of respective first frame and subsequent frames in the portion of the input speech. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28)
-
Specification