Speech processing method and apparatus and program therefor

US 20030055634A1
Filed: 08/08/2002
Published: 03/20/2003
Est. Priority Date: 08/08/2001
Status: Abandoned Application

First Claim

Patent Images

1. A speech processing method for deciding emphasized portion based on a set of speech parameters for each frame, comprising the steps of:

(a) obtaining an emphasized-state appearance probability for a speech parameter vector, which is a quantized set of speech parameters for a current frame by using a codebook which stores, for each code, a speech parameter vector and an emphasized-state appearance probability, each of said speech parameter vectors including at least one of a fundamental frequency, power and a temporal variation of dynamic-measure and/or an inter-frame difference in at least one of those parameters;

(b) calculating an emphasized-state likelihood based on said emphasized-state appearance probability; and

(c) deciding whether a portion including said current frame is emphasized or not based on said calculated emphasized-state likelihood.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A scheme to judge emphasized speech portions, wherein the judgment is executed by a statistical processing in terms of a set of speech parameters including a fundamental frequency, power and a temporal variation of a dynamic measure and/or their derivatives. The emphasized speech portions are used for clues to summarize an audio content or a video content with a speech.

42 Citations

View as Search Results

26 Claims

1. A speech processing method for deciding emphasized portion based on a set of speech parameters for each frame, comprising the steps of:
- (a) obtaining an emphasized-state appearance probability for a speech parameter vector, which is a quantized set of speech parameters for a current frame by using a codebook which stores, for each code, a speech parameter vector and an emphasized-state appearance probability, each of said speech parameter vectors including at least one of a fundamental frequency, power and a temporal variation of dynamic-measure and/or an inter-frame difference in at least one of those parameters;
  
  (b) calculating an emphasized-state likelihood based on said emphasized-state appearance probability; and
  
  (c) deciding whether a portion including said current frame is emphasized or not based on said calculated emphasized-state likelihood.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 2. The method of claim 1, wherein each of said speech parameter vectors includes at least a temporal variation of dynamic measure.
  - 3. The method of claim 1, wherein each of said speech parameter vectors includes at least a fundamental frequency, a power and a temporal variation of dynamic measure.
  - 4. The method of claim 1, wherein each of said speech parameter vectors includes at least a fundamental frequency, power and a temporal variation of a dynamic-measure or an inter-frame difference in each of the parameters
  - 5. The method of any one of claims 1 to 4, wherein said codebook further includes a normal-state appearance probability for each of said speech parameter vectors;
    - said step (a) comprises a step of obtaining a normal-state appearance probability for said speech parameter vector;
      
      said step (b) comprise a step for calculating a normal-state likelihood based on said normal-state appearance probability; and
      
      said step (c) comprises a step for comparing said emphasized-state likelihood with said normal-state likelihood.
  - 6. The method of claim 5, wherein said comparing step (c) is based on said emphasized-state likelihood being larger than said normal likelihood.
  - 7. The method of claim 5, wherein said step (c) is based on a ratio of said emphasized-state likelihood to said normal-state likelihood.
  - 8. The method of any one of claims 1 to 4, wherein said emphasized-state appearance probability stored in said codebook includes an independent emphasized-state appearance probability for the respective code and conditional emphasized-state appearance probabilities for the respective code subsequent to a predetermined number of previous codes, and said step (b) comprises a step for calculating the emphasized-state likelihood by multiplying said independent emphasized-state appearance probability by said conditional emphasized-state appearance probabilities.
  - 9. The method of claim 6, wherein said normal-state appearance probability stored in said codebook includes an independent normal-state appearance probability for the respective code and conditional normal-state probabilities for the respective code subsequent to a predetermined number of previous codes;
    - and said step (b) comprises a step for calculating the normal-state likelihood by multiplying said independent normal-state appearance probability by said conditional normal-state probabilities.
  - 10. The method of any one of claims 1 to 4, wherein said step (a) is characterized by normalizing said speech parameters by each one of said speech parameters for calculating a portion including said current frame, and quantizing a set of said normalized speech parameters.
  - 11. The method of claim 8, wherein said step (b) includes a step for calculating a conditional probability of emphasized-state by linear interpolation of said independent and conditional appearance probabilities.
  - 12. The method of any one of claims 1 to 4, wherein an emphasized initial-state probability is stored in said codebook as said emphasized-state appearance probability, using an acoustical model comprising an output probability for each state transition corresponding to each speech parameter vector and an emphasized-state transition probability for each state transition;
    - said step (a) comprises the steps of;
      
      (a-1) judging each frame whether voiced or unvoiced;
      
      (a-2) judging a portion including a voiced portion at least one frame and laid between unvoiced portions longer than a predetermined number of frames as a speech sub-block;
      
      (a-3) obtaining an emphasized initial-state probability for a speech parameter vector, which is a quantized set of speech parameters for an initial frame in said speech sub-block; and
      
      (a-4) obtaining an output probability for each state transition corresponding to a speech parameter vector, which is a quantized set of speech parameters for each frame after said initial frame in said speech sub-block; and
      
      said step (b) comprises a step for calculating a likelihood as said emphasized-state likelihood based on said emphasized initial-state probability, said output probability and said emphasized-state transition probability respectively for each state transition path.
  - 13. The method of claim 12, wherein initial-state probability is stored in said codebook as said normal-state appearance probability, said acoustical model including a normal-state transition probability for each state transition;
    - said step (a) comprises a step for obtaining a normal initial-state probability for a speech parameter vector, which is a quantized set of speech parameters for an initial frame in said speech sub-block;
      
      said step (b) comprises a step for calculating a likelihood as said normal-state likelihood based on said normal initial-state probability, said output probability and said normal-state transition probability respectively for each state transition path; and
      
      said step (c) comprises a step for comparing said emphasized-state likelihood with said normal-state likelihood.
  - 14. The method of claim 12, wherein said step (a) comprises a step for deciding, as a speech block, a series of at least one speech sub-block having a final sub-block in which an average power in a voiced portion in said final sub-block is smaller than an average power in said speech sub-block of a multiplied level of that by a constant;
    - and said step (c) comprises a step for deciding, as a portion to be summarized, a speech block including a speech sub-block which is decided to be an emphasized sub-block.
  - 15. The method of claim 13, wherein said step (a) comprises a step for deciding, as a speech block, a series of at least one speech sub-block having a final sub-block in which an average power in a voiced portion in said final sub-block is smaller than an average power in said speech sub-block of a multiplied level of that by a constant;
    - and said step (c) comprises;
      
      (c-1) a step for calculating likelihood ratio of the emphasized state likelihood to normal state likelihood;
      
      (c-2) a step for deciding the speech sub-block to be in an emphasized state if said likelihood ratio is greater than a threshold value; and
      
      (c-3) a step for deciding a speech block including the emphasized speech sub-block as a portion to be summarized.
  - 16. The method of claim 15, wherein said step (c) further comprises a step for varying the threshold value and repeating the steps (c-2) and (c-3) to obtain portions to be summarized with a desired summarization ratio.
  - 17. The method of any one of claims 1 to 4, wherein said step (a) comprises the steps of:
    - (a-1) judging each frame whether voiced or unvoiced;
      
      (a-2) judging a portion including a voiced portion at least one frame and laid between unvoiced portions longer than a predetermined number of frames as a speech sub-block; and
      
      (a-3) judging a series of at least one speech sub-block with a final sub-block, in which an average power in a voiced portion is smaller than an average power in whole portion or a multiplied level of that by a constant, as a speech block; and
      
      said step (c) comprises a step for judging said each of speech sub-blocks as said portion including said current frame and judging a speech block including an emphasized speech sub-block as a portion to be summarized.
  - 18. The method of claim 17, wherein said codebook further stores a normal-state appearance probability for each speech parameter vector;
    - said step (a) comprises a step for obtaining a normal-state appearance probability for said speech parameter vector;
      
      said step (b) comprises a step of calculating a normal-state likelihood for each speech sub-block based on said normal-state appearance probability;
      
      said step (c) comprises the steps of;
      
      (c-1) judging a speech block including a speech sub-block, for which a likelihood ratio of said emphasized-state likelihood to said normal-state likelihood is larger than a threshold, as a provisional portion;
      
      (c-2) calculating a total duration of provisional portions or a ratio of a total duration of whole portions to said total duration of provisional portions as a summarization ratio; and
      
      (c-3) deciding said provisional portions as portions to be summarized by calculating said threshold, at which a total duration of provisional portions is equal or approximate to a predetermined summarization time or said summarization ratio is equal or approximate to a predetermined summarization ratio.
  - 19. The method of claim 18 wherein said step (c-3) comprises:
    - (c-3-1) increasing said threshold, when said total duration of provisional portions is longer than said predetermined summarization time or said summarization ratio is smaller than said predetermined summarization ratio and repeating said steps (c-1), (c-2) and (b-3); and
      
      (c-3-2) decreasing said threshold, when said total duration of provisional portions is shorter than said predetermined summarization time or said summarization ratio is larger than said predetermined summarization ratio and repeating said steps (c-1), (c-2) and (b-3).
  - 20. The method of claim 17, wherein said codebook further stores a normal-state appearance probability for each speech parameter vector;
    - said step (a) comprises a step for obtaining a normal-state appearance probability for said speech parameter vector;
      
      said step (b) comprises a step of calculating a normal-state likelihood for each speech sub-block based on said normal-state appearance probability;
      
      said step (c) comprising the steps of;
      
      (c-1) calculating a likelihood ratio of said emphasized-state likelihood to said normal-state likelihood for each speech sub-block;
      
      (c-2) calculating a total duration by accumulating durations of each speech block including one of speech sub-block in a decreasing order of said likelihood ratio; and
      
      (c-3) deciding said speech blocks as portions to be summarized, at which a total duration of provisional portions is equal or approximate to a predetermined summarization time or said summarization ratio is equal or approximate to a predetermined summarization ratio.
  - 21. A speech processing program for executing the method of any one of claims 1 to 18.

22. A speech processing apparatus for deciding whether input speech is emphasized or not based on a set of speech parameters for each frame of said input speech, said apparatus comprising:
- a codebook which stores, for each code, a speech parameter vector and an emphasized-state appearance probability, each of said speech parameter vectors including at least a fundamental frequency, a power and a temporal variation of a dynamic-measure or an inter-frame difference in each of the parameters;
  
  an emphasized-state likelihood calculating part for calculating an emphasized-state likelihood of a portion including a current frame based on said emphasized-state appearance probability; and
  
  an emphasized state deciding part for deciding whether said portion including said current frame is emphasized or not based on said calculated emphasized-state likelihood.
- View Dependent Claims (23, 24, 25, 26)
- - 23. The apparatus of claim 22, wherein said emphasized-state deciding part includes emphasized state deciding means for determining whether said emphasized-state likelihood is higher than a predetermined value, and if so, deciding that said portion including said current frame is emphasized.
  - 24. The apparatus of claim 23, further comprising:
    - an unvoiced portion deciding part for deciding whether each frame of said input speech is an unvoiced portion;
      
      a voiced portion deciding part for deciding whether each frame of said input speech is a voiced portion;
      
      a speech sub-block deciding part for deciding that said portion including said current frame preceded and succeeded by more than a predetermined number of unvoiced portions and including said voiced portion is a speech sub-block;
      
      a speech block deciding part for deciding that when the average power of said voiced portion of one or more frames included in said speech sub-block is smaller than a constant-multiplied value of the average power of said speech sub-block, a speech sub-block group which ends with said speech sub-block is a speech block; and
      
      a summarized portion output part for deciding that a speech block including said speech sub-block decided as emphasized by said emphasized state deciding part is a summarized portion and outputting said speech block as a summarized portion.
  - 25. The apparatus of claim 24, wherein said codebook has further stored therein a normal-state appearance probability of the speech parameter vector corresponding to said each code, said apparatus further comprising:
    - a normal-state likelihood calculating part for calculating the normal-state likelihood of each speech sub-block based on the normal-state appearance probability of the corresponding speech parameter vectors each obtained by quantizing a set of speech parameters of each frame in said speech sub-blocks; and
      
      said emphasized state deciding part including;
      
      a provisionally summarized portion deciding part for deciding that a speech block including a speech sub-block is a provisionally summarized portion if a likelihood ratio between the emphasized-state likelihood of said speech sub-block to its normal-state likelihood is higher than a reference value; and
      
      a summarized portion deciding part for calculating the total amount of time of said provisionally summarized portions, or as the summarization rate, the overall time of the entire portion of said input speech to said total amount of time of said provisionally summarized portions, for calculating said reference value on the basis of which the total amount of time of said provisionally summarized portions becomes substantially equal to a predetermined value or said summarization rate becomes substantially equal to a predetermined value, and for determining said provisionally summarized portions as summarized portions.
  - 26. The apparatus of claim 24, wherein said codebook has further stored therein a normal-state appearance probability of the speech parameter vector corresponding to said each code, said apparatus further comprising:
    - a normal-state likelihood calculating part for calculating a normal-state likelihood of said each speech sub-block based on the normal-state appearance probability of the corresponding speech parameter vector obtained by quantizing a set of speech parameters of each frame in each of said speech sub-blocks; and
      
      said emphasized state deciding part including;
      
      a provisionally summarized portion deciding part for calculating the likelihood ratio of said emphasized-state likelihood of each speech sub-block to its normal-state likelihood and for provisionally deciding that each speech block including speech sub-blocks of likelihood ratios down to a predetermined likelihood ratio in descending order is a provisionally summarized portion; and
      
      a summarized portion deciding part for calculating the total amount of time of provisionally summarized portions, or as the summarization rate, said total amount of time of said provisionally summarized portions to the overall time of the entire portion of said input speech, for calculating said predetermined likelihood ratio on the basis of which the total amount of time of said provisionally summarized portions becomes substantially equal to a predetermined value or said summarization rate becomes substantially equal to a predetermined value, and for determining a summarization portion.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nippon Telegraph and Telephone Corporation
Original Assignee
Nippon Telegraph and Telephone Corporation
Inventors
Mizuno, Osamu, Nakajima, Shinya, Hidaka, Kota, Kojima, Haruhiko, Kuwano, Hidetaka

Application Number

US10/214,232
Publication Number

US 20030055634A1
Time in Patent Office

Days
Field of Search
US Class Current

704/222
CPC Class Codes

G10L 25/00 Speech or voice analysis te...

Speech processing method and apparatus and program therefor

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

42 Citations

26 Claims

Specification

Use Cases

Quick Links

Others

Speech processing method and apparatus and program therefor

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

42 Citations

26 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others