×

Speech processing system and method

  • US 8,620,655 B2
  • Filed: 08/10/2011
  • Issued: 12/31/2013
  • Est. Priority Date: 08/16/2010
  • Status: Expired due to Fees
First Claim
Patent Images

1. A speech processing method, comprising:

  • receiving a speech input which comprises a sequence of feature vectors;

    determining a likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising;

    providing an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and

    adapting the acoustic model to the mismatched speech input,the speech processing method further comprising determining a likelihood of a sequence of features occurring in a given language using a language model; and

    combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal,wherein adapting the acoustic model to the mismatched speaker input comprises;

    relating speech from the mismatched speaker input to the speech used to train the acoustic model using;

    a mismatch function f for primarily modeling differences between the environment of the speaker and the environment under which the acoustic model was trained;

    and a speaker transform F for primarily modeling differences between the speaker of the mismatched speaker input, such that;


    y=f(F(x,v),u)where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modeling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and

    jointly estimating u and v,wherein said joint estimation of u and v is performed using the expectation maximization algorithm and comprises optimizing u and v in a single maximization step of said algorithm,wherein said at least one parameter u comprises parameters n and h, where n is used to model additive noise and h is used to model convolutional noise, andwherein said mismatch function f is of the form;

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×