×

Apparatus for generating a statistical sequence model called class bi-multigram model with bigram dependencies assumed between adjacent sequences

  • US 6,314,399 B1
  • Filed: 04/13/1999
  • Issued: 11/06/2001
  • Est. Priority Date: 06/12/1998
  • Status: Active Grant
First Claim
Patent Images

1. An apparatus for generating a statistical class sequence model called a class bi-multigram model from input strings of discrete-valued units, where bigram dependencies are assumed between adjacent multigrams, where each said multigram is a variable length sequence of maximum length N units, and where class labels are assigned to the sequences, said apparatus comprising:

  • initialization means for taking as an input a training string of units, registering in an inventory of sequences all the combinations of 1 to N units occurring in the input training string, counting the number of times all sequences of units occur and the number of times all pairs of sequences of units co-occur in the input training strings of units, computing an initial bigram probability distribution of all the pairs of sequences as the counted number of times the two sequences co-occur divided by the counted number of times the first sequence occurs in the input training string, and outputting the inventory of sequences and the initial bigram probability distribution of the sequences in the inventory;

    classification means for taking as an input the inventory of sequences and the bigram probability distribution of the sequences in the inventory, classifying the input sequences into a pre-specified desired number of classes, by first assigning each sequence to its own class, and then repeatedly updating a class conditional probability distribution of the sequences and the bigram probability distribution of the classes and merging the pairs of classes for which a loss in mutual information computed with the current class probability distributions is minimal, until a desired number of classes is obtained, and outputting the inventory of sequences with the class label assigned to each sequence, the class conditional probability distribution of the sequences, and the bigram probability distribution of the classes;

    reestimation means for taking as an input the training string of units, the inventory of sequences with the class label assigned to each sequence, the current class conditional probability distribution of the sequences, and the current bigram probability distribution of the classes which are outputted from said classification means, calculating an estimate of the bigram probability distribution of the sequences by using an EM algorithm to maximize the likelihood of the input training string computed with associated input probability distributions, and outputting the inventory of sequences with the bigram probability distribution of the sequences, the process of said reestimation means being performed with a forward-backward algorithm, by using an equation where a bigram probability between a sequence to be processed and a preceding sequence is calculated from a forward likelihood of the input training string which can be taken forward in a time series, the class conditional probability of the sequences to be processed, the probability of the class of the sequence to be processed knowing the class of the preceding sequence, and a backward likelihood of the input training string which can be taken backward in the time series; and

    control means for controlling said classification means and said reestimation means to iteratively execute the process of said classification means and said reestimation means, the input of said classification means being, at a first iteration, the output of the said initialization means, and, during subsequent iterations, the output of said reestimation means, and the input of said reestimation means being the output of said classification means, until a predetermined ending condition is satisfied, thereby generating a statistical class sequence model.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×