System and method for pruning a set of symbol-based sequences by relaxing an independence assumption of the sequences
First Claim
1. A sequence pruning method, comprising:
- representing a set of sequences in a data structure, each sequence in the set of sequences including a first symbol and a context of at least one symbol, a subset of the sequences in the set of sequences each being associated with a respective conditional probability that is based on observations of the respective sequence in training data;
computing a value of a scoring function for each sequence in the set of represented sequences, the scoring function taking into account the conditional probability for the sequence and a probability distribution of each symbol in the sequence if the respective sequence is removed from the set of sequences, the conditional probability for each sequence in the set of sequences that is not in the subset of sequences being computed as a function of the probability of the respective symbol in a back-off context;
iteratively pruning the set of sequences, comprising;
selecting one of the represented sequences to be removed, the selection being based on the computed scoring function values; and
updating the scoring function values of at least one of the remaining ones of the sequences;
outputting a set of remaining sequences; and
wherein at least one of the representing, computing, pruning, and updating is performed with a processor.
6 Assignments
0 Petitions
Accused Products
Abstract
A pruning method includes representing a set of sequences in a data structure. Each sequence s includes a first symbol w and a context c of at least one symbol. Some of the sequences are associated with a conditional probability p(w|c), based on observations of cw in training data. For others, p(w|c) is computed as a function of the probability p(w|ĉ) of the respective symbol w in a back-off context ĉ, p(w|ĉ) being based on observations of sequence ĉw in the training data. A scoring function ƒ(cw) value is computed for each sequence in the set, based on p(w|c) for the sequence and a probability distribution p(s) of each symbol in the sequence if it is removed from the set of sequences. Iteratively, one of the represented sequences is selected to be removed, based on the computed scoring function values, and the scoring function values of remaining sequences are updated.
54 Citations
21 Claims
-
1. A sequence pruning method, comprising:
-
representing a set of sequences in a data structure, each sequence in the set of sequences including a first symbol and a context of at least one symbol, a subset of the sequences in the set of sequences each being associated with a respective conditional probability that is based on observations of the respective sequence in training data; computing a value of a scoring function for each sequence in the set of represented sequences, the scoring function taking into account the conditional probability for the sequence and a probability distribution of each symbol in the sequence if the respective sequence is removed from the set of sequences, the conditional probability for each sequence in the set of sequences that is not in the subset of sequences being computed as a function of the probability of the respective symbol in a back-off context; iteratively pruning the set of sequences, comprising; selecting one of the represented sequences to be removed, the selection being based on the computed scoring function values; and updating the scoring function values of at least one of the remaining ones of the sequences; outputting a set of remaining sequences; and wherein at least one of the representing, computing, pruning, and updating is performed with a processor. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A sequence pruning system, comprising:
-
a data structure representing a set of sequences, each sequence in the set of sequences including a first symbol and a context of at least one symbol, a subset of the sequences in the set of sequences each being associated with a respective conditional probability that is based on observations of the respective sequence in training data; a scoring function computation component which computes a value of a scoring function for each sequence in the set of represented sequences, the scoring function taking into account the conditional probability for the sequence and a probability distribution of each symbol in the sequence if the respective sequence is removed from the set of sequences, the conditional probability for each sequence in the set of sequences that is not in the subset of sequences being computed as a function of the probability of the respective symbol in a back-off context; a sequence selecting component which selects a next one of the represented sequences to be removed, the selection being based on the computed scoring function values; and an update component which updates the scoring function values of remaining ones of the sequences prior to the sequence selecting component selecting another next one of the represented sequences to be removed; an output component which outputs a set of remaining sequences; and a processor which implements the components. - View Dependent Claims (20)
-
-
21. A sequence pruning method, comprising:
-
receiving a set of sequences, each received sequence in the set of sequences including at least one symbol and a context, wherein for at least some of the received sequences, the context comprises at least one respective preceding symbol, each of the received sequences in the set of sequences being associated with a respective conditional probability that is based on observations of the sequence in a corpus; representing each of the received sequences as a respective node in a tree structure of nodes in which each of the nodes in the tree structure is directly linked to no more than one node in the tree structure that represents a direct ancestor and wherein at least some of the nodes in the tree structure represent back-off sequences in which the context of a linked node in the tree structure is reduced by at least one symbol; computing a value of a scoring function for at least some of the sequences in the tree structure based on respective conditional probabilities, a conditional probability for sequences represented in the tree structure that are not in the set of received sequences being computed as a function of a respective symbol in its back-off context; selecting one of the represented sequences to be removed, based on the computed scoring function values, the one of the represented sequences being selected from represented sequences for which there are no remaining sequences represented in the tree structure that consist of at least one symbol and the same sequence as the selected sequence, which precedes the at least one symbol; updating a value of a scoring function for at least one remaining one of the sequences, whose scoring function value is changed by the removal of the selecting one of the represented sequences; after the updating, repeating the selecting of one of the represented sequences from a remaining set of represented sequences; outputting a set of the remaining sequences; and wherein at least one of the representing, computing, selecting, and updating is performed with a processor.
-
Specification