System and method for joint optimization of language model performance and size
First Claim
Patent Images
1. A method of using a tuning set of information to jointly optimize the performance and size of a language model, comprising:
- providing a textual corpus comprising subsets wherein each subset comprises a plurality of items;
creating a Dynamic Order Markov Model data structure by assigning each item of the plurality of items to a node in the data structure, wherein the nodes are logically coupled to denote dependencies of the items, and calculating a frequency of occurrence for each item of the plurality of items;
segmenting at least a subset of a received textual corpus into segments by clustering every N-items of the received corpus into a training unit, wherein resultant training units are separated by gaps, and wherein N is an empirically derived value based, at least in part, on the size of the received corpus;
creating the tuning set from application-specific information;
(a) training a seed model via the tuning set;
(b) calculating a similarity within a sequence of the training units on either side of each of the gaps;
(c) selecting segment boundaries that maximize intra-segment similarity and inter-segment disparity;
(d) calculating a perplexity value for each segment based on a comparison with the seed model;
(e) selecting some of the segments based on their respective perplexity values to augment the tuning set;
iteratively refining the tuning set and the seed model by repeating steps (a) through (e) with respect to a threshold;
refining the language model based on the seed model;
generating the language model that is representative of the textual corpus for use by a host of applications; and
providing recognition of the textual corpus based on the language model.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for the joint optimization of language model performance and size is presented comprising developing a language model from a tuning set of information, segmenting at least a subset of a received textual corpus and calculating a perplexity value for each segment and refining the language model with one or more segments of the received corpus based, at least in part, on the calculated perplexity value for the one or more segments.
74 Citations
32 Claims
-
1. A method of using a tuning set of information to jointly optimize the performance and size of a language model, comprising:
-
providing a textual corpus comprising subsets wherein each subset comprises a plurality of items; creating a Dynamic Order Markov Model data structure by assigning each item of the plurality of items to a node in the data structure, wherein the nodes are logically coupled to denote dependencies of the items, and calculating a frequency of occurrence for each item of the plurality of items; segmenting at least a subset of a received textual corpus into segments by clustering every N-items of the received corpus into a training unit, wherein resultant training units are separated by gaps, and wherein N is an empirically derived value based, at least in part, on the size of the received corpus; creating the tuning set from application-specific information; (a) training a seed model via the tuning set; (b) calculating a similarity within a sequence of the training units on either side of each of the gaps; (c) selecting segment boundaries that maximize intra-segment similarity and inter-segment disparity; (d) calculating a perplexity value for each segment based on a comparison with the seed model; (e) selecting some of the segments based on their respective perplexity values to augment the tuning set; iteratively refining the tuning set and the seed model by repeating steps (a) through (e) with respect to a threshold; refining the language model based on the seed model; generating the language model that is representative of the textual corpus for use by a host of applications; and providing recognition of the textual corpus based on the language model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A modeling agent comprising:
-
a controller, to receive invocation requests to develop a language model from a textual corpus comprising subsets wherein each subset comprises a plurality of items and to calculate a frequency of occurrence for each item of the plurality of items; and a data structure generator, responsive to the controller to; create a Dynamic Order Markov Model data structure by assigning each item of the plurality of items to a node in the data structure, wherein the nodes are logically coupled to denote dependencies of the items; develop a seed model from a tuning set of information; segment at least a subset of a received corpus, wherein the segments of the received corpus are a clustering of every N items of the received corpus into a training unit, wherein N is an empirically derived value based, at least in part, on the size of the received corpus, and the training units are separated by gaps; calculate the similarity within a sequence of training units on either side of each of the gaps; select segment boundaries that improve intra-segment similarity and inter-segment disparity; calculate a perplexity value for each segment; refine the seed model with one or more segments of the received corpus based, at least in part, on the calculated perplexity values; iteratively refine the tuning set with segments ranked by the seed model and in turn iteratively update the seed model via the refined tuning set; filter the received corpus via the seed model to find low-perplexity segments; train the language model via the low-perplexity segments; generate the language model that is representative of the textual corpus for use by a host of applications; and provide recognition of the textual corpus based on the language model. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A method of jointly optimizing the performance and size of a language model, comprising:
-
providing a textual corpus comprising subsets wherein each subset comprises a plurality of items; creating a Dynamic Order Markov Model data structure by assigning each item of the plurality of items to a node in the data structure, wherein the nodes are logically coupled to denote dependencies of the items, and calculating a frequency of occurrence for each item of the plurality of subsets; segmenting one or more relatively large language corpora into multiple segments of N items, wherein N is an empirically derived value based, at least in part, on the size of the received corpus; selecting an initial tuning sample of application-specific data, the initial tuning sample being relatively small in comparison to the one or more relatively large language corpora, wherein the initial tuning sample is used for training a seed model, the seed model to be used for ranking the multiple segments from the language corpora; iteratively training the seed model to obtain a mature seed model, wherein the iterative training proceeds until a threshold is reached, each iteration of the training including; updating the seed model according to the tuning sample; ranking each of the multiple segments according to a perplexity comparison with the seed model; selecting some of the multiple segments that possess a low perplexity; and augmenting the tuning sample with the selected segments; once the threshold is reached, filtering the language corpora according to the mature seed model to select low-perplexity segments; combining data from the low-perplexity segments; training the language model according to the combined data; and generating the language model that is representative of the textual corpus for use by a host of applications; and providing recognition of the textual corpus based on the language model. - View Dependent Claims (27, 28, 29, 30, 31, 32)
-
Specification