Efficient determination of sample size to facilitate building a statistical model
First Claim
1. A computer implemented system that facilitates building a statistical model for a computer readable data set, comprising:
- a first training method that efficiently builds a rough statistical model from a subset of the computer readable data set capable of statistical characterization;
an evaluation component that evaluates the rough statistical model to determine whether the subset of the computer readable data set is an appropriate subset to be utilized to build a refined statistical model for the computer readable data set based at least in part on stopping criterion to facilitate reducing cost of clustering data relative to the computer readable data set;
a second training method that builds the refined statistical model for the computer readable data set from the subset if the subset is deemed appropriate by the evaluation component, the refined statistical model provides a more accurate modeling of the subset than the rough statistical model and facilitates determining good clustering of data for a fixed number of clusters based at least in part on predefined accuracy criteria to facilitate clustering of data relative to the computer readable data set, wherein the clustered data is provided; and
a data scheduler that, based at least in part on a data policy, adaptively controls the size of subsets for which the first training method is applied to facilitate building the refined statistical model.
2 Assignments
0 Petitions
Accused Products
Abstract
A model is constructed for an initial subset of the data using a first parameter estimation algorithm. The model may be evaluated, for example, by applying the model to a holdout data set of the data. If the model is not acceptable, additional data is added to the data subset and the first parameter estimation algorithm is repeated for the aggregate data subset. An appropriate subset of the data exists when the first parameter estimation algorithm produces an acceptable model. The appropriate subset of the data may then be employed by a second parameter estimation algorithm, which may be a more accurate version of the first algorithm or a different algorithm altogether, to build a statistical model to characterize the data.
27 Citations
58 Claims
-
1. A computer implemented system that facilitates building a statistical model for a computer readable data set, comprising:
-
a first training method that efficiently builds a rough statistical model from a subset of the computer readable data set capable of statistical characterization; an evaluation component that evaluates the rough statistical model to determine whether the subset of the computer readable data set is an appropriate subset to be utilized to build a refined statistical model for the computer readable data set based at least in part on stopping criterion to facilitate reducing cost of clustering data relative to the computer readable data set; a second training method that builds the refined statistical model for the computer readable data set from the subset if the subset is deemed appropriate by the evaluation component, the refined statistical model provides a more accurate modeling of the subset than the rough statistical model and facilitates determining good clustering of data for a fixed number of clusters based at least in part on predefined accuracy criteria to facilitate clustering of data relative to the computer readable data set, wherein the clustered data is provided; and a data scheduler that, based at least in part on a data policy, adaptively controls the size of subsets for which the first training method is applied to facilitate building the refined statistical model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A computer implemented system programmed to facilitate building a statistical model, comprising:
-
a first parameter estimation protocol that efficiently builds a rough statistical model from a subset of a computer readable data set based at least in part on a training policy associated therewith, the computer readable data set is statistically characterizable; an evaluation component that determines whether the subset of data from which the rough statistical model was built is an acceptable size for building the statistical model to characterize the data set, the evaluation component utilizes a stopping criterion that is functionally related to an expected incremental benefit and an expected incremental cost associated with increasing the size of the subset of data to facilitate determining whether the rough statistical model is an acceptable size and to facilitate reducing cost of clustering data relative to the computer readable data set; and a second parameter estimation protocol that builds a refined statistical model for the data set from the subset if determined to have the acceptable size, the second parameter estimation protocol having an associated training policy, which enables the second parameter estimation protocol to build the refined statistical model to be a more accurate statistical model than the first parameter estimation protocol, the refined statistical model employed to identify clusters of data within the computer readable data set to facilitate clustering data relative to the computer readable data set, wherein the clustered data is provided. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27)
-
-
28. A computer implemented learning curve method to facilitate building a statistical model, comprising:
-
choosing a subset of a computer readable data set that can be characterized statistically; employing a first training method to build a rough statistical model to characterize the subset; evaluating the rough statistical model for acceptability; if the rough statistical model is unacceptable, repeatedly increasing the size of the subset of data to provide an aggregate data set, building another rough statistical model to characterize the aggregate subset, and reevaluating the other rough statistical model, the acceptability of each rough statistical model based at least in part on a stopping criterion functionally related to an expected incremental benefit and an expected incremental cost associated with increasing the size of the aggregate subset in order to facilitate reducing cost associated with clustering data relative to the computer readable data set; and if the rough statistical model is acceptable, employing a second training method to build a refined statistical model based at least in part on the aggregate data set, the second training method being different from the first training method, the refined statistical model identifies data clusters contained in the computer readable data set to facilitate clustering of data relative to the computer readable data set, wherein the clustered data is provided. - View Dependent Claims (29, 30, 31, 32, 33, 34, 35, 36, 37, 38)
-
-
39. A computer-readable medium having computer-executable instructions for:
-
choosing a subset of a computer readable data set; building a rough statistical model to characterize the subset based at least in part on an associated training policy; evaluating the rough statistical model for acceptability; if the rough statistical model is unacceptable, repeatedly increasing the size of the subset of data to provide an aggregate data set, building a rough statistical model to characterize the aggregate subset based at least in part on an associated training policy, and reevaluating the rough statistical model; building a refined statistical model for the computer readable data set from the aggregate data set if the rough statistical model is determined to be acceptable based at least in part on an associated training policy that includes determining acceptability based at least in part on an expected incremental benefit relative to an expected incremental cost associated with increasing the size of the aggregate data set in order to facilitate reducing cost associated with clustering data relative to the computer readable data set, the refined statistical model more accurately characterizes the aggregate data set; and utilizing the refined statistical model to identify identifiable clusters in the computer readable data set to facilitate clustering data relative to the computer readable data set wherein the clustered data is provided.
-
-
40. A computer implemented method to facilitate constructing a statistical model, comprising:
-
separating computer readable data on a computer readable medium into holdout data set and training data set; determining a data subset from the training data set by estimating statistical model parameters according to a first training policy and evaluating the estimated statistical model parameters relative to the holdout data set and repeating the estimation and evaluation of statistical model parameters with a larger subset of the training data set until an acceptable quality of the estimated statistical model is established to facilitate reducing cost associated with characterizing clusters relative to the computer readable data; controlling parameter initialization employed in each estimation of statistical model parameters repeatedly until an acceptable size for the determined data subset is achieved; and subsequent to establishing the acceptable quality of the estimated statistical model, using the determined data subset to improve the estimated statistical model parameters by employing a second training policy that is more accurate than the first training policy, the estimated model parameters obtained from employment of the second training policy utilized to characterize at least one cluster within the computer readable data to facilitate clustering data relative to the computer readable data, wherein the clustered data is provided. - View Dependent Claims (41, 42, 43, 44, 45, 46, 47)
-
-
48. A computer-readable medium having computer-executable instructions for:
-
separating computer readable data into a holdout data set and a training data set, the computer readable data is statistically characterizable; determining a data subset from the training data set by estimating model parameters and controlling model parameter initialization according to a first training policy and evaluating the estimated model parameters relative to the holdout data set and repeating the estimation, initialization, and evaluation of model parameters with a next successively larger subset of the training data set until an acceptable quality of the estimated model is established to facilitate reducing cost associated with clustering data relative to the computer readable data; subsequent to establishing the acceptable quality of the estimated model, using the determined data subset to improve the estimated model parameters by employing a second training policy that is more accurate than the first training policy; and utilizing the estimated model parameters determined by utilization of the second training policy to identify a cluster in the computer readable data to facilitate clustering data relative to the computer readable data, wherein the clustered data is provided.
-
-
49. A computer implemented method to facilitate constructing a statistical model, comprising:
-
separating computer readable data into a holdout data set and a training data set, the computer readable data is statistically characterizable; iteratively estimating statistical model parameters for a subset of the training data set over a fixed number of iterations and evaluating the estimated statistical model parameters relative to the holdout data set; repeating the estimation and evaluation of statistical model parameters obtained with successively larger subsets of the training data set until an acceptable model quality is established, acceptable model quality determined based at least in part on an expected incremental benefit relative to an expected incremental detriment associated with an increase in size of each larger training subset of the data set in order to facilitate reducing cost associated with clustering data relative to the computer readable data; after the acceptable model quality is established, iteratively estimating statistical model parameters for the data subset, which provided the acceptable model quality, until a better quality of model is provided relative to a preceding estimation performed over the fixed number of iterations; and using the better quality model relative to the computer readable data to identify at least a cluster of data within the computer readable data to facilitate clustering data relative to the computer readable data, wherein the at least a cluster of data is provided. - View Dependent Claims (50, 51, 52, 53, 54, 55)
-
-
56. A computer implemented method to facilitate constructing a statistical model, comprising:
-
separating computer readable data into a holdout data set and a training data set, the computer readable data is statistically characterizable; iteratively estimating statistical model parameters for a subset of the training data set until a first convergence threshold is satisfied and evaluating the estimated statistical model parameters relative to the holdout data set; repeating the estimation and evaluation of statistical model parameters obtained with successively larger subsets of the training data set until determining a size of data subset that provides acceptable statistical model parameters, acceptable statistical model parameters attained where the expected marginal cost outweighs the expected marginal benefit associated with successively larger subsets in order to facilitate reducing cost associated with clustering data relative to the computer readable data; after determining the size of data subset that provides acceptable statistical model parameters, iteratively estimating statistical model parameters for a data subset of the acceptable size until a second convergence threshold is satisfied, the second convergence threshold being less than the first convergence threshold; and based at least in part on the estimated statistical model parameters identified at the second convergence threshold, identifying a good clustering data relative to the computer readable data to facilitate clustering data, wherein the clustered data is provided.
-
-
57. A computer implemented system to facilitate building a statistical model for a computer readable data set, comprising:
-
first means for building a rough statistical model to characterize a subset of the computer readable data set; means for evaluating the acceptability of the rough statistical model based at least in part on an expectational cost-benefit analysis to facilitate reducing cost associated with clustering data relative to the computer readable data set, the first means building another rough statistical model for a larger subset of the data set if the evaluation means determines that a prior rough statistical model is unacceptable; second means, which is different from the first means, for building a refined statistical model from an aggregate subset of data that yielded the rough statistical model deemed acceptable by the evaluation means; and means for identifying a cluster of data within the computer readable data set based at least in part on the refined statistical model to facilitate clustering data relative to the computer readable data set, wherein the clustered data is provided.
-
-
58. A computer implemented system to facilitate building a statistical model for a computer readable data set, comprising:
-
first means for estimating statistical model parameters from a subset of the computer readable data set, the data set is statistically characterizable; means for evaluating the estimated statistical model parameters relative to a holdout data set of the data set; means for determining a data subset from the training data set by causing the first means and the means for evaluating to respectively repeat estimation and evaluation of statistical model parameters with a next successively larger subset of the training data set until an acceptable quality of the statistical model parameters is established, the quality of the statistical model parameters established when the expected cost of generating the next successively larger subset outweighs the expected benefit in accuracy of utilizing the next successively larger subset in order to facilitate reducing cost associated with clustering data relative to the computer readable data set; second means for estimating statistical model parameters based at least in part on the determined data subset to provide a more accurate estimation of model parameters than the first means; means for setting parameters associated with cluster weights of a cluster of data; and means for determining the cluster of data contained in the computer readable data set based at least in part on the more accurate estimation of statistical model parameters to facilitate clustering data relative to the computer readable data set, wherein the clustered data is provided.
-
Specification