Efficient determination of sample size to facilitate building a statistical model

US 7,409,371 B1
Filed: 06/04/2001
Issued: 08/05/2008
Est. Priority Date: 06/04/2001
Status: Expired due to Fees

First Claim

Patent Images

1. A computer implemented system that facilitates building a statistical model for a computer readable data set, comprising:

a first training method that efficiently builds a rough statistical model from a subset of the computer readable data set capable of statistical characterization;

an evaluation component that evaluates the rough statistical model to determine whether the subset of the computer readable data set is an appropriate subset to be utilized to build a refined statistical model for the computer readable data set based at least in part on stopping criterion to facilitate reducing cost of clustering data relative to the computer readable data set;

a second training method that builds the refined statistical model for the computer readable data set from the subset if the subset is deemed appropriate by the evaluation component, the refined statistical model provides a more accurate modeling of the subset than the rough statistical model and facilitates determining good clustering of data for a fixed number of clusters based at least in part on predefined accuracy criteria to facilitate clustering of data relative to the computer readable data set, wherein the clustered data is provided; and

a data scheduler that, based at least in part on a data policy, adaptively controls the size of subsets for which the first training method is applied to facilitate building the refined statistical model.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A model is constructed for an initial subset of the data using a first parameter estimation algorithm. The model may be evaluated, for example, by applying the model to a holdout data set of the data. If the model is not acceptable, additional data is added to the data subset and the first parameter estimation algorithm is repeated for the aggregate data subset. An appropriate subset of the data exists when the first parameter estimation algorithm produces an acceptable model. The appropriate subset of the data may then be employed by a second parameter estimation algorithm, which may be a more accurate version of the first algorithm or a different algorithm altogether, to build a statistical model to characterize the data.

27 Citations

View as Search Results

58 Claims

1. A computer implemented system that facilitates building a statistical model for a computer readable data set, comprising:
- a first training method that efficiently builds a rough statistical model from a subset of the computer readable data set capable of statistical characterization;
  
  an evaluation component that evaluates the rough statistical model to determine whether the subset of the computer readable data set is an appropriate subset to be utilized to build a refined statistical model for the computer readable data set based at least in part on stopping criterion to facilitate reducing cost of clustering data relative to the computer readable data set;
  
  a second training method that builds the refined statistical model for the computer readable data set from the subset if the subset is deemed appropriate by the evaluation component, the refined statistical model provides a more accurate modeling of the subset than the rough statistical model and facilitates determining good clustering of data for a fixed number of clusters based at least in part on predefined accuracy criteria to facilitate clustering of data relative to the computer readable data set, wherein the clustered data is provided; and
  
  a data scheduler that, based at least in part on a data policy, adaptively controls the size of subsets for which the first training method is applied to facilitate building the refined statistical model.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The system of claim 1, the data scheduler increases the size of the subset to provide a larger aggregate subset of the data set if the rough statistical model is unacceptable, the first training method efficiently builds the rough statistical model for each larger aggregate subset of the data until the evaluation component determines the resulting rough statistical model to be acceptable.
  - 3. The system of claim 2, the acceptability of each rough statistical model is determined based at least in part on a stopping criterion functionally related to an expected incremental benefit and a cost associated with increasing the size of the aggregate subset of the data set.
  - 4. The system of claim 3, the cost of the stopping criterion is functionally related to at least one of time associated with evaluating an aggregate data subset of increased size or size of the aggregated subset of the data.
  - 5. The system of claim 3, the stopping criterion is defined by
  - 6. The system of claim 3, the stopping criterion is defined by
  - 7. The system of claim 1, the first training method further comprises an iterative method, which builds the rough statistical model for the subset of the data set according to an associated training policy.
  - 8. The system of claim 7, the first training method further comprises an associated training policy that defines parameter initialization of the first training method for each subset of the data set.
  - 9. The system of claim 8, the training policy associated with the first training method further controls parameter initialization of the first training method, such that at least some of the parameters computed for a previous subset of the data are employed to initialize the first training method for a subsequent larger aggregate subset of the data.
  - 10. The system of claim 8, the first training method is initialized by the same parameter values for each subset of the data subset.
  - 11. The system of claim 8, the training policy sets the iterative method to perform a fixed number of at least one iteration.
  - 12. The system of claim 11, the training policy sets the iterative method to perform a single iteration.
  - 13. The system of claim 11, the second training method further comprises an iterative method that operates according to an associated training policy, so as to produce a more accurate statistical model for the appropriate subset of the data set than the first training method.
  - 14. The system of claim 13, the iterative method associated with at least one of the first or second training methods is an Expectation and Maximization method.
  - 15. The system of claim 7, the training policy associated with the iterative method of the first training method controls the iterative method to run until an associated convergence criterion is satisfied.
  - 16. The system of claim 15, second training method further comprises an iterative method, which builds the refined statistical model for the appropriate subset of the data set according to an associated training policy.
  - 17. The system of claim 16, the training policy associated with the iterative method of the second training method controls the respective iterative method to run until an associated convergence criterion is satisfied, the convergence criterion associated with the second training method provides improved model quality relative to the convergence criterion associated with the first training method.

18. A computer implemented system programmed to facilitate building a statistical model, comprising:
- a first parameter estimation protocol that efficiently builds a rough statistical model from a subset of a computer readable data set based at least in part on a training policy associated therewith, the computer readable data set is statistically characterizable;
  
  an evaluation component that determines whether the subset of data from which the rough statistical model was built is an acceptable size for building the statistical model to characterize the data set, the evaluation component utilizes a stopping criterion that is functionally related to an expected incremental benefit and an expected incremental cost associated with increasing the size of the subset of data to facilitate determining whether the rough statistical model is an acceptable size and to facilitate reducing cost of clustering data relative to the computer readable data set; and
  
  a second parameter estimation protocol that builds a refined statistical model for the data set from the subset if determined to have the acceptable size, the second parameter estimation protocol having an associated training policy, which enables the second parameter estimation protocol to build the refined statistical model to be a more accurate statistical model than the first parameter estimation protocol, the refined statistical model employed to identify clusters of data within the computer readable data set to facilitate clustering data relative to the computer readable data set, wherein the clustered data is provided.
- View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27)
- - 19. The system of claim 18, further comprising a data scheduler that increases the size of the subset of the data set to provide a larger aggregate subset of the data set if the rough statistical model is unacceptable, the first parameter estimation protocol efficiently builds a rough statistical model for each larger aggregate subset until a resulting rough statistical model built therefrom is determined to be acceptable.
  - 20. The system of claim 18, the first parameter estimation protocol further comprises an iterative protocol that builds the rough statistical model for each subset of the data set according to the associated training policy.
  - 21. The system of claim 20, the training policy for the first parameter estimation protocol is operative to control parameter initialization for the first parameter estimation protocol, such that at least some of the parameters computed for a previous subset of the data are employed to initialize the first parameter estimation protocol for a subsequent larger aggregate subset of the data set.
  - 22. The system of claim 20, the first parameter estimation protocol is initialized by the same parameter values for each subset of the data subset.
  - 23. The system of claim 20, the training policy associated with first parameter estimation protocol controls the iterative protocol of the first parameter estimation protocol to perform a fixed number of at least one iteration, the second training protocol further comprising an iterative protocol, which is operative to perform a greater number of iterations than the iterative protocol of the first training protocol based at least in part on a training policy associated with the second parameter estimation protocol.
  - 24. The system of claim 20, the training policy associated with the iterative protocol of the first parameter estimation protocol controls the iterative protocol to run until an associated convergence threshold is satisfied, the second training protocol further comprises an iterative protocol, the training policy associated with the iterative protocol of the second parameter estimation protocol being operative to control the respective iterative protocol to run until an associated convergence threshold is satisfied, the convergence threshold associated with the second parameter estimation protocol is less than the convergence threshold associated with the first parameter estimation algorithm protocol.
  - 25. The system of claim 18, the cost of the stopping criterion is functionally related to at least one of time associated with evaluating the model for a larger subset of data or size of the larger subset of the data.
  - 26. The system of claim 18, the stopping criterion is defined by
  - 27. The system of claim 18, the stopping criterion is defined by

28. A computer implemented learning curve method to facilitate building a statistical model, comprising:
- choosing a subset of a computer readable data set that can be characterized statistically;
  
  employing a first training method to build a rough statistical model to characterize the subset;
  
  evaluating the rough statistical model for acceptability;
  
  if the rough statistical model is unacceptable, repeatedly increasing the size of the subset of data to provide an aggregate data set, building another rough statistical model to characterize the aggregate subset, and reevaluating the other rough statistical model, the acceptability of each rough statistical model based at least in part on a stopping criterion functionally related to an expected incremental benefit and an expected incremental cost associated with increasing the size of the aggregate subset in order to facilitate reducing cost associated with clustering data relative to the computer readable data set; and
  
  if the rough statistical model is acceptable, employing a second training method to build a refined statistical model based at least in part on the aggregate data set, the second training method being different from the first training method, the refined statistical model identifies data clusters contained in the computer readable data set to facilitate clustering of data relative to the computer readable data set, wherein the clustered data is provided.
- View Dependent Claims (29, 30, 31, 32, 33, 34, 35, 36, 37, 38)
- - 29. The system of claim 28, the cost of the stopping criterion is functionally related to at least one of time associated with evaluating an aggregate data subset of increased size or size of the aggregate subset of the data.
  - 30. The system of claim 28, the stopping criterion is defined by
  - 31. The system of claim 28, the stopping criterion is defined by
  - 32. The method of claim 28, the first training method is more computationally efficient than the second training method.
  - 33. The method of claim 28, each instance of model building repeated until obtaining an acceptable rough statistical model by the first training method employs more efficient and less accurate model building than model building employed by the second training method that occurs after obtaining the acceptable rough statistical model.
  - 34. The method of claim 33, each instance of model building repeated until obtaining an acceptable rough statistical model employs the first training method as an iterative method that is run to a first convergence criterion, the second training method employing an iterative method that is run to a second convergence criterion, which demands more iterations than the first convergence criterion in order to obtain convergence, so that the refined statistical model is more accurate than the rough statistical model built by the first training method.
  - 35. The method of claim 33, each instance of model building repeated until obtaining an acceptable rough statistical model employs an iterative method having a fixed number of at least one iteration, the second training method employing an iterative method having a greater number of iterations than the fixed number.
  - 36. The method of claim 28, further comprising controlling parameter initialization employed in each instance of building a model for the aggregate data set prior to obtaining an acceptable rough statistical model.
  - 37. The method of claim 36, further comprising initializing the first training method by the same parameter values for each subset.
  - 38. The method of claim 36, the controlling further comprises reusing at least some of the parameters computed from a previous instance of model building to initialize a subsequent instance of model building for a subsequent larger aggregate data set prior to obtaining an acceptable rough statistical model.

39. A computer-readable medium having computer-executable instructions for:
- choosing a subset of a computer readable data set;
  
  building a rough statistical model to characterize the subset based at least in part on an associated training policy;
  
  evaluating the rough statistical model for acceptability;
  
  if the rough statistical model is unacceptable, repeatedly increasing the size of the subset of data to provide an aggregate data set, building a rough statistical model to characterize the aggregate subset based at least in part on an associated training policy, and reevaluating the rough statistical model;
  
  building a refined statistical model for the computer readable data set from the aggregate data set if the rough statistical model is determined to be acceptable based at least in part on an associated training policy that includes determining acceptability based at least in part on an expected incremental benefit relative to an expected incremental cost associated with increasing the size of the aggregate data set in order to facilitate reducing cost associated with clustering data relative to the computer readable data set, the refined statistical model more accurately characterizes the aggregate data set; and
  
  utilizing the refined statistical model to identify identifiable clusters in the computer readable data set to facilitate clustering data relative to the computer readable data set wherein the clustered data is provided.

40. A computer implemented method to facilitate constructing a statistical model, comprising:
- separating computer readable data on a computer readable medium into holdout data set and training data set;
  
  determining a data subset from the training data set by estimating statistical model parameters according to a first training policy and evaluating the estimated statistical model parameters relative to the holdout data set and repeating the estimation and evaluation of statistical model parameters with a larger subset of the training data set until an acceptable quality of the estimated statistical model is established to facilitate reducing cost associated with characterizing clusters relative to the computer readable data;
  
  controlling parameter initialization employed in each estimation of statistical model parameters repeatedly until an acceptable size for the determined data subset is achieved; and
  
  subsequent to establishing the acceptable quality of the estimated statistical model, using the determined data subset to improve the estimated statistical model parameters by employing a second training policy that is more accurate than the first training policy, the estimated model parameters obtained from employment of the second training policy utilized to characterize at least one cluster within the computer readable data to facilitate clustering data relative to the computer readable data, wherein the clustered data is provided.
- View Dependent Claims (41, 42, 43, 44, 45, 46, 47)
- - 41. The method of claim 40, each estimation of model parameters repeated until the acceptable quality of the estimated model is established further comprises employing an iterative method that is run until a first convergence criterion is satisfied, the estimation of model parameters using the determined data subset further comprising an iterative method that is run until a second convergence criterion is satisfied, which is operative to provide a better quality of model than the first convergence criterion.
  - 42. The system of claim 41, the first convergence criterion causes the associated iterative method to run until a first convergence threshold is satisfied, the second convergence criterion causes the associated iterative method to run until a second convergence threshold is satisfied, the second convergence threshold being less than the first convergence threshold.
  - 43. The method of claim 41, at least one of the iterative method run to the first convergence criterion or the iterative method run to the second convergence criterion is an Expectation and Maximization method.
  - 44. The method of claim 40, each estimation of model parameters repeated until the acceptable quality of the estimated model is established employs an iterative method having a fixed number of at least one iteration, the estimation of model parameters using the determined data subset further employing an iterative method having a greater number of iterations than the fixed number.
  - 45. The method of claim 40, the controlling further comprises reusing at least some of the parameters computed from a previous estimation of model parameters to initialize a subsequent estimation of model parameters for a next larger subset of the training set.
  - 46. The method of claim 40, each estimation of model parameters repeated until the acceptable quality of the estimated model is established further comprises initializing the first training method by the same parameter values.
  - 47. The method of claim 40, further comprising determining the acceptability of the estimated model based at least in part on an expected incremental benefit relative to a cost associated with increasing the size of the subset of the data set.

48. A computer-readable medium having computer-executable instructions for:
- separating computer readable data into a holdout data set and a training data set, the computer readable data is statistically characterizable;
  
  determining a data subset from the training data set by estimating model parameters and controlling model parameter initialization according to a first training policy and evaluating the estimated model parameters relative to the holdout data set and repeating the estimation, initialization, and evaluation of model parameters with a next successively larger subset of the training data set until an acceptable quality of the estimated model is established to facilitate reducing cost associated with clustering data relative to the computer readable data;
  
  subsequent to establishing the acceptable quality of the estimated model, using the determined data subset to improve the estimated model parameters by employing a second training policy that is more accurate than the first training policy; and
  
  utilizing the estimated model parameters determined by utilization of the second training policy to identify a cluster in the computer readable data to facilitate clustering data relative to the computer readable data, wherein the clustered data is provided.

49. A computer implemented method to facilitate constructing a statistical model, comprising:
- separating computer readable data into a holdout data set and a training data set, the computer readable data is statistically characterizable;
  
  iteratively estimating statistical model parameters for a subset of the training data set over a fixed number of iterations and evaluating the estimated statistical model parameters relative to the holdout data set;
  
  repeating the estimation and evaluation of statistical model parameters obtained with successively larger subsets of the training data set until an acceptable model quality is established, acceptable model quality determined based at least in part on an expected incremental benefit relative to an expected incremental detriment associated with an increase in size of each larger training subset of the data set in order to facilitate reducing cost associated with clustering data relative to the computer readable data;
  
  after the acceptable model quality is established, iteratively estimating statistical model parameters for the data subset, which provided the acceptable model quality, until a better quality of model is provided relative to a preceding estimation performed over the fixed number of iterations; and
  
  using the better quality model relative to the computer readable data to identify at least a cluster of data within the computer readable data to facilitate clustering data relative to the computer readable data, wherein the at least a cluster of data is provided.
- View Dependent Claims (50, 51, 52, 53, 54, 55)
- - 50. The method of claim 49, at least one of the iterative estimations employs an Expectation and Maximization method.
  - 51. The method of claim 49, the estimation that occurs after the acceptable model quality is established, further comprises employing an iterative method having a greater number of iterations than the fixed number.
  - 52. The method of claim 49, the estimation of model parameters after the acceptable model quality has been established further comprises employing an iterative method that is run until a convergence criterion is satisfied, which is operative to provide a better quality of model with the data subset than a preceding estimation employing the fixed number of iterations.
  - 53. The method of claim 49, further comprising controlling parameter initialization for each estimation of model parameters that occurs before the acceptable model quality has been established.
  - 54. The method of claim 53, each iterative estimation until the acceptable model quality is established further comprises initializing the first training method by the same parameter values.
  - 55. The method of claim 53, the controlling further comprises reusing at least some of the parameters obtained in a previous estimation of model parameters to initialize a subsequent estimation of model parameters for a next larger subset of the training data set.

56. A computer implemented method to facilitate constructing a statistical model, comprising:
- separating computer readable data into a holdout data set and a training data set, the computer readable data is statistically characterizable;
  
  iteratively estimating statistical model parameters for a subset of the training data set until a first convergence threshold is satisfied and evaluating the estimated statistical model parameters relative to the holdout data set;
  
  repeating the estimation and evaluation of statistical model parameters obtained with successively larger subsets of the training data set until determining a size of data subset that provides acceptable statistical model parameters, acceptable statistical model parameters attained where the expected marginal cost outweighs the expected marginal benefit associated with successively larger subsets in order to facilitate reducing cost associated with clustering data relative to the computer readable data;
  
  after determining the size of data subset that provides acceptable statistical model parameters, iteratively estimating statistical model parameters for a data subset of the acceptable size until a second convergence threshold is satisfied, the second convergence threshold being less than the first convergence threshold; and
  
  based at least in part on the estimated statistical model parameters identified at the second convergence threshold, identifying a good clustering data relative to the computer readable data to facilitate clustering data, wherein the clustered data is provided.

57. A computer implemented system to facilitate building a statistical model for a computer readable data set, comprising:
- first means for building a rough statistical model to characterize a subset of the computer readable data set;
  
  means for evaluating the acceptability of the rough statistical model based at least in part on an expectational cost-benefit analysis to facilitate reducing cost associated with clustering data relative to the computer readable data set, the first means building another rough statistical model for a larger subset of the data set if the evaluation means determines that a prior rough statistical model is unacceptable;
  
  second means, which is different from the first means, for building a refined statistical model from an aggregate subset of data that yielded the rough statistical model deemed acceptable by the evaluation means; and
  
  means for identifying a cluster of data within the computer readable data set based at least in part on the refined statistical model to facilitate clustering data relative to the computer readable data set, wherein the clustered data is provided.

58. A computer implemented system to facilitate building a statistical model for a computer readable data set, comprising:
- first means for estimating statistical model parameters from a subset of the computer readable data set, the data set is statistically characterizable;
  
  means for evaluating the estimated statistical model parameters relative to a holdout data set of the data set;
  
  means for determining a data subset from the training data set by causing the first means and the means for evaluating to respectively repeat estimation and evaluation of statistical model parameters with a next successively larger subset of the training data set until an acceptable quality of the statistical model parameters is established, the quality of the statistical model parameters established when the expected cost of generating the next successively larger subset outweighs the expected benefit in accuracy of utilizing the next successively larger subset in order to facilitate reducing cost associated with clustering data relative to the computer readable data set;
  
  second means for estimating statistical model parameters based at least in part on the determined data subset to provide a more accurate estimation of model parameters than the first means;
  
  means for setting parameters associated with cluster weights of a cluster of data; and
  
  means for determining the cluster of data contained in the computer readable data set based at least in part on the more accurate estimation of statistical model parameters to facilitate clustering data relative to the computer readable data set, wherein the clustered data is provided.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Meek, Christopher A., Heckerman, David E., Thiesson, Bo
Primary Examiner(s)
STARKS, WILBERT L

Application Number

US09/873,719
Time in Patent Office

2,619 Days
Field of Search

706/12, 706/13, 706/20, 706/45
US Class Current

706/12
CPC Class Codes

G06N 20/00 Machine learning

Efficient determination of sample size to facilitate building a statistical model

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

27 Citations

58 Claims

Specification

Solutions

Use Cases

Quick Links

Efficient determination of sample size to facilitate building a statistical model

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

27 Citations

58 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links