Single-pass low-storage arbitrary probabilistic location estimation for massive data sets
First Claim
1. A method for providing an estimate of a summary of a data set generated by an unknown distribution, comprising:
- selecting a subset of data points from the data set;
applying a scoring rule to each data point of the subset of data points based on an a summary of a set of estimated relative locations and an assigned weight for each data point to provide a score for each data point;
selectively retaining data points to track based on the score for each data point; and
determining an estimate of the summary of the data set based on the retained data points.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention includes a method and system for providing an estimate of a summary of a data set generated by an unknown distribution. The method includes selecting a subset of data points from the data set, applying a scoring rule to each data point of the subset of data points based on an estimated relative location and an assigned weight for each data point to provide a score for each data point, selectively retaining data points to track based on the score for each data point; and determining an estimate of the summary of the data set based on the retained data points.
127 Citations
66 Claims
-
1. A method for providing an estimate of a summary of a data set generated by an unknown distribution, comprising:
-
selecting a subset of data points from the data set;
applying a scoring rule to each data point of the subset of data points based on an a summary of a set of estimated relative locations and an assigned weight for each data point to provide a score for each data point;
selectively retaining data points to track based on the score for each data point; and
determining an estimate of the summary of the data set based on the retained data points. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method for providing an estimate of a summary of a data set generated by an unknown distribution, comprising:
-
(a) inputting m data points from a data set;
(b) assigning a relative location to each said m data points;
(c) assigning a weight to each said m data points;
(d) inputting a subset, n, of the remaining data points;
(e) estimating a relative location for each said m and n data points;
(f) assigning a weight to each said m and n data points;
(g) scoring each said m and n data points;
(h) retaining a subset of said m and n data points, their associated estimated relative locations and weights, the retained data points becoming the m data points;
(i) repeating steps (d) through (h) until all data points have been analyzed;
(j) providing the estimate of the summary of the data set based on said m data points. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
-
- 49. A method for estimating an arbitrary quantile for an unknown distribution wherein the improvement comprises tracking a small fraction of the original data set using a scoring rule based on an estimated rank and an assigned weight for each data point to determine which data points to track and which data points to ignore.
-
52. A computer system for estimating unknown quantiles for data sets, comprising:
-
(a) computer processor means for processing data;
(b) storage means for storing data on a storage medium;
(c) means for inputting and sorting m data points from a data set;
(d) means for estimating rank of each said m data points;
(e) means for assigning weight of each said m data points;
(f) means for scoring said m data points;
(g) means for comparing scores of said m data points with points previously observed for inclusion or exclusion on two-dimensional grids;
(h) repeating steps (d) through (g) until all m data points have been analyzed; and
(i) means for choosing a data point from the two-dimensional grid to estimate an unknown quantile. - View Dependent Claims (53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64)
-
-
65. A computer system for estimating a summary of a data set generated by an unknown distribution, comprising:
-
a processor for processing data;
a data storage for storing data operatively connected to the processor;
an input operatively connected to the processor for receiving data points from the data set;
the processor adapted to receive data points from the data set, apply a scoring rule to each data point received from the data set based on an estimated relative location and an assigned weight for each data point, retain data points to track based on the score for each data point, and determine an estimate of the summary of the data set based on the retained data points. - View Dependent Claims (66)
-
Specification