Single-pass low-storage arbitrary probabilistic location estimation for massive data sets

US 20030078924A1
Filed: 04/11/2002
Published: 04/24/2003
Est. Priority Date: 04/11/2001
Status: Active Grant

First Claim

Patent Images

1. A method for providing an estimate of a summary of a data set generated by an unknown distribution, comprising:

selecting a subset of data points from the data set;

applying a scoring rule to each data point of the subset of data points based on an a summary of a set of estimated relative locations and an assigned weight for each data point to provide a score for each data point;

selectively retaining data points to track based on the score for each data point; and

determining an estimate of the summary of the data set based on the retained data points.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention includes a method and system for providing an estimate of a summary of a data set generated by an unknown distribution. The method includes selecting a subset of data points from the data set, applying a scoring rule to each data point of the subset of data points based on an estimated relative location and an assigned weight for each data point to provide a score for each data point, selectively retaining data points to track based on the score for each data point; and determining an estimate of the summary of the data set based on the retained data points.

127 Citations

66 Claims

1. A method for providing an estimate of a summary of a data set generated by an unknown distribution, comprising:
- selecting a subset of data points from the data set;
  
  applying a scoring rule to each data point of the subset of data points based on an a summary of a set of estimated relative locations and an assigned weight for each data point to provide a score for each data point;
  
  selectively retaining data points to track based on the score for each data point; and
  
  determining an estimate of the summary of the data set based on the retained data points.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1 wherein the summary of the data set is selected from the set comprising a cumulative density function, a probability density function, a parametric summary, a semi-parametric summary, and a non-parametric summary of the data set.
  - 3. The method of claim 1 wherein the summary of a set of estimated relative location is a single rank estimate.
  - 4. The method of claim 1 wherein between 20 and 100 data points are tracked.
  - 5. The method of claim 1 wherein the estimate is a set of probabilistic location.
  - 6. The method of claim 5 wherein the estimate is a single quantile estimate.
  - 7. The method of claim 1 wherein the estimated relative location for each data point is a function of the previous and current relative location and weights for each of the data points.
  - 8. The method of claim 1 further comprising determining the estimated relative location for each point by determining the point'"'"'s relative location to retained data points, applying a linear interpolation to determine the point'"'"'s relative location when the point is not at a boundary;
    - and applying a nonlinear interpolation to determine the point'"'"'s relative location when the point is at a boundary.
  - 9. The method of claim 8 wherein the step of selecting retaining data points includes retaining data points having the smallest score and discarding data points having the largest scores.
  - 10. The method of claim 8 wherein 100 or less data points are retained.

11. A method for providing an estimate of a summary of a data set generated by an unknown distribution, comprising:
- (a) inputting m data points from a data set;
  
  (b) assigning a relative location to each said m data points;
  
  (c) assigning a weight to each said m data points;
  
  (d) inputting a subset, n, of the remaining data points;
  
  (e) estimating a relative location for each said m and n data points;
  
  (f) assigning a weight to each said m and n data points;
  
  (g) scoring each said m and n data points;
  
  (h) retaining a subset of said m and n data points, their associated estimated relative locations and weights, the retained data points becoming the m data points;
  
  (i) repeating steps (d) through (h) until all data points have been analyzed;
  
  (j) providing the estimate of the summary of the data set based on said m data points.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
- - 12. The method of claim 11 wherein the summary of the data set is selected from the set comprising a cumulative density function, a probability density function, a parametric summary, a semi-parametric summary of the data set.
  - 13. The method of claim 11 wherein the estimated relative location is a rank estimate.
  - 14. The method of claim 13, wherein the rank assigned to each said m data point is a function of an actual rank of the m data points after said points have been partially or fully sorted.
  - 15. The method for claim 14, wherein the rank assigned to each said m data point is the actual rank of the m data points after said points have been sorted.
  - 16. The method of claim 13, wherein the estimated rank for each said m data point is a function that uses in part or in its entirety any or all of the following as arguments:
    - said n data points, any of the rank estimates for said m data points, the total number of data points that have been inputted and the total number of data points in the data set.
  - 17. The method of claim 16, wherein the estimated rank for each said m data point is the previous rank estimate for the data point plus the number of said n data points with a value lower than the said data point.
  - 18. The method of claim 13, wherein the estimated rank for each said n data point is a function that uses in part or in its entirety any or all of the following as arguments:
    - the previous and current estimated ranks and weights for the said m data points, the actual values of said m data points, the estimated ranks and assigned weights and actual value of any or all of the remaining said n data points, the total number of data points that have been inputted and the total number of data points in the data set.
  - 19. The method of claim 18, wherein the estimated rank for one of the said n data points where said point is not a new maximum or minimum with regards to all of the data points considered up to this point, is not a function of the current rank estimate and value of the said m data points which are immediately above and below said point.
  - 20. The method of claim 19, wherein the estimated rank for one of n data points where said point is not immediately adjacent to the largest or smallest of the m data points, is a linear interpolation of the current rank estimate of the m data points which are immediately above and below the said point.
  - 21. The method of claim 19, wherein the estimated rank for one of n data points where said point is immediately adjacent to either the largest or smallest of the m data points is a non-linear interpolation of the current rank estimate of the m data points which are immediately above and below said point.
  - 22. The method of claim 18, wherein the estimated rank for each said n data point is equal to the value 1 where that point is smaller than all of the m data points and the remaining n data points and wherein the estimated rank for each said n data point where that point is smaller than all of the said m data points and larger than at least one of the remaining n data points, is a function of the value of the estimated rank of the smallest of the m data points and the n data points.
  - 23. The method of claim 18, wherein the estimated rank for each n data point where that point is larger than all of the m data points and the remaining n data points, is equal to the total number of data points that have been inputted and where in the estimated rank for each n data point where the point is larger than all of the m data points and smaller than at least one of the remaining n data points, is a function of the value of the estimated rank of the largest of the m data points and the n data points.
  - 24. The method of claim 13, wherein the initial assigned weight to the m data points is a function that uses in part or in its entirety any or all of the following as arguments:
    - the total number of data points in the data set, the total number of data points in the initial group, the number of data points that are to be inputted during each iteration, and the estimated rank of the m data points.
  - 25. The method of claim 24, wherein the initial assigned weight to said m data points is equal to a constant.
  - 26. The method of claim 13, wherein the weight assigned to the m and n data points, for all but the initial assigning of weights for the m data points, is a function that uses in part or in its entirety any or all of the following as arguments:
    - any or all of the previously assigned weights for the m data points, the current weights for the n data points, any of the estimated ranks for the m and n data points, the actual values of the m and n data points, the total number of data points that have been inputted and the total number of data points in the data set.
  - 27. The method of claim 26, wherein the weight assigned to each of said m data points is equal to the weight initially assigned to that said point.
  - 28. The method of claim 26, wherein the weight assigned to each of said n data points is equal to a function of the distances, as defined by any metric, between any of the estimated ranks of said m data points and any of the estimated ranks of said remaining n data points and the actual values of said m and n data points.
  - 29. The method of claim 28, wherein the weight assigned to each of n data points is equal to the smaller of the two distances, as defined by the absolute value of the distance derived by standard subtraction, between the estimated rank of said data point and the estimated rank of m data points which are immediately above and below said point.
  - 30. The method of claim 13, wherein the subset of inputted data points are determined by comparing elements of the proposed inputted data points and said m data points being tracked.
  - 31. The method of claim 30, wherein some or all data points in the subset that have a value exactly equal to the value of one of said m data points being tracked are used to calculate the estimated rank and assigned weight for said m data points being tracked and then discarded.
  - 32. The method of claim 31, wherein all data points are to be discarded are discarded by assigning a score of minus infinity to said data points.
  - 33. The method of claim 30, wherein two or more of the subset of inputted data points have the same value, all of the data points with equal value are used to calculate the estimated rank and assigned weight for m data points being tracked and all but one of said group are discarded, unless the group of data points with equal value are exactly equal to the value of one of said m data points being tracked in which case all of the data points in the group of equal data points will be discarded.
  - 34. The method of claim 33, wherein all data points to be discarded are discarded by assigning a score of minus infinity to said data points.
  - 35. The method of claim 13, wherein the subset of data points selected are the next n data points in the data set, as determined by the order in which they were recorded, unless there is less than n data points left in the data set, where the subset will be the remaining data points.
  - 36. The method of claim 35, wherein the size of the subset of data points being selected in equal to one.
  - 37. The method of claim 30, wherein the score for each m and n data points is a function that uses in part or in its entirety any or all of the following as arguments:
    - any or all previous weights, ranks or scores, any or all previously assigned weights for said m data points, the current weight for said n data points, any of estimated ranks for m and n data points, actual values of the m and n data points, the total number of data points inputted and the total number of data points in the data set.
  - 38. The method of claim 37, wherein the score for each m and n data points is a function of the estimated ranks and assigned weights for said m and n data points and the number of data points inputted.
  - 39. The method of claim 38, wherein the score for each m and n data points is a function of estimated rank, assigned weight and a target rank, the target rank being a fixed proportion of the number of data points inputted, assigned to each said data point where said data point is not the largest or smallest of said m and n data points.
  - 40. The method of claim 39, wherein the score for each m and n data points, where the said data point does not have the largest or smallest value of the m and n data points is equal to the distance, as defined by any metric, between the estimated rank and the target rank multiplied by any function of the assigned weight for said data point.
  - 41. The method of claim 40, wherein the score for each m and n data points, where the said data point does not have the largest or smallest value of the m and n said data points is equal to the absolute value of the estimated rank minus the target rank divided by the assigned weight for said data point.
  - 42. The method of claim 38, wherein the score for each m and n data points is equal to zero, where the data point has the largest or smallest value of the m and n data points.
  - 43. The method of claim 13, wherein a subset of m and n data points and their associated estimated ranks and weights are retained based on a comparison of the score calculated for each said data point.
  - 44. The method of claim 43, wherein the m data points with the smallest score of m and n data points are retained along with their associated estimated ranks and assigned weights.
  - 45. The method for claim 13, wherein the unknown cumulative density function is estimated as a function of the value and the estimated rank of said m data points.
  - 46. The method for claim 45, wherein an unknown quantile is estimated as a function of the value and the estimated rank of said m data points.
  - 47. The method for claim 46, wherein the estimate of an unknown quantile is equal to the value of the data point of said m data points with the smallest distance, as defined by any metric, between the estimated rank and the target rank associated with said quantile.
  - 48. The method for claim 47, wherein the estimate of an unknown quantile is equal to the value of the data point of said m data points with the smallest distance, using the absolute value of the difference between the estimated rank and the target rank associated with this quantile, as defined by the proportion of the total data set associated with this quantile.

49. A method for estimating an arbitrary quantile for an unknown distribution wherein the improvement comprises tracking a small fraction of the original data set using a scoring rule based on an estimated rank and an assigned weight for each data point to determine which data points to track and which data points to ignore.
- View Dependent Claims (50, 51)
- - 50. The method of claim 49 further comprising the steps of:
    - (a) inputting and sorting m data points from a data set;
      
      (b) estimating rank of each said m data points;
      
      (c) assigning weight of each said m data points;
      
      (d) scoring said m data points;
      
      (e) comparing scores of said m data points with points previously observed for inclusion or exclusion on two-dimensional grids;
      
      (f) repeating steps (b) through (e) until all m data points have been analyzed; and
      
      (g) choosing a data point from the two-dimensional grid to estimate an unknown quantile.
  - 51. The method of claim 50 wherein scoring said m data points further comprises:
    - (a) calculating a target by multiplying the quantile percentile by the number of data points observed;
      
      (b) calculating the absolute value of the difference between the target and the estimated rank of said data point;
      
      (c) dividing by the weight of said data point.

52. A computer system for estimating unknown quantiles for data sets, comprising:
- (a) computer processor means for processing data;
  
  (b) storage means for storing data on a storage medium;
  
  (c) means for inputting and sorting m data points from a data set;
  
  (d) means for estimating rank of each said m data points;
  
  (e) means for assigning weight of each said m data points;
  
  (f) means for scoring said m data points;
  
  (g) means for comparing scores of said m data points with points previously observed for inclusion or exclusion on two-dimensional grids;
  
  (h) repeating steps (d) through (g) until all m data points have been analyzed; and
  
  (i) means for choosing a data point from the two-dimensional grid to estimate an unknown quantile.
- View Dependent Claims (53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64)
- - 53. The system of claim 52, wherein the means for estimating the rank of each said m data point comprises means for determining the position of said data point in relation to data points previously observed and calculating the distance.
  - 54. The system of claim 53 wherein the means of calculating the distance comprises maintaining the same rank of a previously observed data point where said data point is a new minimum or maximum.
  - 55. The system of claim 53 wherein the means for calculating the distance uses non-linear interpolation.
  - 56. The system of claim 53 wherein the means for calculating the distance uses linear interpolation when said data point is close to a minimum or maximum.
  - 57. The system of claim 52 wherein the means for assigning weight of each said data point comprises means for comparing the rank of said data point to the ranks of adjacent points by metric calculation.
  - 58. The system of claim 52 wherein the means of assigning weight of each said data point comprises calculating the minimum difference between the rank of said data point and ranks of adjacent points.
  - 59. The system of claim 52 wherein means for scoring said m data points comprises comparing the true rank for a targeted quantile with the estimated rank for each said data point.
  - 60. The system of claim 52 wherein means for scoring said m data points comprises:
    - (a) calculating a target by multiplying the quantile percentile by the number of data points observed;
      
      (b) calculating the absolute value of the difference between the target and the estimated rank of said data point;
      
      (c) dividing by the weight of said data point.
  - 61. The system of claim 52 wherein the means for comparing scores comprises inserting said data point in the two-dimensional grid.
  - 62. The system of claim 52 wherein the means for choosing a data point from the two-dimensional grid to estimate an unknown quantile comprises taking the point with the rank closest to the target.
  - 63. The system of claim 52 wherein the m data points of a data set are observed sequentially.
  - 64. The system of claim 52 wherein the m data points of a data set are observed more than one at a time.

65. A computer system for estimating a summary of a data set generated by an unknown distribution, comprising:
- a processor for processing data;
  
  a data storage for storing data operatively connected to the processor;
  
  an input operatively connected to the processor for receiving data points from the data set;
  
  the processor adapted to receive data points from the data set, apply a scoring rule to each data point received from the data set based on an estimated relative location and an assigned weight for each data point, retain data points to track based on the score for each data point, and determine an estimate of the summary of the data set based on the retained data points.
- View Dependent Claims (66)
- - 66. The computer system of claim 65 wherein the processor is further adapted to estimate multiple summaries of the data set at the same time.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Penn State Research Foundation
Original Assignee
Penn State Research Foundation
Inventors
McDermott, James P., Lin, Dennis K.J., Liechty, John C.

Granted Patent

US 7,076,487 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/7
CPC Class Codes

G06F 16/2462   Approximate or statistical ...

G06F 17/18   for evaluating statistical ...

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99937   Sorting

Y10S 707/99945   Object-oriented database st...

Single-pass low-storage arbitrary probabilistic location estimation for massive data sets

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

127 Citations

66 Claims

Specification

Solutions

Use Cases

Quick Links

Single-pass low-storage arbitrary probabilistic location estimation for massive data sets

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

127 Citations

66 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links