Ranking system
First Claim
1. A computer-implemented method of identifying a top k most frequently occurring items in a data set, the method comprising:
- arranging a processor to access a value of a confidence measure and a value of an error tolerance parameter;
arranging the processor to sample items from the data set at least as many times as prescribed by a function of the confidence measure, the error tolerance parameter and other parameters which exclude a parameter m being a number of distinct items in the data set, and to continue sampling items until a stopping rule is met, the stopping rule being based in part on a posterior probability of error conditional on an observed empirical frequency of items in the data set;
arranging a memory to form a sample sketch from the sampled items; and
arranging the processor to identify the top k items from the sample sketch;
wherein the stopping rule is selected from one of;
stopping sampling when a sample counter exceeds the maximum value of the sum of an empirical frequency of a k-th largest frequency item in a sample set and a largest frequency that is smaller than a threshold or relative threshold for frequencies of false items, divided by the empirical frequency of the k-th largest frequency item in the sample set minus the largest frequency that is smaller than the threshold or relative threshold for frequencies of false items, multiplied by a Δ
quantile of a normal distribution; and
stopping sampling when a sample counter exceeds a sum of an empirical frequency of a k-th largest frequency item in a sample set and a largest frequency that is smaller than a threshold or relative threshold for frequencies of false items, divided by the empirical frequency of the k-th largest frequency item in the sample set minus the largest frequency that is smaller than the threshold or relative threshold for frequencies of false items squared, and multiplied by a Δ
quantile of a normal distribution.
2 Assignments
0 Petitions
Accused Products
Abstract
Ranking systems are described. In an embodiment a large scale data center has peta bytes of items and a query engine is provided to find the top k most frequently occurring items. In embodiments, samples are taken from the data center at least until a specified number of samplings is met, or until a stopping rule is met. In examples, the samples form a sample sketch which is used to find the top k most frequently occurring items without the need to examine every item in the data center. In other examples, the number of samplings or stopping rule is varied to provide ranks or frequencies. In other embodiments the ranking system operates on items having values to find separators which divide the items into bins such that the proportion of the items in each bin is different. For example, a data set may be apportioned to different types of processor.
13 Citations
15 Claims
-
1. A computer-implemented method of identifying a top k most frequently occurring items in a data set, the method comprising:
-
arranging a processor to access a value of a confidence measure and a value of an error tolerance parameter; arranging the processor to sample items from the data set at least as many times as prescribed by a function of the confidence measure, the error tolerance parameter and other parameters which exclude a parameter m being a number of distinct items in the data set, and to continue sampling items until a stopping rule is met, the stopping rule being based in part on a posterior probability of error conditional on an observed empirical frequency of items in the data set; arranging a memory to form a sample sketch from the sampled items; and arranging the processor to identify the top k items from the sample sketch; wherein the stopping rule is selected from one of; stopping sampling when a sample counter exceeds the maximum value of the sum of an empirical frequency of a k-th largest frequency item in a sample set and a largest frequency that is smaller than a threshold or relative threshold for frequencies of false items, divided by the empirical frequency of the k-th largest frequency item in the sample set minus the largest frequency that is smaller than the threshold or relative threshold for frequencies of false items, multiplied by a Δ
quantile of a normal distribution; andstopping sampling when a sample counter exceeds a sum of an empirical frequency of a k-th largest frequency item in a sample set and a largest frequency that is smaller than a threshold or relative threshold for frequencies of false items, divided by the empirical frequency of the k-th largest frequency item in the sample set minus the largest frequency that is smaller than the threshold or relative threshold for frequencies of false items squared, and multiplied by a Δ
quantile of a normal distribution. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A query engine arranged to query a data center storing on an order of peta bytes of items in a data set, to find a top k most frequently occurring items in the data center, the query engine comprising:
-
a processor arranged to access a value of a confidence measure, a value of an error tolerance parameter, and an estimate of a frequency of the top kth most frequent item in the data set; the processor also being arranged to continue sampling items from the data set at least as many times as prescribed by a function of the confidence measure, the error tolerance parameter, the estimate of the frequency of the top kth most frequent item in the data set, and excluding a parameter m being a number of unique items in the data set, the number of unique items being different than a total number of items in the data set; a memory arranged to store a sample sketch comprising the sampled items; and the processor also being arranged to identify the top k most frequently occurring items in the sample sketch; wherein the function of the confidence measure is selected from; four times an estimate of the frequency of the top kth most frequent item in the data set divided by an error tolerance parameter squared, multiplied by, a logarithm of 1 over a prescribed upper bound on a probability of error and a logarithm of k items with frequencies greater than or equal to the estimate of the frequency of the top kth most frequent item multiplied by the smaller of;
two divided by, the estimate of the frequency of the top kth most frequent item in the data set minus the error tolerance parameter; and
the number of items in the data set minus k;four times an estimate of the top kth most frequent item in the data set divided by an error tolerance parameter squared multiplied by the sum of a logarithm of 1 over a prescribed upper bound on a probability of error and a logarithm of k items greater than or equal to the estimate of the top kth most frequent item multiplied by the smaller of two divided by, the estimate of the frequency of the top kth most frequent item in the data set minus the error tolerance parameter, plus the estimate of the frequency of the top kth most frequent item in the data set; and
the number of items in the data set; andeight times an estimate of the frequency of the top most frequent item in the data set multiplied by 1 minus the estimate of the frequency of the top most frequent item in the data set, divided by an error parameter squared and multiplied by a logarithm of 1 over a prescribed upper bound on a probability of error and a logarithm of two times M where M is the smaller of two divided by, the estimate of the frequency of the top kth most frequent item in the data set minus the error tolerance parameter, plus 1; and
the number of items in the data set.
-
-
10. A ranking system comprising:
-
an input arranged to access a data set comprising items having values; a memory arranged to store a required number of bins into which the values are to be partitioned; a processor arranged to find separator values for allocating the items into the bins in order that a specified proportion of the items is allocated to each bin, that specified proportion being different for at least two bins; the processor being arranged to sample items from the data set at least until a specified function is met, that function being of a confidence measure, an error tolerance parameter, and a width of the smallest bin; a memory arranged to store a sample sketch comprising the sampled items; and wherein the processor is also arranged to find the separator values from the sample sketch; wherein the function of the confidence measure is selected from; four times an estimate of the frequency of the top kth most frequent item in the data set divided by an error tolerance parameter squared, multiplied by, a logarithm of 1 over a prescribed upper bound on a probability of error and a logarithm of k items with frequencies greater than or equal to the estimate of the frequency of the top kth most frequent item multiplied by the smaller of;
two divided by, the estimate of the frequency of the top kth most frequent item in the data set minus the error tolerance parameter; and
the number of items in the data set minus k;four times an estimate of the top kth most frequent item in the data set divided by an error tolerance parameter squared multiplied by the sum of a logarithm of 1 over a prescribed upper bound on a probability of error and a logarithm of k items greater than or equal to the estimate of the top kth most frequent item multiplied by the smaller of two divided by, the estimate of the frequency of the top kth most frequent item in the data set minus the error tolerance parameter, plus the estimate of the frequency of the top kth most frequent item in the data set; and
the number of items in the data set; andeight times an estimate of the frequency of the top most frequent item in the data set multiplied by 1 minus the estimate of the frequency of the top most frequent item in the data set, divided by an error parameter squared and multiplied by a logarithm of 1 over a prescribed upper bound on a probability of error and a logarithm of two times M where M is the smaller of two divided by, the estimate of the frequency of the top kth most frequent item in the data set minus the error tolerance parameter, plus 1; and
the number of items in the data set. - View Dependent Claims (11, 12, 13, 14, 15)
-
Specification