Ranking system

US 8,478,762 B2
Filed: 05/01/2009
Issued: 07/02/2013
Est. Priority Date: 05/01/2009
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of identifying a top k most frequently occurring items in a data set, the method comprising:

arranging a processor to access a value of a confidence measure and a value of an error tolerance parameter;

arranging the processor to sample items from the data set at least as many times as prescribed by a function of the confidence measure, the error tolerance parameter and other parameters which exclude a parameter m being a number of distinct items in the data set, and to continue sampling items until a stopping rule is met, the stopping rule being based in part on a posterior probability of error conditional on an observed empirical frequency of items in the data set;

arranging a memory to form a sample sketch from the sampled items; and

arranging the processor to identify the top k items from the sample sketch;

wherein the stopping rule is selected from one of;

stopping sampling when a sample counter exceeds the maximum value of the sum of an empirical frequency of a k-th largest frequency item in a sample set and a largest frequency that is smaller than a threshold or relative threshold for frequencies of false items, divided by the empirical frequency of the k-th largest frequency item in the sample set minus the largest frequency that is smaller than the threshold or relative threshold for frequencies of false items, multiplied by a Δ

quantile of a normal distribution; and

stopping sampling when a sample counter exceeds a sum of an empirical frequency of a k-th largest frequency item in a sample set and a largest frequency that is smaller than a threshold or relative threshold for frequencies of false items, divided by the empirical frequency of the k-th largest frequency item in the sample set minus the largest frequency that is smaller than the threshold or relative threshold for frequencies of false items squared, and multiplied by a Δ

quantile of a normal distribution.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Ranking systems are described. In an embodiment a large scale data center has peta bytes of items and a query engine is provided to find the top k most frequently occurring items. In embodiments, samples are taken from the data center at least until a specified number of samplings is met, or until a stopping rule is met. In examples, the samples form a sample sketch which is used to find the top k most frequently occurring items without the need to examine every item in the data center. In other examples, the number of samplings or stopping rule is varied to provide ranks or frequencies. In other embodiments the ranking system operates on items having values to find separators which divide the items into bins such that the proportion of the items in each bin is different. For example, a data set may be apportioned to different types of processor.

13 Citations

View as Search Results

15 Claims

1. A computer-implemented method of identifying a top k most frequently occurring items in a data set, the method comprising:
- arranging a processor to access a value of a confidence measure and a value of an error tolerance parameter;
  
  arranging the processor to sample items from the data set at least as many times as prescribed by a function of the confidence measure, the error tolerance parameter and other parameters which exclude a parameter m being a number of distinct items in the data set, and to continue sampling items until a stopping rule is met, the stopping rule being based in part on a posterior probability of error conditional on an observed empirical frequency of items in the data set;
  
  arranging a memory to form a sample sketch from the sampled items; and
  
  arranging the processor to identify the top k items from the sample sketch;
  
  wherein the stopping rule is selected from one of;
  
  stopping sampling when a sample counter exceeds the maximum value of the sum of an empirical frequency of a k-th largest frequency item in a sample set and a largest frequency that is smaller than a threshold or relative threshold for frequencies of false items, divided by the empirical frequency of the k-th largest frequency item in the sample set minus the largest frequency that is smaller than the threshold or relative threshold for frequencies of false items, multiplied by a Δ
  
  quantile of a normal distribution; and
  
  stopping sampling when a sample counter exceeds a sum of an empirical frequency of a k-th largest frequency item in a sample set and a largest frequency that is smaller than a threshold or relative threshold for frequencies of false items, divided by the empirical frequency of the k-th largest frequency item in the sample set minus the largest frequency that is smaller than the threshold or relative threshold for frequencies of false items squared, and multiplied by a Δ
  
  quantile of a normal distribution.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. A method as claimed in claim 1 which also provides the frequencies of the top k most frequently occurring items in the data set and wherein the processor is arranged to identify the top k most frequently occurring items in the sample sketch and the frequencies of occurrence of those items in the sample sketch.
  - 3. A method as claimed in claim 1 wherein the processor is arranged to access an absolute value of the error tolerance parameter.
  - 4. A method as claimed in claim 1 wherein the processor is arranged to access a relative value of the error tolerance parameter.
  - 5. A method as claimed in claim 1 wherein the items are sensor readings of error types in a manufacturing process and wherein the top k most frequent error types are identified and provided to a control system arranged to control the manufacturing process.
  - 6. A method as claimed in claim 1 wherein the items are keywords observed at an information retrieval system and wherein the top k most frequently occurring keywords are identified and provided to the information retrieval system as feedback.
  - 7. A method as claimed in claim 1 wherein the items are nodes in a communications network and wherein the top k most frequently occurring nodes are identified and provided as input to a management node of the communications network.
  - 8. A method as claimed in claim 1 wherein the data set comprises peta bytes of items.

9. A query engine arranged to query a data center storing on an order of peta bytes of items in a data set, to find a top k most frequently occurring items in the data center, the query engine comprising:
- a processor arranged to access a value of a confidence measure, a value of an error tolerance parameter, and an estimate of a frequency of the top kth most frequent item in the data set;
  
  the processor also being arranged to continue sampling items from the data set at least as many times as prescribed by a function of the confidence measure, the error tolerance parameter, the estimate of the frequency of the top kth most frequent item in the data set, and excluding a parameter m being a number of unique items in the data set, the number of unique items being different than a total number of items in the data set;
  
  a memory arranged to store a sample sketch comprising the sampled items; and
  
  the processor also being arranged to identify the top k most frequently occurring items in the sample sketch;
  
  wherein the function of the confidence measure is selected from;
  
  four times an estimate of the frequency of the top kth most frequent item in the data set divided by an error tolerance parameter squared, multiplied by, a logarithm of 1 over a prescribed upper bound on a probability of error and a logarithm of k items with frequencies greater than or equal to the estimate of the frequency of the top kth most frequent item multiplied by the smaller of;
  
  two divided by, the estimate of the frequency of the top kth most frequent item in the data set minus the error tolerance parameter; and
  
  the number of items in the data set minus k;
  
  four times an estimate of the top kth most frequent item in the data set divided by an error tolerance parameter squared multiplied by the sum of a logarithm of 1 over a prescribed upper bound on a probability of error and a logarithm of k items greater than or equal to the estimate of the top kth most frequent item multiplied by the smaller of two divided by, the estimate of the frequency of the top kth most frequent item in the data set minus the error tolerance parameter, plus the estimate of the frequency of the top kth most frequent item in the data set; and
  
  the number of items in the data set; and
  
  eight times an estimate of the frequency of the top most frequent item in the data set multiplied by 1 minus the estimate of the frequency of the top most frequent item in the data set, divided by an error parameter squared and multiplied by a logarithm of 1 over a prescribed upper bound on a probability of error and a logarithm of two times M where M is the smaller of two divided by, the estimate of the frequency of the top kth most frequent item in the data set minus the error tolerance parameter, plus 1; and
  
  the number of items in the data set.

10. A ranking system comprising:
- an input arranged to access a data set comprising items having values;
  
  a memory arranged to store a required number of bins into which the values are to be partitioned;
  
  a processor arranged to find separator values for allocating the items into the bins in order that a specified proportion of the items is allocated to each bin, that specified proportion being different for at least two bins;
  
  the processor being arranged to sample items from the data set at least until a specified function is met, that function being of a confidence measure, an error tolerance parameter, and a width of the smallest bin;
  
  a memory arranged to store a sample sketch comprising the sampled items; and
  
  wherein the processor is also arranged to find the separator values from the sample sketch;
  
  wherein the function of the confidence measure is selected from;
  
  four times an estimate of the frequency of the top kth most frequent item in the data set divided by an error tolerance parameter squared, multiplied by, a logarithm of 1 over a prescribed upper bound on a probability of error and a logarithm of k items with frequencies greater than or equal to the estimate of the frequency of the top kth most frequent item multiplied by the smaller of;
  
  two divided by, the estimate of the frequency of the top kth most frequent item in the data set minus the error tolerance parameter; and
  
  the number of items in the data set minus k;
  
  four times an estimate of the top kth most frequent item in the data set divided by an error tolerance parameter squared multiplied by the sum of a logarithm of 1 over a prescribed upper bound on a probability of error and a logarithm of k items greater than or equal to the estimate of the top kth most frequent item multiplied by the smaller of two divided by, the estimate of the frequency of the top kth most frequent item in the data set minus the error tolerance parameter, plus the estimate of the frequency of the top kth most frequent item in the data set; and
  
  the number of items in the data set; and
  
  eight times an estimate of the frequency of the top most frequent item in the data set multiplied by 1 minus the estimate of the frequency of the top most frequent item in the data set, divided by an error parameter squared and multiplied by a logarithm of 1 over a prescribed upper bound on a probability of error and a logarithm of two times M where M is the smaller of two divided by, the estimate of the frequency of the top kth most frequent item in the data set minus the error tolerance parameter, plus 1; and
  
  the number of items in the data set.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. A ranking system as claimed in claim 10 which further comprises a plurality of second processors and an output arranged to output items from the data set to the second processors on the basis of the separator values.
  - 12. A ranking system as claimed in claim 10 which is arranged to operate where the data set is comprises on the order of peta bytes of items.
  - 13. A ranking system as claimed in claim 11 wherein the second processors are connected to the ranking system over a communications network.
  - 14. A ranking system as claimed in claim 11 wherein the second processors are arranged to each process values at different scales.
  - 15. A ranking system as claimed in claim 10 wherein the values are sensor readings sensed from a mechanical system to be controlled.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Vasudevan, Dinkar, Vojnovi, Milan
Primary Examiner(s)
Trujillo, James
Assistant Examiner(s)
Davanlou, Soheila

Application Number

US12/434,329
Publication Number

US 20100281033A1
Time in Patent Office

1,523 Days
Field of Search

707/748
US Class Current

707/748
CPC Class Codes

G06F 16/24578 using ranking

Ranking system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

13 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Ranking system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

13 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links