Horizon histogram optimizations
First Claim
1. A method of identifying distinct values that occur at or above a threshold frequency within a data set, the method comprising:
- allocating a number of count storage buckets, wherein a function of the threshold frequency determines how many count storage buckets to allocate;
performing a counting operation on a particular subset of the data set by, for each item of the particular subset;
when the item corresponds to a distinct value that is currently associated with a particular bucket of the count storage buckets, incrementing a count of the particular bucket;
when the item corresponds to a distinct value that is not currently associated with any of the count storage buckets and at least one of the count storage buckets is available to store a count for the distinct value, associating the distinct value with an available bucket and initializing a count of the available bucket;
when the item corresponds to a distinct value that is not currently associated with any of the count storage buckets and none of the count storage buckets is available to store a count for the distinct value, decrementing at least each count of the count storage buckets that is positive;
selecting, for the particular subset of the data set, a candidate set of distinct values and associated counts from the distinct values that are associated with the count storage buckets after the counting operation;
selecting a plurality of candidate sets of distinct values and associated counts by repeating at least the counting operation and the selecting of the candidate set for each particular subset of a plurality of non-overlapping subsets of the data set;
forming a merged candidate set of distinct values by merging the plurality of candidate sets based on the associated counts;
for each distinct value in the merged candidate set, determining a frequency of occurrence of the distinct value within the data set;
identifying a set of high-frequency distinct values by comparing the frequency of occurrence of each distinct values in the merged candidate set with the threshold frequency;
wherein the method is performed by one or more computing devices.
8 Assignments
0 Petitions
Accused Products
Abstract
Values that occur above a threshold frequency for certain characteristic(s) of a data set are identified. A limited number of count buckets are allocated based on the threshold. Buckets store proxy counts for identifying candidate sets of values rather than actual counts. The data set is divided and each portion is analyzed separately, by iterating through each item in that portion. During each iteration, depending on an item'"'"'s value(s), a bucket is incremented, all buckets are decremented, or a bucket is assigned or reassigned to count different value(s). A candidate set of values and associated counts is selected for a portion based on the buckets. The candidate sets for each portion are merged and, in some embodiments, filtered based on the associated counts. Actual frequencies are then determined for the values that remain in the merged candidate set.
109 Citations
33 Claims
-
1. A method of identifying distinct values that occur at or above a threshold frequency within a data set, the method comprising:
-
allocating a number of count storage buckets, wherein a function of the threshold frequency determines how many count storage buckets to allocate; performing a counting operation on a particular subset of the data set by, for each item of the particular subset; when the item corresponds to a distinct value that is currently associated with a particular bucket of the count storage buckets, incrementing a count of the particular bucket; when the item corresponds to a distinct value that is not currently associated with any of the count storage buckets and at least one of the count storage buckets is available to store a count for the distinct value, associating the distinct value with an available bucket and initializing a count of the available bucket; when the item corresponds to a distinct value that is not currently associated with any of the count storage buckets and none of the count storage buckets is available to store a count for the distinct value, decrementing at least each count of the count storage buckets that is positive; selecting, for the particular subset of the data set, a candidate set of distinct values and associated counts from the distinct values that are associated with the count storage buckets after the counting operation; selecting a plurality of candidate sets of distinct values and associated counts by repeating at least the counting operation and the selecting of the candidate set for each particular subset of a plurality of non-overlapping subsets of the data set; forming a merged candidate set of distinct values by merging the plurality of candidate sets based on the associated counts; for each distinct value in the merged candidate set, determining a frequency of occurrence of the distinct value within the data set; identifying a set of high-frequency distinct values by comparing the frequency of occurrence of each distinct values in the merged candidate set with the threshold frequency; wherein the method is performed by one or more computing devices. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. One or more non-transitory computer-readable storage media storing instructions for identifying distinct values that occur at or above a threshold frequency within a data set
wherein the instructions, when executed by one or more computing devices, cause performance of: -
allocating a number of count storage buckets, wherein a function of the threshold frequency determines how many count storage buckets to allocate; performing a counting operation on a particular subset of the data set by, for each item of the particular subset; when the item corresponds to a distinct value that is currently associated with a particular bucket of the count storage buckets, incrementing a count of the particular bucket; when the item corresponds to a distinct value that is not currently associated with any of the count storage buckets and at least one of the count storage buckets is available to store a count for the distinct value, associating the distinct value with an available bucket and initializing a count of the available bucket; when the item corresponds to a distinct value that is not currently associated with any of the count storage buckets and none of the count storage buckets is available to store a count for the distinct value, decrementing at least each count of the count storage buckets that is positive; selecting, for the particular subset of the data set, a candidate set of distinct values and associated counts from the distinct values that are associated with the count storage buckets after the counting operation; selecting a plurality of candidate sets of distinct values and associated counts by repeating at least the counting operation and the selecting of the candidate set for each particular subset of a plurality of non-overlapping subsets of the data set; forming a merged candidate set of distinct values by merging the plurality of candidate sets based on the associated counts; for each distinct value in the merged candidate set, determining a frequency of occurrence of the distinct value within the data set; identifying a set of high-frequency distinct values by comparing the frequencies frequency of occurrence of the each distinct values in the merged candidate set with the threshold frequency. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A system for identifying distinct values that occur at or above a threshold frequency within a data set, comprising:
-
a plurality of processors; and one or more memories comprising a plurality of sets of count storage buckets; one or more computer-readable storage media storing the data set; wherein the plurality of processors is configured to allocate the plurality of sets of count storage buckets by allocating, for each particular set of the plurality of sets, a number of buckets, wherein a function of the threshold frequency determines how many count storage buckets to allocate; wherein a first subset of the plurality of processors is configured to instruct each of a plurality of non-overlapping subsets of the plurality of processors to identify candidate sets of high frequency items for different portions of the data set; wherein each of the plurality of non-overlapping subsets of the plurality of processors is configured to identify a candidate set of high frequency items and associated proxy counts using a different particular set of count storage buckets by, for each item of a portion of the data set; when the item corresponds to a distinct value that is currently associated with a particular bucket of the particular set of count storage buckets, incrementing a proxy count of the particular bucket; when the item corresponds to a distinct value that is not currently associated with any bucket in the particular set of count storage buckets and at least one bucket in the particular set of count storage buckets is available to store a proxy count for the distinct value, associating the distinct value with an available bucket and initializing a proxy count of the available bucket; when the item corresponds to a distinct value that is not currently associated with any bucket in the particular set of count storage buckets and no bucket in the particular set of count storage buckets is available to store a proxy count for the distinct value, decrementing at least each proxy count of the particular set of count storage buckets that is positive; wherein the first subset of the plurality of processors is further configured to; form a merged candidate set of distinct values by merging the candidate sets based on the associated proxy counts; for each distinct value in the merged candidate set, determine a frequency of occurrence of the distinct value within the data set; identify a set of high-frequency distinct values by comparing the frequency of occurrence of each merged candidate set with the threshold frequency. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
-
Specification