×

Horizon histogram optimizations

  • US 8,433,702 B1
  • Filed: 09/28/2011
  • Issued: 04/30/2013
  • Est. Priority Date: 09/28/2011
  • Status: Active Grant
First Claim
Patent Images

1. A method of identifying distinct values that occur at or above a threshold frequency within a data set, the method comprising:

  • allocating a number of count storage buckets, wherein a function of the threshold frequency determines how many count storage buckets to allocate;

    performing a counting operation on a particular subset of the data set by, for each item of the particular subset;

    when the item corresponds to a distinct value that is currently associated with a particular bucket of the count storage buckets, incrementing a count of the particular bucket;

    when the item corresponds to a distinct value that is not currently associated with any of the count storage buckets and at least one of the count storage buckets is available to store a count for the distinct value, associating the distinct value with an available bucket and initializing a count of the available bucket;

    when the item corresponds to a distinct value that is not currently associated with any of the count storage buckets and none of the count storage buckets is available to store a count for the distinct value, decrementing at least each count of the count storage buckets that is positive;

    selecting, for the particular subset of the data set, a candidate set of distinct values and associated counts from the distinct values that are associated with the count storage buckets after the counting operation;

    selecting a plurality of candidate sets of distinct values and associated counts by repeating at least the counting operation and the selecting of the candidate set for each particular subset of a plurality of non-overlapping subsets of the data set;

    forming a merged candidate set of distinct values by merging the plurality of candidate sets based on the associated counts;

    for each distinct value in the merged candidate set, determining a frequency of occurrence of the distinct value within the data set;

    identifying a set of high-frequency distinct values by comparing the frequency of occurrence of each distinct values in the merged candidate set with the threshold frequency;

    wherein the method is performed by one or more computing devices.

View all claims
  • 8 Assignments
Timeline View
Assignment View
    ×
    ×