Horizon histogram optimizations

US 8,433,702 B1
Filed: 09/28/2011
Issued: 04/30/2013
Est. Priority Date: 09/28/2011
Status: Active Grant

First Claim

Patent Images

1. A method of identifying distinct values that occur at or above a threshold frequency within a data set, the method comprising:

allocating a number of count storage buckets, wherein a function of the threshold frequency determines how many count storage buckets to allocate;

performing a counting operation on a particular subset of the data set by, for each item of the particular subset;

when the item corresponds to a distinct value that is currently associated with a particular bucket of the count storage buckets, incrementing a count of the particular bucket;

when the item corresponds to a distinct value that is not currently associated with any of the count storage buckets and at least one of the count storage buckets is available to store a count for the distinct value, associating the distinct value with an available bucket and initializing a count of the available bucket;

when the item corresponds to a distinct value that is not currently associated with any of the count storage buckets and none of the count storage buckets is available to store a count for the distinct value, decrementing at least each count of the count storage buckets that is positive;

selecting, for the particular subset of the data set, a candidate set of distinct values and associated counts from the distinct values that are associated with the count storage buckets after the counting operation;

selecting a plurality of candidate sets of distinct values and associated counts by repeating at least the counting operation and the selecting of the candidate set for each particular subset of a plurality of non-overlapping subsets of the data set;

forming a merged candidate set of distinct values by merging the plurality of candidate sets based on the associated counts;

for each distinct value in the merged candidate set, determining a frequency of occurrence of the distinct value within the data set;

identifying a set of high-frequency distinct values by comparing the frequency of occurrence of each distinct values in the merged candidate set with the threshold frequency;

wherein the method is performed by one or more computing devices.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Values that occur above a threshold frequency for certain characteristic(s) of a data set are identified. A limited number of count buckets are allocated based on the threshold. Buckets store proxy counts for identifying candidate sets of values rather than actual counts. The data set is divided and each portion is analyzed separately, by iterating through each item in that portion. During each iteration, depending on an item'"'"'s value(s), a bucket is incremented, all buckets are decremented, or a bucket is assigned or reassigned to count different value(s). A candidate set of values and associated counts is selected for a portion based on the buckets. The candidate sets for each portion are merged and, in some embodiments, filtered based on the associated counts. Actual frequencies are then determined for the values that remain in the merged candidate set.

109 Citations

33 Claims

1. A method of identifying distinct values that occur at or above a threshold frequency within a data set, the method comprising:
- allocating a number of count storage buckets, wherein a function of the threshold frequency determines how many count storage buckets to allocate;
  
  performing a counting operation on a particular subset of the data set by, for each item of the particular subset;
  
  when the item corresponds to a distinct value that is currently associated with a particular bucket of the count storage buckets, incrementing a count of the particular bucket;
  
  when the item corresponds to a distinct value that is not currently associated with any of the count storage buckets and at least one of the count storage buckets is available to store a count for the distinct value, associating the distinct value with an available bucket and initializing a count of the available bucket;
  
  when the item corresponds to a distinct value that is not currently associated with any of the count storage buckets and none of the count storage buckets is available to store a count for the distinct value, decrementing at least each count of the count storage buckets that is positive;
  
  selecting, for the particular subset of the data set, a candidate set of distinct values and associated counts from the distinct values that are associated with the count storage buckets after the counting operation;
  
  selecting a plurality of candidate sets of distinct values and associated counts by repeating at least the counting operation and the selecting of the candidate set for each particular subset of a plurality of non-overlapping subsets of the data set;
  
  forming a merged candidate set of distinct values by merging the plurality of candidate sets based on the associated counts;
  
  for each distinct value in the merged candidate set, determining a frequency of occurrence of the distinct value within the data set;
  
  identifying a set of high-frequency distinct values by comparing the frequency of occurrence of each distinct values in the merged candidate set with the threshold frequency;
  
  wherein the method is performed by one or more computing devices.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the number of count storage buckets allocated for each counting operation is less than the number of distinct values to which the items in the data set correspond.
  - 3. The method of claim 1, wherein the number of count storage buckets allocated for each counting operation is the inverse of the threshold frequency, rounded.
  - 4. The method of claim 1, wherein determining that an item corresponds to a distinct value comprises determining that the distinct value appears for a target characteristic of the item.
  - 5. The method of claim 1, wherein merging the plurality of candidate sets comprises:
    - for each distinct value found in the plurality of candidate sets, summing all counts associated with the distinct value;
      
      accepting into the merged candidate set only those distinct values having the highest summed counts, wherein the number of values in the merged candidate set is based on the inverse of the threshold frequency.
  - 6. The method of claim 1, wherein merging the plurality of candidate sets comprises:
    - generating a proxy data set comprising one or more members for each distinct value in the plurality of candidate sets, wherein the one or more members correspond, in number, to the associated counts for that distinct value;
      
      performing a proxy counting operation on the proxy data set by, for each member of the proxy data set;
      
      when the member corresponds to a distinct value that is currently associated with a particular bucket of the count storage buckets, incrementing a count of the particular bucket;
      
      when the member corresponds to a distinct value that is not associated with any of the count storage buckets and at least one of the count storage buckets is available to store a count for the distinct value, associating the distinct value with an available bucket and initializing a count of the available bucket;
      
      when the member corresponds to a distinct value that is not associated with any of the count storage buckets and none of the count storage buckets is available to store a count for the distinct value, decrementing at least each count of the count storage buckets that is positive;
      
      identifying the merged candidate set based on which distinct values are associated with the count storage buckets upon completion of the proxy counting operation.
  - 7. The method of claim 1, wherein a count storage bucket is an available bucket when the count storage bucket is empty or when the count storage bucket stores a count that is less than a count to which buckets are initialized.
  - 8. The method of claim 1, wherein repeating the counting operation for each particular subset of the data set comprises performing each counting operation at a different node of a plurality of nodes.
  - 9. The method of claim 8, wherein at least two or more nodes in the plurality of nodes perform the counting operation in parallel using different sets of count storage buckets.
  - 10. The method of claim 8, wherein each node has exclusive access to the particular subset of the data set for which the node performs the counting operation.
  - 11. The method of claim 1, wherein decrementing at least each count of the count storage buckets that is positive comprises decrementing each count by one.

12. One or more non-transitory computer-readable storage media storing instructions for identifying distinct values that occur at or above a threshold frequency within a data setwherein the instructions, when executed by one or more computing devices, cause performance of:
- allocating a number of count storage buckets, wherein a function of the threshold frequency determines how many count storage buckets to allocate;
  
  performing a counting operation on a particular subset of the data set by, for each item of the particular subset;
  
  when the item corresponds to a distinct value that is currently associated with a particular bucket of the count storage buckets, incrementing a count of the particular bucket;
  
  when the item corresponds to a distinct value that is not currently associated with any of the count storage buckets and at least one of the count storage buckets is available to store a count for the distinct value, associating the distinct value with an available bucket and initializing a count of the available bucket;
  
  when the item corresponds to a distinct value that is not currently associated with any of the count storage buckets and none of the count storage buckets is available to store a count for the distinct value, decrementing at least each count of the count storage buckets that is positive;
  
  selecting, for the particular subset of the data set, a candidate set of distinct values and associated counts from the distinct values that are associated with the count storage buckets after the counting operation;
  
  selecting a plurality of candidate sets of distinct values and associated counts by repeating at least the counting operation and the selecting of the candidate set for each particular subset of a plurality of non-overlapping subsets of the data set;
  
  forming a merged candidate set of distinct values by merging the plurality of candidate sets based on the associated counts;
  
  for each distinct value in the merged candidate set, determining a frequency of occurrence of the distinct value within the data set;
  
  identifying a set of high-frequency distinct values by comparing the frequencies frequency of occurrence of the each distinct values in the merged candidate set with the threshold frequency.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 13. The one or more non-transitory computer-readable storage media of claim 12, wherein the number of count storage buckets allocated for each counting operation is less than the number of distinct values to which the items in the data set correspond.
  - 14. The one or more non-transitory computer-readable storage media of claim 12, wherein the number of count storage buckets allocated for each counting operation is the inverse of the threshold frequency, rounded.
  - 15. The one or more non-transitory computer-readable storage media of claim 12, wherein determining that an item corresponds to a distinct value comprises determining that the distinct value appears for a target characteristic of the item.
  - 16. The one or more non-transitory computer-readable storage media of claim 12, wherein merging the plurality of candidate sets comprises:
    - for each distinct value found in the plurality of candidate sets, summing all counts associated with the distinct value;
      
      accepting into the merged candidate set only those distinct values having the highest summed counts, wherein the number of values in the merged candidate set is based on the inverse of the threshold frequency.
  - 17. The one or more non-transitory computer-readable storage media of claim 12, wherein merging the plurality of candidate sets comprises:
    - generating a proxy data set comprising one or more members for each distinct value in the plurality of candidate sets, wherein the one or more members correspond, in number, to the associated counts for that distinct value;
      
      performing a proxy counting operation on the proxy data set by, for each member of the proxy data set;
      
      when the member corresponds to a distinct value that is currently associated with a particular bucket of the count storage buckets, incrementing a count of the particular bucket;
      
      when the member corresponds to a distinct value that is not associated with any of the count storage buckets and at least one of the count storage buckets is available to store a count for the distinct value, associating the distinct value with an available bucket and initializing a count of the available bucket;
      
      when the member corresponds to a distinct value that is not associated with any of the count storage buckets and none of the count storage buckets is available to store a count for the distinct value, decrementing at least each count of the count storage buckets that is positive;
      
      identifying the merged candidate set based on which distinct values are associated with the count storage buckets upon completion of the proxy counting operation.
  - 18. The one or more non-transitory computer-readable storage media of claim 12, wherein a count storage bucket is an available bucket when the count storage bucket is empty or when the count storage bucket stores a count that is less than a count to which buckets are initialized.
  - 19. The one or more non-transitory computer-readable storage media of claim 12, wherein repeating the counting operation for each particular subset of the data set comprises performing each counting operation at a different node of a plurality of nodes.
  - 20. The one or more non-transitory computer-readable storage media of claim 19, wherein at least two or more nodes in the plurality of nodes perform the counting operation in parallel using different sets of count storage buckets.
  - 21. The one or more non-transitory computer-readable storage media of claim 19, wherein each node has exclusive access to the particular subset of the data set for which the node performs the counting operation.
  - 22. The one or more computer-readable storage media of claim 12, wherein decrementing at least each count of the count storage buckets that is positive comprises decrementing each count by one.

23. A system for identifying distinct values that occur at or above a threshold frequency within a data set, comprising:
- a plurality of processors; and
  
  one or more memories comprising a plurality of sets of count storage buckets;
  
  one or more computer-readable storage media storing the data set;
  
  wherein the plurality of processors is configured to allocate the plurality of sets of count storage buckets by allocating, for each particular set of the plurality of sets, a number of buckets, wherein a function of the threshold frequency determines how many count storage buckets to allocate;
  
  wherein a first subset of the plurality of processors is configured to instruct each of a plurality of non-overlapping subsets of the plurality of processors to identify candidate sets of high frequency items for different portions of the data set;
  
  wherein each of the plurality of non-overlapping subsets of the plurality of processors is configured to identify a candidate set of high frequency items and associated proxy counts using a different particular set of count storage buckets by, for each item of a portion of the data set;
  
  when the item corresponds to a distinct value that is currently associated with a particular bucket of the particular set of count storage buckets, incrementing a proxy count of the particular bucket;
  
  when the item corresponds to a distinct value that is not currently associated with any bucket in the particular set of count storage buckets and at least one bucket in the particular set of count storage buckets is available to store a proxy count for the distinct value, associating the distinct value with an available bucket and initializing a proxy count of the available bucket;
  
  when the item corresponds to a distinct value that is not currently associated with any bucket in the particular set of count storage buckets and no bucket in the particular set of count storage buckets is available to store a proxy count for the distinct value, decrementing at least each proxy count of the particular set of count storage buckets that is positive;
  
  wherein the first subset of the plurality of processors is further configured to;
  
  form a merged candidate set of distinct values by merging the candidate sets based on the associated proxy counts;
  
  for each distinct value in the merged candidate set, determine a frequency of occurrence of the distinct value within the data set;
  
  identify a set of high-frequency distinct values by comparing the frequency of occurrence of each merged candidate set with the threshold frequency.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
- - 24. The system of claim 23, wherein the number of count storage buckets in each set of count storage buckets is less than the number of distinct values to which the items in the data set correspond.
  - 25. The system of claim 23, wherein the number of count storage buckets allocated in each set of count storage buckets is the inverse of the threshold frequency, rounded.
  - 26. The system of claim 23, wherein determining that an item corresponds to a distinct value comprises determining that the distinct value appears for a target characteristic of the item.
  - 27. The system of claim 23, wherein merging the candidate sets comprises:
    - for each distinct value found in the candidate sets, summing all proxy counts associated with the distinct value;
      
      accepting into the merged candidate set only those distinct values having the highest summed proxy counts, wherein the number of values in the merged candidate set is based on the inverse of the threshold frequency.
  - 28. The system of claim 23, wherein merging the candidate sets comprises:
    - generating a proxy data set comprising one or more members for each distinct value in the candidate sets, wherein the one or more members correspond, in number, to the associated proxy counts for that distinct value;
      
      performing a proxy counting operation on the proxy data set by, for each member of the proxy data set;
      
      when the member corresponds to a distinct value that is currently associated with a particular bucket of a first set of count storage buckets, incrementing a proxy count of the particular bucket;
      
      when the member corresponds to a distinct value that is not associated with any bucket in the first set of count storage buckets and at least one bucket in the first set of count storage buckets is available to store a proxy count for the distinct value, associating the distinct value with an available bucket and initializing a proxy count of the available bucket;
      
      when the member corresponds to a distinct value that is not associated with any bucket in the first set of count storage buckets and no bucket in the first set of count storage buckets is available to store a proxy count for the distinct value, decrementing at least each proxy count of the first set of count storage buckets that is positive;
      
      identifying the merged candidate set based on which distinct values are associated with the first set of count storage buckets upon completion of the proxy counting operation.
  - 29. The system of claim 23, wherein a count storage bucket is an available bucket when the count storage bucket is empty or when the count storage bucket stores a proxy count that is less than a proxy count to which count storage buckets are initialized.
  - 30. The system of claim 23, wherein each different non-overlapping subset of the plurality of processors resides at a different set of one or more computing devices.
  - 31. The system of claim 23, wherein at least two or more of the non-overlapping subsets of the plurality of processors identify candidate sets in parallel.
  - 32. The system of claim 23, wherein each different non-overlapping subset of the plurality of processors has exclusive access to the particular portion of the data set for which the non-overlapping subset of the plurality of processors identifies candidate sets.
  - 33. The system of claim 23, wherein decrementing at least each count of the count storage buckets that is positive comprises decrementing each count by one.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Palantir Technologies Incorporated
Original Assignee
Palantir Technologies Incorporated
Inventors
Carrino, John A., Harris, Michael
Primary Examiner(s)
Lu, Kuen S
Assistant Examiner(s)
OBISESAN, AUGUSTINE KUNLE

Application Number

US13/246,867
Time in Patent Office

580 Days
Field of Search

None
US Class Current

707/717
CPC Class Codes

G06F 16/903 Querying for retrieval from...

Horizon histogram optimizations

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

109 Citations

33 Claims

Specification

Solutions

Use Cases

Quick Links

Horizon histogram optimizations

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

109 Citations

33 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links