Preview data aggregation
First Claim
1. A computer implemented method, comprising:
- processing, at a first worker node, a first data chunk of a dataset to generate a first intermediate result, the processing of the first data chunk comprising inserting a first plurality of key-value pairs from the first data chunk into the first intermediate result, the dataset being partitioned into the first data chunk and a second data chunk;
generating, at a merger node, a key map based at least on a determination that a quantity of the first plurality of key-value pairs in the first intermediate result exceeds a threshold value, the key map being generated to include one or more keys of the key-value pairs in the first intermediate result;
processing, at a second worker node, the second data chunk to generate a second intermediate result, the processing of the second data chunk includes inserting, into the second intermediate result, a first key-value pair and a second key-value pair based at least on a first key associated with the first key-value pair and a second key associated with the second key-value pair being present in the key map, the processing of the second data chunk further includes omitting, from the second intermediate result, a third key-value pair based at least on a third key associated with the third key-value pair being absent from the key map, the first key-value pair and the second key-value pair being inserted in a same order as an order of the first key and the second key in the key map; and
generating a preview of the processing of the dataset, the preview being generated by at least merging the first intermediate result and the second intermediate result without identifying one or more key-value pairs from each of the first intermediate result and the second intermediate result that share a same key.
1 Assignment
0 Petitions
Accused Products
Abstract
In one respect, there is provided a method. The method can include processing a first data chunk to generate a first intermediate result. A key map can be generated based on a determination that a quantity of the key-value pairs in the first intermediate result exceeds a threshold. The key map can be generated to include keys in the first intermediate result. A second data chunk can be processed to generate a second intermediate result. The second data chunk can be processed based on the key map. The processing of the second data chunk can include omitting a key-value pair in the second data chunk from being inserted into the second intermediate result based on a key associated with the key-value pair being absent from the key map. A preview of the processing of the dataset can be generated based on the first intermediate result and the second intermediate result.
-
Citations
14 Claims
-
1. A computer implemented method, comprising:
-
processing, at a first worker node, a first data chunk of a dataset to generate a first intermediate result, the processing of the first data chunk comprising inserting a first plurality of key-value pairs from the first data chunk into the first intermediate result, the dataset being partitioned into the first data chunk and a second data chunk; generating, at a merger node, a key map based at least on a determination that a quantity of the first plurality of key-value pairs in the first intermediate result exceeds a threshold value, the key map being generated to include one or more keys of the key-value pairs in the first intermediate result; processing, at a second worker node, the second data chunk to generate a second intermediate result, the processing of the second data chunk includes inserting, into the second intermediate result, a first key-value pair and a second key-value pair based at least on a first key associated with the first key-value pair and a second key associated with the second key-value pair being present in the key map, the processing of the second data chunk further includes omitting, from the second intermediate result, a third key-value pair based at least on a third key associated with the third key-value pair being absent from the key map, the first key-value pair and the second key-value pair being inserted in a same order as an order of the first key and the second key in the key map; and generating a preview of the processing of the dataset, the preview being generated by at least merging the first intermediate result and the second intermediate result without identifying one or more key-value pairs from each of the first intermediate result and the second intermediate result that share a same key. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system, comprising:
-
at least one data processor; and at least one memory storing instructions which, when executed by the at least one data processor, result in operations comprising; processing, at a first worker node, a first data chunk of a dataset to generate a first intermediate result, the processing of the first data chunk comprising inserting a first plurality of key-value pairs from the first data chunk into the first intermediate result, the dataset being partitioned into the first data chunk and a second data chunk; generating, at a merger node, a key map based at least on a determination that a quantity of the first plurality of key-value pairs in the first intermediate result exceeds a threshold value, the key map being generated to include one or more keys of the key-value pairs in the first intermediate result; processing, at a second worker node, the second data chunk to generate a second intermediate result, the processing of the second data chunk includes inserting, into the second intermediate result, a first key-value pair and a second key-value pair based at least on a first key associated with the first key-value pair and a second key associated with the second key-value pair being present in the key map, the processing of the second data chunk further includes omitting, from the second intermediate result, a third key-value pair based at least on a third key associated with the third key-value pair being absent from the key map, the first key-value pair and the second key-value pair being inserted in a same order as an order of the first key and the second key in the key map; and generating a preview of the processing of the dataset, the preview being generated by at least merging the first intermediate result and the second intermediate result without identifying one or more key-value pairs from each of the first intermediate result and the second intermediate result that share a same key. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A non-transitory computer-readable storage medium including program code, which when executed by at least one data processor, cause operations comprising:
-
processing, at a first worker node, a first data chunk of a dataset to generate a first intermediate result, the processing of the first data chunk comprising inserting a first plurality of key-value pairs from the first data chunk into the first intermediate result, the dataset being partitioned into the first data chunk and a second data chunk; generating, at a merger node, a key map based at least on a determination that a quantity of the first plurality of key-value pairs in the first intermediate result exceeds a threshold value, the key map being generated to include one or more keys of the key-value pairs in the first intermediate result; processing, at a second worker node, the second data chunk to generate a second intermediate result, the processing of the second data chunk includes inserting, into the second intermediate result, a first key-value pair and a second key-value pair based at least on a first key associated with the first key-value pair and a second key associated with the second key-value pair being present in the key map, the processing of the second data chunk further includes omitting, from the second intermediate result, a third key-value pair based at least on a third key associated with the third key-value pair being absent from the key map, the first key-value pair and the second key-value pair being inserted in a same order as an order of the first key and the second key in the key map; and generating a preview of the processing of the dataset, the preview being generated by at least merging the first intermediate result and the second intermediate result without identifying one or more key-value pairs from each of the first intermediate result and the second intermediate result that share a same key.
-
Specification