Method for shard assignment in a large-scale data processing job
First Claim
Patent Images
1. A system for shard assignment in a distributed data processing system, the system comprising:
- one or more processing devices;
one or more storage devices storing instructions that, when executed by the one or more processing devices, cause the one or more processing devices to implement;
a plurality of worker processes anda master process for coordinating a data processing job that;
divides an input dataset into a plurality of shards;
indexes the plurality of shards;
aggregates the plurality of shards into one or more groups based on the shards'"'"' indices;
initially assigns an indexed shard from each group to a worker process; and
in response to a worker having processed its initially assigned indexed shard, assigns subsequent shards from the same group as the initially assigned shard to the worker process based on the index of the previously-assigned shard.
3 Assignments
0 Petitions
Accused Products
Abstract
A method for shard assignment in a large-scale data processing job is provided. Datasets are divided into a plurality of shards and the shards are indexed and aggregated into one or more groups. A worker process is initially assigned an indexed shard from a group. The initial assignment can assigned based on a simple algorithm. The worker'"'"'s subsequent shard assignment is based on the index of the initially assigned shard.
-
Citations
21 Claims
-
1. A system for shard assignment in a distributed data processing system, the system comprising:
-
one or more processing devices; one or more storage devices storing instructions that, when executed by the one or more processing devices, cause the one or more processing devices to implement; a plurality of worker processes and a master process for coordinating a data processing job that; divides an input dataset into a plurality of shards; indexes the plurality of shards; aggregates the plurality of shards into one or more groups based on the shards'"'"' indices; initially assigns an indexed shard from each group to a worker process; and in response to a worker having processed its initially assigned indexed shard, assigns subsequent shards from the same group as the initially assigned shard to the worker process based on the index of the previously-assigned shard. - View Dependent Claims (2, 3, 4, 5, 6, 19)
-
-
7. A computer-implemented method for shard assignment in a distributed data processing system, comprising:
-
dividing an input dataset into a plurality of shards; indexing the plurality of shards; aggregating the plurality of shards into one or more groups based on the shards'"'"' indices; initially assigning an indexed shard from each group to a worker process; and in response to a worker having processed its initially assigned indexed shard, assigning subsequent shards from the same group as the initially assigned shard to the worker based on the index of the previously-assigned shard. - View Dependent Claims (8, 9, 10, 11, 12, 20)
-
-
13. A non-transitory computer-readable medium having stored therein computer executable code that causes one or more processors to execute the steps of:
-
dividing an input dataset into a plurality of shards; indexing the plurality of shards; aggregating the plurality of shards into one or more groups based on the shards'"'"' indices; initially assigning an indexed shard from each group to a worker process; and in response to a worker having processed its initially assigned indexed shard, assigning subsequent shards from the same group as the initially assigned shard to the to a-worker based on the index of the previously-assigned shard. - View Dependent Claims (14, 15, 16, 17, 18, 21)
-
Specification