Data placement control for distributed computing environment
First Claim
Patent Images
1. A method comprising:
- dividing a first dataset including a first plurality of elements into partitions by hashing a key for each of the first plurality of elements to generate a hash value for the key of the element, wherein each element of the first plurality of elements is stored in a partition corresponding to the hash value for the key of the element;
selecting a set of distributed storage system nodes as a first primary node group for storage of the partitions of the first dataset;
causing a primary copy of the partitions of the first dataset to be stored on the first primary node group by a distributed storage system file server based on the respective hash values such that the storage system node on which each element of the first plurality of elements is stored is associated with the hash value for the key of the element;
dividing at least one additional dataset into partitions by hashing a key for each element of the at least one additional dataset to generate a hash value for the key of the element, wherein the datasets comprise tables; and
causing a primary copy of the partitions of each additional dataset to be stored on corresponding primary node groups by the distributed storage system file server as a function of hash values such that the storage system node of each partition in the corresponding primary node group is known by hashing of the key, wherein a number of partitions that store each of the tables is a power of two, and wherein at least one partition is striped across multiple nodes of the primary node group.
1 Assignment
0 Petitions
Accused Products
Abstract
A method includes dividing a dataset into partitions by hashing a specified key, selecting a set of distributed file system nodes as a primary node group for storage of the partitions, and causing a primary copy of the partitions to be stored on the primary node group by a distributed storage system file server such that the location of each partition is known by hashing of the specified key.
10 Citations
12 Claims
-
1. A method comprising:
-
dividing a first dataset including a first plurality of elements into partitions by hashing a key for each of the first plurality of elements to generate a hash value for the key of the element, wherein each element of the first plurality of elements is stored in a partition corresponding to the hash value for the key of the element; selecting a set of distributed storage system nodes as a first primary node group for storage of the partitions of the first dataset; causing a primary copy of the partitions of the first dataset to be stored on the first primary node group by a distributed storage system file server based on the respective hash values such that the storage system node on which each element of the first plurality of elements is stored is associated with the hash value for the key of the element; dividing at least one additional dataset into partitions by hashing a key for each element of the at least one additional dataset to generate a hash value for the key of the element, wherein the datasets comprise tables; and causing a primary copy of the partitions of each additional dataset to be stored on corresponding primary node groups by the distributed storage system file server as a function of hash values such that the storage system node of each partition in the corresponding primary node group is known by hashing of the key, wherein a number of partitions that store each of the tables is a power of two, and wherein at least one partition is striped across multiple nodes of the primary node group. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system comprising:
-
a memory having instructions stored thereon; and a processor in communication with the memory, wherein the processor executes the instructions to; divide each of multiple datasets including a plurality of elements into partitions by hashing a key for each of the plurality of elements to generate a hash value for the key of the element, wherein each element of the plurality of elements is stored in a partition corresponding to the hash value for the key of the element; select sets of distributed storage system nodes as primary node groups for storage of the partitions; and cause a primary copy of the partitions of each dataset to be stored on corresponding primary node groups by a distributed storage system file server based on the respective hash values such that the storage system node on which each element of the plurality of elements is stored is associated with the hash value for the key of the element, wherein a number of partitions that store each of the multiple datasets is a power of two, and wherein at least one partition is striped across multiple nodes. - View Dependent Claims (8, 9, 10, 11, 12)
-
Specification