Distributed data set storage and retrieval
First Claim
1. An apparatus comprising a processor component and a storage to store instructions that, when executed by the processor component, cause the processor component to perform operations comprising:
- retrieve, from one or more storage devices through a network, metadata indicative of organization of data within a data set, and map data indicative of organization of multiple data blocks within a data file maintained by the one or more storage devices, wherein;
the map data comprises multiple map entries; and
each map entry of the multiple map entries corresponds to one or more data blocks of the multiple data blocks;
receive, from multiple node devices, indications of which node devices among the multiple node devices are available node devices that are each able to perform a processing task with at least one data set portion of the one or more data set portions; and
in response to an indication within the metadata or the map data that the data set comprises partitioned data wherein the data within the data set is organized into multiple partitions that are each distributable to a single node device, and each map entry corresponds to a single data block;
determine a first quantity of the available node devices based on the indications of which node devices are available node devices;
retrieve a second quantity of node devices last involved in storage of the data set within the data file from the metadata or the map data;
compare the first and second quantities of node devices to detect a match between the first and second quantities;
assign each of the available node devices one of a series of positive integer values as a designation value, wherein the series extends from an integer value of 0 to a positive integer value equal to the first quantity minus the integer value of 1; and
in response to detection of a match between the first and second quantities, for each map entry of the map data;
retrieve, from the map entry, a hashed identifier for one data sub-block indicated in the map entry as within the corresponding data block, and a data sub-block size for each of the data sub-blocks indicated in the map entry as within the corresponding data block, wherein;
the hashed identifier is derived from a partition label of a partition of the multiple partitions; and
the data sub-block comprises a data set portion of the one or more data set portions;
determine a location of the corresponding data block within the data file;
divide the hashed identifier by the first quantity to obtain a modulo value;
compare the modulo value to the designation value assigned to each of the available node devices to identify an available node device assigned a designation value that matches the modulo value; and
provide a pointer to the available node device assigned the designation value that matches the modulo value, the pointer comprising;
an indication of the location of the corresponding data block; and
a sum of the data sub-block sizes of all of the data sub-blocks within the corresponding data block.
1 Assignment
0 Petitions
Accused Products
Abstract
An apparatus includes processor component caused to: retrieve metadata of organization of data within a data set, and map data of organization of data blocks within a data file; receive indications of which node devices are available to perform a processing task with a data set portion; and in response to the data set including partitioned data, compare the quantities of available node devices and of the node devices last involved in storing the data set. In response to a match, for each map data map entry: retrieve a hashed identifier for a data sub-block, and a size for each of the data sub-blocks within the corresponding data block; divide the hashed identifier by the quantity of available node devices; compare the modulo value to a designation assigned to each of the available node devices; and provide a pointer to the available node device assigned the matching designation.
-
Citations
30 Claims
-
1. An apparatus comprising a processor component and a storage to store instructions that, when executed by the processor component, cause the processor component to perform operations comprising:
-
retrieve, from one or more storage devices through a network, metadata indicative of organization of data within a data set, and map data indicative of organization of multiple data blocks within a data file maintained by the one or more storage devices, wherein; the map data comprises multiple map entries; and each map entry of the multiple map entries corresponds to one or more data blocks of the multiple data blocks; receive, from multiple node devices, indications of which node devices among the multiple node devices are available node devices that are each able to perform a processing task with at least one data set portion of the one or more data set portions; and in response to an indication within the metadata or the map data that the data set comprises partitioned data wherein the data within the data set is organized into multiple partitions that are each distributable to a single node device, and each map entry corresponds to a single data block; determine a first quantity of the available node devices based on the indications of which node devices are available node devices; retrieve a second quantity of node devices last involved in storage of the data set within the data file from the metadata or the map data; compare the first and second quantities of node devices to detect a match between the first and second quantities; assign each of the available node devices one of a series of positive integer values as a designation value, wherein the series extends from an integer value of 0 to a positive integer value equal to the first quantity minus the integer value of 1; and in response to detection of a match between the first and second quantities, for each map entry of the map data; retrieve, from the map entry, a hashed identifier for one data sub-block indicated in the map entry as within the corresponding data block, and a data sub-block size for each of the data sub-blocks indicated in the map entry as within the corresponding data block, wherein; the hashed identifier is derived from a partition label of a partition of the multiple partitions; and the data sub-block comprises a data set portion of the one or more data set portions; determine a location of the corresponding data block within the data file; divide the hashed identifier by the first quantity to obtain a modulo value; compare the modulo value to the designation value assigned to each of the available node devices to identify an available node device assigned a designation value that matches the modulo value; and provide a pointer to the available node device assigned the designation value that matches the modulo value, the pointer comprising; an indication of the location of the corresponding data block; and a sum of the data sub-block sizes of all of the data sub-blocks within the corresponding data block. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer-program product tangibly embodied in a non-transitory machine-readable storage medium, the computer-program product including instructions operable to cause a processor component to perform operations comprising:
-
retrieve, from one or more storage devices through a network, metadata indicative of organization of data within a data set, and map data indicative of organization of multiple data blocks within a data file maintained by the one or more storage devices, wherein; the map data comprises multiple map entries; and each map entry of the multiple map entries corresponds to one or more data blocks of the multiple data blocks; receive, from multiple node devices, indications of which node devices among the multiple node devices are available node devices that are each able to perform a processing task with at least one data set portion of the one or more data set portions; and in response to an indication within the metadata or the map data that the data set comprises partitioned data wherein the data within the data set is organized into multiple partitions that are each distributable to a single node device, and each map entry corresponds to a single data block; determine a first quantity of the available node devices based on the indications of which node devices are available node devices; retrieve a second quantity of node devices last involved in storage of the data set within the data file from the metadata or the map data; compare the first and second quantities of node devices to detect a match between the first and second quantities; assign each of the available node devices one of a series of positive integer values as a designation value, wherein the series extends from an integer value of 0 to a positive integer value equal to the first quantity minus the integer value of 1; and in response to detection of a match between the first and second quantities, for each map entry of the map data; retrieve, from the map entry, a hashed identifier for one data sub-block indicated in the map entry as within the corresponding data block, and a data sub-block size for each of the data sub-blocks indicated in the map entry as within the corresponding data block, wherein; the hashed identifier is derived from a partition label of a partition of the multiple partitions; and the data sub-block comprises a data set portion of the one or more data set portions; determine a location of the corresponding data block within the data file; divide the hashed identifier by the first quantity to obtain a modulo value; compare the modulo value to the designation value assigned to each of the available node devices to identify an available node device assigned a designation value that matches the modulo value; and provide a pointer to the available node device assigned the designation value that matches the modulo value, the pointer comprising; an indication of the location of the corresponding data block; and a sum of the data sub-block sizes of all of the data sub-blocks within the corresponding data block. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer-implemented method comprising:
-
retrieving, from one or more storage devices through a network, metadata indicative of organization of data within a data set, and map data indicative of organization of multiple data blocks within a data file maintained by the one or more storage devices, wherein; the map data comprises multiple map entries; and each map entry of the multiple map entries corresponds to one or more data blocks of the multiple data blocks; receiving, from multiple node devices, indications of which node devices among the multiple node devices are available node devices that are each able to perform a processing task with at least one data set portion of the one or more data set portions; and in response to an indication within the metadata or the map data that the data set comprises partitioned data wherein the data within the data set is organized into multiple partitions that are each distributable to a single node device, and each map entry corresponds to a single data block; determining a first quantity of the available node devices based on the indications of which node devices are available node devices; retrieving a second quantity of node devices last involved in storage of the data set within the data file from the metadata or the map data; comparing the first and second quantities of node devices to detect a match between the first and second quantities; and assigning each of the available node devices one of a series of positive integer values as a designation value, wherein the series extends from an integer value of 0 to a positive integer value equal to the first quantity minus the integer value of 1; and in response to detection of a match between the first and second quantities, for each map entry of the map data; retrieving, from the map entry, a hashed identifier for one data sub-block indicated in the map entry as within the corresponding data block, and a data sub-block size for each of the data sub-blocks indicated in the map entry as within the corresponding data block, wherein; the hashed identifier is derived from a partition label of a partition of the multiple partitions; and the data sub-block comprises a data set portion of the one or more data set portions; determining a location of the corresponding data block within the data file; dividing the hashed identifier by the first quantity to obtain a modulo value; comparing the modulo value to the designation value assigned to each of the available node devices to identify an available node device assigned a designation value that matches the modulo value; and providing a pointer to the available node device assigned the designation value that matches the modulo value, the pointer comprising; an indication of the location of the corresponding data block; and a sum of the data sub-block sizes of all of the data sub-blocks within the corresponding data block. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
-
Specification