Distributed data set storage and retrieval
First Claim
1. An apparatus comprising a processor component and a storage to store instructions that, when executed by the processor component, cause the processor component to perform operations comprising:
- analyze metadata indicative of an organization of data within a data set or map data indicative of an organization of multiple data blocks of the data set within a data file to determine whether the data set comprises partitioned data;
determine whether the analyzed metadata indicates whether the data set comprises partitioned data, wherein when the data set comprises partitioned data the data set is organized into multiple partitions and each whole partition is distributable to a single node device to be processed;
in response to a determination that data set does not comprise partitioned data, wherein the map data comprises multiple map entries in which each map entry corresponds to one or more data blocks of the multiple data blocks, perform operations comprising;
for each map entry of the map data, retrieve, from the map entry, a data block size and a data block quantity, wherein the data block quantity indicates a quantity of adjacent data blocks of the multiple data blocks within the data file that correspond to the map entry; and
for each data block that corresponds to the map entry;
determine a location of the corresponding data block within the data file;
select one of a first quantity of node devices that are currently available to perform a processing task; and
provide a pointer to the selected one of the first quantity of node devices, wherein the pointer comprises;
an indication of the location of the corresponding data block within the data file; and
the data block size; and
in response to a determination that the data set comprises partitioned data, perform operations comprising;
compare the first quantity of node devices to a second quantity of node devices that were last involved in storage of the data set within the data file;
assign a numerical designation value to each node device of the first quantity of node devices; and
in response to a match between the first and second quantities of node devices, for each data block of the multiple data blocks, perform operations comprising;
retrieve, from the map data, a hashed identifier for one data sub-block indicated by the map data to be within the data block, wherein;
the data sub-block comprises data of the data set that belongs to a partition of the multiple partitions; and
the hashed identifier is derived from a partition label of the partition;
retrieve, from the map data, a data sub-block size for each of the data sub-blocks indicated by the map data to be within the data block;
determine a location of the data block within the data file;
divide the hashed identifier by the first quantity of node devices to obtain a modulo value; and
transmit a pointer to a node device of the first quantity of node devices that is assigned a designation value that matches the modulo value, wherein the pointer comprises;
an indication of the location of the data block within the data file; and
a sum of data sub-block sizes of the data sub-blocks within the data block.
0 Assignments
0 Petitions
Accused Products
Abstract
An apparatus includes a processor component caused to: retrieve metadata of organization of data within a data set, and map data of organization of data blocks within a data file; receive indications of which node devices are available to perform a processing task with a data set portion; and in response to the data set including partitioned data, compare the quantities of available node devices and of the node devices last involved in storing the data set. In response to a match, for each map data map entry: retrieve a hashed identifier for a data sub-block, and a size for each of the data sub-blocks within the corresponding data block; divide the hashed identifier by the quantity of available node devices; compare the modulo value to a designation assigned to each of the available node devices; and provide a pointer to the available node device assigned the matching designation.
-
Citations
30 Claims
-
1. An apparatus comprising a processor component and a storage to store instructions that, when executed by the processor component, cause the processor component to perform operations comprising:
-
analyze metadata indicative of an organization of data within a data set or map data indicative of an organization of multiple data blocks of the data set within a data file to determine whether the data set comprises partitioned data; determine whether the analyzed metadata indicates whether the data set comprises partitioned data, wherein when the data set comprises partitioned data the data set is organized into multiple partitions and each whole partition is distributable to a single node device to be processed; in response to a determination that data set does not comprise partitioned data, wherein the map data comprises multiple map entries in which each map entry corresponds to one or more data blocks of the multiple data blocks, perform operations comprising; for each map entry of the map data, retrieve, from the map entry, a data block size and a data block quantity, wherein the data block quantity indicates a quantity of adjacent data blocks of the multiple data blocks within the data file that correspond to the map entry; and for each data block that corresponds to the map entry; determine a location of the corresponding data block within the data file; select one of a first quantity of node devices that are currently available to perform a processing task; and provide a pointer to the selected one of the first quantity of node devices, wherein the pointer comprises; an indication of the location of the corresponding data block within the data file; and the data block size; and in response to a determination that the data set comprises partitioned data, perform operations comprising; compare the first quantity of node devices to a second quantity of node devices that were last involved in storage of the data set within the data file; assign a numerical designation value to each node device of the first quantity of node devices; and in response to a match between the first and second quantities of node devices, for each data block of the multiple data blocks, perform operations comprising; retrieve, from the map data, a hashed identifier for one data sub-block indicated by the map data to be within the data block, wherein; the data sub-block comprises data of the data set that belongs to a partition of the multiple partitions; and the hashed identifier is derived from a partition label of the partition; retrieve, from the map data, a data sub-block size for each of the data sub-blocks indicated by the map data to be within the data block; determine a location of the data block within the data file; divide the hashed identifier by the first quantity of node devices to obtain a modulo value; and transmit a pointer to a node device of the first quantity of node devices that is assigned a designation value that matches the modulo value, wherein the pointer comprises; an indication of the location of the data block within the data file; and a sum of data sub-block sizes of the data sub-blocks within the data block. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer-program product tangibly embodied in a non-transitory machine-readable storage medium, the computer-program product including instructions operable to cause a processor component to perform operations comprising:
-
analyze metadata indicative of an organization of data within a data set or map data indicative of an organization of multiple data blocks of the data set within a data file to determine whether the data set comprises partitioned data; determine whether the analyzed metadata indicates whether the data set comprises partitioned data, wherein when the data set comprises partitioned data the data set is organized into multiple partitions and each whole partition is distributable to a single node device to be processed; in response to a determination that data set does not comprise partitioned data, wherein the map data comprises multiple map entries in which each map entry corresponds to one or more data blocks of the multiple data blocks, perform operations comprising; for each map entry of the map data, retrieve, from the map entry, a data block size and a data block quantity, wherein the data block quantity indicates a quantity of adjacent data blocks of the multiple data blocks within the data file that correspond to the map entry; and for each data block that corresponds to the map entry; determine a location of the corresponding data block within the data file; select one of a first quantity of node devices that are currently available to perform a processing task; and provide a pointer to the selected one of the first quantity of node devices, wherein the pointer comprises; an indication of the location of the corresponding data block within the data file; and the data block size; and in response to a determination that the data set comprises partitioned data, perform operations comprising; compare the first quantity of node devices to a second quantity of node devices that were last involved in storage of the data set within the data file; assign a numerical designation value to each node device of the first quantity of node devices; and in response to a match between the first and second quantities of node devices, for each data block of the multiple data blocks, perform operations comprising; retrieve, from the map data, a hashed identifier for one data sub-block indicated by the map data to be within the data block, wherein; the data sub-block comprises data of the data set that belongs to a partition of the multiple partitions; and the hashed identifier is derived from a partition label of the partition; retrieve, from the map data, a data sub-block size for each of the data sub-blocks indicated by the map data to be within the data block; determine a location of the data block within the data file; divide the hashed identifier by the first quantity of node devices to obtain a modulo value; and transmit a pointer to a node device of the first quantity of node devices that is assigned a designation value that matches the modulo value, wherein the pointer comprises; an indication of the location of the data block within the data file; and a sum of data sub-block sizes of the data sub-blocks within the data block. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer-implemented method comprising:
-
analyzing, by a processor component, metadata indicative of an organization of data within a data set or map data indicative of an organization of multiple data blocks of the data set within a data file to determine whether the data set comprises partitioned data; determining whether the analyzed metadata indicates whether the data set comprises partitioned data, wherein when the data set comprises partitioned data the data set is organized into multiple partitions and each whole partition is distributable to a single node device to be processed; in response to a determination that data set does not comprise partitioned data, wherein the map data comprises multiple map entries in which each map entry corresponds to one or more data blocks of the multiple data blocks, performing operations comprising; for each map entry of the map data, retrieving, from the map entry, a data block size and a data block quantity, wherein the data block quantity indicates a quantity of adjacent data blocks of the multiple data blocks within the data file that correspond to the map entry; and for each data block that corresponds to the map entry; determining a location of the corresponding data block within the data file; selecting one of a first quantity of node devices that are currently available to perform a processing task; and providing a pointer to the selected one of the first quantity of node devices, wherein the pointer comprises; an indication of the location of the corresponding data block within the data file; and the data block size;
orin response to a determination that the data set comprises partitioned data, performing operations comprising; comparing, by the processor component, the first quantity of node devices to a second quantity of node devices that were last involved in storage of the data set within the data file; assigning, by the processor component, a numerical designation value to each node device of the first quantity of node devices; and in response to a match between the first and second quantities of node devices, for each data block of the multiple data blocks, performing operations comprising; retrieving, from the map data, a hashed identifier for one data sub-block indicated by the map data to be within the data block, wherein; the data sub-block comprises data of the data set that belongs to a partition of the multiple partitions; and the hashed identifier is derived from a partition label of the partition; retrieving, from the map data, a data sub-block size for each of the data sub-blocks indicated by the map data to be within the data block; determining a location of the data block within the data file; dividing the hashed identifier by the first quantity of node devices to obtain a modulo value; and transmitting a pointer to a node device of the first quantity of node devices that is assigned a designation value that matches the modulo value, wherein the pointer comprises; an indication of the location of the data block within the data file; and a sum of data sub-block sizes of the data sub-blocks within the data block. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
-
Specification