×

Distributed data set storage and retrieval

  • US 10,185,721 B2
  • Filed: 11/06/2017
  • Issued: 01/22/2019
  • Est. Priority Date: 07/27/2015
  • Status: Active Grant
First Claim
Patent Images

1. An apparatus comprising a processor component and a storage to store instructions that, when executed by the processor component, cause the processor component to perform operations comprising:

  • analyze metadata indicative of an organization of data within a data set or map data indicative of an organization of multiple data blocks of the data set within a data file to determine whether the data set comprises partitioned data;

    determine whether the analyzed metadata indicates whether the data set comprises partitioned data, wherein when the data set comprises partitioned data the data set is organized into multiple partitions and each whole partition is distributable to a single node device to be processed;

    in response to a determination that data set does not comprise partitioned data, wherein the map data comprises multiple map entries in which each map entry corresponds to one or more data blocks of the multiple data blocks, perform operations comprising;

    for each map entry of the map data, retrieve, from the map entry, a data block size and a data block quantity, wherein the data block quantity indicates a quantity of adjacent data blocks of the multiple data blocks within the data file that correspond to the map entry; and

    for each data block that corresponds to the map entry;

    determine a location of the corresponding data block within the data file;

    select one of a first quantity of node devices that are currently available to perform a processing task; and

    provide a pointer to the selected one of the first quantity of node devices, wherein the pointer comprises;

    an indication of the location of the corresponding data block within the data file; and

    the data block size; and

    in response to a determination that the data set comprises partitioned data, perform operations comprising;

    compare the first quantity of node devices to a second quantity of node devices that were last involved in storage of the data set within the data file;

    assign a numerical designation value to each node device of the first quantity of node devices; and

    in response to a match between the first and second quantities of node devices, for each data block of the multiple data blocks, perform operations comprising;

    retrieve, from the map data, a hashed identifier for one data sub-block indicated by the map data to be within the data block, wherein;

    the data sub-block comprises data of the data set that belongs to a partition of the multiple partitions; and

    the hashed identifier is derived from a partition label of the partition;

    retrieve, from the map data, a data sub-block size for each of the data sub-blocks indicated by the map data to be within the data block;

    determine a location of the data block within the data file;

    divide the hashed identifier by the first quantity of node devices to obtain a modulo value; and

    transmit a pointer to a node device of the first quantity of node devices that is assigned a designation value that matches the modulo value, wherein the pointer comprises;

    an indication of the location of the data block within the data file; and

    a sum of data sub-block sizes of the data sub-blocks within the data block.

View all claims
  • 0 Assignments
Timeline View
Assignment View
    ×
    ×