Distributed data set storage and retrieval

US 10,185,721 B2
Filed: 11/06/2017
Issued: 01/22/2019
Est. Priority Date: 07/27/2015
Status: Active Grant

First Claim

Patent Images

1. An apparatus comprising a processor component and a storage to store instructions that, when executed by the processor component, cause the processor component to perform operations comprising:

analyze metadata indicative of an organization of data within a data set or map data indicative of an organization of multiple data blocks of the data set within a data file to determine whether the data set comprises partitioned data;

determine whether the analyzed metadata indicates whether the data set comprises partitioned data, wherein when the data set comprises partitioned data the data set is organized into multiple partitions and each whole partition is distributable to a single node device to be processed;

in response to a determination that data set does not comprise partitioned data, wherein the map data comprises multiple map entries in which each map entry corresponds to one or more data blocks of the multiple data blocks, perform operations comprising;

for each map entry of the map data, retrieve, from the map entry, a data block size and a data block quantity, wherein the data block quantity indicates a quantity of adjacent data blocks of the multiple data blocks within the data file that correspond to the map entry; and

for each data block that corresponds to the map entry;

determine a location of the corresponding data block within the data file;

select one of a first quantity of node devices that are currently available to perform a processing task; and

provide a pointer to the selected one of the first quantity of node devices, wherein the pointer comprises;

an indication of the location of the corresponding data block within the data file; and

the data block size; and

in response to a determination that the data set comprises partitioned data, perform operations comprising;

compare the first quantity of node devices to a second quantity of node devices that were last involved in storage of the data set within the data file;

assign a numerical designation value to each node device of the first quantity of node devices; and

in response to a match between the first and second quantities of node devices, for each data block of the multiple data blocks, perform operations comprising;

retrieve, from the map data, a hashed identifier for one data sub-block indicated by the map data to be within the data block, wherein;

the data sub-block comprises data of the data set that belongs to a partition of the multiple partitions; and

the hashed identifier is derived from a partition label of the partition;

retrieve, from the map data, a data sub-block size for each of the data sub-blocks indicated by the map data to be within the data block;

determine a location of the data block within the data file;

divide the hashed identifier by the first quantity of node devices to obtain a modulo value; and

transmit a pointer to a node device of the first quantity of node devices that is assigned a designation value that matches the modulo value, wherein the pointer comprises;

an indication of the location of the data block within the data file; and

a sum of data sub-block sizes of the data sub-blocks within the data block.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An apparatus includes a processor component caused to: retrieve metadata of organization of data within a data set, and map data of organization of data blocks within a data file; receive indications of which node devices are available to perform a processing task with a data set portion; and in response to the data set including partitioned data, compare the quantities of available node devices and of the node devices last involved in storing the data set. In response to a match, for each map data map entry: retrieve a hashed identifier for a data sub-block, and a size for each of the data sub-blocks within the corresponding data block; divide the hashed identifier by the quantity of available node devices; compare the modulo value to a designation assigned to each of the available node devices; and provide a pointer to the available node device assigned the matching designation.

Citations

30 Claims

1. An apparatus comprising a processor component and a storage to store instructions that, when executed by the processor component, cause the processor component to perform operations comprising:
- analyze metadata indicative of an organization of data within a data set or map data indicative of an organization of multiple data blocks of the data set within a data file to determine whether the data set comprises partitioned data;
  
  determine whether the analyzed metadata indicates whether the data set comprises partitioned data, wherein when the data set comprises partitioned data the data set is organized into multiple partitions and each whole partition is distributable to a single node device to be processed;
  
  in response to a determination that data set does not comprise partitioned data, wherein the map data comprises multiple map entries in which each map entry corresponds to one or more data blocks of the multiple data blocks, perform operations comprising;
  
  for each map entry of the map data, retrieve, from the map entry, a data block size and a data block quantity, wherein the data block quantity indicates a quantity of adjacent data blocks of the multiple data blocks within the data file that correspond to the map entry; and
  
  for each data block that corresponds to the map entry;
  
  determine a location of the corresponding data block within the data file;
  
  select one of a first quantity of node devices that are currently available to perform a processing task; and
  
  provide a pointer to the selected one of the first quantity of node devices, wherein the pointer comprises;
  
  an indication of the location of the corresponding data block within the data file; and
  
  the data block size; and
  
  in response to a determination that the data set comprises partitioned data, perform operations comprising;
  
  compare the first quantity of node devices to a second quantity of node devices that were last involved in storage of the data set within the data file;
  
  assign a numerical designation value to each node device of the first quantity of node devices; and
  
  in response to a match between the first and second quantities of node devices, for each data block of the multiple data blocks, perform operations comprising;
  
  retrieve, from the map data, a hashed identifier for one data sub-block indicated by the map data to be within the data block, wherein;
  
  the data sub-block comprises data of the data set that belongs to a partition of the multiple partitions; and
  
  the hashed identifier is derived from a partition label of the partition;
  
  retrieve, from the map data, a data sub-block size for each of the data sub-blocks indicated by the map data to be within the data block;
  
  determine a location of the data block within the data file;
  
  divide the hashed identifier by the first quantity of node devices to obtain a modulo value; and
  
  transmit a pointer to a node device of the first quantity of node devices that is assigned a designation value that matches the modulo value, wherein the pointer comprises;
  
  an indication of the location of the data block within the data file; and
  
  a sum of data sub-block sizes of the data sub-blocks within the data block.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The apparatus of claim 1, wherein the selection of one of the first quantity node devices comprises a round robin selection of one of the first quantity of node devices.
  - 3. The apparatus of claim 1, wherein in response to the determination that the data set comprises partitioned data and in response to a lack of a match between the first and second quantities of node devices, the processor component is caused to perform operations comprising:
    - for each data sub-block within each data block of the multiple data blocks, perform operations comprising;
      
      retrieve, from the map data, the data sub-block size and hashed identifier of the data sub-block;
      
      determine a location of the data sub-block within the data file;
      
      divide the hashed identifier by the first quantity of node devices to obtain a modulo value; and
      
      provide a pointer to the node of the first quantity of node devices that is assigned a designation value that matches the modulo value, wherein the pointer comprises;
      
      an indication of the location of the data sub-block within the data file; and
      
      the data sub-block size of the data sub-block.
  - 4. The apparatus of claim 1, wherein:
    - the apparatus comprises a node device of the first quantity of node devices; and
      
      the processor component is caused to retrieve, via a network, a portion of the data set from the data file as one of the first quantity of node devices at least partially in parallel with at least one other node device of the first quantity of node devices.
  - 5. The apparatus of claim 4, wherein the processor component is caused to perform the processing task with a portion of the data set retrieved from the data file as the one of the first quantity of node devices at least partially in parallel with at least one other node device of the first quantity of node devices.
  - 6. The apparatus of claim 4, wherein the processor component is caused to perform operations comprising:
    - in response to the determination that the data set does not comprise partitioned data, the processor component is caused to retrieve a whole data block of the multiple data blocks as the portion of the data set retrieved;
      
      in response to the determination that the data set does comprise partitioned data and in response to a match between the first and second quantities of node devices, the processor component is caused to retrieve a whole data block of the multiple data blocks as the portion of the data set retrieved; and
      
      in response to the determination that the data set does comprise partitioned data and in response to a lack of a match between the first and second quantities of node devices, the processor component is caused to retrieve a data sub-block of a data block of the multiple data blocks as the portion of the data set retrieved.
  - 7. The apparatus of claim 1, wherein:
    - the data file is maintained by one or more storage devices; and
      
      the processor component is caused to retrieve, from the one or more storage devices through a network, the map data and the metadata.
  - 8. The apparatus of claim 7, wherein, to retrieve the map data from the data file, the processor component is caused to perform operations comprising:
    - transmit, via the network, a first command to the one or more storage devices to provide a map base of the data map from the data file;
      
      receive, via the network, a map base from the data file;
      
      analyze the map base to determine whether at least a portion of the map data is stored within one or more map extensions within the data file; and
      
      in response to a determination that at least a portion of the map data is stored within one or more map extensions, perform operations comprising;
      
      transmit, via the network, a second command to the one or more storage devices to provide at least one map extension of the one or more map extensions; and
      
      receive, via the network, the at least one map extension from the data file.
  - 9. The apparatus of claim 1, comprising a control device coupled to a grid of multiple node devices, wherein the processor component is caused to perform operations comprising:
    - recurringly receive, via a network, indications of availability to perform processing tasks from each node device of the multiple node devices;
      
      recurringly update a stored indication of availability of each node device of the multiple node devices; and
      
      analyze the stored indications of availability to identify the first quantity of node devices.
  - 10. The apparatus of claim 9, wherein the processor component is caused to perform operations comprising provide an indication of the processing task to the first quantity of node devices to enable more than one node device of the first quantity of node devices to each perform the processing task with a portion of the data set at least partially in parallel.

11. A computer-program product tangibly embodied in a non-transitory machine-readable storage medium, the computer-program product including instructions operable to cause a processor component to perform operations comprising:
- analyze metadata indicative of an organization of data within a data set or map data indicative of an organization of multiple data blocks of the data set within a data file to determine whether the data set comprises partitioned data;
  
  determine whether the analyzed metadata indicates whether the data set comprises partitioned data, wherein when the data set comprises partitioned data the data set is organized into multiple partitions and each whole partition is distributable to a single node device to be processed;
  
  in response to a determination that data set does not comprise partitioned data, wherein the map data comprises multiple map entries in which each map entry corresponds to one or more data blocks of the multiple data blocks, perform operations comprising;
  
  for each map entry of the map data, retrieve, from the map entry, a data block size and a data block quantity, wherein the data block quantity indicates a quantity of adjacent data blocks of the multiple data blocks within the data file that correspond to the map entry; and
  
  for each data block that corresponds to the map entry;
  
  determine a location of the corresponding data block within the data file;
  
  select one of a first quantity of node devices that are currently available to perform a processing task; and
  
  provide a pointer to the selected one of the first quantity of node devices, wherein the pointer comprises;
  
  an indication of the location of the corresponding data block within the data file; and
  
  the data block size; and
  
  in response to a determination that the data set comprises partitioned data, perform operations comprising;
  
  compare the first quantity of node devices to a second quantity of node devices that were last involved in storage of the data set within the data file;
  
  assign a numerical designation value to each node device of the first quantity of node devices; and
  
  in response to a match between the first and second quantities of node devices, for each data block of the multiple data blocks, perform operations comprising;
  
  retrieve, from the map data, a hashed identifier for one data sub-block indicated by the map data to be within the data block, wherein;
  
  the data sub-block comprises data of the data set that belongs to a partition of the multiple partitions; and
  
  the hashed identifier is derived from a partition label of the partition;
  
  retrieve, from the map data, a data sub-block size for each of the data sub-blocks indicated by the map data to be within the data block;
  
  determine a location of the data block within the data file;
  
  divide the hashed identifier by the first quantity of node devices to obtain a modulo value; and
  
  transmit a pointer to a node device of the first quantity of node devices that is assigned a designation value that matches the modulo value, wherein the pointer comprises;
  
  an indication of the location of the data block within the data file; and
  
  a sum of data sub-block sizes of the data sub-blocks within the data block.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The computer-program product of claim 11, wherein the selection of one of the first quantity node devices comprises a round robin selection of one of the first quantity of node devices.
  - 13. The computer-program product of claim 11, wherein in response to the determination that the data set comprises partitioned data and in response to a lack of a match between the first and second quantities of node devices, the processor component is caused to perform operations comprising:
    - for each data sub-block within each data block of the multiple data blocks, perform operations comprising;
      
      retrieve, from the map data, the data sub-block size and hashed identifier of the data sub-block;
      
      determine a location of the data sub-block within the data file;
      
      divide the hashed identifier by the first quantity of node devices to obtain a modulo value; and
      
      provide a pointer to the node of the first quantity of node devices that is assigned a designation value that matches the modulo value, wherein the pointer comprises;
      
      an indication of the location of the data sub-block within the data file; and
      
      the data sub-block size of the data sub-block.
  - 14. The computer-program product of claim 11, wherein:
    - the processor component is incorporated into a node device of the first quantity of node devices; and
      
      the processor component is caused to retrieve, via a network, a portion of the data set from the data file as one of the first quantity of node devices at least partially in parallel with at least one other node device of the first quantity of node devices.
  - 15. The computer-program product of claim 14, wherein the processor component is caused to perform the processing task with a portion of the data set retrieved from the data file as the one of the first quantity of node devices at least partially in parallel with at least one other node device of the first quantity of node devices.
  - 16. The computer-program product of claim 11, wherein:
    - the data file is maintained by one or more storage devices; and
      
      the processor component is caused to retrieve, from the one or more storage devices through a network, the map data and the metadata.
  - 17. The computer-program product of claim 16, wherein, to retrieve the map data from the data file, the processor component is caused to perform operations comprising:
    - transmit, via the network, a first command to the one or more storage devices to provide a map base of the data map from the data file;
      
      receive, via the network, a map base from the data file;
      
      analyze the map base to determine whether at least a portion of the map data is stored within one or more map extensions within the data file; and
      
      in response to a determination that at least a portion of the map data is stored within one or more map extensions, perform operations comprising;
      
      transmit, via the network, a second command to the one or more storage devices to provide at least one map extension of the one or more map extensions; and
      
      receive, via the network, the at least one map extension from the data file.
  - 18. The computer-program product of claim 17, wherein the processor component is caused to perform operations comprising:
    - in response to the determination that at least a portion of the map data is stored within one or more map extensions, retrieve an extension pointer to each map extension of the at least one map extension from the map base, and retrieve at least one map entry of the map data from each map extension of the one or more map extensions; and
      
      in response to a determination that no portion of the map data is stored within one or more map extensions, retrieve each map entry of the map data from the map base.
  - 19. The computer-program product of claim 11, wherein:
    - the processor component is incorporated into a control device coupled to a grid of multiple node devices; and
      
      the processor component is caused to perform operations comprising;
      
      recurringly receive, via a network, indications of availability to perform processing tasks from each node device of the multiple node devices;
      
      recurringly update a stored indication of availability of each node device of the multiple node devices; and
      
      analyze the stored indications of availability to identify the first quantity of node devices.
  - 20. The computer-program product of claim 19, wherein the processor component is caused to perform operations comprising provide an indication of the processing task to the first quantity of node devices to enable more than one node device of the first quantity of node devices to each perform the processing task with a portion of the data set at least partially in parallel.

21. A computer-implemented method comprising:
- analyzing, by a processor component, metadata indicative of an organization of data within a data set or map data indicative of an organization of multiple data blocks of the data set within a data file to determine whether the data set comprises partitioned data;
  
  determining whether the analyzed metadata indicates whether the data set comprises partitioned data, wherein when the data set comprises partitioned data the data set is organized into multiple partitions and each whole partition is distributable to a single node device to be processed;
  
  in response to a determination that data set does not comprise partitioned data, wherein the map data comprises multiple map entries in which each map entry corresponds to one or more data blocks of the multiple data blocks, performing operations comprising;
  
  for each map entry of the map data, retrieving, from the map entry, a data block size and a data block quantity, wherein the data block quantity indicates a quantity of adjacent data blocks of the multiple data blocks within the data file that correspond to the map entry; and
  
  for each data block that corresponds to the map entry;
  
  determining a location of the corresponding data block within the data file;
  
  selecting one of a first quantity of node devices that are currently available to perform a processing task; and
  
  providing a pointer to the selected one of the first quantity of node devices, wherein the pointer comprises;
  
  an indication of the location of the corresponding data block within the data file; and
  
  the data block size;
  
  orin response to a determination that the data set comprises partitioned data, performing operations comprising;
  
  comparing, by the processor component, the first quantity of node devices to a second quantity of node devices that were last involved in storage of the data set within the data file;
  
  assigning, by the processor component, a numerical designation value to each node device of the first quantity of node devices; and
  
  in response to a match between the first and second quantities of node devices, for each data block of the multiple data blocks, performing operations comprising;
  
  retrieving, from the map data, a hashed identifier for one data sub-block indicated by the map data to be within the data block, wherein;
  
  the data sub-block comprises data of the data set that belongs to a partition of the multiple partitions; and
  
  the hashed identifier is derived from a partition label of the partition;
  
  retrieving, from the map data, a data sub-block size for each of the data sub-blocks indicated by the map data to be within the data block;
  
  determining a location of the data block within the data file;
  
  dividing the hashed identifier by the first quantity of node devices to obtain a modulo value; and
  
  transmitting a pointer to a node device of the first quantity of node devices that is assigned a designation value that matches the modulo value, wherein the pointer comprises;
  
  an indication of the location of the data block within the data file; and
  
  a sum of data sub-block sizes of the data sub-blocks within the data block.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 22. The computer-implemented method of claim 21, wherein the selection of one of the first quantity node devices comprises a round robin selection of one of the first quantity of node devices.
  - 23. The computer-implemented method of claim 21, wherein in response to the determination that the data set comprises partitioned data and in response to a lack of a match between the first and second quantities of node devices, performing operations comprising:
    - for each data sub-block within each data block of the multiple data blocks, performing operations comprising;
      
      retrieving, from the map data, the data sub-block size and hashed identifier of the data sub-block;
      
      determining a location of the data sub-block within the data file;
      
      dividing the hashed identifier by the first quantity of node devices to obtain a modulo value; and
      
      providing a pointer to the node of the first quantity of node devices that is assigned a designation value that matches the modulo value, wherein the pointer comprises;
      
      an indication of the location of the data sub-block within the data file; and
      
      the data sub-block size of the data sub-block.
  - 24. The computer-implemented method of claim 21, wherein:
    - the processor component is incorporated into a node device of the first quantity of node devices; and
      
      the method comprises retrieving, via a network, a portion of the data set from the data file as one of the first quantity of node devices at least partially in parallel with at least one other node device of the first quantity of node devices.
  - 25. The computer-implemented method of claim 24, comprising performing, by the processor component, the processing task with a portion of the data set retrieved from the data file as the one of the first quantity of node devices at least partially in parallel with at least one other node device of the first quantity of node devices.
  - 26. The computer-implemented method of claim 21, wherein:
    - the data file is maintained by one or more storage devices; and
      
      the method comprises retrieving, from the one or more storage devices through a network, the map data and the metadata.
  - 27. The computer-implemented method of claim 26, wherein retrieving the map data from the data file comprises:
    - transmitting, via the network, a first command to the one or more storage devices to provide a map base of the data map from the data file;
      
      receiving, via the network, a map base from the data file;
      
      analyzing, by the processor component, the map base to determine whether at least a portion of the map data is stored within one or more map extensions within the data file; and
      
      in response to a determination that at least a portion of the map data is stored within one or more map extensions, performing operations comprising;
      
      transmitting, via the network, a second command to the one or more storage devices to provide at least one map extension of the one or more map extensions; and
      
      receiving, via the network, the at least one map extension from the data file.
  - 28. The computer-implemented method of claim 21, wherein:
    - the processor component is incorporated into a control device coupled to a grid of multiple node devices; and
      
      the method comprises;
      
      recurringly receiving, via a network, indications of availability to perform processing tasks from each node device of the multiple node devices;
      
      recurringly updating, by the processor component, a stored indication of availability of each node device of the multiple node devices; and
      
      analyzing, by the processor component, the stored indications of availability to identify the first quantity of node devices.
  - 29. The computer-implemented method of claim 28, comprising providing an indication of the processing task to the first quantity of node devices to enable more than one node device of the first quantity of node devices to each perform the processing task with a portion of the data set at least partially in parallel.
  - 30. The computer-implemented method of claim 29, comprising:
    - receiving, via the network, a request from a node device of the first quantity of node devices for a pointer to a location at which to store a portion of data generated by the performance of the processing task by the first quantity of node devices, wherein the request comprises an indication of size of the portion of the generated data;
      
      deriving, by the processor component, a location within another data file at which to store the portion of generated data based on a sum of sizes of portions of the generated data associated with previously received requests for pointers; and
      
      transmitting, via the network, an indication of the requested pointer to the node device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SAS Institute Incorporated
Original Assignee
SAS Institute Incorporated
Inventors
Bowman, Brian Payton, Krueger, Steven E., Knight, Richard Todd, Ho, Chih-Wei
Primary Examiner(s)
Woolwine, Shane

Application Number

US15/804,570
Publication Number

US 20180075051A1
Time in Patent Office

442 Days
Field of Search

711135, 711170-173
US Class Current
CPC Class Codes

G06F 12/0292   using tables or multilevel ...

G06F 16/137   Hash-based content-based in...

G06F 16/1827   Management specifically ada...

G06F 16/22   Indexing; Data structures t...

G06F 16/278   Data partitioning, e.g. hor...

G06F 2212/1016   Performance improvement

G06F 2212/1056   Simplification

G06F 2212/154   Networked environment

G06F 2212/262   configured as RAID

G06F 2212/263   Network storage, e.g. SAN o...

G06F 3/0604   Improving or facilitating a...

G06F 3/0607   by facilitating the process...

G06F 3/061   Improving I/O performance

G06F 3/064   Management of blocks

G06F 3/0643   Management of files

G06F 3/0644   Management of space entitie...

G06F 3/067   Distributed or networked st...

G06F 9/5072   Grid computing

G06F 9/5077   Logical partitioning of res...

Distributed data set storage and retrieval

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Distributed data set storage and retrieval

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links