Method and apparatus for offloading compute resources to a flash co-processing appliance

US 9,158,540 B1
Filed: 11/13/2012
Issued: 10/13/2015
Est. Priority Date: 11/14/2011
Status: Active Grant

First Claim

Patent Images

1. A parallel supercomputing cluster comprising:

compute nodes interconnected in a mesh of data links for executing a Message Passing Interface (MPI) job using MPI data transfer between the computer nodes over the mesh of data links; and

solid-state storage nodes each linked to a respective group of the compute nodes for receiving checkpoint data from the respective compute nodes, andmagnetic disk storage linked to each of the solid-state storage nodes for asynchronous migration of the checkpoint data from the solid-state storage nodes to the magnetic disk storage;

wherein each solid-state storage node includes a data processor coupled to the respective group of compute nodes for receiving the checkpoint data from the respective group of compute nodes and coupled to the magnetic disk storage for transmitting the checkpoint data to the magnetic disk storage, solid state storage coupled to the data processor for buffering the checkpoint data, and non-transitory computer readable storage medium storing computer instructions that, when executed by the data processor, perform the steps of;

(a) presenting a file system interface to the MPI job, and multiple MPI processes of the MPI job writing the checkpoint data to a shared file in the solid-state storage in a strided fashion in a first data layout;

(b) asynchronously migrating the checkpoint data from the shared file in the solid-state storage to the magnetic disk storage and writing the checkpoint data to the magnetic disk storage in a sequential fashion in a second data layout; and

(c) performing additional tasks offloaded from the compute nodes or associated with the checkpoint data.

View all claims

12 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Solid-State Drive (SSD) burst buffer nodes are interposed into a parallel supercomputing cluster to enable fast burst checkpoint of cluster memory to or from nearby interconnected solid-state storage with asynchronous migration between the burst buffer nodes and slower more distant disk storage. The SSD nodes also perform tasks offloaded from the compute nodes or associated with the checkpoint data. For example, the data for the next job is preloaded in the SSD node and very fast uploaded to the respective compute node just before the next job starts. During a job, the SSD nodes perform fast visualization and statistical analysis upon the checkpoint data. The SSD nodes can also perform data reduction and encryption of the checkpoint data.

54 Citations

View as Search Results

20 Claims

1. A parallel supercomputing cluster comprising:
- compute nodes interconnected in a mesh of data links for executing a Message Passing Interface (MPI) job using MPI data transfer between the computer nodes over the mesh of data links; and
  
  solid-state storage nodes each linked to a respective group of the compute nodes for receiving checkpoint data from the respective compute nodes, andmagnetic disk storage linked to each of the solid-state storage nodes for asynchronous migration of the checkpoint data from the solid-state storage nodes to the magnetic disk storage;
  
  wherein each solid-state storage node includes a data processor coupled to the respective group of compute nodes for receiving the checkpoint data from the respective group of compute nodes and coupled to the magnetic disk storage for transmitting the checkpoint data to the magnetic disk storage, solid state storage coupled to the data processor for buffering the checkpoint data, and non-transitory computer readable storage medium storing computer instructions that, when executed by the data processor, perform the steps of;
  
  (a) presenting a file system interface to the MPI job, and multiple MPI processes of the MPI job writing the checkpoint data to a shared file in the solid-state storage in a strided fashion in a first data layout;
  
  (b) asynchronously migrating the checkpoint data from the shared file in the solid-state storage to the magnetic disk storage and writing the checkpoint data to the magnetic disk storage in a sequential fashion in a second data layout; and
  
  (c) performing additional tasks offloaded from the compute nodes or associated with the checkpoint data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The parallel supercomputing cluster as claimed in claim 1, wherein the additional tasks include preloading data into said each solid-state storage node for processing by the MPI job in the respective group of compute nodes.
  - 3. The parallel supercomputing cluster as claimed in claim 2, which further includes a control station computer and a network coupling the control station computer to each of the solid-state storage nodes, and the pre-loaded data is transmitted from the control station computer to the solid-state storage nodes over the network coupling the control station computer to each of the solid-state storage nodes.
  - 4. The parallel supercomputing cluster as claimed in claim 3, wherein the pre-loaded data is transmitted from the control station computer to the solid-state storage nodes over the network coupling the control station computer to each of the solid-state storage nodes and preloaded into said each of the solid-state storage nodes without any impact on MPI traffic on the mesh of data links.
  - 5. The parallel supercomputing cluster as claimed in claim 3, wherein all of the pre-loaded data is transmitted from the control station computer to the solid-state storage nodes over the network coupling the control station computer to each of the solid-state storage nodes for the MPI job before completion of execution of a previous MPI job.
  - 6. The parallel supercomputing cluster as claimed in claim 1, wherein the additional tasks include processing of the checkpoint data to produce a visualization of the checkpoint data presented in real time to a user.
  - 7. The parallel supercomputing cluster as claimed in claim 1, wherein the additional tasks include performing a statistical analysis of the checkpoint data presented in real time to a user.
  - 8. The parallel supercomputing cluster as claimed in claim 1, wherein the additional tasks include performing an analysis of the checkpoint data in order to terminate a simulation upon detection of a simulation error.
  - 9. The parallel supercomputing cluster as claimed in claim 1, wherein the additional tasks include performing data reduction operations upon the checkpoint data to reduce the magnetic disk storage capacity needed to store the checkpoint data.
  - 10. The parallel supercomputing cluster as claimed in claim 1, wherein the additional tasks include encrypting the checkpoint data so that the encrypted checkpoint data is stored in the magnetic disk storage.

11. A method of operating a parallel supercomputing cluster, the parallel supercomputing cluster including compute nodes, solid-state storage nodes, and magnetic disk storage, the compute nodes being interconnected in a mesh of data links for executing a Message Passing Interface (MPI) job using MPI data transfer between the computer nodes over the mesh of data links, each of the solid-state storage nodes being linked to a respective group of the compute nodes for receiving checkpoint data from the respective compute nodes, and the magnetic disk storage being linked to each of the solid-state storage nodes for asynchronous migration of the checkpoint data from the solid-state storage nodes to the magnetic disk storage, and each of the solid-state storage nodes including a data processor coupled to the respective group of compute nodes for receiving the checkpoint data from the respective group of compute nodes and coupled to the magnetic disk storage for transmitting the checkpoint data to the magnetic disk storage, and solid-state storage coupled to the data processor for buffering the checkpoint data, and non-transitory computer readable storage medium storing computer instructions, said method comprising the data processor executing the computer instructions to perform the steps of:
- (a) presenting a file system interface to the MPI job, and multiple MPI processes of the MPI job writing the checkpoint data to a shared file in the solid-state storage in a strided fashion in a first data layout;
  
  (b) asynchronously migrating the checkpoint data from the shared file in the solid-state storage to the magnetic disk storage and writing the checkpoint data to the magnetic disk storage in a sequential fashion in a second data layout; and
  
  (c) performing additional tasks offloaded from the compute nodes or associated with the checkpoint data.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The method as claimed in claim in claim 11, wherein the additional tasks include preloading data into said each solid-state storage node for processing by the MPI job in the respective group of compute nodes.
  - 13. The method as claimed in claim 12, which the parallel supercomputing cluster further includes a control station computer and a network coupling the control station computer to each of the solid-state storage nodes, and the method further includes transmitting the pre-loaded data from the control station computer to the solid-state storage nodes over the network coupling the control station computer to each of the solid-state storage nodes.
  - 14. The method as claimed in claim 13, wherein the pre-loaded data is transmitted from the control station computer to the solid-state storage nodes over the network coupling the control station computer to each of the solid-state storage nodes and preloaded into said each of the solid-state storage nodes without any impact on MPI traffic on the mesh of data links.
  - 15. The method as claimed in claim 13, wherein all of the pre-loaded data is transmitted from the control station computer to the solid-state storage nodes over the network coupling the control station computer to each of the solid-state storage nodes for the MPI job before completion of execution of a previous MPI job.
  - 16. The method as claimed in claim 11, wherein the additional tasks include processing of the checkpoint data to produce a visualization of the checkpoint data presented in real time to a user.
  - 17. The method as claimed in claim 11, wherein the additional tasks include performing a statistical analysis of the checkpoint data presented in real time to a user.
  - 18. The method as claimed in claim 11, wherein the additional tasks include performing an analysis of the checkpoint data in order to terminate a simulation upon detection of a simulation error.
  - 19. The method as claimed in claim 11, wherein the additional tasks include performing data reduction operations upon the checkpoint data to reduce the magnetic disk storage capacity needed to store the checkpoint data.
  - 20. The method as claimed in claim 11, wherein the additional tasks include encrypting the checkpoint data so that the encrypted checkpoint data is stored in the magnetic disk storage.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.), TRIAD National Security, LLC
Original Assignee
EMC Corporation (Dell Technologies Inc.), Los Alamos National Security LLC (Government of the United States of America)
Inventors
Tzelnic, Percy, Faibish, Sorin, Gupta, Uday K., Bent, John, Grider, Gary Alan, Chen, Hsing-bung
Primary Examiner(s)
Nam, Hyun

Application Number

US13/676,019
Time in Patent Office

1,064 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 11/108   Parity data distribution in...

G06F 11/1438   Restarting or rejuvenating

G06F 2211/1028   Distributed, i.e. distribut...

Method and apparatus for offloading compute resources to a flash co-processing appliance

First Claim

12 Assignments

0 Petitions

Accused Products

Abstract

54 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Method and apparatus for offloading compute resources to a flash co-processing appliance

First Claim

12 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

54 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others