Architecture and method for a burst buffer using flash technology
First Claim
1. A parallel supercomputing cluster system comprising:
- hardware compute nodes interconnected in a mesh of data links for executing a Message Passing Interface (MPI) job stored in memory and for using MPI data transfer between the compute nodes over the mesh of data links; and
hardware solid-state storage nodes each linked to a respective group of the compute nodes for receiving checkpoint data from the respective compute nodes, andmagnetic disk storage linked to each of the solid-state storage nodes for asynchronous migration of the checkpoint data from the solid-state storage nodes to the magnetic disk storage;
wherein each solid-state storage node includes a hardware data processor coupled to the respective group of compute nodes for receiving the checkpoint data from the respective group of compute nodes and coupled to the magnetic disk storage for transmitting the checkpoint data to the magnetic disk storage, solid-state storage coupled to the data processor for buffering the checkpoint data, and non-transitory computer readable storage medium storing computer instructions that, when executed by the data processor, perform the steps of;
(a) presenting a file system interface to the MPI job, and multiple MPI processes of the MPI job writing the checkpoint data to a shared file in the solid-state storage in a strided fashion in a first data layout; and
(b) asynchronously migrating the checkpoint data from the shared file in the solid-state storage to the magnetic disk storage and writing the checkpoint data to the magnetic disk storage in a sequential fashion in a second data layout.
12 Assignments
0 Petitions
Accused Products
Abstract
A parallel supercomputing cluster includes compute nodes interconnected in a mesh of data links for executing an MPI job, and solid-state storage nodes each linked to a respective group of the compute nodes for receiving checkpoint data from the respective compute nodes, and magnetic disk storage linked to each of the solid-state storage nodes for asynchronous migration of the checkpoint data from the solid-state storage nodes to the magnetic disk storage. Each solid-state storage node presents a file system interface to the MPI job, and multiple MPI processes of the MPI job write the checkpoint data to a shared file in the solid-state storage in a strided fashion, and the solid-state storage node asynchronously migrates the checkpoint data from the shared file in the solid-state storage to the magnetic disk storage and writes the checkpoint data to the magnetic disk storage in a sequential fashion.
67 Citations
20 Claims
-
1. A parallel supercomputing cluster system comprising:
-
hardware compute nodes interconnected in a mesh of data links for executing a Message Passing Interface (MPI) job stored in memory and for using MPI data transfer between the compute nodes over the mesh of data links; and hardware solid-state storage nodes each linked to a respective group of the compute nodes for receiving checkpoint data from the respective compute nodes, and magnetic disk storage linked to each of the solid-state storage nodes for asynchronous migration of the checkpoint data from the solid-state storage nodes to the magnetic disk storage; wherein each solid-state storage node includes a hardware data processor coupled to the respective group of compute nodes for receiving the checkpoint data from the respective group of compute nodes and coupled to the magnetic disk storage for transmitting the checkpoint data to the magnetic disk storage, solid-state storage coupled to the data processor for buffering the checkpoint data, and non-transitory computer readable storage medium storing computer instructions that, when executed by the data processor, perform the steps of; (a) presenting a file system interface to the MPI job, and multiple MPI processes of the MPI job writing the checkpoint data to a shared file in the solid-state storage in a strided fashion in a first data layout; and (b) asynchronously migrating the checkpoint data from the shared file in the solid-state storage to the magnetic disk storage and writing the checkpoint data to the magnetic disk storage in a sequential fashion in a second data layout. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A parallel supercomputing cluster system comprising:
-
hardware compute nodes interconnected in a mesh of data links for executing a Message Passing Interface (MPI) job stored in memory and for using MPI data transfer between the compute nodes over the mesh of data links; and hardware solid-state storage nodes each linked to a respective group of the compute nodes for receiving checkpoint data from the respective compute nodes, and magnetic disk storage linked to each of the solid-state storage nodes for asynchronous migration of the checkpoint data from the solid-state storage nodes to the magnetic disk storage; wherein each solid-state storage node includes a data processor coupled to the respective group of compute nodes for receiving the checkpoint data from the respective group of compute nodes and coupled to the magnetic disk storage for transmitting the checkpoint data to the magnetic disk storage, solid-state storage coupled to the data processor for buffering the checkpoint data, and non-transitory computer readable storage medium storing computer instructions that, when executed by the data processor, perform the steps of; (a) presenting a file system interface to the MPI job, and multiple MPI processes of the MPI job writing the checkpoint data to a shared file in the solid-state storage in a strided fashion in a first data layout, the checkpoint data being migrated from each compute node in the respective group of compute nodes to said each solid-state storage node by using remote direct memory access (RDMA); and (b) asynchronously migrating the checkpoint data from the shared file in the solid-state storage to the magnetic disk storage and writing the checkpoint data to the magnetic disk storage in a sequential fashion in a second data layout. - View Dependent Claims (10, 11, 12, 13)
-
-
14. A method of operating a parallel supercomputing cluster including hardware compute nodes interconnected in a mesh of data links for executing a Message Passing Interface (MPI) job stored in memory and for using MPI data transfer between the compute nodes over the mesh of data links, hardware solid-state storage nodes each linked to a respective group of the compute nodes for receiving checkpoint data from the respective compute nodes, and magnetic disk storage linked to each of the solid-state storage nodes for asynchronous migration of the checkpoint data from the solid-state storage nodes to the magnetic disk storage, wherein each solid-state storage node includes a hardware data processor coupled to the respective group of compute nodes for receiving the checkpoint data from the respective group of compute nodes and coupled to the magnetic disk storage for transmitting the checkpoint data to the magnetic disk storage, and solid-state storage coupled to the data processor for buffering the checkpoint data, and non-transitory computer readable storage medium storing computer instructions, said method comprising the data processor executing the computer instructions to perform the steps of:
-
(a) presenting a file system interface to the MPI job, and multiple MPI processes of the MPI job writing the checkpoint data to a shared file in the solid-state storage in a strided fashion in a first data layout; and (b) asynchronously migrating the checkpoint data from the shared file in the solid-state storage to the magnetic disk storage and writing the checkpoint data to the magnetic disk storage in a sequential fashion in a second data layout. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification