Architecture and method for a burst buffer using flash technology

US 9,286,261 B1
Filed: 11/13/2012
Issued: 03/15/2016
Est. Priority Date: 11/14/2011
Status: Active Grant

First Claim

Patent Images

1. A parallel supercomputing cluster system comprising:

hardware compute nodes interconnected in a mesh of data links for executing a Message Passing Interface (MPI) job stored in memory and for using MPI data transfer between the compute nodes over the mesh of data links; and

hardware solid-state storage nodes each linked to a respective group of the compute nodes for receiving checkpoint data from the respective compute nodes, andmagnetic disk storage linked to each of the solid-state storage nodes for asynchronous migration of the checkpoint data from the solid-state storage nodes to the magnetic disk storage;

wherein each solid-state storage node includes a hardware data processor coupled to the respective group of compute nodes for receiving the checkpoint data from the respective group of compute nodes and coupled to the magnetic disk storage for transmitting the checkpoint data to the magnetic disk storage, solid-state storage coupled to the data processor for buffering the checkpoint data, and non-transitory computer readable storage medium storing computer instructions that, when executed by the data processor, perform the steps of;

(a) presenting a file system interface to the MPI job, and multiple MPI processes of the MPI job writing the checkpoint data to a shared file in the solid-state storage in a strided fashion in a first data layout; and

(b) asynchronously migrating the checkpoint data from the shared file in the solid-state storage to the magnetic disk storage and writing the checkpoint data to the magnetic disk storage in a sequential fashion in a second data layout.

View all claims

12 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A parallel supercomputing cluster includes compute nodes interconnected in a mesh of data links for executing an MPI job, and solid-state storage nodes each linked to a respective group of the compute nodes for receiving checkpoint data from the respective compute nodes, and magnetic disk storage linked to each of the solid-state storage nodes for asynchronous migration of the checkpoint data from the solid-state storage nodes to the magnetic disk storage. Each solid-state storage node presents a file system interface to the MPI job, and multiple MPI processes of the MPI job write the checkpoint data to a shared file in the solid-state storage in a strided fashion, and the solid-state storage node asynchronously migrates the checkpoint data from the shared file in the solid-state storage to the magnetic disk storage and writes the checkpoint data to the magnetic disk storage in a sequential fashion.

67 Citations

View as Search Results

20 Claims

1. A parallel supercomputing cluster system comprising:
- hardware compute nodes interconnected in a mesh of data links for executing a Message Passing Interface (MPI) job stored in memory and for using MPI data transfer between the compute nodes over the mesh of data links; and
  
  hardware solid-state storage nodes each linked to a respective group of the compute nodes for receiving checkpoint data from the respective compute nodes, andmagnetic disk storage linked to each of the solid-state storage nodes for asynchronous migration of the checkpoint data from the solid-state storage nodes to the magnetic disk storage;
  
  wherein each solid-state storage node includes a hardware data processor coupled to the respective group of compute nodes for receiving the checkpoint data from the respective group of compute nodes and coupled to the magnetic disk storage for transmitting the checkpoint data to the magnetic disk storage, solid-state storage coupled to the data processor for buffering the checkpoint data, and non-transitory computer readable storage medium storing computer instructions that, when executed by the data processor, perform the steps of;
  
  (a) presenting a file system interface to the MPI job, and multiple MPI processes of the MPI job writing the checkpoint data to a shared file in the solid-state storage in a strided fashion in a first data layout; and
  
  (b) asynchronously migrating the checkpoint data from the shared file in the solid-state storage to the magnetic disk storage and writing the checkpoint data to the magnetic disk storage in a sequential fashion in a second data layout.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The parallel supercomputing cluster as claimed in claim 1, wherein the solid-state storage includes flash memory in solid-state drives.
  - 3. The parallel supercomputer cluster as claimed in claim 1, wherein the second data layout includes a separate checkpoint file for each of the multiple MPI processes of the MPI job.
  - 4. The parallel supercomputing cluster as claimed in claim 1, wherein the computer instructions include a first file system server layer for presenting the file system interface to the MPI job and defining the first data layout, a writeback scheduler for scheduling the asynchronous migration of the checkpoint data from the shared file in the solid-state storage to the magnetic disk storage, a second file system layer for conversion between the first data layout and the second data layout, and a third file system layer for defining the second data layout.
  - 5. The parallel supercomputer cluster as claimed in claim 4, wherein the second file system layer provides a parallel log-structured file system.
  - 6. The parallel supercomputer cluster as claimed in claim 4, wherein the third file system layer provides a global file system.
  - 7. The parallel supercomputer cluster as claimed in claim 1, wherein the computer instructions, when executed by the data processor, respond to MPI invocations from the MPI job in the respective group of compute nodes to locate and specify data placement in the first data layout in the solid-state storage.
  - 8. The parallel supercomputer cluster as claimed in claim 7, wherein the MPI invocations include an invocation collective to all the compute nodes and the solid-state storage nodes to take file view information and read and distribute associated file data to the other nodes, so that there is a local copy in the solid-state storage of each solid-state storage node of what each MPI process will use on a subsequent MPI operation.

9. A parallel supercomputing cluster system comprising:
- hardware compute nodes interconnected in a mesh of data links for executing a Message Passing Interface (MPI) job stored in memory and for using MPI data transfer between the compute nodes over the mesh of data links; and
  
  hardware solid-state storage nodes each linked to a respective group of the compute nodes for receiving checkpoint data from the respective compute nodes, andmagnetic disk storage linked to each of the solid-state storage nodes for asynchronous migration of the checkpoint data from the solid-state storage nodes to the magnetic disk storage;
  
  wherein each solid-state storage node includes a data processor coupled to the respective group of compute nodes for receiving the checkpoint data from the respective group of compute nodes and coupled to the magnetic disk storage for transmitting the checkpoint data to the magnetic disk storage, solid-state storage coupled to the data processor for buffering the checkpoint data, and non-transitory computer readable storage medium storing computer instructions that, when executed by the data processor, perform the steps of;
  
  (a) presenting a file system interface to the MPI job, and multiple MPI processes of the MPI job writing the checkpoint data to a shared file in the solid-state storage in a strided fashion in a first data layout, the checkpoint data being migrated from each compute node in the respective group of compute nodes to said each solid-state storage node by using remote direct memory access (RDMA); and
  
  (b) asynchronously migrating the checkpoint data from the shared file in the solid-state storage to the magnetic disk storage and writing the checkpoint data to the magnetic disk storage in a sequential fashion in a second data layout.
- View Dependent Claims (10, 11, 12, 13)
- - 10. The parallel supercomputing cluster as claimed in claim 9, wherein the solid-state storage includes flash memory in solid-state drives.
  - 11. The parallel supercomputer cluster as claimed in claim 9, wherein the second data layout includes a separate checkpoint file for each of the multiple MPI processes of the MPI job.
  - 12. The parallel supercomputer cluster as claimed in claim 9, wherein the computer instructions, when executed by the data processor, respond to MPI invocations from the MPI job in the respective group of compute nodes to locate and specify data placement in the first data layout in the solid-state storage, and the MPI invocations include an invocation collective to all the compute nodes and the solid-state storage nodes to take file view information and read and distribute associated file data to the other nodes, so that there is a local copy in the solid-state storage of each solid-state storage node of what each MPI process will need use on a subsequent MPI operation.
  - 13. The parallel supercomputing cluster as claimed in claim 9, which further includes the Parallel Log-Structured File System (PLFS) writing metadata and logs to a Lustre File System to write the metadata and logs to backing storage, and the Parallel Log-Structured File System (PLFS) writing data to a Network File System (NFS) to write the data to backing storage.

14. A method of operating a parallel supercomputing cluster including hardware compute nodes interconnected in a mesh of data links for executing a Message Passing Interface (MPI) job stored in memory and for using MPI data transfer between the compute nodes over the mesh of data links, hardware solid-state storage nodes each linked to a respective group of the compute nodes for receiving checkpoint data from the respective compute nodes, and magnetic disk storage linked to each of the solid-state storage nodes for asynchronous migration of the checkpoint data from the solid-state storage nodes to the magnetic disk storage, wherein each solid-state storage node includes a hardware data processor coupled to the respective group of compute nodes for receiving the checkpoint data from the respective group of compute nodes and coupled to the magnetic disk storage for transmitting the checkpoint data to the magnetic disk storage, and solid-state storage coupled to the data processor for buffering the checkpoint data, and non-transitory computer readable storage medium storing computer instructions, said method comprising the data processor executing the computer instructions to perform the steps of:
- (a) presenting a file system interface to the MPI job, and multiple MPI processes of the MPI job writing the checkpoint data to a shared file in the solid-state storage in a strided fashion in a first data layout; and
  
  (b) asynchronously migrating the checkpoint data from the shared file in the solid-state storage to the magnetic disk storage and writing the checkpoint data to the magnetic disk storage in a sequential fashion in a second data layout.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The method as claimed in claim 14, wherein the computer instructions include a first file system server layer presenting the file system interface to the MPI job and defining the first data layout, a writeback scheduler scheduling the asynchronous migration of the checkpoint data from the shared file in the solid-state storage to the magnetic disk storage, a second file system layer converting between the first data layout and the second data layout, and a third file system layer defining the second data layout.
  - 16. The method as claimed in claim 15, which further includes the second file system layer providing a parallel log-structured file system.
  - 17. The method as claimed in claim 15, which further includes the third file system layer providing a global file system.
  - 18. The method as claimed in claim 15, which further includes the data processor executing the computer instructions to respond to MPI invocations from the MPI job in the respective group of compute nodes to locate and specify data placement in the first data layout in the solid-state storage.
  - 19. The method as claimed in claim 18, wherein the MPI invocations include an invocation collective to all the compute nodes and the solid-state storage nodes to take file view information and read and distribute associated file data to the other nodes, so that there is a local copy in the solid-state storage of each solid-state storage node of what each MPI process will use on a subsequent MPI operation.
  - 20. The method as claimed in claim 14, which further includes migrating the checkpoint data from each compute node in the respective group of compute nodes to said each solid-state storage node by using remote direct memory access (RDMA).

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.), TRIAD National Security, LLC
Original Assignee
EMC Corporation (Dell Technologies Inc.), Los Alamos National Security LLC (Government of the United States of America)
Inventors
Tzelnic, Percy, Faibish, Sorin, Gupta, Uday K., Bent, John, Grider, Gary Alan, Chen, Hsing-bung
Primary Examiner(s)
Sison, June

Application Number

US13/676,000
Time in Patent Office

1,218 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 11/108   Parity data distribution in...

G06F 15/167   using a common memory, e.g....

G06F 16/1847   specifically adapted to sta...

G06F 2211/1028   Distributed, i.e. distribut...

Architecture and method for a burst buffer using flash technology

First Claim

12 Assignments

0 Petitions

Accused Products

Abstract

67 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Architecture and method for a burst buffer using flash technology

First Claim

12 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

67 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links