Ultrascalable petaflop parallel supercomputer
First Claim
1. A scalable, massively parallel computer system comprising:
- a plurality of processing nodes interconnected by independent networks, each node including one or more processing elements for performing computation or communication activity as required when performing parallel algorithm operations, each of said processing nodes including a direct memory access (DMA) element operable for providing a plurality of functions for said processing node; and
,a first of said multiple independent networks comprising an n-dimensional torus network including communication links interconnecting said nodes in a manner optimized for providing high-speed, low latency point-to-point and multicast packet communications among said nodes or sub-sets of nodes, said processing node DMA element providing a communications message passing interface enabling communication of messages among said nodes;
a second of said multiple independent networks including a scalable collective network comprising nodal interconnections that facilitate simultaneous global operations among nodes or sub-sets of nodes of said network; and
,partitioning means for dynamically configuring one or more combinations of independent processing networks according to needs of one or more algorithms, each independent network including a configurable sub-set of processing nodes interconnected by divisible portions of said first and second networks,each of said configured independent processing networks is utilized to enable simultaneous collaborative processing for optimizing algorithm processing performance, and,wherein each said DMA element at said nodes comprises;
a processor interface for interfacing with the at least one processor, a DMA controller logic device, a memory interface for interfacing with a memory structure for storing information, a DMA network interface for interfacing with the network, one or more injection and reception byte counters, and injection and reception FIFO metadata associated with a injection FIFO and reception FIFO, respectively,wherein said DMA element supports message-passing operation as controlled from an application via an Injection FIFO Metadata describing multiple Injection FIFOs, where each Injection FIFO may contain an arbitrary number of message descriptors to process messages with a fixed processing overhead irrespective of the number of message descriptors comprising the Injection FIFO,said (DMA) element is operable for Direct Memory Access functions for point-to-point, multicast, and all-to-all communications amongst said nodes.
3 Assignments
0 Petitions
Accused Products
Abstract
A massively parallel supercomputer of petaOPS-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC) having up to four processing elements. The ASIC nodes are interconnected by multiple independent networks that optimally maximize the throughput of packet communications between nodes with minimal latency. The multiple networks may include three high-speed networks for parallel algorithm message passing including a Torus, collective network, and a Global Asynchronous network that provides global barrier and notification functions. These multiple independent networks may be collaboratively or independently utilized according to the needs or phases of an algorithm for optimizing algorithm processing performance. The use of a DMA engine is provided to facilitate message passing among the nodes without the expenditure of processing resources at the node.
-
Citations
44 Claims
-
1. A scalable, massively parallel computer system comprising:
-
a plurality of processing nodes interconnected by independent networks, each node including one or more processing elements for performing computation or communication activity as required when performing parallel algorithm operations, each of said processing nodes including a direct memory access (DMA) element operable for providing a plurality of functions for said processing node; and
,a first of said multiple independent networks comprising an n-dimensional torus network including communication links interconnecting said nodes in a manner optimized for providing high-speed, low latency point-to-point and multicast packet communications among said nodes or sub-sets of nodes, said processing node DMA element providing a communications message passing interface enabling communication of messages among said nodes; a second of said multiple independent networks including a scalable collective network comprising nodal interconnections that facilitate simultaneous global operations among nodes or sub-sets of nodes of said network; and
,partitioning means for dynamically configuring one or more combinations of independent processing networks according to needs of one or more algorithms, each independent network including a configurable sub-set of processing nodes interconnected by divisible portions of said first and second networks, each of said configured independent processing networks is utilized to enable simultaneous collaborative processing for optimizing algorithm processing performance, and, wherein each said DMA element at said nodes comprises; a processor interface for interfacing with the at least one processor, a DMA controller logic device, a memory interface for interfacing with a memory structure for storing information, a DMA network interface for interfacing with the network, one or more injection and reception byte counters, and injection and reception FIFO metadata associated with a injection FIFO and reception FIFO, respectively, wherein said DMA element supports message-passing operation as controlled from an application via an Injection FIFO Metadata describing multiple Injection FIFOs, where each Injection FIFO may contain an arbitrary number of message descriptors to process messages with a fixed processing overhead irrespective of the number of message descriptors comprising the Injection FIFO, said (DMA) element is operable for Direct Memory Access functions for point-to-point, multicast, and all-to-all communications amongst said nodes. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
-
-
24. A massively parallel computing system comprising:
-
a plurality of processing nodes interconnected by independent networks, each processing node comprising a system-on-chip Application Specific Integrated Circuit (ASIC) comprising four or more processing elements each capable of performing computation or message passing operations, each of said processing nodes including a direct memory access (DMA) element operable for providing a plurality of functions for said processing node, and, each said processing element of a processing node including a shared memory storage structure, said shared memory storage structure is programmed into a plurality of memory bank structures a processing node including a programmable means for sacrificing an amount memory to enable processing parallelism; one or more first logic devices associated with a respective said processor element, each one or more first logic devices for receiving physical memory address signals and programmable for generating a respective memory storage structure select signal upon receipt of pre-determined address bit values at selected physical memory address bit locations; and
,a second logic device responsive to each said respective select signal for generating an address signal used for selecting a memory storage structure for processor access, wherein each processor device of said computing system is enabled memory storage access distributed across said one or more memory storage structures; and
,means receiving unselected bit values of said received physical memory address signal for generating an offset bit vector signal used to enable processor element access to memory locations within a selected memory storage structure; and
,a first independent network comprising an n-dimensional torus network including communication links interconnecting said nodes in a manner optimized for providing high-speed, low latency point-to-point and multicast packet communications among said nodes or sub-sets of nodes of said network; a second of said multiple independent networks includes a scalable global collective network comprising nodal interconnections that facilitate simultaneous global operations among nodes or sub-sets of nodes of said network; and
,partitioning means for dynamically configuring one or more combinations of independent processing networks according to needs of one or more algorithms, each independent network including a configured sub-set of processing nodes interconnected by divisible portions of said first and second networks, and, means enabling rapid coordination of processing and message passing activity at each said processing element in each independent processing network, wherein one, or more, of the processing elements performs calculations needed by the algorithm, while the other, or more, of the processing elements performs message passing activities for communicating with other nodes of said network, as required when performing particular classes of algorithms, a processing node DMA element providing a communications message passing interface enabling communication of messages among said nodes, wherein each of said configured independent processing networks and node processing elements thereof are dynamically utilized to enable collaborative processing for optimizing algorithm processing performance. - View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
-
-
35. In a massively parallel computing structure comprising a plurality of processing nodes interconnected by multiple independent networks, each processing node comprising:
-
a system-on-chip Application Specific Integrated Circuit (ASIC) comprising four or more processing elements each capable of performing computation or message passing operations; means enabling rapid coordination of processing and message passing activity at each said processing element, wherein one or both of the processing elements performs calculations needed by the algorithm, while the other or both processing element performs message passing activities for communicating with other interconnected nodes of a network, as required when performing particular classes of algorithms, and, means for supporting overlap of communication and computation with non-blocking communication primitives wherein each of said plurality of processing nodes includes a direct memory access (DMA) element operable for providing a communications message passing interface enabling communication of messages among said nodes, each said DMA element at said nodes comprising; a processor interface for interfacing with the at least one processor, a DMA controller logic device, a memory interface for interfacing with a memory structure for storing information, a DMA network interface for interfacing with the network, one or more injection and reception byte counters, and injection and reception FIFO metadata associated with a injection FIFO and reception FIFO, respectively, wherein said DMA element supports message-passing operation as controlled from an application via an Injection FIFO Metadata describing multiple Injection FIFOs, where each Injection FIFO may contain an arbitrary number of message descriptors to process messages with a fixed processing overhead irrespective of the number of message descriptors comprising the Injection FIFO, said (DMA) element is operable for Direct Memory Access functions for point-to-point, multicast, and all-to-all communications amongst said nodes. - View Dependent Claims (36, 37, 38, 39)
-
-
40. A system for providing an operating environment in a parallel computer system, comprising:
-
a system administration and management subsystem including at least core monitoring and control database operable to store information associated with a plurality of nodes in a parallel computer system; a partition and job management subsystem operable to allocate one or more nodes in the parallel computer system to one or more job partitions to provide job management in the parallel computer system; an application development and debug tools subsystem including at least one ore more debugging environments, one or more application performance monitoring and tuning tools, one or more compilers and a user interface, the application development and debug tools subsystem operable to provide one or more application development tools in the parallel computer system; a compute node kernel and services subsystem operable to provide an environment for execution of one or more user processes; and input/output kernel and services subsystem operable to communicate with one or more file servers and other computer systems, the input/output kernel and services subsystem further operable to provide input and output functionality between said one or more user processes and said one or more file servers and other computer systems, wherein the core monitoring and control database includes; a configuration database operable to provide a representation of a plurality of hardware on the parallel computer system; an operational database including information and status associated with one or more interactions in the plurality of hardware on the parallel computer system; an environmental database including current values for the plurality of hardware on the parallel computer system;
ora reliability, availability, and serviceability database operable to collect information associated with reliability, availability, and serviceability of the parallel computer system;
or combinations thereof,wherein the system administration and management subsystem uses the configuration database, the operational database, the environmental database, or the reliability, availability, and serviceability database, or combinations thereof, in managing the parallel computer system. - View Dependent Claims (41, 42, 43, 44)
-
Specification