Method and system for converting a single-threaded software program into an application-specific supercomputer
First Claim
1. A method to automatically convert a single-threaded software application into an application-specific supercomputer, the method comprising:
- a. automatically converting a code fragment from the single-threaded software application into customized hardware of the application-specific supercomputer, whose hardware execution is functionally equivalent to software execution of the code fragment;
b. generating interfaces on hardware and software parts of the single-threaded software application, where the interfaces, at run time;
i. perform a software-to-hardware program state transfer upon entry to the code fragment;
ii. perform a hardware-to-software program state transfer upon exit from the code fragment; and
iii. maintain memory coherence between hardware and software memories of the single-threaded software application; and
c. partitioning the customized hardware of the application-specific supercomputer obtained in steps a and b into a plurality of modules.
1 Assignment
0 Petitions
Accused Products
Abstract
The invention comprises (i) a compilation method for automatically converting a single-threaded software program into an application-specific supercomputer, and (ii) the supercomputer system structure generated as a result of applying this method. The compilation method comprises: (a) Converting an arbitrary code fragment from the application into customized hardware whose execution is functionally equivalent to the software execution of the code fragment; and (b) Generating interfaces on the hardware and software parts of the application, which (i) Perform a software-to-hardware program state transfer at the entries of the code fragment; (ii) Perform a hardware-to-software program state transfer at the exits of the code fragment; and (iii) Maintain memory coherence between the software and hardware memories. If the resulting hardware design is large, it is divided into partitions such that each partition can fit into a single chip. Then, a single union chip is created which can realize any of the partitions.
119 Citations
39 Claims
-
1. A method to automatically convert a single-threaded software application into an application-specific supercomputer, the method comprising:
-
a. automatically converting a code fragment from the single-threaded software application into customized hardware of the application-specific supercomputer, whose hardware execution is functionally equivalent to software execution of the code fragment; b. generating interfaces on hardware and software parts of the single-threaded software application, where the interfaces, at run time; i. perform a software-to-hardware program state transfer upon entry to the code fragment; ii. perform a hardware-to-software program state transfer upon exit from the code fragment; and iii. maintain memory coherence between hardware and software memories of the single-threaded software application; and c. partitioning the customized hardware of the application-specific supercomputer obtained in steps a and b into a plurality of modules. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
-
-
36. A method to automatically convert a parallel software application comprising a plurality of processes communicating with messages into an application-specific supercomputer, the method comprising:
for each process x within the plurality of processes of the software application; a. automatically converting a code fragment from the process x into customized hardware of an application-specific hardware accelerator, whose hardware execution is functionally equivalent to software execution of the code fragment; b. generating interfaces on hardware and software parts of the process x, where the interfaces, at run time; i. perform a software-to-hardware program state transfer upon entry to the code fragment; ii. perform a hardware-to-software program state transfer upon exit from the code fragment; and iii. maintain memory coherence between hardware and software memories of the process x; c. converting a leaf region in a region hierarchy of the code fragment to a hardware finite state machine; d. creating at least one copy of the hardware finite state machine for the leaf region, and combining a network for communication with the at least one copy of the hardware finite state machine, such that the combined at least one copy of the hardware finite state machine and network for communication behaves as a pipelined primitive operation for performing function of the leaf region; e. recursively applying steps c and d to the region hierarchy of the code fragment, so that at each point where a parent region invokes a child region in the software execution of the code fragment, the hardware finite state machine for the parent region initiates the pipelined primitive operation for the child region in hardware execution of the code fragment; f. creating customized hardware synchronization units to ensure that if a memory instruction instance I2 is dependent on a memory instruction instance I1 in the software execution of the code fragment, the memory instruction instance I2 is executed after the memory instruction instance I1 in the hardware execution of the code fragment; g. creating a coherent memory hierarchy that; i. supports a plurality of load/store ports that are accessed in parallel; and ii. signals a completion of each memory instruction issued from each load/store port of the plurality of load/store ports, for supporting synchronization units; h. connecting application-specific hardware accelerators created in steps a-g for each process within the plurality of processes of the parallel software application in a network, to create an application-specific supercomputer; and i. partitioning the application-specific supercomputer of step h into a plurality of modules.
-
37. A method to partition a hardware design of an application-specific supercomputer comprising a plurality of components and a plurality of networks connecting the plurality of components into a plurality of modules, the method comprising:
-
a. creating a scalable network to enable cross-module communication; b. partitioning the hardware design of the application-specific supercomputer into a plurality of modules, such that each component of the plurality of components is placed into a particular module of the plurality of modules; c. placing an I/O controller in each module of the plurality of modules to enable cross-module communication; d. in each module of the plurality of modules, for each network x that is originally attached to at least one component placed in the module, creating a local sub-network of the network x connected to a port of a local I/O controller dedicated to the network x, and also connected to local components originally connected to the network x; e. for each network x, enabling delivery of a message from any component on the network x to any other component on the same network x, as if the hardware design of the application-specific supercomputer were not partitioned, by; i. in a message source module, sending the message by a message source component to a local I/O controller of the message source module over the local sub-network of the network x; ii. sending the message outside of the message source module by the local I/O controller of the message source module, and routing the message within the scalable network to enable cross-module communication until the message reaches a message destination module; and iii. accepting the message by a local I/O controller of the message destination module and delivering the message to a message destination component within the message destination module over the local sub-network of the network x. - View Dependent Claims (38)
-
-
39. An application-specific supercomputer whose hardware execution is functionally equivalent to software execution of a code fragment within a single-threaded software application, where the application-specific supercomputer comprises:
-
a. one or more levels of a hierarchy of pipelined primitive operations implementing hierarchical software pipelining derived from a region hierarchy of the code fragment; b. at least one customized hardware synchronization unit to ensure that if a memory instruction instance I2 is dependent on a memory instruction instance I1 in the software execution of the code fragment, the memory instruction instance I2 is executed after the memory instruction instance I1 in hardware execution of the code fragment; and c. at least one coherent memory hierarchy that; i. supports a plurality of load/store ports that are accessed in parallel; and ii. signals a completion of each memory instruction issued from each load/store port of the plurality of load/store ports, for supporting synchronization units; and where the application-specific supercomputer is implemented as a plurality of copies of a union module implemented in ASIC technology, with scalable network connections, and where the union module implemented in ASIC technology is able to perform function of any of a plurality of modules resulting from partitioning a hardware design of the application-specific supercomputer.
-
Specification