Method and system for converting a single-threaded software program into an application-specific supercomputer
First Claim
1. A method, implemented by a compiler, to create an incomplete butterfly sub-network, where r >
- =2 is a radix of the incomplete butterfly sub-network and is a power of two; and
where a number of input ports m >
=1 of the incomplete butterfly sub-network is not a power of r, and/or a number of output ports n >
=1 of the incomplete butterfly sub-network is not a power of r; and
where the incomplete butterfly sub-network is obtained from a corresponding complete butterfly sub-network consisting of multiplexers, buffers, and wires, where a number of input ports and a number of output ports of the corresponding complete butterfly sub-network are both equal to rd, where d is a smallest integer that makes rd greater than or equal to a maximum of m and n, by;
retaining, in the incomplete butterfly sub-network, only multiplexers, buffers, and wires of the corresponding complete butterfly sub-network required for routing packets from a first m input ports of the corresponding complete butterfly sub-network to a first n output ports of the corresponding complete butterfly sub-network, anddeleting any remaining multiplexers, buffers, and wires of the corresponding complete butterfly sub-network; and
where the compiler automatically translates a single-threaded software program code fragment into a partitioned application-specific supercomputer functionally equivalent to the single-threaded software program code fragment, in part by creating one or more customized incomplete butterfly sub-networks for scalable message communication between hardware components of the partitioned application-specific supercomputer, where each customized incomplete butterfly sub-network among the one or more customized incomplete butterfly sub-networks has a minimum number of input ports, a minimum number of output ports, and a minimum number of payload bits per port for reducing area, power, and message communication latency.
1 Assignment
0 Petitions
Accused Products
Abstract
The invention comprises (i) a compilation method for automatically converting a single-threaded software program into an application-specific supercomputer, and (ii) the supercomputer system structure generated as a result of applying this method. The compilation method comprises: (a) Converting an arbitrary code fragment from the application into customized hardware whose execution is functionally equivalent to the software execution of the code fragment; and (b) Generating interfaces on the hardware and software parts of the application, which (i) Perform a software-to-hardware program state transfer at the entries of the code fragment; (ii) Perform a hardware-to-software program state transfer at the exits of the code fragment; and (iii) Maintain memory coherence between the software and hardware memories. If the resulting hardware design is large, it is divided into partitions such that each partition can fit into a single chip. Then, a single union chip is created which can realize any of the partitions.
51 Citations
6 Claims
-
1. A method, implemented by a compiler, to create an incomplete butterfly sub-network, where r >
- =2 is a radix of the incomplete butterfly sub-network and is a power of two; and
where a number of input ports m >
=1 of the incomplete butterfly sub-network is not a power of r, and/or a number of output ports n >
=1 of the incomplete butterfly sub-network is not a power of r; andwhere the incomplete butterfly sub-network is obtained from a corresponding complete butterfly sub-network consisting of multiplexers, buffers, and wires, where a number of input ports and a number of output ports of the corresponding complete butterfly sub-network are both equal to rd, where d is a smallest integer that makes rd greater than or equal to a maximum of m and n, by; retaining, in the incomplete butterfly sub-network, only multiplexers, buffers, and wires of the corresponding complete butterfly sub-network required for routing packets from a first m input ports of the corresponding complete butterfly sub-network to a first n output ports of the corresponding complete butterfly sub-network, and deleting any remaining multiplexers, buffers, and wires of the corresponding complete butterfly sub-network; and where the compiler automatically translates a single-threaded software program code fragment into a partitioned application-specific supercomputer functionally equivalent to the single-threaded software program code fragment, in part by creating one or more customized incomplete butterfly sub-networks for scalable message communication between hardware components of the partitioned application-specific supercomputer, where each customized incomplete butterfly sub-network among the one or more customized incomplete butterfly sub-networks has a minimum number of input ports, a minimum number of output ports, and a minimum number of payload bits per port for reducing area, power, and message communication latency. - View Dependent Claims (2)
- =2 is a radix of the incomplete butterfly sub-network and is a power of two; and
-
3. A method, implemented by a compiler, to create a task sub-network,
where the task sub-network is scalable, is built entirely in hardware, does not include any processors executing software instructions, and serves to load-balance and distribute packets representing tasks to homogeneous hardware resources able to perform tasks; - and
where the task sub-network has one or more input ports and a plurality of output ports; and where a packet entering the task sub-network on an input port does not indicate any output port the packet should be routed to, but the packet is routed to any output port able to accept the packet; and where the compiler automatically translates a single-threaded software program code fragment into a partitioned application-specific supercomputer functionally equivalent to the single-threaded software program code fragment, in part by creating one or more customized task sub-networks for scalable message communication between hardware components of the partitioned application-specific supercomputer, where each customized task sub-network among the one or more customized task sub-networks has a minimum number of input ports, a minimum number of output ports, and a minimum number of payload bits per port for reducing area, power, and message communication latency. - View Dependent Claims (4, 5, 6)
- and
Specification