Method and system for converting a single-threaded software program into an application-specific supercomputer

US 8,966,457 B2
Filed: 11/15/2011
Issued: 02/24/2015
Est. Priority Date: 11/15/2011
Status: Active Grant

First Claim

Patent Images

1. A method to automatically convert a single-threaded software application into an application-specific supercomputer, the method comprising:

a. automatically converting a code fragment from the single-threaded software application into customized hardware of the application-specific supercomputer, whose hardware execution is functionally equivalent to software execution of the code fragment;

b. generating interfaces on hardware and software parts of the single-threaded software application, where the interfaces, at run time;

i. perform a software-to-hardware program state transfer upon entry to the code fragment;

ii. perform a hardware-to-software program state transfer upon exit from the code fragment; and

iii. maintain memory coherence between hardware and software memories of the single-threaded software application; and

c. partitioning the customized hardware of the application-specific supercomputer obtained in steps a and b into a plurality of modules.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention comprises (i) a compilation method for automatically converting a single-threaded software program into an application-specific supercomputer, and (ii) the supercomputer system structure generated as a result of applying this method. The compilation method comprises: (a) Converting an arbitrary code fragment from the application into customized hardware whose execution is functionally equivalent to the software execution of the code fragment; and (b) Generating interfaces on the hardware and software parts of the application, which (i) Perform a software-to-hardware program state transfer at the entries of the code fragment; (ii) Perform a hardware-to-software program state transfer at the exits of the code fragment; and (iii) Maintain memory coherence between the software and hardware memories. If the resulting hardware design is large, it is divided into partitions such that each partition can fit into a single chip. Then, a single union chip is created which can realize any of the partitions.

119 Citations

View as Search Results

39 Claims

1. A method to automatically convert a single-threaded software application into an application-specific supercomputer, the method comprising:
- a. automatically converting a code fragment from the single-threaded software application into customized hardware of the application-specific supercomputer, whose hardware execution is functionally equivalent to software execution of the code fragment;
  
  b. generating interfaces on hardware and software parts of the single-threaded software application, where the interfaces, at run time;
  
  i. perform a software-to-hardware program state transfer upon entry to the code fragment;
  
  ii. perform a hardware-to-software program state transfer upon exit from the code fragment; and
  
  iii. maintain memory coherence between hardware and software memories of the single-threaded software application; and
  
  c. partitioning the customized hardware of the application-specific supercomputer obtained in steps a and b into a plurality of modules.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
- - 2. The method of claim 1, further comprising:
    - a. automatically converting a leaf region in a region hierarchy of the code fragment to a hardware finite state machine; and
      
      b. creating at least one copy of the hardware finite state machine for the leaf region, and combining a network for communication with the at least one copy of the hardware finite state machine, such that the combined at least one copy of the hardware finite state machine and network for communication behaves as a pipelined primitive operation for performing function of the leaf region.
  - 3. The method of claim 2, further comprising:
    - recursively applying the method of claim 2 to the region hierarchy of the code fragment, so that at each point where a parent region invokes a child region in the software execution of the code fragment, the hardware finite state machine for the parent region initiates the pipelined primitive operation for the child region in hardware execution of the code fragment.
  - 4. The method of claim 3, further comprising:
    - creating customized hardware synchronization units to ensure that if a memory instruction instance I₂is dependent on a memory instruction instance I₁in the software execution of the code fragment, the memory instruction instance I₂is executed after the memory instruction instance I₁in the hardware execution of the code fragment.
  - 5. The method of claim 4, further comprising:
    - creating a coherent memory hierarchy that;
      
      a. supports a plurality of load/store ports that are accessed in parallel; and
      
      b. signals a completion of each memory instruction issued from each load/store port of the plurality of load/store ports, for supporting synchronization units.
  - 6. The method of claim 5, further comprising:
    - achieving synchronization using a quiescence detection synchronization unit.
  - 7. The method of claim 5, further comprising:
    - achieving synchronization using a train crash synchronization unit.
  - 8. The method of claim 5, further comprising:
    - achieving synchronization using a serializing synchronization unit.
  - 9. The method of claim 5, further comprising:
    - achieving synchronization using a FIFO synchronization unit.
  - 10. The method of claim 5, further comprising:
    - given a region in a region hierarchy of the code fragment and a group of dependent loads and stores in the region, achieving synchronization between the group of dependent loads and stores by compiling a customized synchronization circuit, which emulates behavior of snoopy write-update caches within the region.
  - 11. The method of claim 5, further comprising:
    - creating application-specific customized hardware to speculate that a dependence from the memory instruction instance I₁to the memory instruction instance I₂will not be observed at runtime;
      
      executing the memory instruction instance I₂speculatively without waiting for the memory instruction instance I₁to finish;
      
      detecting an incorrect speculation if an incorrect speculative execution of the memory instruction instance I₂occurs; and
      
      recovering from the incorrect speculation by finally re-executing the memory instruction instance I₂after the memory instruction instance I₁.
  - 12. The method of claim 5, further comprising:
    - for a message being sent from a sending component to a receiving component, detecting and eliminating sending of bits that are constant, dead, or redundant; and
      
      recreating the bits that are constant or redundant at the receiving component.
  - 13. The method of claim 5, further comprising:
    - automatically identifying the code fragment in the single-threaded software application that will be converted into hardware.
  - 14. The method of claim 5, further comprising:
    - using software or hardware profiling feedback to assist in determining hardware subsystem parameters in future compilations.
  - 15. The method of claim 5, further comprising:
    - creating application-specific customized hardware such that while a thread is waiting for a result of a network operation, other instructions from the same thread or from a different thread are executed, for achieving improved network latency tolerance.
  - 16. The method of claim 5, further comprising:
    - merging two networks consisting of;
      
      a. a first network with a hardware finite state machine performing a function A, andb. a second network with a hardware finite state machine performing a function B distinct from the function A,by creating a common hardware finite state machine that performs both the functions A and B, for sharing resources.
  - 17. The method of claim 16, where:
    - a. the common hardware finite state machine performing both the functions A and B is replaced by a general purpose microprocessor;
      
      b. optionally, the hardware finite state machine performing the function A is replaced by a general purpose microprocessor; and
      
      c. optionally, the hardware finite state machine performing the function B is replaced by a general purpose microprocessor.
  - 18. The method of claim 5, further comprising:
    - a. compiling recursive region invocations within the code fragment, by making a hardware finite state machine invoke a pipelined primitive operation containing the hardware finite state machine itself; and
      
      b. if a hardware finite state machine is not able to initiate a region invocation over a network because of resource contention, making the hardware finite state machine perform the region invocation within itself, without using the network, for avoiding deadlock.
  - 19. The method of claim 5, further comprising:
    - applying frequency optimizations to hardware finite state machines.
  - 20. The method of claim 5, further comprising:
    - executing region invocations speculatively, and subsequently, when it is found that a speculative invocation was on an untaken path, canceling the speculative invocation.
  - 21. The method of claim 5, further comprising:
    - applying symbolic execution with pointer analysis support and using results of the symbolic execution to disambiguate dependencies.
  - 22. The method of claim 5, further comprising:
    - using a butterfly sub-network having at least one input port and at least one output port, where;
      
      a. number of input ports of the butterfly sub-network is not limited to a power of two, andb. number of output ports of the butterfly sub-network is not limited to a power of two.
  - 23. The method of claim 5, further comprising:
    - using a task crossbar switch having at least one input port and at least one output port, where any number of requesting input ports and any number of accepting output ports are matched in parallel and in a single step.
  - 24. The method of claim 5, further comprising:
    - maintaining precise exceptions in code fragments accelerated by the application-specific supercomputer.
  - 25. The method of claim 5, further comprising:
    - automatically converting a code fragment into an application-specific supercomputer, where the code fragment includes a requirement selected from the group consisting of;
      
      a. memory mapped I/O;
      
      b. sequential multiprocessor consistency;
      
      c. accesses to volatile variables; and
      
      d. operating system kernel code execution.
  - 26. The method of claim 5, further comprising:
    - applying symbolic execution with pointer analysis support to achieve hardware optimizations.
  - 27. The method of claim 5, further comprising:
    - creating a directory-based write-update cache whose design is simplified because of explicit synchronizations between dependent memory operations.
  - 28. The method of claim 5, further comprising:
    - applying application-specific memory partitioning to;
      
      a. reduce a number of cache ports,b. create specialized cache memories, andc. reduce cache coherence hardware.
  - 29. The method of claim 5, further comprising:
    - given that a region accesses any predictable sequence of addresses within a data structure, creating a streaming cache which computes a next address in a sequence of addresses locally on its own, and either supplies a next load data from the next address or accepts a next store data into the next address.
  - 30. The method of claim 5, further comprising:
    - accelerating more than one code fragment within the single-threaded software application by;
      
      a. a host computer identifies the code fragment and an entry point within the code fragment in a message sent to an accelerator; and
      
      b. the accelerator receives the message and forwards the message to a hardware sub-module to perform function of the code fragment and the entry point within the code fragment.
  - 31. The method of claim 5, further comprising:
    - converting;
      
      a. a hardware component pin specification, andb. a single-threaded sequential code fragment, where the single-threaded sequential code fragment comprises;
      
      i. instruction primitives to communicate with hardware component pins, andii. locally declared data structuresinto an application-specific hardware component having the hardware component pins, such that the application-specific hardware component is functionally equivalent to the single-threaded sequential code fragment.
  - 32. An application-specific supercomputer obtained from a single-threaded software application utilizing the method of claim 5.
  - 33. The method of claim 5, further comprising:
    - partitioning a hardware design of the application-specific supercomputer comprising a plurality of components and a plurality of networks connecting the plurality of components into a plurality of modules, by;
      
      a. creating a scalable network to enable cross-module communication;
      
      b. partitioning the hardware design of the application-specific supercomputer into a plurality of modules, such that each component of the plurality of components is placed into a particular module of the plurality of modules;
      
      c. placing an I/O controller in each module of the plurality of modules to enable cross-module communication;
      
      d. in each module of the plurality of modules,for each network x that is originally attached to at least one component placed in the module,creating a local sub-network of the network x connected to a port of a local I/O controller dedicated to the network x, and also connected to local components originally connected to the network x;
      
      e. for each network x, enabling delivery of a message from any component on the network x to any other component on the same network x, as if the hardware design of the application-specific supercomputer were not partitioned, by;
      
      i. in a message source module, sending the message by a message source component to a local I/O controller of the message source module over the local sub-network of the network x;
      
      ii. sending the message outside of the message source module by the local I/O controller of the message source module and routing the message within the scalable network to enable cross-module communication until the message reaches a message destination module; and
      
      iii. accepting the message by a local I/O controller of the message destination module and delivering the message to a message destination component within the message destination module over the local sub-network of the network x.
  - 34. The method of claim 33, further comprising:
    - creating a union module that is able to realize any of the plurality of modules created by the partitioning of the hardware design of the application-specific supercomputer.
  - 35. An application-specific supercomputer obtained from a single-threaded software application utilizing the method of claim 34.

36. A method to automatically convert a parallel software application comprising a plurality of processes communicating with messages into an application-specific supercomputer, the method comprising:
- for each process x within the plurality of processes of the software application;
  
  a. automatically converting a code fragment from the process x into customized hardware of an application-specific hardware accelerator, whose hardware execution is functionally equivalent to software execution of the code fragment;
  
  b. generating interfaces on hardware and software parts of the process x, where the interfaces, at run time;
  
  i. perform a software-to-hardware program state transfer upon entry to the code fragment;
  
  ii. perform a hardware-to-software program state transfer upon exit from the code fragment; and
  
  iii. maintain memory coherence between hardware and software memories of the process x;
  
  c. converting a leaf region in a region hierarchy of the code fragment to a hardware finite state machine;
  
  d. creating at least one copy of the hardware finite state machine for the leaf region, and combining a network for communication with the at least one copy of the hardware finite state machine, such that the combined at least one copy of the hardware finite state machine and network for communication behaves as a pipelined primitive operation for performing function of the leaf region;
  
  e. recursively applying steps c and d to the region hierarchy of the code fragment, so that at each point where a parent region invokes a child region in the software execution of the code fragment, the hardware finite state machine for the parent region initiates the pipelined primitive operation for the child region in hardware execution of the code fragment;
  
  f. creating customized hardware synchronization units to ensure that if a memory instruction instance I₂is dependent on a memory instruction instance I₁in the software execution of the code fragment, the memory instruction instance I₂is executed after the memory instruction instance I₁in the hardware execution of the code fragment;
  
  g. creating a coherent memory hierarchy that;
  
  i. supports a plurality of load/store ports that are accessed in parallel; and
  
  ii. signals a completion of each memory instruction issued from each load/store port of the plurality of load/store ports, for supporting synchronization units;
  
  h. connecting application-specific hardware accelerators created in steps a-g for each process within the plurality of processes of the parallel software application in a network, to create an application-specific supercomputer; and
  
  i. partitioning the application-specific supercomputer of step h into a plurality of modules.

37. A method to partition a hardware design of an application-specific supercomputer comprising a plurality of components and a plurality of networks connecting the plurality of components into a plurality of modules, the method comprising:
- a. creating a scalable network to enable cross-module communication;
  
  b. partitioning the hardware design of the application-specific supercomputer into a plurality of modules, such that each component of the plurality of components is placed into a particular module of the plurality of modules;
  
  c. placing an I/O controller in each module of the plurality of modules to enable cross-module communication;
  
  d. in each module of the plurality of modules,for each network x that is originally attached to at least one component placed in the module,creating a local sub-network of the network x connected to a port of a local I/O controller dedicated to the network x, and also connected to local components originally connected to the network x;
  
  e. for each network x, enabling delivery of a message from any component on the network x to any other component on the same network x, as if the hardware design of the application-specific supercomputer were not partitioned, by;
  
  i. in a message source module, sending the message by a message source component to a local I/O controller of the message source module over the local sub-network of the network x;
  
  ii. sending the message outside of the message source module by the local I/O controller of the message source module, and routing the message within the scalable network to enable cross-module communication until the message reaches a message destination module; and
  
  iii. accepting the message by a local I/O controller of the message destination module and delivering the message to a message destination component within the message destination module over the local sub-network of the network x.
- View Dependent Claims (38)
- - 38. The method of claim 37, further comprising:
    - creating a union module that is able to realize any of the plurality of modules created by the partitioning of the hardware design of the application-specific supercomputer.

39. An application-specific supercomputer whose hardware execution is functionally equivalent to software execution of a code fragment within a single-threaded software application, where the application-specific supercomputer comprises:
- a. one or more levels of a hierarchy of pipelined primitive operations implementing hierarchical software pipelining derived from a region hierarchy of the code fragment;
  
  b. at least one customized hardware synchronization unit to ensure that if a memory instruction instance I₂is dependent on a memory instruction instance I₁in the software execution of the code fragment, the memory instruction instance I₂is executed after the memory instruction instance I₁in hardware execution of the code fragment; and
  
  c. at least one coherent memory hierarchy that;
  
  i. supports a plurality of load/store ports that are accessed in parallel; and
  
  ii. signals a completion of each memory instruction issued from each load/store port of the plurality of load/store ports, for supporting synchronization units; and
  
  where the application-specific supercomputer is implemented as a plurality of copies of a union module implemented in ASIC technology, with scalable network connections, and where the union module implemented in ASIC technology is able to perform function of any of a plurality of modules resulting from partitioning a hardware design of the application-specific supercomputer.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Global Supercomputing Corporation
Original Assignee
Global Supercomputing Corporation
Inventors
Ebcioglu, Kemal, Kultursay, Emre, Kandemir, Mahmut Taylan
Primary Examiner(s)
Chen, Qing

Application Number

US13/296,232
Publication Number

US 20130125097A1
Time in Patent Office

1,197 Days
Field of Search

717136-161, 716/105, 716/116, 716/117, 716/124, 716/125, 716/128, 716/131
US Class Current

717/136
CPC Class Codes

G06F 12/08   in hierarchically structure...

G06F 12/0862   with prefetch

G06F 12/0875   with dedicated cache, e.g. ...

G06F 12/0895   of parts of caches, e.g. di...

G06F 15/17381   Two dimensional, e.g. mesh,...

G06F 2115/10   Processors

G06F 2212/455   Image or video data

G06F 2212/6026   Prefetching based on access...

G06F 30/30   Circuit design

G06F 30/323   Translation or migration, e...

G06F 30/392   Floor-planning or layout, e...

G06F 8/40   Transformation of program code

G06F 8/4452   Software pipelining

G06F 8/452   Loops

G06F 9/52   Program synchronisation; Mu...

Y02D 10/00   Energy efficient computing,...

Method and system for converting a single-threaded software program into an application-specific supercomputer

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

119 Citations

39 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for converting a single-threaded software program into an application-specific supercomputer

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

119 Citations

39 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links