Configurable Processor with Backside LookUp Table

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
1Forward
Citation 
0
Petitions 
0
Assignments
First Claim
1. A configurable processor, comprising:
 a semiconductor substrate comprising a front side and a backside;
a lookup table circuit (LUT) on said backside for storing data related to a desired function, wherein said LUT comprises at least a programmable memory array;
an arithmetic logic circuit (ALC) on said front side for performing arithmetic operations on said data;
a plurality of throughsilicon vias (TSV) through said semiconductor substrate for communicatively coupling said LUT and said ALC.
0 Assignments
0 Petitions
Accused Products
Abstract
The present invention discloses a configurable processor with a backside lookup table. The configurable processor comprises a lookup table circuit (LUT) on the backside of the processor substrate and an arithmetic logic circuit (ALC) on the front side of the processor substrate. The LUT stores data related to a desired function. The ALC performs arithmetic operations on the data read out from the LUT.
1 Citation
Configurable processor with inpackage lookup table  
Patent #
US 10,445,067 B2
Filed 11/28/2018

Current Assignee
Hangzhou Haicun Information Technology Co. Ltd.

Sponsoring Entity
Hangzhou Haicun Information Technology Co. Ltd.

No References
20 Claims
 1. A configurable processor, comprising:
a semiconductor substrate comprising a front side and a backside; a lookup table circuit (LUT) on said backside for storing data related to a desired function, wherein said LUT comprises at least a programmable memory array; an arithmetic logic circuit (ALC) on said front side for performing arithmetic operations on said data; a plurality of throughsilicon vias (TSV) through said semiconductor substrate for communicatively coupling said LUT and said ALC.  View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
1 Specification
This application claims priority from Chinese Patent Application 201610300576.9, filed on May 7, 2016; Chinese Patent Application 201710311013.4, filed on May 5, 2017, in the State Intellectual Property Office of the People'"'"'s Republic of China (CN), the disclosure of which are incorporated herein by references in their entireties.
The present invention relates to the field of integrated circuit, and more particularly to processors.
Conventional processors use logicbased computation (LBC), which realizes mathematical functions primarily with logic circuits (e.g. XOR circuit). Logic circuits are suitable for arithmetic operations (i.e. addition, subtraction and multiplication), but not for nonarithmetic functions (e.g. elementary functions, special functions). Nonarithmetic functions are computationally hard. Rapid and efficient realization of the nonarithmetic functions has been a major challenge.
For the conventional processors, only few basic nonarithmetic functions (e.g. basic algebraic functions and basic transcendental functions) are implemented by hardware and they are referred to as builtin functions. These builtin functions are realized by a combination of arithmetic operations and lookup tables. For example, U.S. Pat. No. 5,954,787 issued to Eun on Sep. 21, 1999 taught a method for generating sine/cosine functions using lookup tables; U.S. Pat. No. 9,207,910 issued to Azadet et al. on Dec. 8, 2015 taught a method for calculating a power function using lookup tables.
Realization of builtin functions is further illustrated in
The 2D integration puts stringent requirements on the manufacturing process. As is well known in the art, the memory transistors in the LUT 200X are vastly different from the logic transistors in the ALC 100X. The memory transistors have stringent requirements on leakage current, while the logic transistors have stringent requirements on drive current. To form highperformance memory transistors and highperformance logic transistors on the same surface of the semiconductor substrate 00S at the same time is a challenge.
The 2D integration also limits computational density and computational complexity. Computation has been developed towards higher computational density and greater computational complexity. The computational density, i.e. the computational power (e.g. the number of floatingpoint operations per second) per die area, is a figure of merit for parallel computation. The computational complexity, i.e. the total number of builtin functions supported by a processor, is a figure of merit for scientific computation. For the 2D integration, inclusion of the LUT 200X increases the die size of the conventional processor 00X and lowers its computational density. This has an adverse effect on parallel computation. Moreover, because the ALU 100X, as the primary component of the conventional processor 00X, occupies a large die area, the LUT 200X, occupying only a small die area, supports few builtin functions.
The LBCbased processor 00X suffers one drawback. Because different logic circuits are used to realize different builtin functions, the processor 00X is fully customized. In other words, once its design is complete, the processor 00X can only realize a fixed set of predefined builtin functions. Apparently, configurable computation is more desirable, where a same hardware can realize different mathematical functions under the control of a set of configuration signals.
In the past, configurable logic, i.e. a same hardware realizes different logics under the control of a set of configuration signals, was realized by configurable gate array (e.g. fieldprogrammable gate array). U.S. Pat. No. 4,870,302 issued to Freeman on Sep. 26, 1989 (hereinafter Freeman) discloses a configurable gate array. It comprises an array of configurable logic elements and a hierarchy of configurable interconnects that allow the configurable logic elements to be wired together. In the priorart configurable gate arrays, mathematical functions are still realized in fixed computing elements, which are part of hard blocks and not configurable, i.e. the circuits realizing these mathematical functions are fixedly connected and are not subject to change by programming. Apparently, fixed computing elements would limit further applications of the configurable gate array. To overcome this difficulty, the present invention expands the original concept of the configurable gate array by making the fixed computing elements configurable.
It is a principle object of the present invention to realize configurable computation.
It is a further object of the present invention to realize fieldconfigurable computation.
It is a further object of the present invention to realize reconfigurable computation.
It is a further object of the present invention to realize configurable computation for multivariable functions.
It is a further object of the present invention to provide a configurable processor with a greater computational complexity.
It is a further object of the present invention to provide a configurable processor with a higher computational density.
It is a further object of the present invention to provide a fieldprogrammable gate array (FPGA) with a greater computational flexibility.
In accordance with these and other objects of the present invention, the present invention discloses a configurable processor with a backside lookup table.
The present invention discloses a configurable processor with a backside lookup table (BSLUT) (i.e. BSLUT configurable processor). The BSLUT processor comprises a logic circuit and a memory circuit. The logic circuit is formed on the front side of the processor substrate and comprises at least an arithmetic logic circuit (ALC), whereas the memory circuit is formed on the backside of the processor substrate and comprises at least a lookup table circuit (LUT). The ALC and LUT are communicatively coupled by a plurality of throughsilicon vias (TSV). Located on the backside of the processor substrate, the LUT is referred to as backside LUT (BSLUT). Because it is programmable, the BSLUT can realize a desired function by writing the data related to the desired function (e.g. the lookup table for the desired function) into the BSLUT, thus realizing configurable computation.
The BSLUT configurable processor uses memorybased computation (MBC), which realizes mathematical functions primarily with the LUT. Compared with the LUT used by the conventional processor, the BSLUT used by the BSLUT configurable processor has a much larger capacity. Although arithmetic operations are still performed, the MBC only needs to calculate a polynomial to a lower order because it uses a larger BSLUT as a starting point for computation. For the MBC, the fraction of computation done by the BSLUT could be more than the ALC.
Each usage cycle of the BSLUT configurable processor comprises two stages: a configuration stage and a computation stage. In the configuration stage, the data related to a desired function is written into the BSLUT. In the computation stage, the desired function is realized by reading the functionrelated data from the BSLUT. The BSLUT configurable processor can realize fieldconfigurable computation and reconfigurable computation. For the fieldconfigurable computation, the BSLUT configurable processor can realize a desired function in the field of use by writing the data related to the desired function into the BSLUT in the field of use. For reconfigurable computation, the BSLUT comprises at least a reprogrammable memory array and the BSLUT configurable processor can realize different functions by writing different data related to different functions (e.g. the lookup tables for different functions) into the BSLUT during different usage cycles. For example, during a first usage cycle, the BSLUT stores data related to a first function; during a second usage cycle, the BSLUT stores data related to a second function.
Because the ALC and the LUT are located on different sides of the processor substrate, this type of vertical integration is referred to as doublesided integration. The doublesided integration has a profound effect on the computational density and computational complexity. For the conventional 2D integration, the footprint of a conventional processor 00X is roughly equal to the sum of those of the ALU 100X and the LUT 200X. On the other hand, because the doublesided integration moves the LUT from aside to the backside, the BSLUT processor becomes smaller and computationally more powerful. In addition, the total LUT capacity of the conventional processor 00X is less than 100 kb, whereas the total BSLUT capacity for the BSLUT processor could reach 100 Gb. Consequently, a single BSLUT processor could support as many as 10,000 builtin functions (including various types of complex mathematical functions), far more than the conventional processor 00X. Furthermore, because the ALC and the LUT are on different sides of the processor substrate, the logic transistors in the ALC and the memory transistors in the LUT are formed in separate processing steps, which can be individually optimized.
To further improve programmability, the present invention further discloses a BSLUT configurable gate array. It comprises an array of configurable computing elements, an array of configurable logic elements and an array of configurable interconnects. The BSLUT comprises at least a programmable memory array which stores data related to a function (e.g. the lookup table for the function). Because it is programmable, the BSLUT can realize a desired function by writing the data related to the desired function into the BSLUT, thus realizing configurable computation. The configurable logic elements and configurable interconnects in the BSLUT configurable gate array are similar to those in the conventional configurable gate array. During computation, a complex function is first decomposed into a combination of basic functions. Each basic function is then realized by an associated configurable computing element. Finally, the complex function is realized by configuring the corresponding configurable logic elements and configurable interconnects.
Accordingly, the present invention discloses a configurable processor, comprising: a semiconductor substrate comprising a front side and a backside; a lookup table circuit (LUT) on said backside for storing data related to a desired function, wherein said LUT comprises at least a programmable memory array; an arithmetic logic circuit (ALC) on said front side for performing arithmetic operations on said data; a plurality of throughsilicon vias (TSV) through said semiconductor substrate for communicatively coupling said LUT and said ALC.
It should be noted that all the drawings are schematic and not drawn to scale. Relative dimensions and proportions of parts of the device structures in the figures have been shown exaggerated or reduced in size for the sake of clarity and convenience in the drawings. The same reference symbols are generally used to refer to corresponding or similar features in the different embodiments. The symbol “/” means a relationship of “and” or “or”. Throughout the present invention, both “lookup table” and “lookup table circuit” are abbreviated to LUT. Based on context, the LUT may refer to a lookup table or a lookup table circuit.
Those of ordinary skills in the art will realize that the following description of the present invention is illustrative only and is not intended to be in any way limiting. Other embodiments of the invention will readily suggest themselves to such skilled persons from an examination of the within disclosure.
Referring now to
Referring now to
The BSLUT configurable processor 300 uses memorybased computation (MBC), which realizes mathematical functions primarily with the BSLUT 170. Compared with the LUT 200X used by the conventional processor 00X, the BSLUT 170 used by the BSLUT configurable processor 300 has a much larger capacity. Although arithmetic operations are still performed, the MBC only needs to calculate a polynomial to a lower order because it uses a larger BSLUT 170 as a starting point for computation. For the MBC, the fraction of computation done by the BSLUT 170 could be more than the ALC 180.
Each usage cycle of the BSLUT configurable processor 300 comprises two stages: a configuration stage and a computation stage. In the configuration stage, the data related to a desired function is written into the BSLUT 170. In the computation stage, the desired function is realized by reading the functionrelated data from the BSLUT 170. The BSLUT configurable processor 300 can realize fieldconfigurable computation and reconfigurable computation. For the fieldconfigurable computation, the BSLUT configurable processor 300 can realize a desired function in the field of use by writing the data related to the desired function into the BSLUT 170 in the field of use. For reconfigurable computation, the BSLUT 170 comprises at least a reprogrammable memory array and the BSLUT configurable processor 300 can realize different functions by writing different data related to different functions (e.g. the lookup tables for different functions) into the BSLUT 170 during different usage cycles. For example, during a first usage cycle, the BSLUT 170 stores data related to a first function; during a second usage cycle, the BSLUT 170 stores data related to a second function.
The BSLUT 170 may use a RAM or a ROM. The RAM includes SRAM and DRAM. The ROM includes OTP, EPROM, EEPROM and flash memory. The flash memory can be categorized into NOR and NAND, and the NAND can be further categorized into horizontal NAND and vertical NAND. For the reconfigurable computation, the BSLUT 170 uses a reprogrammable memory. For the fieldconfigurable computation, besides the reprogrammable memory, the BSLUT 170 may also use an OTP. On the other hand, the ALC 180 may comprise an adder, a multiplier, and/or a multiplyaccumulator (MAC). It may perform integer operation, fixedpoint operation, or floatingpoint operation.
Because the ALC 100 and the LUT 200 are formed on different sides 0F, 0B of the processor substrate 0S, this type of vertical integration is referred to as doublesided integration. The doublesided integration has a profound effect on the computational density and computational complexity. For the conventional 2D integration, the footprint of a conventional processor 00X is roughly equal to the sum of those of the ALU 100X and the LUT 200X. On the other hand, because the doublesided integration moves the LUT from aside to the backside 0B, the BSLUT processor 300 becomes smaller and computationally more powerful. In addition, the total LUT capacity of the conventional processor 00X is less than 100 kb, whereas the total BSLUT capacity for the BSLUT processor 300 could reach 100 Gb. Consequently, a single BSLUT processor 300 could support as many as 10,000 builtin functions (including various types of complex mathematical functions), far more than the conventional processor 00X. Moreover, the doublesided integration can improve the communication throughput between the BSLUT 170 and the ALC 180. Because they are physically close and coupled by a large number of TSV 160, the BSLUT 170 and the ALC 180 have a larger communication throughput than the LUT 200X and the ALU 100X in the conventional processor 00X. Lastly, the doublesided integration benefits manufacturing process. Because the ALC 180 and the LUT 170 are on different sides 0F, 0B of the processor substrate 0S, the logic transistors in the ALC 180 and the memory transistors in the LUT 170 are formed in separate processing steps, which can be individually optimized.
To further improve programmability, the present invention further discloses a BSLUT configurable gate array 700 (
When realizing a builtin function, combining the LUT with polynomial interpolation can achieve a high precision without using an excessively large LUT. For example, if only LUT (without any polynomial interpolation) is used to realize a singleprecision function (32bit input and 32bit output), it would have a capacity of 2^{32}*32=128 Gb. By including polynomial interpolation, significantly smaller LUTs can be used. In the above embodiment, a singleprecision function can be realized using a total of 4 Mb LUT (2 Mb for the function values, and 2 Mb for the firstderivative values) in conjunction with a firstorder Taylor series. This is significantly less than the LUTonly approach (4 Mb vs. 128 Gb).
Besides elementary functions, the preferred embodiment of
The BSLUT configurable gate array 700 is particularly suitable for realizing multivariable functions. If only LUT is used to realize the above 4variable function, i.e. e=a·sin (b)+c·cos (d), an enormous LUT is needed: 2^{16}*2^{16}*2^{16}*2^{16}*16=256 Eb even for half precision, which is impractical. Using the BSLUT configurable gate array 700, only 8 Mb LUT (including 8 configurable computing elements, each with 1 Mb capacity) is needed to realize a 4variable function. To those skilled in the art, the BSLUT configurable gate array 700 can be used to realize other multivariable functions.
While illustrative embodiments have been shown and described, it would be apparent to those skilled in the art that many more modifications than that have been mentioned above are possible without departing from the inventive concepts set forth therein. For example, the processor could be a microcontroller, a central processing unit (CPU), a digital signal processor (DSP), a graphic processing unit (GPU), a networksecurity processor, an encryption/decryption processor, an encoding/decoding processor, a neuralnetwork processor, or an artificial intelligence (AI) processor. These processors can be found in consumer electronic devices (e.g. personal computers, video game machines, smart phones) as well as engineering and scientific workstations and server machines. The invention, therefore, is not to be limited except in the spirit of the appended claims.