Processor Comprising ThreeDimensional Memory (3DM) Array

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
2Forward
Citations 
0
Petitions 
0
Assignments
First Claim
1. A threedimensional processor (3Dprocessor), comprising:
 a semiconductor substrate including transistors thereon;
at least a computing element formed on said semiconductor substrate, said computing element comprising an arithmetic logic circuit (ALC) and a threedimensional memory (3DM)based lookup table (3DMLUT), whereinsaid ALC is formed on said semiconductor substrate and configured to perform at least one arithmetic operation on data from said 3DMLUT;
said 3DMLUT is stored in at least a 3DM array, said 3DM array being stacked above said ALC;
said 3DM array and said ALC are communicatively coupled by a plurality of contact vias.
0 Assignments
0 Petitions
Accused Products
Abstract
The present invention discloses a processor comprising threedimensional memory (3DM) array (3Dprocessor). Instead of logicbased computation (LBC), the 3Dprocessor uses memorybased computation (MBC). It comprises an array of computing elements, with each computing element comprising an arithmetic logic circuit (ALC) and a 3DMbased lookup table (3DMLUT). The ALC performs arithmetic operations on the LUT data, while the 3DMLUT is stored in at least one 3DM array.
2 Citations
Configurable processor with inpackage lookup table  
Patent #
US 10,445,067 B2
Filed 11/28/2018

Current Assignee
Hangzhou Haicun Information Technology Co. Ltd.

Sponsoring Entity
Hangzhou Haicun Information Technology Co. Ltd.

Nonvolatile memory devices, memory systems and methods of operating nonvolatile memory devices for processing user data  
Patent #
US 10,672,479 B2
Filed 09/11/2018

Current Assignee
Samsung Electronics Co. Ltd.

Sponsoring Entity
Samsung Electronics Co. Ltd.

No References
20 Claims
 1. A threedimensional processor (3Dprocessor), comprising:
a semiconductor substrate including transistors thereon; at least a computing element formed on said semiconductor substrate, said computing element comprising an arithmetic logic circuit (ALC) and a threedimensional memory (3DM)based lookup table (3DMLUT), wherein said ALC is formed on said semiconductor substrate and configured to perform at least one arithmetic operation on data from said 3DMLUT; said 3DMLUT is stored in at least a 3DM array, said 3DM array being stacked above said ALC; said 3DM array and said ALC are communicatively coupled by a plurality of contact vias.  View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
1 Specification
This application claims priority from Chinese Patent Application 201610083747.7, filed on Feb. 13, 2016; Chinese Patent Application 201610260845.3, filed on Apr. 22, 2016; Chinese Patent Application 201610289592.2, filed on May 2, 2016; Chinese Patent Application 201710237780.5, filed on Apr. 12, 2017, in the State Intellectual Property Office of the People'"'"'s Republic of China (CN), the disclosure of which are incorporated herein by references in their entireties.
1. Technical Field of the Invention
The present invention relates to the field of integrated circuit, and more particularly to processors.
2. Prior Art
Conventional processors use logicbased computation (LBC), which carries out computation primarily with logic circuits (e.g. XOR circuit). Logic circuits are suitable for arithmetic operations (i.e. addition, subtraction and multiplication), but not for nonarithmetic functions (e.g. elementary functions, special functions). Nonarithmetic functions are computationally hard. Rapid and efficient realization thereof has been a major challenge.
For the conventional processors, only few basic nonarithmetic functions (e.g. basic algebraic functions and basic transcendental functions) are implemented by hardware and they are referred to as builtin functions. These builtin functions are realized by a combination of logic circuits and lookup tables (LUT). For example, U.S. Pat. No. 5,954,787 issued to Eun on Sep. 21, 1999 taught a method for generating sine/cosine functions using LUTs; U.S. Pat. No. 9,207,910 issued to Azadet et al. on Dec. 8, 2015 taught a method for calculating a power function using LUTs.
Realization of builtin functions is further illustrated in
Computation has been developed along the directions of computational density and computational complexity. The computational density is a figure of merit for parallel computation and it refers to the computational power (e.g. the number of floatingpoint operations per second) per die area. The computational complexity is a figure of merit for scientific computation and it refers to the total number of builtin functions supported by a processor. The 2D integration severely limits computational density and computational complexity.
For the 2D integration, inclusion of the LUT 370 increases the die size of the conventional processor 300 and lowers its computational density. This has an adverse effect on parallel computation. Moreover, because the ALU 380 is the primary component of the conventional processor 300 and occupies a large die area, the LUT 370 is left with a small die area and only supports few builtin functions.
This small set of builtin functions (˜10 types, including arithmetic operations) is the foundation of scientific computation. Scientific computation uses advanced computing capabilities to advance human understandings and solve engineering problems. It has wide applications in computational mathematics, computational physics, computational chemistry, computational biology, computational engineering, computational economics, computational finance and other computational fields. The prevailing framework of scientific computation comprises three layers: a foundation layer, a function layer and a modeling layer. The foundation layer includes builtin functions that can be implemented by hardware. The function layer includes mathematical functions that cannot be implemented by hardware (e.g. nonbasic nonarithmetic functions). The modeling layer includes mathematical models, which are the mathematical descriptions of the inputoutput characteristics of a system component.
The mathematical functions in the function layer and the mathematical models in the modeling layer are implemented by software. The function layer involves one softwaredecomposition step: mathematical functions are decomposed into combinations of builtin functions by software, before these builtin functions and the associated arithmetic operations are calculated by hardware. The modeling layer involves two softwaredecomposition steps: the mathematical models are first decomposed into combinations of mathematical functions; then the mathematical functions are further decomposed into combinations of builtin functions. Apparently, the softwareimplemented functions (e.g. mathematical functions, mathematical models) run much slower and less efficient than the hardwareimplemented functions (i.e. builtin functions), and extra softwaredecomposition steps (e.g. for mathematical models) would make these performance gaps even more pronounced.
To illustrate how computationally intensive a mathematical model could be,
It is a principle object of the present invention to provide a paradigm shift for scientific computation.
It is a further object of the present invention to provide a processor with improved computational complexity.
It is a further object of the present invention to provide a processor with a large set of builtin functions.
It is a further object of the present invention to realize nonarithmetic functions rapidly and efficiently.
It is a further object of the present invention to realize rapid and efficient modeling and simulation.
It is a further object of the present invention to provide a processor with improved computational density.
In accordance with these and other objects of the present invention, the present invention discloses a processor comprising threedimensional memory (3DM) arrays (3Dprocessor). Instead of logicbased computation (LBC), the 3Dprocessor uses memorybased computation (MBC).
The present invention discloses a processor comprising threedimensional memory (3DM) array (3Dprocessor). It comprises an array of computing elements formed on a semiconductor substrate, with each computing element comprising an arithmetic logic circuit (ALC) and a lookup table (LUT) based on 3DM (3DMLUT). The ALC is formed on the substrate and it performs arithmetic operations on the 3DMLUT data. The 3DMLUT is stored in at least a 3DM array. The 3DM array is stacked above the ALC and at least partially covers the ALC. The 3DM array is further communicatively coupled with the ALC with the contact vias. These contact vias are collectively referred to as 3D interconnects.
The present invention further discloses a memorybased computation (MBC), which carries out computation primarily with the 3DMLUT. Compared with the conventional logicbased computation (LBC), the 3DMLUT used by the MBC has a much larger capacity than the conventional LUT. Although arithmetic operations are still performed for most MBCs, using a larger LUT as a starting point, the MBC only needs to calculate a polynomial to a smaller order. For the MBC, the fraction of computation done by the 3DMLUT could be more than the ALC.
Because the 3DMLUT is stacked above the ALC, this type of vertical integration is referred to as threedimensional (3D) integration. The 3D integration has a profound effect on the computational density. Because the 3DM array does not occupy any substrate area, the footprint of the computing element is roughly equal to that of the ALC. However, the footprint of a conventional processor is roughly equal to the sum of the footprints of the LUT and the ALU. By moving the LUT from aside to above, the computing element becomes smaller. The 3Dprocessor would contain more computing elements, become more computationally powerful and support massive parallelism.
The 3D integration also has a profound effect on the computational complexity of the 3Dprocessor. For a conventional processor, the total LUT capacity is less than 100 kb. In contrast, the total 3DMLUT capacity for a 3Dprocessor could reach 100 Gb (for example, a 3DXPoint die has a storage capacity of 128 Gb). Consequently, a single 3Dprocessor die could support as many as 10,000 builtin functions, which are three orders of magnitude more than the conventional processor.
Significantly more builtin functions shall flatten the prevailing framework of scientific computation (including the foundation, function and modeling layers). The hardwareimplemented functions, which were only available to the foundation layer, now become available to the function and modeling layers. Not only mathematical functions in the function layer can be directly realized by hardware, but also mathematical models in the modeling layer can be directly described by hardware. In the function layer, mathematical functions can be realized by a functionbyLUT method, i.e. the function values are calculated by reading the 3DMLUT plus polynomial interpolation. In the modeling layer, mathematical models can be described by a modelbyLUT method, i.e. the inputoutput characteristics of a system component are modeled by reading the 3DMLUT plus polynomial interpolation. Rapid and efficient computation would lead to a paradigm shift for scientific computation.
Accordingly, the present invention discloses a threedimensional processor (3Dprocessor), comprising: a semiconductor substrate including transistors thereon; at least a computing element formed on said semiconductor substrate, said computing element comprising an arithmetic logic circuit (ALC) and a threedimensional memory (3DM)based lookup table (3DMLUT), wherein said ALC is formed on said semiconductor substrate and configured to perform at least one arithmetic operation on data from said 3DMLUT; said 3DMLUT is stored in at least a 3DM array, said 3DM array being stacked above said ALC; said 3DM array and said ALC are communicatively coupled by a plurality of contact vias.
It should be noted that all the drawings are schematic and not drawn to scale. Relative dimensions and proportions of parts of the device structures in the figures have been shown exaggerated or reduced in size for the sake of clarity and convenience in the drawings. The same reference symbols are generally used to refer to corresponding or similar features in the different embodiments. The symbol “/” means a relationship of “and” or “or”.
Throughout the present invention, the phrase “memory” is used in its broadest sense to mean any semiconductorbased holding place for information, either permanent or temporary; the phrase “permanent” is used in its broadest sense to mean any longterm storage; the phrase “communicatively coupled” is used in its broadest sense to mean any coupling whereby information may be passed from one element to another element; the phrase “on the substrate” means the active elements of a circuit (e.g. transistors) are formed on the surface of the substrate, although the interconnects between these active elements are formed above the substrate and do not touch the substrate; the phrase “above the substrate” means the active elements (e.g. memory cells) are formed above the substrate and do not touch the substrate.
Those of ordinary skills in the art will realize that the following description of the present invention is illustrative only and is not intended to be in any way limiting. Other embodiments of the invention will readily suggest themselves to such skilled persons from an examination of the within disclosure.
Referring now to
The 3Dprocessor 100 uses memorybased computation (MBC), which carries out computation primarily with the 3DMLUT 170. Compared with the conventional logicbased computation (LBC), the 3DMLUT 170 used by the MBC has a much larger capacity than the conventional LUT 370. Although arithmetic operations are still performed for most MBCs, using a larger LUT as a starting point, the MBC only needs to calculate a polynomial to a smaller order. For the MBC, the fraction of computation done by the 3DMLUT 170 could be more than the ALC 180.
Referring now to
3DM can be categorized into 3DRAM (random access memory) and 3DROM (readonly memory). As used herein, the phrase “RAM” is used in its broadest sense to mean any memory for temporarily holding information, including but not limited to registers, SRAM, and DRAM; the phrase “ROM” is used in its broadest sense to mean any memory for permanently holding information, wherein the information being held could be either electrically alterable or unalterable. Most common 3DM is 3DROM. The 3DROM is further categorized into 3D writable memory (3DW) and 3D printed memory (3DP).
For the 3DW, data can be electrically written (or, programmable). Based on the number of programmings allowed, a 3DW can be categorized into threedimensional onetimeprogrammable memory (3DOTP) and threedimensional multipletimeprogrammable memory (3DMTP). The 3DOTP can be written once, while the 3DMTP is electrically reprogrammable. An exemplary 3DMTP is 3DXPoint. Other types of 3DMTP include memristor, resistive randomaccess memory (RRAM or ReRAM), phasechange memory, programmable metallization cell (PMC), conductivebridging randomaccess memory (CBRAM), and the like. For the 3DW, the 3DMLUT 170 can be configured in the field. This becomes even better when the 3DMTP is used, as the 3DMLUT 170 would become reconfigured.
For the 3DP, data are recorded thereto using a printing method during manufacturing. These data are fixedly recorded and cannot be changed after manufacturing. The printing methods include photolithography, nanoimprint, ebeam lithography, DUV lithography, and laserprogramming, etc. An exemplary 3DP is threedimensional maskprogrammed readonly memory (3DMPROM), whose data are recorded by photolithography. Because electrical programming is not required, a memory cell in the 3DP can be biased at a larger voltage during read than the 3DW and therefore, the 3DP is faster than the 3DW.
The 3DW cell 5aa comprises a programmable layer 12 and a diode layer 14. The programmable layer 12 could be an antifuse layer (which can be programmed once and is used for the 3DOTP) or a reprogrammable layer (which is used for the 3DMTP). The diode layer 14 is broadly interpreted as any layer whose resistance at the read voltage is substantially lower than when the applied voltage has a magnitude smaller than or polarity opposite to that of the read voltage. The diode could be a semiconductor diode (e.g. pin silicon diode), or a metaloxide (e.g. TiO_{2}) diode.
In the preferred embodiment of
Referring now to
In the embodiment of
In the embodiment of
Because the 3DMLUT 170 is stacked above the ALC 180, this type of vertical integration is referred to as threedimensional (3D) integration. The 3D integration has a profound effect on the computational density of the 3Dprocessor 100. Because the 3DMLUT 170 does not occupy any substrate area 0, the footprint of the computing element 110i is roughly equal to that of the ALC 180. This is much smaller than a conventional processor 300, whose footprint is roughly equal to the sum of the footprints of the LUT 370 and the ALC 380. By moving the LUT from aside to above, the computing element becomes smaller. The 3Dprocessor 100 would contain more computing elements 1101, become more computationally powerful and support massive parallelism.
The 3D integration also has a profound effect on the computational complexity of the 3Dprocessor 100. For a conventional processor 300, the total LUT capacity is less than 100 kb. In contrast, the total 3DMLUT capacity for a 3Dprocessor 100 could reach 100 Gb (for example, a 3DXPoint die has a storage capacity of 128 Gb). Consequently, a single 3Dprocessor die 100 could support as many as 10,000 builtin functions, which are three orders of magnitude more than the conventional processor 300.
Significantly more builtin functions shall flatten the prevailing framework of scientific computation (including the foundation, function and modeling layers). The hardwareimplemented builtin functions, which were only available to the foundation layer, now become available to the function and modeling layers. Not only mathematical functions in the function layer can be directly realized by hardware (
Referring now to
When calculating a builtin function, combining the LUT with polynomial interpolation can achieve a high precision without using an excessively large LUT. For example, if only LUT (without any polynomial interpolation) is used to realize a singleprecision function (32bit input and 32bit output), it would have a capacity of 2^{32}*32=128 Gb, which is impractical. By including polynomial interpolation, significantly smaller LUTs can be used. In the above embodiment, a singleprecision function can be realized using a total of 4 Mb LUT (2 Mb for function values, and 2 Mb for firstderivative values) in conjunction with a firstorder Taylor series calculation. This is significantly less than the LUTonly approach (4 Mb vs. 128 Gb).
Besides elementary functions, the preferred embodiment of
Referring now to
Referring now to
The 3DMLUT 170U stores different forms of mathematical models. In one case, the mathematical model data stored in the 3DMLUT 170U is raw measurement data, i.e. the measured inputoutput characteristics of the transistor 24. One example is the measured drain current vs. the applied gatesource voltage (I_{D}V_{GS}) characteristics. In another case, the mathematical model data stored in the 3DMLUT 170U is the smoothed measurement data. The raw measurement data could be smoothed using a purely mathematical method (e.g. a bestfit model). Or, this smoothing process can be aided by a physical transistor model (e.g. a BSIM4 V3.0 transistor model). In a third case, the mathematical data stored in the 3DMLUT include not only the measured data, but also its derivative values. For example, the 3DMLUT data include not only the draincurrent values of the transistor 24 (e.g. the I_{D}V_{GS }characteristics), but also its transconductance values (e.g. the G_{m}V_{GS }characteristics). With derivative values, polynomial interpolation can be used to improve the modeling precision using a reasonablesize 3DMLUT, as in the case of
ModelbyLUT offers many advantages. By skipping two softwaredecomposition steps (from mathematical models to mathematical functions, and from mathematical functions to builtin functions), it saves substantial modeling time and energy. ModelbyLUT may need less LUT than functionbyLUT. Because a transistor model (e.g. BSIM4 V3.0) has hundreds of model parameters, calculating the intermediate functions of the transistor model requires extremely large LUTs. However, if we skip functionbyLUT (namely, skipping the transistor models and the associated intermediate functions), the transistor behaviors can be described using only three parameters (including the gatesource voltage V_{GS}, the drainsource voltage V_{DS}, and the bodysource voltage V_{BS}). Describing the mathematical models of the transistor 24 requires relatively small LUTs.
While illustrative embodiments have been shown and described, it would be apparent to those skilled in the art that many more modifications than that have been mentioned above are possible without departing from the inventive concepts set forth therein. For example, the processor could be a microcontroller, a central processing unit (CPU), a digital signal processor (DSP), a graphic processing unit (GPU), a networksecurity processor, an encryption/decryption processor, an encoding/decoding processor, a neuralnetwork processor, or an artificial intelligence (AI) processor. These processors can be found in consumer electronic devices (e.g. personal computers, video game machines, smart phones) as well as engineering and scientific workstations and server machines. The invention, therefore, is not to be limited except in the spirit of the appended claims.