Efficient linear detection implementation for massive MIMO

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
0Forward
Citations 
0
Petitions 
1
Assignment
First Claim
1. A method comprising:
 per given time instance, acquiring K samples b from a signal r, which is based at least on K transmitted symbols x and a transfer matrix H of a communication channel, and a linear detection matrix A of a size K×
K, which is based at least on the transfer matrix H;
iteratively calculating at most K(K−
1) tentative parameters b{tilde over (
)} and at most K(K−
1) tentative parameters A{tilde over (
)} for the K samples b and the linear detection matrix A;
checking whether or not the tentative parameters b{tilde over (
)} and A{tilde over (
)} have converged;
if b{tilde over (
)} and A{tilde over (
)} have converged, deciding K estimation values x{circumflex over (
)} for the K transmitted symbols x based on b{tilde over (
)} and A{tilde over (
)}; and
if b{tilde over (
)} and A{tilde over (
)} have not converged, returning to the iteratively calculating b{tilde over (
)} and A{tilde over (
)}for the K samples b.
1 Assignment
0 Petitions
Accused Products
Abstract
Per given time instance, K samples b are acquired from a signal r, which is based at least on K transmitted symbols x and a transfer matrix H of a communication channel, and a linear detection matrix A of a size K×K is acquired, which is based at least on the transfer matrix H (S101). For the K samples b and the linear detection matrix A, at most K(K−1) tentative parameters b{tilde over ( )} and at most K(K−1) tentative parameters A{tilde over ( )} are iteratively calculated (S102). It is checked whether or not the tentative parameters b{tilde over ( )} and A{tilde over ( )} have converged (S103). If b{tilde over ( )} and A{tilde over ( )} have converged, K estimation values x{circumflex over ( )} are decided for the K transmitted symbols x based on b{tilde over ( )} and A{tilde over ( )} (S104). If b{tilde over ( )} and A{tilde over ( )} have not converged, it is returned to the iteratively calculating b{tilde over ( )} and A{tilde over ( )} for the K samples b.
6 Citations
No References
MIMO DECODER AND MIMO DECODING METHOD  
Patent #
US 20090279644A1
Filed 03/28/2006

Current Assignee
NEC Corporation

Sponsoring Entity
NEC Corporation

Lower complexity computation of lattice reduction  
Patent #
US 20070268981A1
Filed 05/22/2006

Current Assignee
Nokia Corporation

Sponsoring Entity
Nokia Corporation

Method and apparatus for estimating a signal sequence in a MIMOOFDM mobile communication system  
Patent #
US 20040208254A1
Filed 01/12/2004

Current Assignee
Samsung Electronics Co. Ltd.

Sponsoring Entity
Samsung Electronics Co. Ltd.

Apparatus, method and computer program product providing iterative recursive least squares (RLS) algorithm for coded MIMO systems  
Patent #
US 20070286312A1
Filed 05/31/2007

Current Assignee
Nokia Corporation

Sponsoring Entity
Nokia Corporation

MULTIINPUT MULTIOUTPUT (MIMO) DETECTION SYSTEMS  
Patent #
US 20160254883A1
Filed 02/29/2016

Current Assignee
Georgia Tech Research Corporation

Sponsoring Entity
Georgia Tech Research Corporation

SIGNAL DETECTING METHOD AND DEVICE  
Patent #
US 20170170928A1
Filed 10/23/2016

Current Assignee
Tsinghua University

Sponsoring Entity
Tsinghua University

20 Claims
 1. A method comprising:
per given time instance, acquiring K samples b from a signal r, which is based at least on K transmitted symbols x and a transfer matrix H of a communication channel, and a linear detection matrix A of a size K×
K, which is based at least on the transfer matrix H;iteratively calculating at most K(K−
1) tentative parameters b{tilde over (
)} and at most K(K−
1) tentative parameters A{tilde over (
)} for the K samples b and the linear detection matrix A;checking whether or not the tentative parameters b{tilde over (
)} and A{tilde over (
)} have converged;if b{tilde over (
)} and A{tilde over (
)} have converged, deciding K estimation values x{circumflex over (
)} for the K transmitted symbols x based on b{tilde over (
)} and A{tilde over (
)}; andif b{tilde over (
)} and A{tilde over (
)} have not converged, returning to the iteratively calculating b{tilde over (
)} and A{tilde over (
)}for the K samples b. View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
 9. A nontransitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following:
per given time instance, acquiring K samples b from a signal r, which is based at least on K transmitted symbols x and a transfer matrix H of a communication channel, and a linear detection matrix A of a size K×
K, which is based at least on the transfer matrix H;iteratively calculating at most K(K−
1) tentative parameters b{tilde over (
)} and at most K(K−
1) tentative parameters A{tilde over (
)} for the K samples b and the linear detection matrix A;checking whether or not the tentative parameters b{tilde over (
)} and A{tilde over (
)} have converged;if b{tilde over (
)} and A{tilde over (
)} have converged, deciding K estimation values x{circumflex over (
)} for the K transmitted symbols x based on b{tilde over (
)} and A{tilde over (
)}; andif b{tilde over (
)} and A{tilde over (
)} have not converged, returning to the iteratively calculating b{tilde over (
)} and A{tilde over (
)} for the K samples b. View Dependent Claims (10, 11, 12, 13)
 14. An apparatus comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:
per given time instance, acquiring K samples b from a signal r, which is based at least on K transmitted symbols x and a transfer matrix H of a communication channel, and a linear detection matrix A of a size K×
K, which is based at least on the transfer matrix H;iteratively calculating at most K(K−
1) tentative parameters b{tilde over (
)} and at most K(K−
1) tentative parameters A{tilde over (
)} for the K samples b and the linear detection matrix A;checking whether or not the tentative parameters b{tilde over (
)} and A{tilde over (
)} have converged;if b{tilde over (
)} and A{tilde over (
)} have converged, deciding K estimation values x{circumflex over (
)} for the K transmitted symbols x based on b{tilde over (
)} and A{tilde over (
)}; andif b{tilde over (
)} and A{tilde over (
)} have not converged, returning to the iteratively calculating b{tilde over (
)} and A{tilde over (
)} for the K samples b. View Dependent Claims (15, 16, 17, 18, 19, 20)
1 Specification
Some embodiments relate to an efficient linear detection implementation for massive multiinput multioutput (MIMO).
Massive MIMO, also known as largescale MIMO, is a scalable version of pointtopoint MIMO, or multiuser MIMO, with many antennas at both link ends. Multiuser detection for massive MIMO relies on the implementation of linear signal processing schemes such as the zeroforcing (ZF) and the linear minimum meansquareerror (LMMSE) detectors.
An option for implementing the linear (e.g., LMMSE) receiver is via an exact implementation. The cheapest, computationally speaking, implementation of an exact linear procedure is based on the Cholesky decomposition, namely:
 1. First, decompose A≙LL^{T}, where L is a lower triangular matrix.
 2. Next, solve Ly=b via forward substitution to extract the vector y.
 3. Finally, solve L^{T}x=y by back substitution to extract the desired soft decision x*.
The main drawback of such an exact computation approach is in it being computationallyheavy, where the number of required operations is cubic with the number of transmitting antennas, which again may be large in massive MIMO applications. Another problem with this approach, deteriorating its hardware (HW) efficiency, is the fact that its pipeline latency is relatively large, since the Cholesky solver is based on a pipeline of 3 different stages (again the decomposition itself, forward and backward substitution) that must be performed sequentially, one after the other. Consequently, the supported data throughput, which is proportional to the latency reciprocal, is also limited.
A popular iterative alternative is based on the classical GaussSeidel (GS) method:
{circumflex over (x)}_{t+1}=(D+L)^{−1}(b−U{circumflex over (x)}_{t}),
where D, L, U are the diagonal, strictly lower triangular and strictly upper triangular matrices of the (LMMSE) matrix A, respectively.
Some embodiments aim at providing an approximate scheme for executing linear detection, which is enhanced in terms of hardware efficiency and latency.
Some embodiments provide for an iterative approximate scheme for executing linear detection, involving matrix inversion (e.g. like LMMSE), in a massive MIMO environment.
According to some embodiments, a method, a computer program, a computer readable medium, a nontransitory computer readable medium and apparatuses are provided as defined in the appended claims.
The iterative scheme for an approximate implementation of the computationallydemanding linear multiuser detection in the massive MIMO configuration, according to some embodiments, shows attractive merits of hardware efficiency.
In the following the invention will be described by way of embodiments thereof with reference to the accompanying drawings.
First, the problem with multiuser detection for massive MIMO is described, for exposition purposes, in terms of an LMMSE detector, but evidently it is also applicable to other forms of linear detection, such as the implementation of the ZF detector, and any other linear detector (implicitly) involving matrix inversion.
Furthermore, the iterative scheme according to some embodiments is applied in the context of beamforming in the downlink.
Consider a generic uplink wireless massive MIMO channel with K transmitting antennas (e.g., users, transmitting units) and M receiving antennas at the basestation (or alternatively M antenna receptions aggregated together at a central baseband unit (BBU) location).
The system load is defined as
and typically in underloaded massive MIMO β<<1. Assume a flat fading channel and Additive White Gaussian Noise (AWGN) (with variance σ^{2}):
where vector r represents a signal received at the M receiving antennas (receiving units), h_{i,j }represents entries of transfer matrix H, with i=1 . . . K and j=1 . . . M, vector x represents K transmitted symbols transmitted by the transmitting units, and vector n represents noise.
The LMMSE receiver in such a setup boils down to the linear operation:
{circumflex over (x)}=(H^{T}H+σ^{2}I_{K})^{−1}H^{T}r
where vector {circumflex over (x)} represents K estimation values for the K transmitted symbols, and I_{K }is an identity matrix.
Note that this LMMSE operation is equivalent to solving a linear system of the form:
where R≙H^{T}H is the channel'"'"'s Gram matrix, A is the (LMMSE) matrix, to be implicitly inverted, and b is the matchedfilter output. The vector x* is the desired solution to this set of linear equations.
In process 1, in step S101, at a time instance t, K samples b are acquired from a signal r, wherein r is based at least on K transmitted symbols x and a transfer matrix H of a communication channel. Further, in step S101, at time instance t, a linear detection matrix A of a size K×K is acquired, which is based at least on the transfer matrix H. According to an example implementation, the signal r is received from K transmitting units. Alternatively or in addition, the K samples b are received from a preceding matched filter module. It is noted that b, r and x may be vectors corresponding to those as described above.
In step S102, at most K(K−1) tentative parameters b{tilde over ( )} and at most K(K−1) tentative parameters A{tilde over ( )} are calculated for the K samples b and the linear detection matrix A. According to an example implementation, the linear detection matrix A comprises a linear minimum meansquareerror detection matrix as described above, zero forcing detection matrix, or any other linear detection matrix.
According to an example implementation, calculation of b{tilde over ( )} and A{tilde over ( )} implicitly involves approximating a matrix inversion of the linear transfer matrix A.
According to an example implementation, calculation of b{tilde over ( )} and A{tilde over ( )} comprises an initialization in which the K samples b and the K×K linear detection matrix A are used as inputs.
In step S103, it is checked whether or not the tentative parameters b{tilde over ( )} and A{tilde over ( )} have converged according to a predefined stopping criterion.
If b{tilde over ( )} and A{tilde over ( )} have converged (Yes in S103), process 1 proceeds to step S104 in which, based on b{tilde over ( )} and A{tilde over ( )}, K estimation values x{circumflex over ( )} are decided for the K transmitted symbols x. Then, process 1 advances to step S105 in which the time instance t is incremented, and then again to step S101 in which K samples b and a K×K linear detection matrix A of the next time instance are acquired.
On the other hand, if b{tilde over ( )} and A{tilde over ( )} have not converged (No in S103), process 1 returns to step S102.
According to an alternative example implementation, calculation of b{tilde over ( )} and A{tilde over ( )} comprises an initialization in which the K samples b are replaced with b−Ax{circumflex over ( )}_{t=0}, wherein the decision in step S104 comprises adding the K estimation values x{circumflex over ( )} to x{circumflex over ( )}_{t=0}, with x{circumflex over ( )}_{t=0}=(D^{−1}−D^{−1}ED^{−1})b, where D and E are the diagonal and complementary offdiagonal matrices of the linear transfer matrix A, respectively.
It is noted that process 1 may be terminated e.g. in case there are no more samples b to acquire, after a predetermined time period, etc.
The abovedescribed process 1 provides for a recipe for hardware (HW)efficient implementations of multiuser linear detection (as the above LMMSE receiver) in massive MIMO applications, which is of importance for the design of attractive and costeffective 5G and beyond systems.
Hardware efficiency refers, among others, to the following aspects:
 Significantly less FLOPs (floatingpoint operations) than a straightforward exact implementation which is by nature cubic with the number of transmitting antennas K;
 Less silicon area (reduced bill of materials (BoM));
 Reduced computation (pipeline) latency;
 Support of higher data throughput in the HW pipeline;
 Seamlessly lending itself to parallel computation resources implementation (e.g., either within the same HW entity or distributed across several HW entities connected via a fast bus);
 Having a minimal number of iterations which entail maximal effectiveness in detection;
 Enabling HW reuse via scheme iterations; and
 All the above, at the expense of negligible performance loss.
The approximate scheme illustrated in
 Faster convergence, typically 3 times faster in the examined practical configurations (as described later on);
 Moreover, for typical massive MIMO underloads, only 1, or 2 iterations of the proposed scheme are required in order to infer soft decisions which are sufficiently close, in their accuracy, to the output of an exact (and costly) implementation of the linear detector;
 The complexity per iteration, in FLOPs, of the scheme of process 1 is less than 3 times larger than that of other iterative scheme alternatives, thus yielding an overall computational gain;
 Following its accelerated convergence property, the scheme of process 1 suggests a reduced pipeline latency, and consequently a higher supported throughput, compared to the exact Choleskybased scheme and other iterative schemes, as the one based on GS;
 The scheme of process 1 inherently lends itself to parallel computation resources implementation;
 In terms of the hardware efficiency, measured via the metric of the product of FLOPs×latency, which is also proportional to the ratio area/throughput, the scheme of process 1 is found to be superior to the exact Cholesky implementation and the iterative implementation alternatives (the lesser this abovementioned product, or ratio, the more efficient the HW); and
 All the above is conveyed in practical massive MIMO configurations at the expense of only 0.1 dB loss in the SNR achieving 10% blockerrorrate (BLER).
Per K transmitted symbols at a given time instance, inputs into the proposed linear detector implementation (module 200) are:
(1) a K×1 vector of samples received from a preceding matched filter module; and
(2) a K×K linear detection matrix A (which is for instance in the LMMSE detector case A=(H′H+σ^{2}I_{K})). The module 200 outputs a K×1 vector of soft decisions, {circumflex over (x)}, estimating the desired K transmitted symbols.
In the module 200, first, in a step “initialize” for all i and j (i=1 . . . K, j=1 . . . K, and j≠i), A{tilde over ( )}_{ij }and b{tilde over ( )}_{ij }are initialized to zero. A{tilde over ( )}_{ij }are entries (tentative parameters) of A{tilde over ( )}, and b{tilde over ( )}_{ij }are entries (tentative parameters) of b{tilde over ( )}.
Then, in a step “iterate”, for all i and j corresponding to nonzero entries of the linear detection matrix A, calculations are performed as illustrated in
Then, in a step “check”, it is checked whether or not A{tilde over ( )}_{ij }and b{tilde over ( )}_{ij }have converged according to some predefined criterion. In case A{tilde over ( )}_{ij }and b{tilde over ( )}_{ij }have converged, the process in module 200 continues to a step “decision”. Otherwise, the process returns to “iterate”.
In the step “decision”, x{circumflex over ( )}_{t }(i) is calculated as:
x{circumflex over ( )}_{t}(i)=(b{tilde over ( )}_{i\j}+b{tilde over ( )}_{ji})/(A{tilde over ( )}_{i\j}+A{tilde over ( )}_{ji})
According to an example implementation, a first initialization option for the abovedescribed iterative scheme comprises a naïve (yet cheap) approach setting: {circumflex over (x)}_{t=0}=0. In other words, b and A are taken as inputs to the process of
According to another example implementation, a second initialization option for the abovedescribed iterative scheme comprises a calculated (thus “smarter”) initialization which is a bit pricy: {circumflex over (x)}_{t=0}=(D^{−1}−D^{−1}ED^{−1})b, where D, E are the diagonal and complementary offdiagonal matrices of the matrix A, respectively. This initialization means replacing the vector b with b−Ax{circumflex over ( )}_{t=0 }where x{circumflex over ( )}_{t=0 }is defined as indicated above, and in the step “decision” adding the estimated x{circumflex over ( )}_{t }to x{circumflex over ( )}_{t=0 }to get the approximated value. The second initialization option, which expedites convergence compared to the zero initialization, is based on a 2term Neumann series expansion (NSE).
According to an example implementation, the control unit 10 executes process 1 shown in
According to another example implementation, the control unit 10 implements the module 200 of
According to some embodiments, the control unit 10 is implemented in a base station of a communication system (comprising 3G, 4G and 5G systems) and/or is used for baseband processing, and/or is implemented in access points or handsets, and/or is used for wireless communication.
According to some embodiments, the processing resources (processing circuitry) 11, memory resources (memory circuitry) 12 and interfaces (interface circuitry) 13 function as acquiring means implementing step S101, calculating means implementing step S102, checking means and returning means, implementing step S102, and deciding means implementing step S104.
The memory resources (memory circuitry) 12 may store a program that is assumed to include program instructions that, when executed by the processing resources (processing circuitry) 11, enable the electronic device to operate in accordance with the exemplary embodiments. In general, the various embodiments may be implemented in hardware or special purpose circuits, software (a computer program comprising computer readable instructions, computer readable instructions embodied on a computer readable medium e.g. comprising a nontransitory computer readable medium), logic or any combination thereof.
Further, as used in this application, the term “circuitry” refers to all of the following:
(a) hardwareonly circuit implementations (such as implementations in only analog and/or digital circuitry) and
(b) to combinations of circuits and software (and/or firmware), such as (as applicable): (i) to a combination of processor(s) or (ii) to portions of processor(s)/software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and
(c) to circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present.
This definition of “circuitry” applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term “circuitry” would also cover an implementation of merely a processor (or multiple processors) or portion of a processor and its (or their) accompanying software and/or firmware. The term “circuitry” would also cover, for example and if applicable to the particular claim element, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in server, a cellular network device, or other network device.
Now reference is made to
It is noted that, as for other offtheshelf iterative schemes, GSbased implementation is observed, in the massive MIMO setup simulations, to converge faster than a conjugate gradient leastsquares (CGLS) based scheme, thus it was chosen as the method of reference for iterative schemes in the comparison.
On the first row of the table there is the exact Cholesky implementation with its cubic order number of complex multiplications and additions. Note that this number sums up the operations for the 3 stages of the Cholesky scheme (decomposition, forward substitution and backward substitution).
Next, there is the GS iterative scheme with t_{max }iterations and a 0vector initialization. This setup has the least number of complex multiplications or additions per iteration which is K squared. The calculated initialization of GS naturally entails some overhead in operations, but also speeds up its convergence. In the next two rows there is the required number of complex multiplications and additions needed for the proposed iterative scheme with a naïve 0 initialization and the calculated initialization. Both are quadratic with the number of transmitting antennas or users K.
In order to evaluate and compare the HW efficiency of the proposed scheme to its alternatives, in the following analysis it is assumed that:
 Each complex multiplication yields 4 real multiplications and 2 real additions;
 Each complex addition boils down to 2 real additions; and
 The number of FLOPs is defined as the sum of the required number of real multiplications and additions.
A massive MIMO uplink with LMMSE detection in a Rayleigh channel where each user'"'"'s information data is encoded with a rate 1/2 LDPC code and modulated using a 64QAM constellation has been simulated. A perfect channel estimation at the receiver side is assumed and it is focused on the operational point of 10% BLER. In the comparison, only approximate detector setups are considered which yield less than 0.3 dB degradation at the 10% BLER reference point w.r.t. the exact LMMSE.
For evaluating the HW efficiency, the product out of the number of FLOPs and the latency is of interest, which is, again, also proportional to the ratio between the required silicon area (viz., number of lookup table (LUT) in field programmable gate array (FPGA)) for implementation and the supported throughput. Hence, the lesser this product, the more efficient is the HW required by the corresponding LMMSE implementation. That is, for a given silicon area higher throughput is supported or alternatively for a given throughput rate a lower silicon area is required.
First Simulation Case: Tx8Rx64
The first simulation setup examined is for the case of K=8 transmitting users or streams (transmitting units) and M=64 receiving antennas (receiving units). Simulation results are shown in
The HW efficiency is evaluated via the following exposition: The yaxis in
The black circle point denotes the latency and FLOPs required by the Cholesky exact LMMSE implementation. As stated in the previous MAC operations table, the number of FLOPs is cubic with the number of users, and also the latency is relatively large, since the Cholesky solver is based on a pipeline of 3 different consecutive stages. Next to each approximate scheme marker in
The most efficient scheme observed is the proposed iterative detector with a simple zero initialization (represented by “Δ”), which yields a performance loss of only 0.1 dB compared to the exact LMMSE performance. Note that this is achieved after only 1 iteration. The 2nd best HW efficient detector is the GS with the calculated initialization (the black circle), which yields a 0.3 dB degradation after 1 iteration. Then comes the GS with naïve 0 initialization (black square) at the cost of 0.3 dB, and next in line is the proposed detector with the calculated initialization, at a cost as minimal as 0.05 dB. The 5th efficient scheme is the proposed scheme with 0 initialization but allowing 2 iterations, which yield practically the same performance as the exact implementation. Next, is the GS with 0 initialization and 3 iterations (−0.05 dB), and finally the GS detector with 2 iterations and a calculated init (−0.1 dB).
The dashed lines in the figure illustrate the evolution in the FLOPsLatency plain of the different scheme when one increases or reduces the number of allowed iterations. Please note that in this particular massive MIMO configuration, a method can require more FLOPs (even than that of the 100% corresponding to the exact scheme), but still be considered as more HW efficient since its pipeline latency may be shorter, thus potentially enable more throughput.
To summarize and enumerate the plot shown in
As shown in
Second Simulation Case: Tx32Rx256
Now a massive MIMO configuration is considered which comprises the same load but more transmitting users, namely Tx32Rx256. Simulation results are shown in
Again, the proposed scheme is observed as the most efficient at the expense of only 0.1 dB degradation, with GS (calculated initialization) after 1 iteration as the 2nd best (but −0.25 dB loss). Note also that for K=32 users, as opposed to the previous K=8 users case, all the examined methods, achieving up to 0.3 dB performance loss, still exhibit a total number of FLOPs reduction, that is all the markers reside below the 100% yaxis line.
The table shown in
The proposed detector for 32 transmitters vs. 256 receiving antennas configuration, when compared to the exact alternative, reduces the required area by 70%, increases the supported throughput by 200% and overall enables a 10× improvement in the HW efficiency at the cost of only 0.1 dB SNR loss at the working point. Compared to the 2nd best scheme, 45% less FLOPs, or area, the same supported throughput and 2× the efficiency are achieved, including a 0.15 dB gain in SNR.
Third Simulation Case: Tx32Rx128
In the third examined setup being simulated, the number of transmitters is kept (that is K=32), but the load is doubled, that is having only M=128 receiving antennas. Simulation results are shown in
Again, the proposed detector with a naïve initialization yields the most efficient implementation followed in this case by a 0init GS which allows 3 iterations.
The table shown in
As shown in
It is to be understood that the above description is illustrative of the invention and is not to be construed as limiting the invention. Various modifications and applications may occur to those skilled in the art without departing from the true spirit and scope of the invention as defined by the appended claims.