Parallel deblocking filter for H.264 video codec
First Claim
1. A process for carrying out the deblocking filter defined by the H.264 video coding standard, operating by simultaneously deblocking edges in a luma macroblock and Cb and Cr chroma macroblocks, wherein an edge is a boundary between two blocks, and vertical edge filtering refers to changing the pixels in the blocks on the left and the right of the edge, and horizontal edge filtering refers to changing the pixels in the blocks above and below the edge, using a parallel processing architecture computer having a plurality of computational units (hereafter called clusters);
- wherein the vertical luma edges form the first set of edges, the horizontal luma edges form the second set of edges, the vertical Cb chroma edges form the third set of edges, the horizontal Cb chroma edges form the fourth set of edges, the vertical Cr chroma edges form the fifth set of edges, the horizontal Cb chroma edges form the sixth set of edges; and
wherein the processing of each set of edges is carried on a plurality of computational units referred to as a set of clusters, in a set of iterations determined by the data dependency within the set of edges and with other sets of edges, such that the first set of edges is processed on the first set of clusters in the first set of iterations, and so on for the rest of the sets of edges, mutatis mutandis; and
wherein said sets of clusters and sets of iterations may be partially or completely overlapping or completely disjoint, wherein overlap of the sets of iterations implies simultaneous processing of parts or entire sets of edges, and overlap of the sets of clusters implies that processing of different parts of sets of edges is allocated to the same computational units.
4 Assignments
0 Petitions
Accused Products
Abstract
A process and apparatus for implementing parallelization in deblocking filter used in a an H.264 codec are disclosed. In the preferred embodiment, the process is carried out on a parallel architecture consisting of a plurality of groups, each consisting of eight clusters, wherein each cluster is a separate processor capable of tensor operations in SIMD or MIMD or mode on 4×4 matrix data. All eight clusters of one group are used to simultaneously deblock both luma and chroma vertical and horizontal edges between 4×4 blocks of pixels in a macroblock in a total of eight iterations, utilizing in the best way the data dependency between the edges. Processes to deblock these same luma and chroma edges in more iterations on four cluster and two cluster parallel architectures are also disclosed. A comparison of the maximum parallelization achievable with the invention and the amount of parallelization with various species within the prior art is also disclosed.
-
Citations
38 Claims
-
1. A process for carrying out the deblocking filter defined by the H.264 video coding standard, operating by simultaneously deblocking edges in a luma macroblock and Cb and Cr chroma macroblocks, wherein an edge is a boundary between two blocks, and vertical edge filtering refers to changing the pixels in the blocks on the left and the right of the edge, and horizontal edge filtering refers to changing the pixels in the blocks above and below the edge, using a parallel processing architecture computer having a plurality of computational units (hereafter called clusters);
-
wherein the vertical luma edges form the first set of edges, the horizontal luma edges form the second set of edges, the vertical Cb chroma edges form the third set of edges, the horizontal Cb chroma edges form the fourth set of edges, the vertical Cr chroma edges form the fifth set of edges, the horizontal Cb chroma edges form the sixth set of edges; and wherein the processing of each set of edges is carried on a plurality of computational units referred to as a set of clusters, in a set of iterations determined by the data dependency within the set of edges and with other sets of edges, such that the first set of edges is processed on the first set of clusters in the first set of iterations, and so on for the rest of the sets of edges, mutatis mutandis; and wherein said sets of clusters and sets of iterations may be partially or completely overlapping or completely disjoint, wherein overlap of the sets of iterations implies simultaneous processing of parts or entire sets of edges, and overlap of the sets of clusters implies that processing of different parts of sets of edges is allocated to the same computational units. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A process for deblocking in a parallel architecture computer, comprising the steps:
-
A) determining data independent calculations in the deblocking process for the vertical edges of a macroblock in a first column of blocks of said macroblock; B) loading blocks of pixels needed for at least some of the data independent calculations determined in step A into clusters of a parallel architecture computer, wherein the number blocks than can be processed on each iteration is bounded by the number of available clusters and the data dependency; C) simultaneously calculating deblocking of said blocks in all clusters loaded in step B and storing said filtered pixel values; D) repeating steps A through D in multiple iterations until all vertical edges in all columns of a macroblock have been deblocked; E) if any clusters of said parallel architecture are unused during said iterations of deblocking of said vertical edges, determining ripeness of horizontal edges for deblocking by determining if any horizontal edge or edges in said macroblock can be deblocked because final values for pixels in the blocks that define said horizontal edge have been deblocked for the last time during deblocking of said macroblock, and if it is determined that one or more horizontal edges of said macroblock are ripe for deblocking, loading the blocks needed to deblock said horizontal edge or edges into unused clusters and deblocking said horizontal edge or edges simultaneously with deblocking of said vertical edges; and F) continuing to use clusters of said parallel architecture computer in a plurality of iterations to deblock filter horizontal edges of said macroblock as said blocks have their pixel values finally adjusted during the process of deblocking said vertical edges and continuing said iterations until all said horizontal edges have been deblocked.
-
-
10. A computer-readable medium having stored therein computer-readable instructions which, when executed by a computer, cause said computer to carry out the following process:
-
A) determining data independent calculations in the deblocking process for the vertical edges of a macroblock in a first column of blocks of said macroblock; B) loading blocks of pixels needed for at least some of the data independent calculations determined in step A into clusters of a parallel architecture computer, wherein the number blocks than can be processed on each number of available clusters and the data dependency; C) simultaneously calculating deblocking of said blocks in all clusters loaded in step B and storing said filtered pixel values; D) repeating steps A through D in multiple iterations until all vertical edges in all columns of a macroblock have been deblocked; E) if any clusters of said parallel architecture are unused during said iterations of deblocking of said vertical edges, determining ripeness of horizontal edges for deblocking by determining if any horizontal edge or edges in said macroblock can be deblocked because final values for pixels in the blocks that define said horizontal edge have been deblocked for the last time during deblocking of said macroblock, and if it is determined that one or more horizontal edges of said macroblock are ripe for deblocking, loading the blocks needed to deblock said horizontal edge or edges into unused clusters and deblocking said horizontal edge or edges simultaneously with deblocking of said vertical edges; and F) continuing to use clusters of said parallel architecture computer in a plurality of iterations to deblock filter horizontal edges of said macroblock as said blocks have their pixel values finally adjusted during the process of deblocking said vertical edges and continuing said iterations until said horizontal edges have been deblocked.
-
-
11. A computer programmed to preform deblocking of horizontal and vertical edges of a luma macroblock, said computer having a plurality of calculation clusters, comprising:
-
first means for deblocking all the vertical edges in a luma macroblock in multiple iterations using a plurality of computing clusters of a parallel architecture computer, said means for deblocking the columns of vertical of a edges in the order determined by their data dependencies with at least some of the data independence vertical edges in each iteration being simultaneously deblocked using multiple ones of said clusters operated simultaneously; and
second means for deblocking all the horizontal edges in a luma macroblock in multiple iterations using a plurality of computing clusters of a parallel architecture computer, said second means for deblocking the columns of horizontal edges in the order determined by their data dependencies with at least some of the data independent vertical edges in each iteration being simultaneously deblocked using multiple ones of said clusters operated simultaneously. - View Dependent Claims (12)
-
-
13. A process to calculate filtered pixel values to deblock an edge, using a computer capable of performing mathematical tensor operations to perform column-wise, row-wise and element-wise multiplication of 4×
- 4 matrices, wherein one or more matrices of weights defined in the H.264 video coding standard is multiplied by one or more matrices of luma or chroma pixel values of a first block and a second block which define an edge between them which is to be deblocked, and the matrix multiplication results are combined so as to derive the deblocking filter output for said first and second blocks.
- View Dependent Claims (14, 15)
-
16. A process for deblocking an edge defined by two 4×
- 4 blocks of pixels comprising;
performing a long filtering process and a short filtering process on each row of pixels defined by said two 4×
4 blocks;for each row of a deblocked 4×
4 output block corresponding to each of said two 4×
4 blocks of pixels which are input to the deblocking process, selecting either the results calculated by said long filter or the results calculated by said short filter or the original pixel data of the row based upon predetermined selection criteria. - View Dependent Claims (17, 18, 19, 20)
- 4 blocks of pixels comprising;
-
21. An apparatus comprising:
-
first means for simultaneously deblocking a first plurality of vertical luma edges and a first plurality of Cb and Cr chroma edges in a first two iterations, all edges being deblocked in the order required by data dependency; second means for simultaneously deblocking a second plurality of luma vertical edges and a first plurality of horizontal luma edges during third and fourth iterations, all edges being deblocked in the order required by data dependency; third means for simultaneously deblocking a first plurality of Cb and Cr chroma horizontal edges and second plurality of horizontal luma edges during fifth and sixth iterations, all said edges being deblocked in the order required by data dependency; fourth means for simultaneously deblocking a third plurality of horizontal luma edges during seventh and eighth iterations, all edges being deblocked in the order required by data dependency.
-
-
22. A process comprising:
-
simultaneously deblocking a first plurality of vertical luma edges and a first plurality of Cb and Cr chroma edges in a first two iterations, all edges being deblocked in the order required by data dependency; simultaneously deblocking a second plurality of luma vertical edges and a first plurality of horizontal luma edges during third and fourth iterations, all edges being deblocked in the order required by data dependency; simultaneously deblocking a first plurality of Cb and Cr chroma horizontal edges and second plurality of horizontal luma edges during fifth and sixth iterations, all said edges being deblocked in the order required by data dependency;
simultaneously deblocking a third plurality of horizontal luma edges during seventh and eighth iterations, all edges being deblocked in the order required by data dependency.
-
-
23. A computer-readable medium having stored thereon computer-readable instructions which, when executed by one or more computational units causes said one or more computational units to perform the following process:
-
simultaneously deblocking a first plurality of vertical luma edges and a first plurality of Cb and Cr chroma edges in a first two iterations, all edges being deblocked in the order required by data dependency; simultaneously deblocking a second plurality of luma vertical edges and a first plurality of horizontal luma edges during third and fourth iterations, all edges being deblocked in the order required by data dependency; simultaneously deblocking a first plurality of Cb and Cr chroma horizontal edges and second plurality of horizontal luma edges during fifth and sixth iterations, all said edges being deblocked in the order required by data dependency; simultaneously deblocking a third plurality of horizontal luma edges during seventh and eighth iterations, all edges being deblocked in the order required by data dependency.
-
-
24. An apparatus for deblocking an edge defined by a left block of pixels and a right block of pixels comprising:
-
means for doing the long filter and short filter deblocking calculations simultaneously to deblock a left block of pixels; and means for doing the long filter and short filter deblocking calculations simultaneously to deblock a right block of pixels. - View Dependent Claims (25, 26)
-
-
27. A process for deblocking an edge defined by left and right blocks of pixels comprising the steps:
-
A) doing long filter and short filter deblocking calculations simultaneously to deblock a left block of pixels; and B) doing long filter and short filter deblocking calculations simultaneously to deblock a right block or pixels. - View Dependent Claims (28, 29)
-
-
30. A deblocking process carried out on a parallel architecture computer comprising of a plurality of clusters, which are capable of operating simultaneously and independently of each other, wherein the deblocking process has three levels of parallelization, comprising simultaneously processing multiple edges of luma and/or chroma data defined by different blocks of the luma and/or chroma pixel data during predetermined iterations, with the number of edges being simultaneously processed and whether the edges are luma, chroma or both and whether the edges are horizontal or vertical determined by the number of clusters in said parallel architecture computer and by the inherent data dependency.
-
31. A deblocking process carried out in eight iterations on a parallel architecture computer comprising of eight clusters, each of which is capable of operating simultaneously and independently of each other, wherein the deblocking has multiple levels of parallelization in that both horizontal and vertical edges are processed simultaneously during some iterations, both luma and chroma edges are processed simultaneously during some iterations and wherein multiple rows of pixels in each block are processed simultaneously, wherein said deblocking process comprises:
-
A) simultaneously processing multiple edges of luma and/or chroma data defined by different blocks of the luma and/or chroma pixel data during predetermined iterations, with the number of edges being simultaneously processed during any particular iteration and whether the edges are luma, chroma or both during any particular iteration and whether the edges are horizontal or vertical during any particular iteration is determined by the number of clusters in said parallel architecture computer and by the inherent data dependency; B) and wherein the horizontal and vertical luma and chroma edges are divided into 6 predetermined sets and wherein deblocking of set of edges 1 is carried out during iterations 1 through 4, the deblocking of set of edges 2 is carried out in iterations 3 through 8, the deblocking of set of edges 3 is carried out in iterations 1 and 2, the deblocking of set of edges 4 is carried out in iterations 5 and 6, the deblocking of set of edges 5 is carried out in iterations 1 and 2, and the deblocking of set of edges 6 is carried out in iterations 5 and 6, wherein when the iterations numbers of the various sets of edges overlap, it means the edges are being simultaneously deblocked.
-
-
32. A computer-readable medium having stored thereon computer-readable instructions which, when executed by a parallel processing computer having a plurality of computational units called clusters, cause said parallel processing computer to perform the following deblocking process to deblock all the vertical and horizontal luma and chroma edges of a macroblock in eight iterations, said process comprising:
-
1) simultaneously deblocking a predetermined plurality of vertical luma edges and multiple vertical Cb and Cr chroma edges on all eight clusters during a first two of eight iterations; and 2) simultaneously deblocking a plurality of luma vertical edges and one luma horizontal edge during a third iteration using five of said plurality of clusters; 3) simultaneously deblocking a plurality of luma vertical edges and two luma horizontal edges during a fourth iteration using six of said clusters; 4) simultaneously deblocking a plurality of Cb and Cr chroma horizontal edges and a plurality of luma horizontal edges during a fifth and sixth iteration using all eight clusters; 5) simultaneously deblocking a plurality of horizontal luma edges during a seventh iteration using three clusters; and 6) simultaneously deblocking a plurality of horizontal luma edges during an eighth iteration using two clusters; and wherein the order of deblocking of luma and chroma vertical and horizontal edges is determined by data dependency.
-
-
33. A process for simultaneously deblocking luma and chroma macroblocks over a plurality iterations using a parallel processing computer which has a plurality of computational units such that some are idle during some of said iterations, each computational unit optimized for tensor operations on matrix data, each computational unit called a cluster, said process deblocking a luma macroblock and Cb and Cr macroblocks simultaneously during a plurality of iterations, said process comprising the steps:
-
1) simultaneously deblocking a first set vertical luma edges and a first set of vertical Cb and Cr chroma edges on a first set of clusters during a first set of said plurality of iterations; and 2) simultaneously deblocking a second set of vertical luma edges and a first set of one or more luma horizontal edge during a second set of one or more iterations using a second set of said plurality clusters;
3) simultaneously deblocking a first set of Cb and Cr horizontal chroma edges and a second set of horizontal luma edges during a third set iteration using third set of said clusters;4) simultaneously deblocking a third set of horizontal luma edges during a fourth set of iterations using a fourth set of clusters; and wherein the order of deblocking of luma and chroma vertical and horizontal edges is determined by data dependency.
-
-
34. A parallel processing architecture computer having a plurality of computational units called clusters, said computer programmed to perform the following process to deblock luma and chroma macroblocks of a video frame simultaneously over a plurality of iterations using a plurality of said clusters:
-
1) simultaneously deblocking a first set vertical luma edges and a first set of vertical Cb and Cr chroma edges on a first set of clusters during a first set of said plurality of iterations; and 2) simultaneously deblocking a second set of vertical luma edges and a first set of one or more luma horizontal edge during a second set of one or more iterations using a second set of said plurality clusters;
3) simultaneously deblocking a first set Cb and Cr horizontal chroma edges and a second set of horizontal luma edges during a third set iteration using third set of said clusters; 4) simultaneously deblocking a third set of horizontal luma edges during a fourth set of iterations using a fourth set of clusters; and wherein the order of deblocking of luma and chroma vertical and horizontal edges is determined by data dependency.
-
-
35. A computer-readable medium having stored thereon computer-readable instructions which when executed by a parallel processing architecture computer having a plurality of computational units called clusters cause said clusters to perform the following process to deblock luma and chroma macroblocks of a video frame simultaneously over a plurality of iterations using a plurality of said clusters:
-
1) simultaneously deblocking a first set vertical luma edges and a first set of vertical Cb and Cr chroma edges on a first set of clusters during a first set of said plurality of iterations; and 2) simultaneously deblocking a second set of vertical luma edges and a first set of one or more luma horizontal edge during a second set of one or more iterations using a second set of said plurality clusters; 3) simultaneously deblocking a first set of Cb and Cr horizontal chroma edges and a second set of horizontal luma edges during a third set iteration using third set of said clusters; 4) simultaneously deblocking a third set of horizontal luma edges during a fourth set of iterations using a fourth set of clusters; and wherein the order of deblocking of luma and chroma vertical and horizontal edges is determined by data dependency.
-
-
36. An apparatus comprising:
-
first means for simultaneously deblocking a first set of vertical luma edges and a first set of vertical Cb and Cr chroma edges over a first set of iterations using a first set of clusters of a parallel processing architecture computer; second means for simultaneously deblocking a second set of vertical luma edges and a first set of horizontal luma edges over a second set of iterations using a second set of clusters; third means for simultaneously deblocking a first set of Cb and Cr horizontal chroma edges and a second set of horizontal luma edges over a third set of iterations using a third set of said clusters; fourth means for simultaneously deblocking a third set of horizontal luma edges over a fourth set of iterations using a fourth set of clusters; and wherein said first, second, third and fourth means are structured to deblock said vertical and horizontal luma and chroma edges in an order determined by data dependency.
-
-
37. A process to calculate filtered pixel values to deblock an edge, using a computer capable of performing mathematical tensor operations to perform column-wise, row-wise and element-wise multiplication of 4×
- 4 matrices, wherein one or more matrices of weights defined in the H.264 video coding standard is multiplied by one or more matrices of luma or chroma pixel values of a first block and a second block which define an edge between them which is to be deblocked, and the matrix multiplication results are combined so as to derive the deblocking filter output for said first and second blocks.
-
38. A parallel processing computer having a plurality of computing clusters, and programmed to perform deblocking of vertical and horizontal edges of luma and/or chroma macroblocks, said program controlling said computer to perform the following process:
-
using one or more clusters of a parallel processing computer capable of performing matrix mathematical operations to multiply one or more matrices of weights defined in the H.264 video coding standard by on or more matrices of pixel values of a first block and a second block which define an edge between them which is to be deblocked, and to combine the matrix multiplication results so as to derive the deblocking filter output for said and second blocks.
-
Specification