EFFICIENT MATRIX MULTIPLICATION ON A PARALLEL PROCESSING DEVICE
First Claim
1. A method for mapping one or more cooperative thread arrays (CTA) to different tiles of a result matrix to perform a matrix multiplication operation, the method comprising:
- defining a tile size;
dividing the result matrix into one or more tiles based on the tile size;
determining a CTA size;
creating a CTA for each tile;
defining a CTA grid, wherein each tile is associated with a different location within the CTA grid; and
issuing a first CTA.
0 Assignments
0 Petitions
Accused Products
Abstract
The present invention enables efficient matrix multiplication operations on parallel processing devices. One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations. Another embodiment is a second method for mapping CTAs to result tiles. Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations. The present invention advantageously enables result matrix elements to be computed on a tile-by-tile basis using multiple CTAs executing concurrently on different streaming multiprocessors, enables source tiles to be copied to local memory to reduce the number accesses from the global memory when computing a result tile, and enables coalesced read operations from the global memory as well as write operations to the local memory without bank conflicts.
-
Citations
20 Claims
-
1. A method for mapping one or more cooperative thread arrays (CTA) to different tiles of a result matrix to perform a matrix multiplication operation, the method comprising:
-
defining a tile size; dividing the result matrix into one or more tiles based on the tile size; determining a CTA size; creating a CTA for each tile; defining a CTA grid, wherein each tile is associated with a different location within the CTA grid; and issuing a first CTA. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for mapping a plurality of cooperative thread arrays (CTA) to different tiles of a result matrix to perform a matrix multiplication operation, the system comprising:
-
one or more memories configured to store one or more software processes; and a processor coupled to the one or more memories and including; one or more processing units, each processing unit configured to execute one or more CTAs, and CTA issue logic coupled to the one or more processing units, wherein the one or more software processes are configured to; define a tile size, divide the result matrix into one or more tiles based on the tile size, determine a CTA size, create a CTA for each tile, and defining a CTA grid, wherein each tile is associated with a different location within the CTA grid, and wherein the CTA issue logic is configured to issue a first CTA. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification