MECHANISMS TO IMPROVE DATA LOCALITY FOR DISTRIBUTED GPUS
First Claim
1. A system comprising:
- a plurality of memory devices; and
a plurality of processing units, wherein each processing unit of the plurality of processing units is coupled to one or more local memory devices of the plurality of memory devices;
wherein the system is configured to;
partition a workload into a plurality of workgroups;
partition one or more data buffers into a plurality of data partitions; and
determine how to dispatch workgroups to the plurality of processing units and map data partitions to the plurality of memory devices based on minimizing accesses to non-local memory devices.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems, apparatuses, and methods for implementing mechanisms to improve data locality for distributed processing units are disclosed. A system includes a plurality of distributed processing units (e.g., GPUs) and memory devices. Each processing unit is coupled to one or more local memory devices. The system determines how to partition a workload into a plurality of workgroups based on maximizing data locality and data sharing. The system determines which subset of the plurality of workgroups to dispatch to each processing unit of the plurality of processing units based on maximizing local memory accesses and minimizing remote memory accesses. The system also determines how to partition data buffer(s) based on data sharing patterns of the workgroups. The system maps to each processing unit a separate portion of the data buffer(s) so as to maximize local memory accesses and minimize remote memory accesses.
-
Citations
20 Claims
-
1. A system comprising:
-
a plurality of memory devices; and a plurality of processing units, wherein each processing unit of the plurality of processing units is coupled to one or more local memory devices of the plurality of memory devices; wherein the system is configured to; partition a workload into a plurality of workgroups; partition one or more data buffers into a plurality of data partitions; and determine how to dispatch workgroups to the plurality of processing units and map data partitions to the plurality of memory devices based on minimizing accesses to non-local memory devices. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method comprising:
-
partitioning a workload into a plurality of workgroups; partitioning one or more data buffers into a plurality of data partitions; and determining how to dispatch workgroups to a plurality of processing units and map data partitions to the local memory devices of the plurality of processing units based on minimizing non-local memory accesses. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer readable storage medium storing program instructions, wherein the program instructions are executable by a processor to:
-
partition a workload into a plurality of workgroups; partition one or more data buffers into a plurality of data partitions; and determine how to dispatch workgroups to a plurality of processing units and map data partitions to the local memory devices of the plurality of processing units based on minimizing non-local memory accesses. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification