HARDWARE ACCELERATOR ARCHITECTURE AND TEMPLATE FOR WEB-SCALE K-MEANS CLUSTERING
First Claim
1. A hardware accelerator comprising:
- one or more sparse tiles to execute operations for a clustering task involving a matrix, each of the sparse tiles comprising a first plurality of processing units to operate upon a first plurality of blocks of the matrix that have been streamed to one or more random access memories of the one or more sparse tiles over a high bandwidth interface from a first memory unit; and
one or more very/hyper sparse tiles to execute operations for the clustering task involving the matrix, each of the very/hyper sparse tiles comprising a second plurality of processing units to operate upon a second plurality of blocks of the matrix that have been randomly accessed over a low-latency interface from a second memory unit.
2 Assignments
0 Petitions
Accused Products
Abstract
Hardware accelerator architectures for clustering are described. A hardware accelerator includes sparse tiles and very/hyper sparse tiles. The sparse tile(s) execute operations for a clustering task involving a matrix. Each sparse tile includes a first plurality of processing units to operate upon a first plurality of blocks of the matrix that have been streamed to one or more random access memories of the sparse tiles over a high bandwidth interface from a first memory unit. Each of the very/hyper sparse tiles are to execute operations for the clustering task involving the matrix. Each of the very/hyper sparse tiles includes a second plurality of processing units to operate upon a second plurality of blocks of the matrix that have been randomly accessed over a low-latency interface from a second memory unit.
66 Citations
20 Claims
-
1. A hardware accelerator comprising:
-
one or more sparse tiles to execute operations for a clustering task involving a matrix, each of the sparse tiles comprising a first plurality of processing units to operate upon a first plurality of blocks of the matrix that have been streamed to one or more random access memories of the one or more sparse tiles over a high bandwidth interface from a first memory unit; and one or more very/hyper sparse tiles to execute operations for the clustering task involving the matrix, each of the very/hyper sparse tiles comprising a second plurality of processing units to operate upon a second plurality of blocks of the matrix that have been randomly accessed over a low-latency interface from a second memory unit. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method in a hardware accelerator for efficiently executing clustering comprising:
-
executing, by one or more sparse tiles of the hardware accelerator, operations for a clustering task involving a matrix, each of the sparse tiles comprising a first plurality of processing units to operate upon a first plurality of blocks of the matrix that have been streamed to one or more random access memories of the one or more sparse tiles over a high bandwidth interface from a first memory unit; and executing, by one or more very/hyper sparse tiles of the hardware accelerator, operations for the clustering task involving the matrix, each of the very/hyper sparse tiles comprising a second plurality of processing units to operate upon a second plurality of blocks of the matrix that have been randomly accessed over a low-latency interface from a second memory unit. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A system comprising:
-
a first memory unit; a second memory unit; one or more sparse tiles to execute operations for a clustering task involving a matrix, each of the sparse tiles comprising a first plurality of processing units to operate upon a first plurality of blocks of the matrix that have been streamed to one or more random access memories of the one or more sparse tiles over a high bandwidth interface from the first memory unit; and one or more very/hyper sparse tiles to execute operations for the clustering task involving the matrix, each of the very/hyper sparse tiles comprising a second plurality of processing units to operate upon a second plurality of blocks of the matrix that have been randomly accessed over a low-latency interface from the second memory unit.
-
Specification