Method and apparatus for predicting GPU malfunctions
First Claim
1. A method of predicting GPU malfunctions in a cluster of GPUs, the method comprising:
- collecting a plurality of measurements of a condition of a first GPU in the cluster of GPUs during a pre-determined time period;
determining a GPU count that represents how many of the measurements of the condition exceeded a threshold value during the pre-determined time period, and obtaining a mean fault count;
determining a GPU standard deviation based on the plurality of measurements of the condition collected during the pre-determined time and a plurality of measurements of the condition previously collected from the first GPU;
detecting when the GPU count is greater than the mean fault count, and the GPU standard deviation is less than a fault standard deviation threshold; and
migrating an application program executing on the first GPU to a second GPU in the cluster of GPUs in response to detecting when the GPU count exceeds the mean fault count and the GPU standard deviation falls below the fault standard deviation threshold.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of predicting GPU malfunctions includes installing a daemon program at a GPU node, the daemon program periodically collecting GPU status parameters corresponding to the GPU node at a pre-determined time period. The method also includes obtaining the GPU status parameters from the GPU node and comparing the obtained GPU status parameters with mean status fault parameters to determine whether the GPU is to malfunction, where the mean status fault parameters are obtained by use of a pre-configured statistical model. Prior to a GPU enters a malfunction state, the GPU can be replaced, or the programs executing on the GPU can be migrated to other GPUs for execution, without affecting the normal business operations.
-
Citations
20 Claims
-
1. A method of predicting GPU malfunctions in a cluster of GPUs, the method comprising:
-
collecting a plurality of measurements of a condition of a first GPU in the cluster of GPUs during a pre-determined time period; determining a GPU count that represents how many of the measurements of the condition exceeded a threshold value during the pre-determined time period, and obtaining a mean fault count; determining a GPU standard deviation based on the plurality of measurements of the condition collected during the pre-determined time and a plurality of measurements of the condition previously collected from the first GPU; detecting when the GPU count is greater than the mean fault count, and the GPU standard deviation is less than a fault standard deviation threshold; and migrating an application program executing on the first GPU to a second GPU in the cluster of GPUs in response to detecting when the GPU count exceeds the mean fault count and the GPU standard deviation falls below the fault standard deviation threshold. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method of predicting GPU malfunctions in a cluster of GPUs, the method comprising:
-
collecting a GPU usage duration of a first GPU in the cluster of GPUs during a pre-determined time period; detecting when the GPU usage duration is greater than a fault usage duration; and migrating an application program executing on the first GPU to a second GPU in the cluster of GPUs in response to detecting when the GPU usage duration is greater than the fault usage duration. - View Dependent Claims (9, 10)
-
-
11. An apparatus for predicting GPU malfunctions in a cluster of GPUs, the apparatus comprising:
-
a processor; and a non-transitory computer-readable medium coupled to the processor, the non-transitory computer-readable medium having computer-readable instructions stored thereon to be executed when accessed by the processor, the instructions comprising; collecting a plurality of measurements of a condition of a first GPU in the cluster of GPUs during a pre-determined time period; determining a GPU count that represents how many of the measurements of the condition exceeded a threshold value during the pre-determined time period; determining a GPU standard deviation based on the plurality of measurements of the condition collected during the pre-determined time and a plurality of measurements of the condition previously collected from the first GPU; detecting when the GPU count is greater than a fault count, and the GPU standard deviation is less than a fault standard deviation threshold; and migrating an application program executing on the first GPU to a second GPU in the cluster of GPUs in response to detecting when the GPU count exceeds the fault count and the GPU standard deviation falls below the fault standard deviation threshold. - View Dependent Claims (12, 13, 14, 15, 16, 17)
-
-
18. An apparatus for predicting GPU malfunctions in a cluster of GPUs, the apparatus comprising:
-
a processor; and a non-transitory computer-readable medium coupled to the processor, the non-transitory computer-readable medium having computer-readable instructions stored thereon to be executed when accessed by the processor, the instructions comprising; collecting a GPU usage duration of a first GPU in the cluster of GPUs during a pre-determined time period; detecting when the GPU usage duration is greater than a fault usage duration; and migrating an application program executing on the first GPU to a second GPU in the cluster of GPUs in response to detecting when the GPU usage duration is greater than the fault usage duration. - View Dependent Claims (19, 20)
-
Specification