METHOD AND APPARATUS FOR PREDICTING GPU MALFUNCTIONS
First Claim
1. A method of predicting GPU malfunctions, the method comprising:
- installing a daemon program at a GPU node, the daemon program periodically collecting GPU status parameters corresponding to the GPU node at a pre-determined time period;
obtaining the GPU status parameters from the GPU node; and
comparing the obtained GPU status parameters with mean status fault parameters to determine whether the GPU is to malfunction, wherein the mean status fault parameters are obtained by use of a pre-configured statistical model.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of predicting GPU malfunctions includes installing a daemon program at a GPU node, the daemon program periodically collecting GPU status parameters corresponding to the GPU node at a pre-determined time period. The method also includes obtaining the GPU status parameters from the GPU node and comparing the obtained GPU status parameters with mean status fault parameters to determine whether the GPU is to malfunction, where the mean status fault parameters are obtained by use of a pre-configured statistical model. Prior to a GPU enters a malfunction state, the GPU can be replaced, or the programs executing on the GPU can be migrated to other GPUs for execution, without affecting the normal business operations.
19 Citations
20 Claims
-
1. A method of predicting GPU malfunctions, the method comprising:
-
installing a daemon program at a GPU node, the daemon program periodically collecting GPU status parameters corresponding to the GPU node at a pre-determined time period; obtaining the GPU status parameters from the GPU node; and comparing the obtained GPU status parameters with mean status fault parameters to determine whether the GPU is to malfunction, wherein the mean status fault parameters are obtained by use of a pre-configured statistical model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. An apparatus for predicting GPU malfunctions, the apparatus comprising:
-
a processor; and a non-transitory computer-readable medium operably coupled to the processor, the non-transitory computer-readable medium having computer-readable instructions stored thereon to be executed when accessed by the processor, the instructions comprising; an installation module configured to install a daemon program at a GPU node, the daemon program periodically collecting GPU status parameters corresponding to the GPU node at a pre-determined time period; a collecting module configured to obtain the GPU status parameters from the GPU node; and a processing module configured to compare the obtained GPU status parameters with mean status fault parameters to determine whether the GPU is to malfunction, wherein the mean status fault parameters are obtained by use of a pre-configured statistical model. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification