×

Method and apparatus for predicting GPU malfunctions

  • US 10,031,797 B2
  • Filed: 02/26/2016
  • Issued: 07/24/2018
  • Est. Priority Date: 02/26/2015
  • Status: Active Grant
First Claim
Patent Images

1. A method of predicting GPU malfunctions in a cluster of GPUs, the method comprising:

  • collecting a plurality of measurements of a condition of a first GPU in the cluster of GPUs during a pre-determined time period;

    determining a GPU count that represents how many of the measurements of the condition exceeded a threshold value during the pre-determined time period, and obtaining a mean fault count;

    determining a GPU standard deviation based on the plurality of measurements of the condition collected during the pre-determined time and a plurality of measurements of the condition previously collected from the first GPU;

    detecting when the GPU count is greater than the mean fault count, and the GPU standard deviation is less than a fault standard deviation threshold; and

    migrating an application program executing on the first GPU to a second GPU in the cluster of GPUs in response to detecting when the GPU count exceeds the mean fault count and the GPU standard deviation falls below the fault standard deviation threshold.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×