Method and apparatus for predicting GPU malfunctions

US 10,031,797 B2
Filed: 02/26/2016
Issued: 07/24/2018
Est. Priority Date: 02/26/2015
Status: Active Grant

First Claim

Patent Images

1. A method of predicting GPU malfunctions in a cluster of GPUs, the method comprising:

collecting a plurality of measurements of a condition of a first GPU in the cluster of GPUs during a pre-determined time period;

determining a GPU count that represents how many of the measurements of the condition exceeded a threshold value during the pre-determined time period, and obtaining a mean fault count;

determining a GPU standard deviation based on the plurality of measurements of the condition collected during the pre-determined time and a plurality of measurements of the condition previously collected from the first GPU;

detecting when the GPU count is greater than the mean fault count, and the GPU standard deviation is less than a fault standard deviation threshold; and

migrating an application program executing on the first GPU to a second GPU in the cluster of GPUs in response to detecting when the GPU count exceeds the mean fault count and the GPU standard deviation falls below the fault standard deviation threshold.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of predicting GPU malfunctions includes installing a daemon program at a GPU node, the daemon program periodically collecting GPU status parameters corresponding to the GPU node at a pre-determined time period. The method also includes obtaining the GPU status parameters from the GPU node and comparing the obtained GPU status parameters with mean status fault parameters to determine whether the GPU is to malfunction, where the mean status fault parameters are obtained by use of a pre-configured statistical model. Prior to a GPU enters a malfunction state, the GPU can be replaced, or the programs executing on the GPU can be migrated to other GPUs for execution, without affecting the normal business operations.

Citations

20 Claims

1. A method of predicting GPU malfunctions in a cluster of GPUs, the method comprising:
- collecting a plurality of measurements of a condition of a first GPU in the cluster of GPUs during a pre-determined time period;
  
  determining a GPU count that represents how many of the measurements of the condition exceeded a threshold value during the pre-determined time period, and obtaining a mean fault count;
  
  determining a GPU standard deviation based on the plurality of measurements of the condition collected during the pre-determined time and a plurality of measurements of the condition previously collected from the first GPU;
  
  detecting when the GPU count is greater than the mean fault count, and the GPU standard deviation is less than a fault standard deviation threshold; and
  
  migrating an application program executing on the first GPU to a second GPU in the cluster of GPUs in response to detecting when the GPU count exceeds the mean fault count and the GPU standard deviation falls below the fault standard deviation threshold.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein:
    - the condition is temperature;
      
      the plurality of measurements collected from the first GPU during the pre-determined time period include a plurality of currently-collected temperature measurements, and the plurality of measurements of the condition previously collected from the first GPU include a plurality of previously-collected temperature measurements;
      
      the GPU count represents how many temperature measurements exceeded a temperature threshold; and
      
      the GPU standard deviation is based on the plurality of currently-collected temperature measurements and the plurality of previously-collected temperature measurements.
  - 3. The method of claim 2, wherein the fault count is a mean count based on the plurality of currently-collected temperature measurements and the plurality of previously-collected temperature measurements.
  - 4. The method of claim 3, wherein the fault standard deviation threshold is an assigned value based on experience.
  - 5. The method of claim 1, wherein:
    - the condition is power consumption;
      
      the plurality of measurements collected from the first GPU during the pre-determined time period include a plurality of currently-collected power consumption measurements, and the plurality of measurements of the condition previously collected from the first GPU include a plurality of previously-collected power consumption measurements;
      
      the GPU count represents how many power consumption measurements exceeded a power consumption threshold; and
      
      the GPU standard deviation is based on the plurality of currently-collected power consumption measurements, and the plurality of previously-collected power consumption measurements.
  - 6. The method of claim 5, wherein the fault count is a mean count based on the plurality of currently-collected power consumption measurements collected during the pre-determined time and the plurality of previously-collected power consumption measurements.
  - 7. The method of claim 6, wherein the fault standard deviation threshold is an assigned value based on experience.

8. A method of predicting GPU malfunctions in a cluster of GPUs, the method comprising:
- collecting a GPU usage duration of a first GPU in the cluster of GPUs during a pre-determined time period;
  
  detecting when the GPU usage duration is greater than a fault usage duration; and
  
  migrating an application program executing on the first GPU to a second GPU in the cluster of GPUs in response to detecting when the GPU usage duration is greater than the fault usage duration.
- View Dependent Claims (9, 10)
- - 9. The method of claim 8, wherein the GPU usage duration is a mean duration based on the GPU usage duration collected during the pre-determined time and a plurality of GPU usage durations previously collected from the first GPU.
  - 10. The method of claim 9, wherein the plurality of GPU usage durations previously collected from the first GPU are stored in an information storage space as a plurality of stored measurements that correspond to the GPU.

11. An apparatus for predicting GPU malfunctions in a cluster of GPUs, the apparatus comprising:
- a processor; and
  
  a non-transitory computer-readable medium coupled to the processor, the non-transitory computer-readable medium having computer-readable instructions stored thereon to be executed when accessed by the processor, the instructions comprising;
  
  collecting a plurality of measurements of a condition of a first GPU in the cluster of GPUs during a pre-determined time period;
  
  determining a GPU count that represents how many of the measurements of the condition exceeded a threshold value during the pre-determined time period;
  
  determining a GPU standard deviation based on the plurality of measurements of the condition collected during the pre-determined time and a plurality of measurements of the condition previously collected from the first GPU;
  
  detecting when the GPU count is greater than a fault count, and the GPU standard deviation is less than a fault standard deviation threshold; and
  
  migrating an application program executing on the first GPU to a second GPU in the cluster of GPUs in response to detecting when the GPU count exceeds the fault count and the GPU standard deviation falls below the fault standard deviation threshold.
- View Dependent Claims (12, 13, 14, 15, 16, 17)
- - 12. The apparatus of claim 11, wherein:
    - the condition is temperature;
      
      the plurality of measurements collected from the first GPU during the pre-determined time period include a plurality of currently-collected temperature measurements, and the plurality of measurements of the condition previously collected from the first GPU include a plurality of previously-collected temperature measurements;
      
      the GPU count represents how many temperature measurements exceeded a temperature threshold; and
      
      the GPU standard deviation is based on the plurality of currently-collected temperature measurements and the plurality of previously-collected temperature measurements.
  - 13. The apparatus of claim 12, wherein the fault count is a mean count based on the plurality of currently-collected temperature measurements and the plurality of previously-collected temperature measurements.
  - 14. The apparatus of claim 13, wherein the fault standard deviation threshold is an assigned value based on experience.
  - 15. The apparatus of claim 11, wherein:
    - the condition is power consumption;
      
      the plurality of measurements collected from the first GPU during the pre-determined time period include a plurality of currently-collected power consumption measurements, and the plurality of measurements of the condition previously collected from the first GPU include a plurality of previously-collected power consumption measurements;
      
      the GPU count represents how many power consumption measurements exceeded a power consumption threshold; and
      
      the GPU standard deviation is based on the plurality of currently-collected power consumption measurements, and the plurality of previously-collected power consumption measurements.
  - 16. The apparatus of claim 15, wherein the fault count is a mean count based on the plurality of currently-collected power consumption measurements collected during the pre-determined time and the plurality of previously-collected power consumption measurements.
  - 17. The apparatus of claim 16, wherein the fault standard deviation threshold is an assigned value based on experience.

18. An apparatus for predicting GPU malfunctions in a cluster of GPUs, the apparatus comprising:
- a processor; and
  
  a non-transitory computer-readable medium coupled to the processor, the non-transitory computer-readable medium having computer-readable instructions stored thereon to be executed when accessed by the processor, the instructions comprising;
  
  collecting a GPU usage duration of a first GPU in the cluster of GPUs during a pre-determined time period;
  
  detecting when the GPU usage duration is greater than a fault usage duration; and
  
  migrating an application program executing on the first GPU to a second GPU in the cluster of GPUs in response to detecting when the GPU usage duration is greater than the fault usage duration.
- View Dependent Claims (19, 20)
- - 19. The apparatus of claim 18, wherein the GPU usage duration is a mean duration based on the GPU usage duration collected during the pre-determined time and a plurality of GPU usage durations previously collected from the first GPU.
  - 20. The apparatus of claim 19, wherein the plurality of GPU usage durations previously collected from the first GPU are stored in an information storage space as a plurality of stored measurements that correspond to the first GPU.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Alibaba Group Holding Ltd.
Original Assignee
Alibaba Group Holding Ltd.
Inventors
Hui, Fei
Primary Examiner(s)
Patel, Kamini B

Application Number

US15/054,948
Publication Number

US 20160253230A1
Time in Patent Office

879 Days
Field of Search

714 472
US Class Current
CPC Class Codes

G06F 11/008   Reliability or availability...

G06F 11/0721   within a central processing...

G06F 11/076   by exceeding a count or rat...

G06F 11/079   Root cause analysis, i.e. e...

G06T 1/20   Processor architectures; Pr...

G06T 2200/28   involving image processing ...

Method and apparatus for predicting GPU malfunctions

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for predicting GPU malfunctions

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links