Automatic repair of computing devices in a data center

US 10,691,528 B1
Filed: 01/29/2020
Issued: 06/23/2020
Est. Priority Date: 07/23/2019
Status: Active Grant

First Claim

Patent Images

1. A management device for managing a plurality of computing devices in a data center, wherein the management device comprises:

a network interface for communicating with the plurality of computing devices,a first module that sends a first health status query for a selected one of the computing devices,a second module configured to receive and process any responses to the first health status query, anda third module configured to create support tickets,wherein the second module is configured to, in response to not receiving an acceptable response to the first health status query within a first predetermined time;

(a) send a first repair instruction to the selected computing device,(b) wait at least enough time for the first repair instruction to complete,(c) cause the first module to send a second health status query to the selected computing device; and

(d) in response to not receiving an acceptable response to the second health status query within a second predetermined time;

(i) cause the first module to send a second repair instruction to the selected computing device,(ii) wait at least enough time for the second repair instruction to complete,(iii) send a third health status query to the selected computing device; and

(iv) in response to not receiving an acceptable response to the third health status query, cause the third module to create a support ticket identifying the second computing device and the second computing device'"'"'s health status.

View all claims

12 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for automating management and repair of a plurality of computing devices located in a data center is disclosed. Health status queries are issued for one or more of the computing devices. If responses not indicative of good device health are received, one or more repair instructions are automatically sent to the unhealthy computing device to repair the computing device by moving it to an acceptable state. If the repair instructions are not successful, a support ticket is automatically generated for the corresponding computing device or devices. Problematic statuses across areas of the data center may be detected and ticketed in addition to individual problematic devices. So-called repeat offender devices may be detected and ticketed even if the repair instructions are successful.

29 Citations

View as Search Results

20 Claims

1. A management device for managing a plurality of computing devices in a data center, wherein the management device comprises:
- a network interface for communicating with the plurality of computing devices,a first module that sends a first health status query for a selected one of the computing devices,a second module configured to receive and process any responses to the first health status query, anda third module configured to create support tickets,wherein the second module is configured to, in response to not receiving an acceptable response to the first health status query within a first predetermined time;
  
  (a) send a first repair instruction to the selected computing device,(b) wait at least enough time for the first repair instruction to complete,(c) cause the first module to send a second health status query to the selected computing device; and
  
  (d) in response to not receiving an acceptable response to the second health status query within a second predetermined time;
  
  (i) cause the first module to send a second repair instruction to the selected computing device,(ii) wait at least enough time for the second repair instruction to complete,(iii) send a third health status query to the selected computing device; and
  
  (iv) in response to not receiving an acceptable response to the third health status query, cause the third module to create a support ticket identifying the second computing device and the second computing device'"'"'s health status.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15)
- - 2. The management device of claim 1, wherein the first health status query, second health status query, and third health status query are queries for the hash rate of the selected computing device, wherein the first repair instruction is to restart a mining application operating on the selected computing device, and wherein the second repair instruction is to restart the selected computing device.
  - 3. The management device of claim 1, wherein the first health status query, second health status query, and third health status query are queries for the temperature of the selected computing device, wherein the first repair instruction is to increase the fan speed of the selected computing device, and wherein the second repair instruction is to reduce the operating frequency of the selected computing device.
  - 4. The management device of claim 1, further comprising a fourth module comprising a deep neural network (DNN) implementation of a Cox proportional hazards (CPH) model, wherein the DNN and CPH are trained on historical status data from the plurality of computing devices, and wherein the fourth module is configured to output a predicted failure probability for the selected computing device, and wherein the third module is configured to create a support ticket for the selected computing device if the predicted failure probability is greater than a predetermined threshold.
  - 5. The management device of claim 1, wherein the plurality of computing devices are mounted in a plurality of racks, wherein the plurality of racks are positioned in a plurality of pods, wherein the second module is configured to detect:
    - (i) if more than a first predetermined percentage of said plurality of computing devices within a particular rack have not provided acceptable responses to the health status queries;
      
      or(ii) if more than a second predetermined percentage of said plurality of computing devices within a particular pod have not provided acceptable responses to the health status queries,and in response thereto cause the third module to create a support ticket for the particular rack or particular pod.
  - 6. The management device of claim 1, wherein the plurality of computing devices are mounted in a plurality of racks, wherein the health status queries are for temperature, wherein the second module is configured to detect if greater than a predetermined threshold of said plurality of computing devices within any of the plurality of racks have not provided acceptable responses to the health status queries, and in response turn on or increase the rate of active cooling for an area of the data center that includes the particular rack.
  - 7. The management device of claim 1, wherein the second module is configured to store health status query responses and detect repeat offender devices within the plurality of computing devices, wherein the repeat offender devices have, within a predetermined time period, multiple unacceptable health status query responses, even if repaired by the repair instructions.
  - 9. The method of claim 1, wherein the first repair instruction is to restart a mining application operating on the second computing device, and wherein the second repair instruction is to restart the second computing device.
  - 10. The method of claim 1, wherein the first health status query, second health status query, and third health status query are queries for the hash rate of the second computing device, wherein the first repair instruction is to restart a mining application operating on the second computing device, and wherein the second repair instruction is to restart the second computing device.
  - 11. The method of claim 1, wherein the first health status query, second health status query, and third health status query are queries for the temperature of the second computing device, wherein the first repair instruction is to restart a mining application operating on the second computing device, and wherein the second repair instruction is to restart the second computing device.
  - 12. The method of claim 1, wherein the first health status query, second health status query, and third health status query are queries for the temperature of the second computing device, wherein the first repair instruction is to increase the fan speed, and wherein the second repair instruction is to reduce the operating frequency of the second computing device.
  - 13. The method of claim 1, wherein the first health status query, second health status query, and third health status query are for the fan speed of the second computing device, wherein the first repair instruction is to change the fan speed, and wherein the second repair instruction is to restart the second computing device.
  - 14. The method of claim 1, further comprising generating a resolved ticket in response to the first or second repair instructions resulting in an acceptable health status query response.
  - 15. The method of claim 1, further comprising storing information on the number of repair instructions dispatched to the second device and refraining from submitting any more repair instructions to the second device if a predetermined threshold of repair attempts has been exceeded within a repair window.

8. A method for managing a plurality of computing devices in a data center, the method comprising:
- issuing from a first computing device a first health status query for a second computing device of said plurality of computing devices;
  
  in response to not receiving an acceptable response to the first health status query within a first predetermined time;
  
  (i) issuing from the first computing device to the second computing device, a first repair instruction,(ii) waiting at least enough time for the first repair instruction to complete,(iii) issuing from the first computing device a second health status query for the second computing device, and(iv) in response to not receiving an acceptable response to the second health status query within a second predetermined time;
  
  (i) issuing from the first computing device to the second computing device, a second repair instruction,(ii) waiting at least enough time for the second repair instruction to complete,(iii) issuing from the first computing device a third health status query for the second computing device, and(iv) in response to not receiving an acceptable response to the third health status query within a second predetermined time, issuing from the first computing device a repair ticket.

16. A non-transitory, computer-readable storage medium storing instructions executable by a processor of a computational device, which when executed cause the computational device to:
- send a first health status query from a first computing device for a second computing device, and in response to not receiving an acceptable response to the first health status query within a first predetermined time;
  
  (v) send from the first computing device to the second computing device, a first repair instruction,(vi) wait at least enough time for the first repair instruction to complete,(vii) send a second health status query from the first computing device to the second computing device; and
  
  (viii) in response to not receiving an acceptable response to the second health status query within a second predetermined time;
  
  (v) send from the first computing device to the second computing device, a second repair instruction,(vi) wait at least enough time for the second repair instruction to complete,(vii) send a third health status query from the first computing device to the second computing device,(viii) in response to not receiving an acceptable response to the third health status query from the second computing device within a second predetermined time, send from the first computing device a repair ticket.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The storage medium of claim 16, wherein the first health status query, second health status query, and third health status query are queries for the hash rate of the second computing device, wherein the first repair instruction is to restart a mining application operating on the second computing device, and wherein the second repair instruction is to restart the second computing device.
  - 18. The storage medium of claim 16, wherein the first health status query, second health status query, and third health status query are queries for the temperature of the second computing device, wherein the first repair instruction is to increase the fan speed of the second computing device, and wherein the second repair instruction is to reduce the operating frequency of the second computing device.
  - 19. The storage medium of claim 16, further comprising creating a support ticket in response to detecting a number of unhealthy computing devices greater than a predetermined threshold within a rack.
  - 20. The storage medium of claim 16, further comprising storing information on the number of repair instructions dispatched to the second device and refraining from submitting any more repair instructions to the second device if a predetermined threshold of repair attempts has been exceeded within a repair window.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Core Scientific, Inc.
Original Assignee
Core Scientific, Inc.
Inventors
Ferreira, Ian, Balakrishnan, Ganesh, Adams, Evan, Cortez, Carla, Hullander, Eric
Primary Examiner(s)
Truong, Loan L. T.

Application Number

US16/776,213
Time in Patent Office

146 Days
Field of Search
US Class Current
CPC Class Codes

G06F 11/0709   in a distributed system con...

G06F 11/0721   within a central processing...

G06F 11/0793   Remedial or corrective acti...

G06F 11/1438   Restarting or rejuvenating

G06F 11/1441   Resetting or repowering

G06F 11/3006   where the computing system ...

G06F 11/3058   Monitoring arrangements for...

G06F 11/3409   for performance assessment

G06F 11/3447   Performance evaluation by m...

Automatic repair of computing devices in a data center

First Claim

12 Assignments

0 Petitions

Accused Products

Abstract

29 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic repair of computing devices in a data center

First Claim

12 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

29 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links