MANAGEMENT OF A FAULT CONDITION IN A COMPUTING SYSTEM
First Claim
1. A system comprising:
- a sensor to collect data in a high performance computing (HPC) system; and
a plurality of message determiners, wherein each of the message determiners are to dynamically publish a message over a publisher-subscriber system and are to dynamically subscribe to a message over the publisher-subscriber system, and wherein at least one message is to correspond to the data from the sensor and is to be used to coordinate actions to manage a fault condition in the HPC system.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems, apparatuses, and/or methods may manage a fault condition in a computer system. An apparatus may dynamically publish a message over a publisher-subscriber system and dynamically subscribe to a message over the publisher-subscriber system, wherein at least one message may be used to address a fault condition in the computing system. The apparatus may predict a fault condition in a high performance computing (HPC) system, communicate fault information to a user, monitor health of the HPC system, respond to the fault condition in the HPC system, recover from the fault condition in the HPC system, maintain a rule for a fault management component, and/or communicate the fault information over the publisher-subscriber system in real-time. Messages may also be aggregated to minimize fault information traffic. The publisher-subscriber system may facilitate dynamic and/or real-time coordinated, integrated (e.g., system-wide), and/or scalable fault management.
21 Citations
25 Claims
-
1. A system comprising:
-
a sensor to collect data in a high performance computing (HPC) system; and a plurality of message determiners, wherein each of the message determiners are to dynamically publish a message over a publisher-subscriber system and are to dynamically subscribe to a message over the publisher-subscriber system, and wherein at least one message is to correspond to the data from the sensor and is to be used to coordinate actions to manage a fault condition in the HPC system. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. An apparatus comprising:
a message determiner to dynamically publish a message over a publisher-subscriber system and to dynamically subscribe to a message over the publisher-subscriber system, wherein at least one message is to be used to coordinate actions to manage a fault condition in a computer system. - View Dependent Claims (9, 10, 11, 12, 13)
-
14. A method comprising:
-
dynamically publishing a message over a publisher-subscriber system by a message determiner; and dynamically subscribing to a message over the publisher-subscriber system by the message determiner, wherein at least one message is used to coordinate actions to manage a fault condition in a computer system. - View Dependent Claims (15, 16, 17, 18)
-
-
19. The method of claim, wherein at least one message is to include a fault monitor message, a fault response message, a fault report message, a fault policy message, or a fault prediction message.
-
20. At least one computer readable storage medium comprising a set of instructions which, when executed by a device, cause the device to:
-
dynamically publish a message over a publisher-subscriber by a message determiner; and dynamically subscribe to a message over the publisher-subscriber system by the message determiner, wherein at least one message is to be used to coordinate actions to manage a fault condition in a computer system. - View Dependent Claims (21, 22, 23, 24, 25)
-
Specification