Management of a fault condition in a computing system
First Claim
1. A system comprising:
- a sensor to collect data in a high performance computing (HPC) system; and
a plurality of message determiners, wherein each of the message determiners are to dynamically publish a message over a publisher-subscriber system and are to dynamically subscribe to a message over the publisher-subscriber system, wherein at least one message is to correspond to the data from the sensor and is to be used to coordinate actions to manage a fault condition in the HPC system, and wherein a message determiner is to include;
a data determiner to dynamically determine a need for data of interest to the message determiner; and
a message generator to generate a request message to one or more of request subscription to a message that is to include the data of interest or prompt subscription to the request message to cause the data of interest to be published over the publication-subscription system.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems, apparatuses, and/or methods may manage a fault condition in a computer system. An apparatus may dynamically publish a message over a publisher-subscriber system and dynamically subscribe to a message over the publisher-subscriber system, wherein at least one message may be used to address a fault condition in the computing system. The apparatus may predict a fault condition in a high performance computing (HPC) system, communicate fault information to a user, monitor health of the HPC system, respond to the fault condition in the HPC system, recover from the fault condition in the HPC system, maintain a rule for a fault management component, and/or communicate the fault information over the publisher-subscriber system in real-time. Messages may also be aggregated to minimize fault information traffic. The publisher-subscriber system may facilitate dynamic and/or real-time coordinated, integrated (e.g., system-wide), and/or scalable fault management.
11 Citations
21 Claims
-
1. A system comprising:
-
a sensor to collect data in a high performance computing (HPC) system; and a plurality of message determiners, wherein each of the message determiners are to dynamically publish a message over a publisher-subscriber system and are to dynamically subscribe to a message over the publisher-subscriber system, wherein at least one message is to correspond to the data from the sensor and is to be used to coordinate actions to manage a fault condition in the HPC system, and wherein a message determiner is to include; a data determiner to dynamically determine a need for data of interest to the message determiner; and a message generator to generate a request message to one or more of request subscription to a message that is to include the data of interest or prompt subscription to the request message to cause the data of interest to be published over the publication-subscription system. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. An apparatus comprising:
a message determiner to dynamically publish a message over a publisher-subscriber system and to dynamically subscribe to a message over the publisher-subscriber system, wherein at least one message is to be used to coordinate actions to manage a fault condition in a computer system, and wherein the message determiner is to include; a data determiner to dynamically determine a need for data of interest to the message determiner; and a message generator to generate a request message to one or more of request subscription to a message that is to include the data of interest or prompt subscription to the request message to cause the data of interest to be published over the publication-subscription system. - View Dependent Claims (8, 9, 10, 11)
-
12. A method comprising:
-
dynamically publishing a message over a publisher-subscriber system by a message determiner; dynamically subscribing to a message over the publisher-subscriber system by the message determiner, wherein at least one message is used to coordinate actions to manage a fault condition in a computer system, the method further including; determining a need for data of interest to the message determiner; and generating a request message to one or more of request subscription to a message including the data of interest or prompt subscription to the request message to cause the data of interest to be published over the publication-subscription system. - View Dependent Claims (13, 14, 15, 16)
-
-
17. At least one non-transitory computer readable storage medium comprising a set of instructions which, when executed by a device, cause the device to:
-
dynamically publish a message over a publisher-subscriber by a message determiner; and dynamically subscribe to a message over the publisher-subscriber system by the message determiner, wherein at least one message is to be used to coordinate actions to manage a fault condition in a computer system, and wherein the instructions, when executed, cause a device to; determine a need for data of interest to the message determiner; and generate a request message to one or more of request subscription to a message including the data of interest or prompt subscription to the request message to cause the data of interest to be published over the publication-subscription system. - View Dependent Claims (18, 19, 20, 21)
-
Specification