Preventing unnecessary data recovery
First Claim
1. A method comprising:
- receiving, at a data processing device, a status of a resource of a distributed system;
when the status of the resource indicates a resource failure, executing instructions on the data processing device to determine whether the resource failure is correlated to any other resource failures within the distributed system based on a system hierarchy of the distributed system, the system hierarchy comprising system domains, each system domain having an active state or an inactive state, the resource belonging to at least one system domain, wherein the resource failure is correlated to other resource failures when a statistically significant number of resources having failures reside in a same system domain;
when the resource failure is correlated to other resource failures within the distributed system, delaying execution on the data processing device of a remedial action associated with the resource; and
when the resource failure is uncorrelated to other resource failures within the distributed system, initiating execution on the data processing device of the remedial action associated with the resource.
2 Assignments
0 Petitions
Accused Products
Abstract
A method that prevents unnecessary data recovery includes receiving, at a data processing device, a status of a resource of a distributed system. When the status of the resource indicates a resource failure, the method includes executing instructions on the data processing device to determine whether the resource failure is correlated to any other resource failures within the distributed system. When the resource failure is correlated to other resource failures within the distributed system, the method includes delaying execution on the data processing device of a remedial action associated with the resource. However, when the resource failure is uncorrelated to other resource failures within the distributed system, the method includes initiating execution on the data processing device of the remedial action associated with the resource.
-
Citations
21 Claims
-
1. A method comprising:
-
receiving, at a data processing device, a status of a resource of a distributed system; when the status of the resource indicates a resource failure, executing instructions on the data processing device to determine whether the resource failure is correlated to any other resource failures within the distributed system based on a system hierarchy of the distributed system, the system hierarchy comprising system domains, each system domain having an active state or an inactive state, the resource belonging to at least one system domain, wherein the resource failure is correlated to other resource failures when a statistically significant number of resources having failures reside in a same system domain; when the resource failure is correlated to other resource failures within the distributed system, delaying execution on the data processing device of a remedial action associated with the resource; and when the resource failure is uncorrelated to other resource failures within the distributed system, initiating execution on the data processing device of the remedial action associated with the resource. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method comprising:
-
receiving, at a data processing device, a status of a resource of a distributed system; when the status of the resource indicates a resource failure, executing instructions on the data processing device to determine whether the resource failure is correlated to any other resource failures within the distributed system based on a system hierarchy of the distributed system, the system hierarchy comprising system domains, each system domain having an active state or an inactive state, the resource belonging to at least one system domain; when the resource failure is correlated to other resource failures within the distributed system, delaying execution on the data processing device of a remedial action associated with the resource until after a first threshold period of time; and when the resource failure is uncorrelated to the other resource failures within the distributed system, initiating execution on the data processing device of the remedial action associated with the resource after a second threshold period of time, wherein the first threshold period of time is greater than the second threshold period of time. - View Dependent Claims (8)
-
-
9. A recovery system for a distributed system, the recovery system comprising:
-
a data processing device in communication with resources of the distributed system, the data processing device receiving a status of a resource of the distributed system; when the status of the resource indicates a resource failure, the data processing device executing instructions to determine whether the resource failure is correlated to any other resource failures within the distributed system based on a system hierarchy of the distributed system, the system hierarchy comprising system domains, each system domain having an active state or an inactive state, the resource belonging to at least one system domain, wherein the resource failure is correlated to other resource failures when a statistically significant number of resources having failures reside in a same system domain; when the resource failure is correlated to other resource failures within the distributed system, the data processing device delaying execution of a remedial action associated with the resource; and when the resource failure is uncorrelated to other resource failures within the distributed system, the data processing device initiating execution of the remedial action associated with the resource. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A recovery system for a distributed system, the recovery system comprising:
-
a data processing device in communication with resources of the distributed system, the data processing device receiving a status of a resource of the distributed system; when the status of the resource indicates a resource failure, the data processing device executing instructions to determine whether the resource failure is correlated to any other resource failures within the distributed system based on a system hierarchy of the distributed system, the system hierarchy comprising system domains, each system domain having an active state or an inactive state, the resource belonging to at least one system domain; when the resource failure is correlated to other resource failures within the distributed system, the data processing device delays execution of a remedial action associated with the resource for a first threshold period of time; and when the resource failure is uncorrelated to the other resource failures within the distributed system, the data processing device initiates execution of the remedial action associated with the resource after a second threshold period of time, wherein the first threshold period of time is greater than the second threshold period of time. - View Dependent Claims (16)
-
-
17. A method comprising:
-
receiving, at a data processing device, a status of a resource of a distributed system; when the status of the resource indicates a resource failure, executing instructions on the data processing device to determine; a correlation between the resource failure and any other resource failures within the distributed system based on a system hierarchy of the distributed system, the system hierarchy comprising system domains, each system domain having an active state or an inactive state, the resource belonging to at least one system domain; and a time duration of the resource failure; when the resource failure is correlated to other resource failures within the distributed system and the time duration is greater than a first threshold period of time, executing on the data processing device a remedial action associated with the resource; and when the resource failure is uncorrelated to other resource failures within the distributed system, and the time duration is greater than a second threshold period of time, executing on the data processing device the remedial action associated with the resource, wherein the first threshold period of time is greater than the second threshold period of time. - View Dependent Claims (18, 19)
-
-
20. A method comprising:
-
receiving, at a data processing device, a status of a resource of a distributed system; when the status of the resource indicates a resource failure, executing instructions on the data processing device to determine; a correlation between the resource failure and any other resource failures within the distributed system based on a system hierarchy of the distributed system, the system hierarchy comprising system domains, each system domain having an active state or an inactive state, the resource belonging to at least one system domain; and a time duration of the resource failure; when the resource failure is correlated to other resource failures within the distributed system and the time duration is greater than a first threshold period of time, executing on the data processing device a remedial action associated with the resource; when the resource failure is uncorrelated to other resource failures within the distributed system, and the time duration is greater than a second threshold period of time, executing on the data processing device the remedial action associated with the resource, wherein the first threshold period of time is greater than the second threshold period of time; when the resource comprises non-transitory memory, initiating data reconstruction as the remedial action for any data stored on the non-transitory memory; and when the resource comprises a computer processor, migrating or restarting a job previously executing on a failed computer processor to an operational computer processor. - View Dependent Claims (21)
-
Specification