System and method for detecting and isolating faults in a computer collaboration environment
First Claim
Patent Images
1. A method for minimizing breakdown in a computer cluster containing a plurality of computers, wherein a resource is to be failed over from a first computer of the plurality of computers, the method comprising:
- identifying that the first computer has failed while running the resource;
tracing a failover history of the resource based on a log containing a history of the resource and the plurality of computers;
identifying the existence of mitigating factors associated with the failover history based on the log, wherein identifying the existence of mitigating factors includes at least one of;
identifying a number of times (L) that another resource was loading on each computer from which the resource was failed over; and
identifying a number of times (R) that the resource entered a running state; and
determining whether to load the resource onto a second computer of the plurality of computers based on the failover history and mitigating factors.
3 Assignments
0 Petitions
Accused Products
Abstract
A method and system are provided for use in a computer collaboration environment. In one example, the method includes identifying that a resource should be failed over from one computer to another computer within the environment. A history of the resource'"'"'s execution within the cluster is examined, and the resource is failed over only if a risk assessment based on the history indicates that a risk level of loading the resource does not exceed an acceptable risk threshold.
25 Citations
28 Claims
-
1. A method for minimizing breakdown in a computer cluster containing a plurality of computers, wherein a resource is to be failed over from a first computer of the plurality of computers, the method comprising:
-
identifying that the first computer has failed while running the resource; tracing a failover history of the resource based on a log containing a history of the resource and the plurality of computers; identifying the existence of mitigating factors associated with the failover history based on the log, wherein identifying the existence of mitigating factors includes at least one of; identifying a number of times (L) that another resource was loading on each computer from which the resource was failed over; and identifying a number of times (R) that the resource entered a running state; and determining whether to load the resource onto a second computer of the plurality of computers based on the failover history and mitigating factors. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer readable medium comprising a plurality of computer-executable instructions for use with a cluster containing at least first and second computers, the instructions including:
-
instructions for identifying that a resource should be failed over from the first computer to the second computer; instructions for examining a history of the resource'"'"'s execution within the cluster;
wherein the instructions for examining the history include instructions for identifying at least one of a number of times (F) the resource has failed over;
a number of times (L) that another resource was loading on each computer from which the resource was failed over; and
a number of times (R) that the resource entered a running state; andinstructions for failing the resource over to the second computer only if a risk assessment based on the history indicates that a risk level of loading the resource does not exceed an acceptable risk threshold. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
-
-
16. A method for preventing breakdown of a computer cluster, the method comprising:
-
identifying that a task should be failed over from a first computer of the cluster to a second computer of the cluster; determining whether a risk assessment has been triggered; retrieving a log containing a running history of the cluster if the risk assessment has been triggered; identifying a number of times (F) the task has failed over based on the log; examining the log to determine at least one of a number of times (L) that another task was loading on each computer when the task was failed over and a number of times (R) that the task has entered a running state; calculating a risk level based on F, L, and R; and loading the task onto the second computer only if the risk level is below a predefined risk threshold. - View Dependent Claims (17, 18)
-
-
19. A method for preventing breakdown of a computer cluster, the method comprising:
-
identifying that a task should be failed over from a first computer of the cluster to a second computer of the cluster; examining a history of the cluster to determine whether the task is associated with a current attack on the cluster; and failing over the task to the second computer only if the task is not associated with a current attack. - View Dependent Claims (20, 21, 22, 23)
-
-
24. A computer collaboration system comprising:
-
first and second computers; a shared storage accessible by the first and second computers containing a log of the system; and a plurality of computer-executable instructions for; identifying that a task should be failed over from the first computer to the second computer; examining the log to trace a history of the task'"'"'s execution within the system wherein the examining the log includes at least one of identifying a number of times (L) that another resource was loading on each computer from which the task was failed over;
identifying a number of times (R) that the task entered a running state; and
identifying a number of running computers (N) remaining in the system; andfailing the task over to the second computer only if a risk assessment based on the history indicates that a risk level of loading the task does not exceed an acceptable risk threshold. - View Dependent Claims (25, 26, 27)
-
-
28. A computer cluster system comprising:
-
means for identifying that a first computer of the cluster has failed while running a task; means for tracing a failover history of the task; means for identifying the existence of mitigating factors associated with the task, wherein the means for identifying the existence of mitigating factors includes at least one of; means for identifying a number of times (L) that another resource was loading on each computer from which the resource was failed over; and means for identifying a number of times (R) that the resource entered a running state; and means for determining whether to load the task onto a second computer of the cluster based on the failover history and mitigating factors.
-
Specification