System and method for detecting problematic data storage nodes
First Claim
1. A method implemented on a first data storage node for maintaining a data storage system, the method comprising:
- the first data storage node monitoring for receipt of instances of a broadcast message from each of a plurality of data storage nodes included in a storage group, wherein a local node list stored at the first data storage node identifies the plurality of data storage nodes as being included in the storage group, the plurality of data storage nodes identified in the local node list includes at least a second data storage node, and instances of the broadcast message include an indication of a node list version being used by the data storage node that transmitted the instance of the broadcast message such that receipt of a given instance of the broadcast message by the first storage node indicates to the first storage node both that the data storage node that sent the given instance of the broadcast message is operating correctly and whether the storage group has been updated to add or remove a data storage node;
the first data storage node receiving a first instance of the broadcast message from the second data storage node;
the first data storage node updating the local node list based on the first instance of the broadcast message indicating that the second data storage node is using a different node list version than the version of the local node list stored at the first data storage node, wherein the first data storage adds at least a third data storage node to the local node list and begins monitoring for receipt of instances of the broadcast message from the third data storage node in response to the first instance of the broadcast message indicating the second data storage node is using the different node list version;
the first data storage node detecting that the second data storage node is malfunctioning based on failing to receive a subsequent instance of the broadcast message from the second data storage node for a predetermined period of time; and
the first data storage node initiating a data replication procedure based on detecting that the second data storage node is malfunctioning, wherein a file stored on the first data storage node is to be replicated in the data replication procedure, the file is associated with a host list stored on the first data storage node, the host list indicates a subset of data storage nodes in the storage group that also store the file, and the host list indicates that the second data storage node is in the subset of data storage nodes.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for maintaining a data storage system is disclosed. The method may include monitoring for receipt of a first broadcast message from a first data storage node, where the first broadcast message may indicate that the first data storage node is operating correctly. The method may also include detecting that the first data storage node is malfunctioning based on not receiving the first broadcast message for a predetermined period of time. The method may also include initiating a data replication procedure based on detecting that the first data storage node is malfunctioning. The data replication procedure may include sending a first multicast message to a plurality of data storage nodes requesting identification of a second data storage node that maintains a copy of a file stored on the first data storage node.
148 Citations
20 Claims
-
1. A method implemented on a first data storage node for maintaining a data storage system, the method comprising:
-
the first data storage node monitoring for receipt of instances of a broadcast message from each of a plurality of data storage nodes included in a storage group, wherein a local node list stored at the first data storage node identifies the plurality of data storage nodes as being included in the storage group, the plurality of data storage nodes identified in the local node list includes at least a second data storage node, and instances of the broadcast message include an indication of a node list version being used by the data storage node that transmitted the instance of the broadcast message such that receipt of a given instance of the broadcast message by the first storage node indicates to the first storage node both that the data storage node that sent the given instance of the broadcast message is operating correctly and whether the storage group has been updated to add or remove a data storage node; the first data storage node receiving a first instance of the broadcast message from the second data storage node; the first data storage node updating the local node list based on the first instance of the broadcast message indicating that the second data storage node is using a different node list version than the version of the local node list stored at the first data storage node, wherein the first data storage adds at least a third data storage node to the local node list and begins monitoring for receipt of instances of the broadcast message from the third data storage node in response to the first instance of the broadcast message indicating the second data storage node is using the different node list version; the first data storage node detecting that the second data storage node is malfunctioning based on failing to receive a subsequent instance of the broadcast message from the second data storage node for a predetermined period of time; and the first data storage node initiating a data replication procedure based on detecting that the second data storage node is malfunctioning, wherein a file stored on the first data storage node is to be replicated in the data replication procedure, the file is associated with a host list stored on the first data storage node, the host list indicates a subset of data storage nodes in the storage group that also store the file, and the host list indicates that the second data storage node is in the subset of data storage nodes. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for maintaining data in a network, the system comprising a plurality of data storage nodes included in a storage group, the plurality of data storage nodes comprising:
-
a first data storage node configured to monitor for receipt of instances of a broadcast message from each of the plurality of data storage nodes included in a storage group, wherein a local node list stored at the first data storage node identifies the plurality of data storage nodes as being included in the storage group, the plurality of data storage nodes identified in the local node list includes at least a second data storage node, and instances of the broadcast message include an indication of a node list version being used by the data storage node that transmitted the instance of the broadcast message such that receipt of a given instance of the broadcast message by the first storage node indicates to the first storage node both that the data storage node that sent the given instance of the broadcast message is operating correctly and whether the storage group has been updated to add or remove a data storage node; the second data storage node configured to send a first instance of the broadcast message, the first instance of the broadcast message comprising an indication of a node list version used by the second data storage node; and the first data storage node further configured to; receive the first instance of the broadcast message from the second data storage node, update the local node list based on the first instance of the broadcast message indicating that the second data storage node is using a different node list version than the version of the local node list stored at the first data storage node, add at least a third data storage node to the local node list and begin monitoring for receipt of instances of the broadcast message from the third data storage node in response to the first instance of the broadcast message indicating the second data storage node is using the different node list version, detect that the second data storage node is malfunctioning based on failing to receive a subsequent instance of the broadcast message from the second data storage node for a predetermined period of time, and initiate a data replication procedure based on detecting that the second data storage node is malfunctioning, wherein at least one file stored on the first data storage node is to be replicated in the data replication procedure, the at least one file is associated with a host list stored on the first data storage node, the host list indicates a subset of data storage nodes in the storage group that also store the at least one file, and the host list indicates that the second data storage node is in the subset of data storage nodes. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A first data storage node configured to:
-
monitor for receipt of instances of a broadcast message from each of a plurality of data storage nodes included in a storage group, wherein a local node list stored at the first data storage node identifies the plurality of data storage nodes as being included in the storage group, the plurality of data storage nodes identified in the local node list includes at least a second data storage node, and instances of the broadcast message include an indication of a node list version being used by the data storage node that transmitted the instance of the broadcast message such that receipt of a given instance of the broadcast message by the first storage node indicates to the first storage node both that the data storage node that sent the given instance of the broadcast message is operating correctly and whether the storage group has been updated to add or remove a data storage node; receive a first instance of the broadcast message from the second data storage node; update the local node list based on the first instance of the broadcast message indicating that the second data storage node is using a different node list version than the version of the local node list stored at the first data storage node; add at least a third data storage node to the local node list and begin monitoring for receipt of instances of the broadcast message from the third data storage node in response to the first instance of the broadcast message indicating the second data storage node is using the different node list version; detect that the second data storage node is malfunctioning based on failing to receive a subsequent instance of the broadcast message from the second data storage node for a predetermined period of time; and initiate a data replication procedure based on detecting that the second data storage node is malfunctioning, wherein a file stored on the first data storage node is to be replicated in the data replication procedure, the file is associated with a host list stored on the first data storage node, and the host list indicates a subset of data storage nodes in the storage group that also store the file, the host list indicates that the second data storage node is in the subset of data storage nodes. - View Dependent Claims (18, 19, 20)
-
Specification