×

System and method for detecting problematic data storage nodes

  • US 9,329,955 B2
  • Filed: 06/28/2011
  • Issued: 05/03/2016
  • Est. Priority Date: 10/24/2008
  • Status: Active Grant
First Claim
Patent Images

1. A method implemented on a first data storage node for maintaining a data storage system, the method comprising:

  • the first data storage node monitoring for receipt of instances of a broadcast message from each of a plurality of data storage nodes included in a storage group, wherein a local node list stored at the first data storage node identifies the plurality of data storage nodes as being included in the storage group, the plurality of data storage nodes identified in the local node list includes at least a second data storage node, and instances of the broadcast message include an indication of a node list version being used by the data storage node that transmitted the instance of the broadcast message such that receipt of a given instance of the broadcast message by the first storage node indicates to the first storage node both that the data storage node that sent the given instance of the broadcast message is operating correctly and whether the storage group has been updated to add or remove a data storage node;

    the first data storage node receiving a first instance of the broadcast message from the second data storage node;

    the first data storage node updating the local node list based on the first instance of the broadcast message indicating that the second data storage node is using a different node list version than the version of the local node list stored at the first data storage node, wherein the first data storage adds at least a third data storage node to the local node list and begins monitoring for receipt of instances of the broadcast message from the third data storage node in response to the first instance of the broadcast message indicating the second data storage node is using the different node list version;

    the first data storage node detecting that the second data storage node is malfunctioning based on failing to receive a subsequent instance of the broadcast message from the second data storage node for a predetermined period of time; and

    the first data storage node initiating a data replication procedure based on detecting that the second data storage node is malfunctioning, wherein a file stored on the first data storage node is to be replicated in the data replication procedure, the file is associated with a host list stored on the first data storage node, the host list indicates a subset of data storage nodes in the storage group that also store the file, and the host list indicates that the second data storage node is in the subset of data storage nodes.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×