System and method to monitor and isolate faults in a storage area network

US 7,043,663 B1
Filed: 06/28/2002
Issued: 05/09/2006
Est. Priority Date: 11/15/2001
Status: Active Grant

First Claim

Patent Images

1. A storage area network comprising:

a plurality of loosely-coupled storage controllers arranged in a redundant configuration to provide, to a plurality of servers, access to virtualized storage, wherein one of the storage controllers operates as a master storage controller and the other storage controller or controllers operate as slave storage controllers;

a respective monitoring application executing on each of the storage controllers configured to determine whether or not the storage controllers are operating properly; and

two or more communication channels coupling the storage controllers and wherein;

the storage controllers are logically arranged in a binary tree having a root node and one or more child nodes such that the master storage controller is the root node of the tree and the slave storage controller or controllers are the child nodes, wherein the root node and each child node have, at most, two associated child nodes; and

each particular node is configured to periodically send, over at least one of the two or more communications channels, a respective inquiry message to each of its associated child nodes and, in response to an inquiry message, each associated child node is configured to send, over at least one of the two or more communications channels, an acknowledgement message to its parent node.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A fiber channel storage area network (SAN) provides virtualized storage space for a number of servers to a number of virtual disks implemented on various virtual redundant array of inexpensive disks (RAID) devices striped across a plurality of physical disk drives. The SAN includes plural controllers and communication paths to allow for fail-safe and fail-over operation. The plural controllers can be loosely-coupled to provide n-way redundancy and have more than one independent channel for communicating with one another. In the event of a failure involving a controller or controller interface, the virtual disks that are accessed via the affected interfaces are re-mapped to another interface in order to continue to provide high data availability. In particular, deadman timers, heartbeat signals internal to each controller, and heartbeat signals between different controllers are used to detect controllers that are no longer communicating with other controllers in order to identify those controllers which are failing or have failed.

Citations

16 Claims

1. A storage area network comprising:
- a plurality of loosely-coupled storage controllers arranged in a redundant configuration to provide, to a plurality of servers, access to virtualized storage, wherein one of the storage controllers operates as a master storage controller and the other storage controller or controllers operate as slave storage controllers;
  
  a respective monitoring application executing on each of the storage controllers configured to determine whether or not the storage controllers are operating properly; and
  
  two or more communication channels coupling the storage controllers and wherein;
  
  the storage controllers are logically arranged in a binary tree having a root node and one or more child nodes such that the master storage controller is the root node of the tree and the slave storage controller or controllers are the child nodes, wherein the root node and each child node have, at most, two associated child nodes; and
  
  each particular node is configured to periodically send, over at least one of the two or more communications channels, a respective inquiry message to each of its associated child nodes and, in response to an inquiry message, each associated child node is configured to send, over at least one of the two or more communications channels, an acknowledgement message to its parent node.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The storage area network according to claim 1, wherein each respective monitoring application, executing on one of the storage controllers, further comprises:
    - an internal monitoring routine configured to determine whether or not the one storage controller is operating properly; and
      
      an external monitoring routine configured to determine whether or not any storage controller other than the one storage controller is operating properly.
  - 3. The storage area network according to claim 2, wherein:
    - each storage controller further comprises an associated front-end processor, a back-end processor, and a control processor; and
      
      wherein the control processor is configured to periodically send an inquiry message to each of the front-end processor and back-end processor and each of the front-end processor and the back-end processor is configured to reply to the control processor in response to each inquiry message periodically sent.
  - 4. The storage area network according to claim 3, wherein:
    - if the front-end processor fails to receive a first inquiry message after a first predetermined period of time since an immediately previous inquiry message sent to the front-end processor, then the internal monitoring routine on the associated storage controller determines the associated storage controller is not operating properly;
      
      if the back-end processor fails to receive a second inquiry message after a second predetermined period of time since an immediately previous inquiry message sent to the back-end processor, then the internal monitoring routine on the associated storage controller determines the associated storage controller is not operating properly;
      
      if the control processor fails to receive a response to either one of the first or second inquiry messages within a third predetermined period of time, then the internal monitoring routine on the associated storage controller determines the associated controller is not operating properly.
  - 5. The storage area network according to claim 4, wherein:
    - the internal monitoring routine on the associated storage controller is configured to halt operation of the associated storage controller upon determining the associated storage controller is not operating properly.
  - 6. The storage area network according to claim 1, wherein each acknowledgement message sent from a particular slave storage controller includes an indication of operating status for that particular slave storage controller and for all slave storage controllers considered to be below that particular slave controller in the binary tree.
  - 7. The storage area network according to claim 1, wherein the external monitoring routine on a particular one of the storage controllers determines a failure condition has occurred in response to either:
    - failing to receive an expected inquiry message from a parent node of the particular one storage controller in the binary tree, orfailing to receive a respective, expected acknowledgement message from any child nodes directly beneath the particular one storage controller in the binary tree.
  - 8. The storage area network according to claim 1, wherein:
    - each acknowledgement message sent from a particular slave storage controller includes a log of the operating statistics for that particular slave storage controller and for all slave storage controllers considered to be below that particular slave storage controller in the binary tree.
  - 9. The storage area network according to claim 1, wherein:
    - each monitoring application on a respective slave storage controller is further configured to determine if a failure has occurred in one or more of the storage controllers and to report the failure to the monitoring application on the master storage controller.
  - 10. The storage area network according to claim 9, wherein:
    - the monitoring application on the master storage controller is configured to determine a solution for the reported failure and to forward the solution to a resource management application executing on the master storage controller which is configured to reconfigure the virtualized storage according to the reported solution.
  - 11. The method according to claim 10, further comprising the steps of:
    - if that particular node that determines the failure is one of the slave storage controllers, then forwarding a message, relating to the failure, to the master storage controller; and
      
      if that particular node that determines the failure is the master storage controller, then forwarding an indication of the failure to a resource management application executing on the master storage controller.
  - 12. The method according to claim 11, further comprising the steps of:
    - in response to receiving the indication of the failure, redistributing resources within the storage area network based on the received indication.

13. A method, in a storage area network comprising plural, loosely-coupled redundant storage controllers, for monitoring the operational status of the storage controllers, said method comprising the steps of:
- arranging the storage controllers logically into a binary tree structure having a root node and one or more child nodes such that a master controller from among the storage controllers is the root node of the tree and the other storage controllers, operating as slave controllers, are the child nodes, wherein the root node and each child node have, at most, two associated child nodes;
  
  monitoring at each particular node an internal operating status of that particular node;
  
  monitoring at each particular node an operating status of any immediate parent node and any immediate child nodes, wherein an immediate parent node is a node arranged in the binary tree above the particular node so as to have no intervening node, and wherein an immediate child node is a node arranged in the tree below the particular node so as to have no intervening node; and
  
  determining, at each particular node, if a failure has occurred based on either monitoring step.
- View Dependent Claims (14, 15, 16)
- - 14. The method according to claim 13, wherein the step of monitoring at each particular node an internal operating status of that node, further includes the steps of:
    - periodically sending a first inquiry message from a control processor of that particular node to a front-end processor at that particular node;
      
      in response to the first inquiry message, the front-end processor sending a first acknowledgement message to the control processor;
      
      periodically sending a second inquiry message from the control processor to a back-end processor at that particular node;
      
      in response to the second inquiry message, the back-end processor sending a second acknowledgement message to the control processor; and
      
      determining that an error at that particular node has occurred if any of the first inquiry message, second inquiry message, first acknowledgment message, or second acknowledgment message are not received.
  - 15. The method according to claim 14, wherein the step of monitoring at each particular node an operating status of any immediate parent node and any immediate child nodes, further includes the steps of:
    - periodically sending a first inquiry message from that particular node to a first immediate child node, if any;
      
      in response to the first inquiry message, the first immediate child node sending a first acknowledgement message to that particular node;
      
      periodically sending a second inquiry message from that particular node to a second immediate child node, if any;
      
      in response to the second inquiry message, the second immediate child node sending a second acknowledgement message to that particular node;
      
      detecting at that particular node whether the first acknowledgement message has not been received within a first predetermined period of time since a most recently sent first inquiry message;
      
      detecting at that particular node whether the second acknowledgement message as not been received within a second predetermined period of time since a most recently sent second inquiry message; and
      
      determining a respective one of the immediate child nodes has failed based on the detecting steps.
  - 16. The method according to claim 15, wherein:
    - the first acknowledgement message includes operating statistic regarding the first immediate child node and any other nodes beneath the first immediate child node; and
      
      the second acknowledgement message, if any, includes operating statistics regarding the second immediate child node and any other nodes beneath the second immediate child node.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xiotech Corporation
Original Assignee
Xiotech Corporation
Inventors
Pittelkow, Michael Henry, Olson, Mark David
Primary Examiner(s)
Beausoliel, Robert
Assistant Examiner(s)
Wilson, Yolanda L

Application Number

US10/184,059
Time in Patent Office

1,411 Days
Field of Search

714/4, 714/6
US Class Current

714/4.4
CPC Class Codes

G06F 11/2092 Techniques of failing over ...

System and method to monitor and isolate faults in a storage area network

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

System and method to monitor and isolate faults in a storage area network

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links