Software directed microcode state save for distributed storage controller
First Claim
1. In a storage system having distributed system components including a host processor running software applications therein generating record updates and having a data mover, said host processor coupled to a first storage controller, wherein an error condition occurs in said storage system, said data mover executing a machine effected method for coordinating problem determinations amongst said distributed system components, said machine effected method comprising steps of:
- (a) issuing I/O operations for record updates generated by said software applications;
(b) storing said record updates in said first storage controller according to said issued I/O operations;
(c) maintaining control information associated with said record updates in said first storage controller;
(d) reading said record updates and associated control information into said data mover from said first storage controller in preparation for remotely copying said record updates;
(e) detecting the storage system error condition in said data mover;
(f) issuing a diagnostic state save channel command word (CCW) from said data mover to said host processor and said first storage controller;
(g) capturing failure information in said host processor and said first storage controller; and
(h) correlating said failure information in said host processor with said failure information in said first storage controller according to said storage system error condition,wherein said diagnostic state save channel command word temporarily suspends operations in said host processor and said storage controller until after said failure information is captured and correlated between said host processor and said first storage controller.
1 Assignment
0 Petitions
Accused Products
Abstract
A storage system improves error debugging by directing distributed system components associated with an error condition to temporarily suspend data processing for collecting failure information. The collected failure information is correlated for later analysis according to an issued diagnostic state save channel command word (CCW) that is triggered by the detection of said error condition. The storage system includes a host processor running applications generating record updates. A data mover in the host processor issues the diagnostic state save CCW upon receiving an error code from one of the system components. The failure information includes software, hardware and microcode control structures of the distributed system components.
-
Citations
20 Claims
-
1. In a storage system having distributed system components including a host processor running software applications therein generating record updates and having a data mover, said host processor coupled to a first storage controller, wherein an error condition occurs in said storage system, said data mover executing a machine effected method for coordinating problem determinations amongst said distributed system components, said machine effected method comprising steps of:
-
(a) issuing I/O operations for record updates generated by said software applications; (b) storing said record updates in said first storage controller according to said issued I/O operations; (c) maintaining control information associated with said record updates in said first storage controller; (d) reading said record updates and associated control information into said data mover from said first storage controller in preparation for remotely copying said record updates; (e) detecting the storage system error condition in said data mover; (f) issuing a diagnostic state save channel command word (CCW) from said data mover to said host processor and said first storage controller; (g) capturing failure information in said host processor and said first storage controller; and (h) correlating said failure information in said host processor with said failure information in said first storage controller according to said storage system error condition, wherein said diagnostic state save channel command word temporarily suspends operations in said host processor and said storage controller until after said failure information is captured and correlated between said host processor and said first storage controller. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer readable storage medium for storing a data mover application for causing state save diagnostics to be merged across multiple system components in response to a storage system error, said system components including a host processor coupled to a storage controller for managing record updates generated by software applications to a direct access storage device (DASD), said data mover application comprising:
-
means for issuing I/O operations for record updates generated in said software applications; means for storing said record updates in said first storage controller according to said issued I/O operations; maintenance means for maintaining control information associated with said record updates in said storage controller; reading means for reading said record updates and associated control information into said data mover in preparation for remotely copying said record updates; detecting means for detecting a storage system error condition and communicating said error condition to said data mover; state save means for issuing a diagnostic state save channel command word (CCW) from said data mover to said host processor and said storage controller; capture means for capturing failure information in said host processor and said storage controller; and correlating means for correlating said failure information in each system component of said multiple system components according to the detected error condition. - View Dependent Claims (10, 11, 12, 13)
-
-
14. A data storage system for coordinating failure information amongst system components associated with an error condition occurring in said data storage system, said system components including one or more storage controllers coupled to non-volatile storage devices for storing record updates thereon, said data storage system comprising:
a host processor running software applications thereon, said applications generating the record updates and transmitting I/O operations to said one or more storage controllers for eventual storage on said non-volatile storage devices, said host processor further including a first data mover for reading said record updates from said one or more storage controllers and assembling said record updates into groups for transmission to a remote storage system for disaster recovery purposes, said data mover receiving an error code from one of said system components indicating a type of error condition that occurred, said data mover issuing a State Save command to those system components associated with said error condition, said State Save command causing said associated system components to temporarily suspend processing record updates for collecting failure information, said failure information being correlated amongst said system components according to the State Save command, said data mover further comprising; a trace queue for storing failure information associated with said data mover; a control section for managing record updates read into said data mover; and a plurality of buffers for storing said record updates and header information associated with said record updates. - View Dependent Claims (15, 16, 17, 18, 19)
-
20. A disaster recovery data storage system having a primary site for processing data and having a secondary site receiving copies of the data for disaster recovery purposes, said disaster recovery data storage system having distributed system components with software, hardware and microcode control structures embedded therein, said disaster recovery data storage system aiding error debugging by triggering the co-ordination and capture of failure information amongst the distributed system components associated with a particular error condition occurring in said disaster recovery data storage system, said disaster recover data storage system comprising:
-
a plurality of storage controllers, each storage controller having a cache memory, a control buffer and a trace buffer; a primary host processor running applications thereon, said applications generating record updates and transmitting an I/O operation for each record update to said plurality of storage controllers for writing said record updates thereto, said host processor further including a primary data mover for reading said record updates from said plurality of storage controllers and assembling said record updates into groups of self describing record sets, said primary data mover receiving an error code from one of said distributed system components indicating a type of error condition that occurred, said primary data mover issuing a Diagnostic State Save channel command directed to those distributed system components associated with said error condition for causing said associated system components to temporarily suspend processing record updates for collecting failure information, such failure information being correlated according to the Diagnostic State Save channel command, said data mover further comprising; a trace queue for storing failure information associated with said primary data mover; a control section for managing record updates read into said primary data mover; and a plurality of buffers for storing said record updates and their associated headers; a plurality of primary direct access storage devices (DASDs) coupled to said plurality of primary storage controllers; a secondary host processor coupled to the secondary data mover and responsive to said Diagnostic State Save channel command; a secondary data mover coupled for receiving the groups of self describing record sets and responsive to said Diagnostic State Save channel command; a plurality of secondary storage controllers coupled to said secondary host processor and responsive to said Diagnostic State Save channel command; and a plurality of secondary DASDs for storing said groups of self describing record sets.
-
Specification