Data management application programming interface failure recovery in a parallel file system
First Claim
1. In a cluster of computing nodes having shared access to one or more volumes of data storage using a parallel file system, a method for managing the data storage, comprising:
- initiating a session of a data management application on a session node selected from among the nodes in the cluster;
receiving an event message in a session queue for processing by the data management application at the session node, responsive to a request submitted to the parallel file system by a user application on a source node among the nodes in the cluster to perform a file operation on a file in the data storage; and
following a failure at the session node, reconstructing the session queue so that processing of the event message by the data management application can continue after recovery from the failure,wherein reconstructing the session queue comprises selecting a new session node from among the nodes in the cluster, and assuming the data management session on the new session node, whereupon the session queue is reconstructed on the new session node, andwherein assuming the data management session comprises assuming the session on the same session node that was used before the failure.
3 Assignments
0 Petitions
Accused Products
Abstract
In a cluster of computing nodes having shared access to one or more volumes of data storage using a parallel file system, a method for managing the data storage includes initiating a session of a data management application on a session node selected from among the nodes in the cluster. The session node receives an event message in a session queue for processing by the data management application, responsive to a request submitted to the parallel file system by a source node among the nodes in the cluster to perform a file operation on a file in the data storage. Following a failure at the session node, the session queue is reconstructed so that processing of the event message by the data management application can continue after recovery from the failure, and the request can be fulfilled at the source node.
-
Citations
57 Claims
-
1. In a cluster of computing nodes having shared access to one or more volumes of data storage using a parallel file system, a method for managing the data storage, comprising:
-
initiating a session of a data management application on a session node selected from among the nodes in the cluster; receiving an event message in a session queue for processing by the data management application at the session node, responsive to a request submitted to the parallel file system by a user application on a source node among the nodes in the cluster to perform a file operation on a file in the data storage; and following a failure at the session node, reconstructing the session queue so that processing of the event message by the data management application can continue after recovery from the failure, wherein reconstructing the session queue comprises selecting a new session node from among the nodes in the cluster, and assuming the data management session on the new session node, whereupon the session queue is reconstructed on the new session node, and wherein assuming the data management session comprises assuming the session on the same session node that was used before the failure. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. In a cluster of computing nodes having shared access to one or more volumes of data storage using a parallel file system, a method for managing the data storage, comprising:
-
initiating a session of a data management application on a session node selected from among the nodes in the cluster; receiving an event message in a session queue for processing by the data management application at the session node, responsive to a request submitted to the parallel file system by a user application on a source node among the nodes in the cluster to perform a file operation on a file in the data storage; following a failure at the session node, reconstructing the session queue so that processing of the event message by the data management application can continue after recovery from the failure; sending a response to the event message from the data management application on the session node to the source node following the recovery from the failure; and performing the file operation requested by the source node subject to the response from the data management application, wherein receiving the event message comprises receiving the message responsive to submission of the request by a file operation thread of a user application running on the source node, and blocking the thread until the response is received from the session node after the recovery from the failure, and wherein reconstructing the session queue comprises sending a message from the session node to all of the nodes, so as to prompt the file operation thread on the source node to submit a new event message to the session node, whereby the event is placed in the reconstructed queue responsive to the new message, and wherein prompting the file operation thread comprises instructing the file operation thread to submit the new event message with respect to an event that is defined as a synchronous event. - View Dependent Claims (18, 19)
-
-
20. Computing apparatus, comprising:
-
one or more volumes of data storage, arranged to store data; and a plurality of computing nodes, linked to access the volumes of data storage using a parallel file system, and arranged so as to enable a data management application to initiate a data management session on a session node selected among the nodes in the cluster, so that when a request is submitted to the parallel file system by a user application on a source node among the nodes in the cluster to perform a file operation on a file in the data storage, an event message is received at the session node responsive to the request, for processing by the data management application, and so that following a failure at the session node, the session queue is reconstructed so that processing of the event message by the data management application can continue after recovery from the failure, wherein the nodes are arranged so that following the failure, a new session node is selected from among the nodes on which the failure has not occurred, and the data management session is assumed on the new session node, whereupon the session queue is reconstructed on the new session node, and wherein the session is assumed on the same session node that was used before the failure. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
-
-
36. Computing apparatus, comprising:
-
one or more volumes of data storage, arranged to store data; and a plurality of computing nodes, linked to access the volumes of data storage using a parallel file system, and arranged so as to enable a data management application to initiate a data management session on a session node selected among the nodes in the cluster, so that when a request is submitted to the parallel file system by a user application on a source node among the nodes in the cluster to perform a file operation on a file in the data storage, an event message is received at the session node responsive to the request, for processing by the data management application, and so that following a failure at the session node, the session queue is reconstructed so that processing of the event message by the data management application can continue after recovery from the failure, wherein the nodes are arranged so that a response to the event message is sent from the data management application on the session node to the source node following the recovery from the failure, whereupon the file operation requested by the source node is carried out subject to the response from the data management application, and wherein the event message is received responsive to submission of the request by a file operation thread of a user application running on the source node, and the thread is blocked until the response is received from the session node after the recovery from the failure, and wherein to reconstruct the session queue, a message is sent from the session node to all of the nodes, so that the file operation thread on the source node is prompted to submit a new event message to the session node, whereby the event is placed in the reconstructed queue responsive to the new message, and wherein the file operation thread is prompted to submit the new event message with respect to an event that is defined as a synchronous event. - View Dependent Claims (37, 38)
-
-
39. A computer software product for use in a cluster of computing nodes having shared access to one or more volumes of data storage using a parallel file system, the product comprising a computer-readable medium in which program instructions are stored, which instructions, when read by the computing nodes, cause a session of a data management application to be initiated on a session node selected among the nodes in the cluster, such that when a user application on a source node among the nodes in the cluster submits a request to the parallel file system to perform a file operation on a file in the data storage, an event message is received at the session node, for processing by the data management application, and such that following a failure at the session node, the session queue is reconstructed so that processing of the event message by the data management application can continue after recovery from the failure,
wherein following the failure, the instructions cause a new session node to be selected from among the nodes on which the failure has not occurred, whereupon the data management session is assumed on the new session node, and the session queue is reconstructed on the new session node, and wherein the session is assumed on the same session node that was used before the failure.
-
55. A computer software product for use in a cluster of computing nodes having shared access to one or more volumes of data storage using a parallel file system, the product comprising a computer-readable medium in which program instructions are stored, which instructions, when read by the computing nodes, cause a session of a data management application to be initiated on a session node selected among the nodes in the cluster, such that when a user application on a source node among the nodes in the cluster submits a request to the parallel file system to perform a file operation on a file in the data storage, an event message is received at the session node, for processing by the data management application, and such that following a failure at the session node, the session queue is reconstructed so that processing of the event message by the data management application can continue after recovery from the failure,
wherein the instructions cause a response to the event message to be sent from the data management application on the session node to the source node following the recovery from the failure, whereupon the file operation requested by the source node is carried out subject to the response from the data management application, and wherein the event message is received responsive to submission of the request by a file operation thread of a user application running on the source node, and the thread is blocked until the response is received from the session node after the recovery from the failure, and wherein to reconstruct the session queue, a message is sent from the session node to all of the nodes, so that the file operation thread on the source node is prompted to submit a new event message to the session node, whereby the event is placed in the reconstructed queue responsive to the new message, and wherein the file operation thread is prompted to submit the new event message with respect to an event that is defined as a synchronous event.
Specification