Fault tolerance for a distributed computing system
First Claim
1. A method comprising:
- detecting a failure of a container, of a set of containers, in a controller node, the container executing a service being performed and isolated from at least one other service being performed in at least one other container on the controller node;
terminating, by the controller node, the container executing the service;
determining, by the controller node, a particular known state for the service, wherein the particular known state is known to be operational without including one or more changes that caused the failure, and wherein the service saves the changes to the particular known state during operation separately from the particular known state;
restarting, by the controller node, the service in a new container that replaces the terminated container, wherein the restarted service starts from the particular known state without using the changes;
wherein an orchestration service, configured to manage the set of containers, detects the failure;
wherein the orchestration service detects the failure via monitoring a communication service in which a status of the service is input; and
wherein the method is performed by at least one device including a hardware processor.
5 Assignments
0 Petitions
Accused Products
Abstract
In one embodiment, a method detects a failure of a container in a controller node where the container includes a service being performed and isolated from other services being performed in other containers on the controller node. The controller node terminates the container including the service and determines a known state for the service. The known state is known to be operational without including a cause of the failure and the service operated from the known state saving changes to the known state during operation separately from the known state. The controller node restarts the service in a new container that replaces the terminated container where the restarted service starts from the known state without using the changes.
75 Citations
16 Claims
-
1. A method comprising:
-
detecting a failure of a container, of a set of containers, in a controller node, the container executing a service being performed and isolated from at least one other service being performed in at least one other container on the controller node; terminating, by the controller node, the container executing the service; determining, by the controller node, a particular known state for the service, wherein the particular known state is known to be operational without including one or more changes that caused the failure, and wherein the service saves the changes to the particular known state during operation separately from the particular known state; restarting, by the controller node, the service in a new container that replaces the terminated container, wherein the restarted service starts from the particular known state without using the changes; wherein an orchestration service, configured to manage the set of containers, detects the failure; wherein the orchestration service detects the failure via monitoring a communication service in which a status of the service is input; and wherein the method is performed by at least one device including a hardware processor. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method comprising:
-
detecting a failure of a container, of a set of containers, in a controller node, the container executing a service being performed and isolated from at least one other service being performed in at least one other container on the controller node; terminating, by the controller node, the container executing the service; determining, by the controller node, a particular known state for the service, wherein the particular known state is known to be operational without including one or more changes that caused the failure, and wherein the service saves the changes to the particular known state during operation separately from the particular known state; restarting, by the controller node, the service in a new container that replaces the terminated container, wherein the restarted service starts from the particular known state without using the changes; upon restarting with the particular known state, determining, by the service, configuration data or state data for the service from storage; and wherein the method is performed by at least one device including a hardware processor.
-
-
7. A method comprising:
-
detecting a failure of a container, of a set of containers, in a controller node, the container executing a service being performed and isolated from at least one other service being performed in at least one other container on the controller node; terminating, by the controller node, the container executing the service; determining, by the controller node, a particular known state for the service, wherein the particular known state is known to be operational without including one or more changes that caused the failure, and wherein the service saves the changes to the particular known state during operation separately from the particular known state; restarting, by the controller node, the service in a new container that replaces the terminated container, wherein the restarted service starts from the particular known state without using the changes; wherein; the particular known state is included in a file system, the service with the failure records differences to the file system without changing the file system, the changes are not used in restarting the service in the new container, and the method is performed by at least one device including a hardware processor.
-
-
8. A system comprising:
-
at least one device including a hardware processor; the system being configured to perform operations comprising; detecting a failure of a container, in a set of containers, in a controller node, the container executing a service being performed and isolated from at least one other service being performed in at least one other container on the controller node; terminating, by the controller node, the container executing the service; determining, by the controller node, a particular known state for the service, wherein the particular known state is known to be operational without including one or more changes that caused the failure, and wherein the service saves the changes to the particular known state during operation separately from the particular known state; restarting, by the controller node, the service in a new container that replaces the terminated container, wherein the restarted service starts from the known state without using the changes; and wherein an orchestration service, configured to manage the set of containers, detects the failure; wherein the orchestration service detects the failure via monitoring a communication service in which a status of the service is input. - View Dependent Claims (9, 10, 11, 12)
-
-
13. A system comprising:
-
at least one device including a hardware processor; The system being configured to perform operations comprising; detecting a failure of a container, in a set of containers, in a controller node, the container executing a service being performed and isolated from at least one other service being performed in at least one other container on the controller node; terminating, by the controller node, the container executing the service; determining, by the controller node, a particular known state for the service, wherein the particular known state is known to be operational without including one or more changes that caused the failure, and wherein the service saves the changes to the particular known state during operation separately from the particular known state; restarting, by the controller node, the service in a new container that replaces the terminated container, wherein the restarted service starts from the particular known state without using the changes; and upon restarting with the particular known state, determining, by the service, configuration data or state data for the service from storage.
-
-
14. A system comprising:
-
at least one device including a hardware processor; the system being configured to perform operations comprising; detecting a failure of a container, in a set of containers, in a controller node, the container executing a service being performed and isolated from at least one other service being performed in at least one other container on the controller node; terminating, by the controller node, the container executing the service; determining, by the controller node, a particular known state for the service, wherein the particular known state is known to be operational without including one or more changes that caused the failure, and wherein the service saves the changes to the particular known state during operation separately from the particular known state; restarting, by the controller node, the service in a new container that replaces the terminated container, wherein the restarted service starts from the particular known state without using the changes; wherein; the particular known state is included in a file system, the service with the failure records differences to the file system without changing the file system, and the changes are not used in restarting the service in the new container.
-
-
15. A non-transitory computer-readable storage medium containing instructions, that when executed, control a computer system to be configured for:
-
detecting a failure of a container, in a set of containers, in a controller node, the container executing a service being performed and isolated from at least one other service being performed in at least one other container on the controller node; terminating, by the controller node, the container executing the service; determining, by the controller node, a particular known state for the service, wherein the particular known state is known to be operational without including one or more changes that caused the failure, and wherein the service saves changes to the particular known state during operation separately from the particular known state; restarting, by the controller node, the service in a new container that replaces the terminated container, wherein the restarted service starts from the particular known state without using the changes; and wherein an orchestration service, configured to manage the set of containers, detects the failure; wherein the orchestration service detects the failure via monitoring a communication service in which a status of the service is input. - View Dependent Claims (16)
-
Specification