System and method for self-healing a database server in a cluster
First Claim
1. A system comprising:
- a plurality of database servers, each database server in the plurality of database servers hosting shards of a database, each shard of the shards of the database having been split from a partition of the database and each partition of the database having been split from the database, each database server in the plurality of database servers having a unique identifier such that a status of each database server in the plurality of database servers can be accessed by other servers in the plurality of database servers, wherein each database server in the plurality of database servers is configured to;
receive a triggering action comprising;
receiving an indication that a minimum timer has expired; and
receiving a pre-determined number of queries;
detect a suspicious observation;
discover that a particular server is underperforming;
compile a plurality of statistics regarding itself, wherein the plurality of statistics is chosen from one of the following;
memory usage, disk activity levels, CPU load, and error rates; and
store the plurality of statistics in a data store accessible by;
(1) each database server in the plurality of database servers; and
(2) a load balancer; and
the load balancer configured to;
allocate queries among the plurality of database servers using load balancing techniques;
determine when a condition has occurred by;
accessing the plurality of statistics in the data store; and
determining that a malfunctioning database server of the plurality of database servers is malfunctioning, comprising determining when one or more of the plurality of statistics stored in the data store by the malfunctioning database server does not meet performance thresholds;
initiate an automatic self-corrective action in a database server in the plurality of database servers, the automatic self-corrective action comprising the database server taking itself out of a rotation for a predetermined amount of time configured to allow the database server to catch up; and
perform a corrective action on the malfunctioning database server comprising;
determining that the malfunctioning database server cannot correct itself;
writing an entry in the data store indicating that the malfunctioning database server is not available;
causing the malfunctioning database server to no longer receive instructions; and
forwarding shard-level queries originally directed to the malfunctioning database server to one or more other database servers of the plurality of database servers.
3 Assignments
0 Petitions
Accused Products
Abstract
A system and method for implementing a database system is presented. A database cluster can comprise multiple database servers. Each database server is configured to regularly compile various statistics upon the occurrence of a triggering event. These statistics can be stored along with the statistics of each database server in the cluster of database servers. Upon the occurrence of various conditions, corrective actions can be implemented. The conditions can include the inability to achieve performance thresholds. The conditions also can include not meeting the performance of other database servers in the cluster. The corrective action can include removing a server temporarily from the cluster or rebooting the server. In addition, a database server can cause the corrective action on other database servers in the cluster. Other embodiments also are disclosed.
-
Citations
10 Claims
-
1. A system comprising:
-
a plurality of database servers, each database server in the plurality of database servers hosting shards of a database, each shard of the shards of the database having been split from a partition of the database and each partition of the database having been split from the database, each database server in the plurality of database servers having a unique identifier such that a status of each database server in the plurality of database servers can be accessed by other servers in the plurality of database servers, wherein each database server in the plurality of database servers is configured to; receive a triggering action comprising; receiving an indication that a minimum timer has expired; and receiving a pre-determined number of queries; detect a suspicious observation; discover that a particular server is underperforming; compile a plurality of statistics regarding itself, wherein the plurality of statistics is chosen from one of the following;
memory usage, disk activity levels, CPU load, and error rates; andstore the plurality of statistics in a data store accessible by; (1) each database server in the plurality of database servers; and (2) a load balancer; and the load balancer configured to; allocate queries among the plurality of database servers using load balancing techniques; determine when a condition has occurred by; accessing the plurality of statistics in the data store; and determining that a malfunctioning database server of the plurality of database servers is malfunctioning, comprising determining when one or more of the plurality of statistics stored in the data store by the malfunctioning database server does not meet performance thresholds; initiate an automatic self-corrective action in a database server in the plurality of database servers, the automatic self-corrective action comprising the database server taking itself out of a rotation for a predetermined amount of time configured to allow the database server to catch up; and perform a corrective action on the malfunctioning database server comprising; determining that the malfunctioning database server cannot correct itself; writing an entry in the data store indicating that the malfunctioning database server is not available; causing the malfunctioning database server to no longer receive instructions; and forwarding shard-level queries originally directed to the malfunctioning database server to one or more other database servers of the plurality of database servers. - View Dependent Claims (2, 3)
-
-
4. A method being implemented via execution of computing instructions configured to run at one or more processors and configured to be stored at non-transitory computer-readable media, the method comprising:
-
in a plurality of database servers, each database server in the plurality of database servers hosting shards of a database, each shard of the shards of the database having been split from a partition of the database and each partition of the database having been split from the database, each database server in the plurality of database servers having a unique identifier such that a status of each database server in the plurality of database servers can be accessed by other servers in the plurality of database servers, performing acts of; receiving a triggering action comprising; receiving an indication that a minimum timer has expired; and receiving a pre-determined number of queries; detecting a suspicious observation; discovering that a particular server is underperforming; compiling a plurality of statistics regarding itself, wherein the plurality of statistics is chosen from one of the following;
memory usage, disk activity levels, CPU load, and error rates; andstoring the plurality of statistics in a data store accessible by; (1) each database server in the plurality of database servers; and (2) a load balancer; and in the load balancer configured to allocate queries among the plurality of database servers using load balancing techniques, performing acts of; determining when a condition has occurred by; accessing the plurality of statistics in the data store; and determining that a malfunctioning database server of the plurality of database servers is malfunctioning comprising determining when one or more of the plurality of statistics stored in the data store by the malfunctioning database server does not meet performance thresholds; initiating an automatic self-corrective action in a database server in the plurality of database servers, the automatic self-corrective action comprising the database server taking itself out of a rotation for a predetermined amount of time configured to allow the database server to catch up; and performing a corrective action on the database server comprising; determining that the malfunctioning database server cannot correct itself; writing an entry in the data store indicating that the malfunctioning database server is not available; causing the malfunctioning database server to no longer receive instructions; and forwarding shard-level queries originally directed to the malfunctioning database server to one or more other database servers of the plurality of database servers. - View Dependent Claims (5, 6)
-
-
7. A method comprising:
-
sending a first incoming instruction to a database server selected from a first plurality of database servers or a second plurality of database servers, using load balancing techniques; retrieving server information for each database server in a cluster of database servers; processing the first incoming instruction to extract a first query in a database server belonging to a first server set and selected from the first plurality of database servers; sending the first query from the database server belonging to the first server set and selected from the first plurality of database servers to a database server belonging to the first server set and selected from the second plurality of database servers; sending the first query from the database server belonging to the first server set and selected from the first plurality of database servers to a database server belonging to the first server set and selected from a third plurality of database servers; executing the first query in the database server belonging to the first server set and selected from the first plurality of database servers, the database server belonging to the first server set and selected from the second plurality of database servers, and the database server belonging to the first server set and selected from the third plurality of database servers; sending a second incoming instruction to a database server selected from the first plurality of database servers or the second plurality of database servers, using the load balancing techniques; processing the second incoming instruction to extract a second query in a database server belonging to a second server set and selected from the first plurality of database servers; sending the second query from the database server belonging to the second server set and selected from the first plurality of database servers to a database server belonging to the second server set and selected from the second plurality of database servers; and executing the second query in the database server belonging to the second server set and selected from the first plurality of database servers, and the database server belonging to the second server set and selected from the second plurality of database servers, wherein; each database server in the first plurality of database servers hosts a copy of a first shard of a database; each database server in the second plurality of database servers hosts a copy of a second shard of the database; each database server in the first plurality of database servers belongs to either the first server set or the second server set; each database server in the second plurality of database servers belongs to either the first server set or the second server set; and each database server in the first plurality of database servers and each database server in the second plurality of database servers is configured to; receive a triggering action; compile a plurality of statistics regarding itself; store the plurality of statistics in a data store accessible by the first plurality of database servers and the second plurality of database servers; determine when a condition has occurred; and perform a corrective action on the database server. - View Dependent Claims (8)
-
-
9. A method being implemented via execution of computing instructions configured to run at one or more processors and configured to be stored at non-transitory computer-readable media, the method comprising:
in a distinct database server of a plurality of database servers, each database server in the plurality of database servers hosting shards of a database, each shard of the shards of the database having been split from a partition of the database, and each partition having been split from the database, each database server in the plurality of database servers having a unique identifier such that a status of each database server in the plurality of database servers can be accessed by other servers in the plurality of database servers, performing acts of; receiving a triggering action comprising; receiving an indication that a minimum timer has expired; and receiving a pre-determined number of queries; detecting a suspicious observation; discovering that a particular server is underperforming; compiling a plurality of statistics regarding itself, wherein the plurality of statistics is chosen from one of the following;
memory usage, disk activity levels, CPU load, and error rates;storing the plurality of statistics in a data store accessible by each database server in the plurality of database servers and a load balancer, wherein the load balancer is configured to; allocate queries among the plurality of database servers using load balancing techniques; determine when a condition has occurred by; accessing the plurality of statistics in the data store; and determining that a malfunctioning database server of the plurality of database servers is malfunctioning, comprising determining when one or more of the plurality of statistics stored in the data store by the malfunctioning database server does not meet performance thresholds; initiate an automatic self-corrective action in a database server in the plurality of database servers, the automatic self-corrective action comprising the database server taking itself out of a rotation for a predetermined amount of time configured to allow the database server to catch up; perform a corrective action on the malfunctioning database server comprising; determining that the malfunctioning database server cannot correct itself; writing an entry in the data store indicating that the malfunctioning database server is not available; causing the malfunctioning database server to no longer receive instructions; and forwarding shard-level queries originally directed to the malfunctioning database server to one or more other database servers of the plurality of database servers; determining when a condition has occurred by; accessing the plurality of statistics in the data store; and determining that the malfunctioning database server of the plurality of database servers is malfunctioning; and mitigating loss of the malfunctioning database server. - View Dependent Claims (10)
Specification