System and method for self-healing a database server in a cluster

US 10,169,138 B2
Filed: 01/29/2016
Issued: 01/01/2019
Est. Priority Date: 09/22/2015
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

a plurality of database servers, each database server in the plurality of database servers hosting shards of a database, each shard of the shards of the database having been split from a partition of the database and each partition of the database having been split from the database, each database server in the plurality of database servers having a unique identifier such that a status of each database server in the plurality of database servers can be accessed by other servers in the plurality of database servers, wherein each database server in the plurality of database servers is configured to;

receive a triggering action comprising;

receiving an indication that a minimum timer has expired; and

receiving a pre-determined number of queries;

detect a suspicious observation;

discover that a particular server is underperforming;

compile a plurality of statistics regarding itself, wherein the plurality of statistics is chosen from one of the following;

memory usage, disk activity levels, CPU load, and error rates; and

store the plurality of statistics in a data store accessible by;

(1) each database server in the plurality of database servers; and

(2) a load balancer; and

the load balancer configured to;

allocate queries among the plurality of database servers using load balancing techniques;

determine when a condition has occurred by;

accessing the plurality of statistics in the data store; and

determining that a malfunctioning database server of the plurality of database servers is malfunctioning, comprising determining when one or more of the plurality of statistics stored in the data store by the malfunctioning database server does not meet performance thresholds;

initiate an automatic self-corrective action in a database server in the plurality of database servers, the automatic self-corrective action comprising the database server taking itself out of a rotation for a predetermined amount of time configured to allow the database server to catch up; and

perform a corrective action on the malfunctioning database server comprising;

determining that the malfunctioning database server cannot correct itself;

writing an entry in the data store indicating that the malfunctioning database server is not available;

causing the malfunctioning database server to no longer receive instructions; and

forwarding shard-level queries originally directed to the malfunctioning database server to one or more other database servers of the plurality of database servers.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for implementing a database system is presented. A database cluster can comprise multiple database servers. Each database server is configured to regularly compile various statistics upon the occurrence of a triggering event. These statistics can be stored along with the statistics of each database server in the cluster of database servers. Upon the occurrence of various conditions, corrective actions can be implemented. The conditions can include the inability to achieve performance thresholds. The conditions also can include not meeting the performance of other database servers in the cluster. The corrective action can include removing a server temporarily from the cluster or rebooting the server. In addition, a database server can cause the corrective action on other database servers in the cluster. Other embodiments also are disclosed.

Citations

10 Claims

1. A system comprising:
- a plurality of database servers, each database server in the plurality of database servers hosting shards of a database, each shard of the shards of the database having been split from a partition of the database and each partition of the database having been split from the database, each database server in the plurality of database servers having a unique identifier such that a status of each database server in the plurality of database servers can be accessed by other servers in the plurality of database servers, wherein each database server in the plurality of database servers is configured to;
  
  receive a triggering action comprising;
  
  receiving an indication that a minimum timer has expired; and
  
  receiving a pre-determined number of queries;
  
  detect a suspicious observation;
  
  discover that a particular server is underperforming;
  
  compile a plurality of statistics regarding itself, wherein the plurality of statistics is chosen from one of the following;
  
  memory usage, disk activity levels, CPU load, and error rates; and
  
  store the plurality of statistics in a data store accessible by;
  
  (1) each database server in the plurality of database servers; and
  
  (2) a load balancer; and
  
  the load balancer configured to;
  
  allocate queries among the plurality of database servers using load balancing techniques;
  
  determine when a condition has occurred by;
  
  accessing the plurality of statistics in the data store; and
  
  determining that a malfunctioning database server of the plurality of database servers is malfunctioning, comprising determining when one or more of the plurality of statistics stored in the data store by the malfunctioning database server does not meet performance thresholds;
  
  initiate an automatic self-corrective action in a database server in the plurality of database servers, the automatic self-corrective action comprising the database server taking itself out of a rotation for a predetermined amount of time configured to allow the database server to catch up; and
  
  perform a corrective action on the malfunctioning database server comprising;
  
  determining that the malfunctioning database server cannot correct itself;
  
  writing an entry in the data store indicating that the malfunctioning database server is not available;
  
  causing the malfunctioning database server to no longer receive instructions; and
  
  forwarding shard-level queries originally directed to the malfunctioning database server to one or more other database servers of the plurality of database servers.
- View Dependent Claims (2, 3)
- - 2. The system of claim 1, wherein:
    - determining that the malfunctioning database server of the plurality of database servers is malfunctioning comprises;
      
      comparing one or more of the plurality of statistics stored in the data store by the malfunctioning database server to an average of all of the plurality of statistics stored in the data store.
  - 3. The system of claim 1, wherein:
    - performing the corrective action comprises restarting the malfunctioning database server.

4. A method being implemented via execution of computing instructions configured to run at one or more processors and configured to be stored at non-transitory computer-readable media, the method comprising:
- in a plurality of database servers, each database server in the plurality of database servers hosting shards of a database, each shard of the shards of the database having been split from a partition of the database and each partition of the database having been split from the database, each database server in the plurality of database servers having a unique identifier such that a status of each database server in the plurality of database servers can be accessed by other servers in the plurality of database servers, performing acts of;
  
  receiving a triggering action comprising;
  
  receiving an indication that a minimum timer has expired; and
  
  receiving a pre-determined number of queries;
  
  detecting a suspicious observation;
  
  discovering that a particular server is underperforming;
  
  compiling a plurality of statistics regarding itself, wherein the plurality of statistics is chosen from one of the following;
  
  memory usage, disk activity levels, CPU load, and error rates; and
  
  storing the plurality of statistics in a data store accessible by;
  
  (1) each database server in the plurality of database servers; and
  
  (2) a load balancer; and
  
  in the load balancer configured to allocate queries among the plurality of database servers using load balancing techniques, performing acts of;
  
  determining when a condition has occurred by;
  
  accessing the plurality of statistics in the data store; and
  
  determining that a malfunctioning database server of the plurality of database servers is malfunctioning comprising determining when one or more of the plurality of statistics stored in the data store by the malfunctioning database server does not meet performance thresholds;
  
  initiating an automatic self-corrective action in a database server in the plurality of database servers, the automatic self-corrective action comprising the database server taking itself out of a rotation for a predetermined amount of time configured to allow the database server to catch up; and
  
  performing a corrective action on the database server comprising;
  
  determining that the malfunctioning database server cannot correct itself;
  
  writing an entry in the data store indicating that the malfunctioning database server is not available;
  
  causing the malfunctioning database server to no longer receive instructions; and
  
  forwarding shard-level queries originally directed to the malfunctioning database server to one or more other database servers of the plurality of database servers.
- View Dependent Claims (5, 6)
- - 5. The method of claim 4, wherein:
    - determining that the malfunctioning database server of the plurality of database servers is malfunctioning comprises;
      
      comparing one or more of the plurality of statistics stored in the data store by the malfunctioning database server to an average of all of the plurality of statistics stored in the data store.
  - 6. The method of claim 4, wherein:
    - performing the corrective action comprises restarting the malfunctioning database server.

7. A method comprising:
- sending a first incoming instruction to a database server selected from a first plurality of database servers or a second plurality of database servers, using load balancing techniques;
  
  retrieving server information for each database server in a cluster of database servers;
  
  processing the first incoming instruction to extract a first query in a database server belonging to a first server set and selected from the first plurality of database servers;
  
  sending the first query from the database server belonging to the first server set and selected from the first plurality of database servers to a database server belonging to the first server set and selected from the second plurality of database servers;
  
  sending the first query from the database server belonging to the first server set and selected from the first plurality of database servers to a database server belonging to the first server set and selected from a third plurality of database servers;
  
  executing the first query in the database server belonging to the first server set and selected from the first plurality of database servers, the database server belonging to the first server set and selected from the second plurality of database servers, and the database server belonging to the first server set and selected from the third plurality of database servers;
  
  sending a second incoming instruction to a database server selected from the first plurality of database servers or the second plurality of database servers, using the load balancing techniques;
  
  processing the second incoming instruction to extract a second query in a database server belonging to a second server set and selected from the first plurality of database servers;
  
  sending the second query from the database server belonging to the second server set and selected from the first plurality of database servers to a database server belonging to the second server set and selected from the second plurality of database servers; and
  
  executing the second query in the database server belonging to the second server set and selected from the first plurality of database servers, and the database server belonging to the second server set and selected from the second plurality of database servers, wherein;
  
  each database server in the first plurality of database servers hosts a copy of a first shard of a database;
  
  each database server in the second plurality of database servers hosts a copy of a second shard of the database;
  
  each database server in the first plurality of database servers belongs to either the first server set or the second server set;
  
  each database server in the second plurality of database servers belongs to either the first server set or the second server set; and
  
  each database server in the first plurality of database servers and each database server in the second plurality of database servers is configured to;
  
  receive a triggering action;
  
  compile a plurality of statistics regarding itself;
  
  store the plurality of statistics in a data store accessible by the first plurality of database servers and the second plurality of database servers;
  
  determine when a condition has occurred; and
  
  perform a corrective action on the database server.
- View Dependent Claims (8)
- - 8. The method of claim 7 further comprising:
    - receiving a first shard query result from the database server belonging to the first server set and selected from the first plurality of database servers;
      
      receiving a second shard query result from the database server belonging to the first server set and selected from the second plurality of database servers;
      
      aggregating the first shard query result with the second shard query result to form an aggregated query result; and
      
      presenting the aggregated query result to a requestor.

9. A method being implemented via execution of computing instructions configured to run at one or more processors and configured to be stored at non-transitory computer-readable media, the method comprising:
- in a distinct database server of a plurality of database servers, each database server in the plurality of database servers hosting shards of a database, each shard of the shards of the database having been split from a partition of the database, and each partition having been split from the database, each database server in the plurality of database servers having a unique identifier such that a status of each database server in the plurality of database servers can be accessed by other servers in the plurality of database servers, performing acts of;
  
  receiving a triggering action comprising;
  
  receiving an indication that a minimum timer has expired; and
  
  receiving a pre-determined number of queries;
  
  detecting a suspicious observation;
  
  discovering that a particular server is underperforming;
  
  compiling a plurality of statistics regarding itself, wherein the plurality of statistics is chosen from one of the following;
  
  memory usage, disk activity levels, CPU load, and error rates;
  
  storing the plurality of statistics in a data store accessible by each database server in the plurality of database servers and a load balancer, wherein the load balancer is configured to;
  
  allocate queries among the plurality of database servers using load balancing techniques;
  
  determine when a condition has occurred by;
  
  accessing the plurality of statistics in the data store; and
  
  determining that a malfunctioning database server of the plurality of database servers is malfunctioning, comprising determining when one or more of the plurality of statistics stored in the data store by the malfunctioning database server does not meet performance thresholds;
  
  initiate an automatic self-corrective action in a database server in the plurality of database servers, the automatic self-corrective action comprising the database server taking itself out of a rotation for a predetermined amount of time configured to allow the database server to catch up;
  
  perform a corrective action on the malfunctioning database server comprising;
  
  determining that the malfunctioning database server cannot correct itself;
  
  writing an entry in the data store indicating that the malfunctioning database server is not available;
  
  causing the malfunctioning database server to no longer receive instructions; and
  
  forwarding shard-level queries originally directed to the malfunctioning database server to one or more other database servers of the plurality of database servers;
  
  determining when a condition has occurred by;
  
  accessing the plurality of statistics in the data store; and
  
  determining that the malfunctioning database server of the plurality of database servers is malfunctioning; and
  
  mitigating loss of the malfunctioning database server.
- View Dependent Claims (10)
- - 10. The method of claim 9, wherein:
    - mitigating the loss of the malfunctioning database server comprises sending instructions causing the malfunctioning database server to restart.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Walmart Apollo, LLC (WalMart Inc.)
Original Assignee
Walmart Apollo, LLC (WalMart Inc.)
Inventors
Guney, Ergin, Zheng, Yan
Primary Examiner(s)
Uddin, Md I

Application Number

US15/010,502
Publication Number

US 20170083397A1
Time in Patent Office

1,068 Days
Field of Search

714 471, 714 472, 714 48, 707708, 707999103, 707999202
US Class Current
CPC Class Codes

G06F 11/0709   in a distributed system con...

G06F 11/0751   Error or fault detection no...

G06F 11/076   by exceeding a count or rat...

G06F 11/0787   Storage of error reports, e...

G06F 11/0793   Remedial or corrective acti...

G06F 11/1469   Backup restoration techniques

G06F 2201/805   Real-time

System and method for self-healing a database server in a cluster

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for self-healing a database server in a cluster

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links