METHOD AND SYSTEM FOR PROVIDING HIGH AVAILABILITY TO DISTRIBUTED COMPUTER APPLICATIONS

US 20070260733A1
Filed: 04/27/2007
Published: 11/08/2007
Est. Priority Date: 05/02/2006
Status: Active Grant

First Claim

Patent Images

1. A method of achieving transparent integration of a distributed application program with a high availability protection program, comprising:

injecting registration code, transparently and automatically, into all sub-programs during launch, without the need of modifying or recompiling the application program and without the need of a custom loader;

registering the distributed application automatically with a high-availability protection program;

detecting a failure in the execution of the distributed application program by said high-availability protection program; and

executing the distributed application, subject to the detected failure, with one or more sub-programs being executed from their respective backup nodes automatically by said high-availability protection program in response to the failure.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Method, system, apparatus and/or computer program for achieving transparent integration of high-availability services for distributed application programs. Loss-less migration of sub-programs from their respective primary nodes to backup nodes is performed transparently to a client which is connected to the primary node. Migration is performed by high-availability services which are configured for injecting registration codes, registering distributed applications, detecting execution failures, executing from backup nodes in response to failure, and other services. High-availability application services can be utilized by distributed applications having any desired number of sub-programs without the need of modifying or recompiling the application program and without the need of a custom loader. In one example embodiment, a transport driver is responsible for receiving messages, halting and flushing of messages, and for issuing messages directing sub-programs to continue after checkpointing.

Citations

42 Claims

1. A method of achieving transparent integration of a distributed application program with a high availability protection program, comprising:
- injecting registration code, transparently and automatically, into all sub-programs during launch, without the need of modifying or recompiling the application program and without the need of a custom loader;
  
  registering the distributed application automatically with a high-availability protection program;
  
  detecting a failure in the execution of the distributed application program by said high-availability protection program; and
  
  executing the distributed application, subject to the detected failure, with one or more sub-programs being executed from their respective backup nodes automatically by said high-availability protection program in response to the failure.
- View Dependent Claims (2, 3, 4)
- - 2. A method as recited in claim 1:
    - wherein said high-availability protection program is configured as an extension of an operating system;
      
      wherein recovery of application programs by said high-availability protection program is performed without the necessity of modifying programming within said application programs.
  - 3. A method as recited in claim 1, wherein said high-availability protection program is configured for protecting against node faults, network faults, and process faults.
  - 4. A method as recited in claim 1, wherein said high-availability protection program is configured to automatically coordinate transparent recovery of distributed applications.

5. A method of performing loss-less migration of a distributed application, comprising:
- migrating one or more sub-programs within an application, without loss, from their respective primary nodes to at least one backup node;
  
  maintaining transparency to a client connected to the primary node over a transport connection;
  
  flushing and halting said transport connection during the taking of checkpoints; and
  
  restoring said one or more sub-programs from said checkpoints in response to initiating recovery of the application.
- View Dependent Claims (6, 7, 8, 9, 10)
- - 6. A method as recited in claim 5:
    - wherein said flushing and halting of the transport connection is performed in response to execution of a transport control layer interposed between the distributed application and the transport connection; and
      
      wherein the transport connection itself is not responsible for flushing and halting of transport traffic.
  - 7. A method as recited in claim 5, wherein said transparency is maintained by a high-availability protection program configured to automatically coordinate transparent recovery of distributed applications.
  - 8. A method as recited in claim 5, wherein said transparency is maintained by a high-availability protection program to said one or more sub-programs running on a primary node while at least one backup node stands ready in the event of a fault and subsequent recovery.
  - 9. A method as recited in claim 5:
    - wherein programming of said high-availability protection program is separate from application programming; and
      
      wherein application programming is configured for high-availability in response to establishing settings for the high-availability protection program, and without the need for program changes or recompilation of the associated applications.
  - 10. A method as recited in claim 5, further comprising coordinating execution of individual sub-programs within a coordinator program.

11. A method of fault protection for applications distributed across multiple computer nodes, comprising:
- providing high-availability application services for transparently loading applications, registering applications for protection, detecting faults in applications, and initiating recovery of applications;
  
  taking checkpoints, by said high-availability application services, of one or more sub-programs within applications executing across multiple computer nodes;
  
  restoring said one or more sub-programs from said checkpoints in response to initiating recovery of one or more said applications by said high-availability application services;
  
  wherein said high-availability application services are provided to said one or more sub-programs running on a primary node, while at least one backup node stands ready in the event of a fault and subsequent recovery; and
  
  coordinating execution of individual sub-programs within a coordinator program which is executed on a node accessible to the multiple computer nodes.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
- - 12. A method as recited in claim 11, wherein said high-availability application services are configured as an extension of an operating system executing on said multiple computer nodes.
  - 13. A method as recited in claim 11, wherein said high-availability application services comprise services for supporting stateless applications, stateful applications, multi-tier enterprise applications, and large distributed applications.
  - 14. A method as recited in claim 11:
    - wherein programming for said high-availability application services are separate from application programming;
      
      wherein application programming can be configured for high-availability access and execution in response to settings established for said high-availability application services, and without the need for program changes or recompilation of the associated applications.
  - 15. A method as recited in claim 11, wherein said primary node and backup node are different nodes in the case of non-local recovery, or the same node in the case of local recovery.
  - 16. A method as recited in claim 11, wherein said high-availability application services are configured for protecting against node faults, network faults, and process faults.
  - 17. A method as recited in claim 11, wherein said restoring of one or more sub-programs is performed as either a stateless or stateful recovery of each sub-program within a distributed application and ensuring that each said sub-program is automatically recovered in a consistent state, without necessitating application or sub-program involvement.
  - 18. A method as recited in claim 17, wherein said stateful high availability for distributed applications is provided by said high-availability application services on a number of system types, comprising:
    - high performance computing, financial modeling, enterprise applications, web servers, application servers, databases, Voice Over IP (VOIP), Session Initiation Protocol (SIP), streaming media, Service Oriented Architectures (SOA).
  - 19. A method as recited in claim 11, wherein said high-availability application services have configurable protection levels.
  - 20. A method as recited in claim 11, wherein said high-availability application services provide coordinated restart and stateful restore for distributed applications.
  - 21. A method as recited in claim 11, wherein said high-availability application services provide coordinated and transparent checkpointing of distributed applications.
  - 22. A method as recited in claim 11, wherein said high-availability application services provide coordinated full and incremental checkpointing for distributed applications.
  - 23. A method as recited in claim 11, wherein said high-availability application services provide checkpointing to a local disk, a shared disk, or to a memory.
  - 24. A method as recited in claim 11, wherein said high-availability application services provide distributed application deadlock and hang protection through external health checks.
  - 25. A method as recited in claim 11, wherein said high-availability application services provide coordinated automatic and transparent recovery of distributed applications.
  - 26. A method as recited in claim 11, wherein said high-availability application services provide automatic startup of distributed applications.
  - 27. A method as recited in claim 11, wherein said high-availability application services provide script support for starting, stopping and restarting sub-programs.
  - 28. A method as recited in claim 11, wherein said high-availability application services are configured to respond to dynamic policy updates.
  - 29. A method as recited in claim 11, further comprising flushing and halting said transport connection when taking checkpoints.
  - 30. A method as recited in claim 29:
    - wherein said flushing and halting of the transport connection is performed in response to execution of a transport control layer interposed between the distributed application and the transport connection; and
      
      wherein the transport connection itself is not responsible for flushing and halting of transport traffic.
  - 31. A method as recited in claim 11, wherein said high-availability application services comprise:
    - injecting registration code, transparently and automatically, into all sub-programs during launch, without the need of modifying or recompiling the application program and without the need of a custom loader;
      
      registering a distributed application automatically within said high-availability application services; and
      
      detecting a failure in the execution of the distributed application program.

32. A computer executable program for loss-less migration of a distributed application program, comprising:
- a high-availability services module configured for execution in conjunction with an operating system upon which at least one application can be executed on one or more computer nodes of a distributed system; and
  
  programming within said high-availability services module executable on said computer nodes for loss-less migration of sub-programs within said at least one application for, checkpointing of all states in the transport connection, coordinating checkpointing of the state of the transport connection across the distributed system, restoring all states in the transport connection to the state they were in at the last checkpoint, coordinating recovery within a restore procedure that is coupled to the transport connection.
- View Dependent Claims (33, 34, 35, 36)
- - 33. A computer executable program as recited in claim 32, further comprising flushing and halting said transport connection during checkpointing.
  - 34. A computer executable program as recited in claim 32, wherein said high-availability services module is configured as an extension of an operating system for one or more computer nodes.
  - 35. A computer executable program as recited in claim 34:
    - wherein programming of said high-availability services module is separate from application programming; and
      
      wherein said application programming can be configured for high-availability execution in response to settings established for the high-availability services, and without the need for program changes or recompilation of the associated applications.
  - 36. A computer executable program as recited in claim 32:
    - wherein said high-availability services module is configured for protecting against node faults, network faults, and process faults.

37. A system of multiple computer nodes over which distributed applications are protected against faults, comprising:
- a plurality of computer nodes upon which applications can be executed;
  
  an operating system configured for execution on each said computer node and upon which said applications are executed;
  
  a high-availability services module configured for protecting said applications from faults, and for executing in combination with said operating system; and
  
  programming within said high-availability services module configured for execution on each said computer node for, providing transparent application functions for loading applications, registering applications for protection, detecting faults in applications, and initiating recovery of applications, checkpointing of one or more sub-programs to create checkpoints for the application executing on at least one said computer node, restoring said one or more sub-programs from said checkpoints during said initiating of recovery of the application, executing said one or more sub-programs on a primary node while at least one backup node stands ready for executing the sub-programs in the event of a fault and subsequent recovery, and coordinating execution of individual sub-programs within a coordinator program which runs on a node accessible to said plurality of computer nodes.
- View Dependent Claims (38, 39, 40, 41, 42)
- - 38. A system as recited in claim 37, further comprising flushing and halting a transport connection during checkpointing.
  - 39. A system as recited in claim 38, wherein a transport driver is responsible for receiving messages, as well as said halting and flushing of messages, and for issuing messages directing sub-programs to continue after checkpointing.
  - 40. A system as recited in claim 37:
    - wherein said high-availability services module is configured as an extension of the operating system;
      
      wherein programming for said high-availability services module is separate from application programming; and
      
      wherein said application programming can be configured for high-availability execution in response to settings established for said high-availability application services, and without the need for program changes, or recompilation, of associated applications.
  - 41. A system as recited in claim 37, wherein said high-availability services module is configured for protecting against node faults, network faults, and process faults.
  - 42. A system as recited in claim 37, wherein said checkpointing comprises:
    - creating at least one incremental checkpoint;
      
      creating at least one full checkpoint; and
      
      automatically and asynchronously merging at least one said incremental checkpoint and at least one said full checkpoint at a checkpoint storage location.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Red Hat, Inc. (International Business Machines Corporation)
Original Assignee
Availigent, Inc.
Inventors
Havemose, Allan, Ngan, Ching-Yuk

Granted Patent

US 7,681,075 B2
Time in Patent Office

Days
Field of Search
US Class Current

709/226
CPC Class Codes

G06F 11/1402   Saving, restoring, recoveri...

G06F 11/1451   by selection of backup cont...

G06F 11/1464   for networked environments

G06F 11/1469   Backup restoration techniques

G06F 11/1482   by means of middleware or O...

G06F 11/2028   eliminating a faulty proces...

G06F 11/2035   without idle spare hardware

G06F 2201/805   Real-time

G06F 2201/815   Virtual

G06F 2201/84   Using snapshots, i.e. a log...

METHOD AND SYSTEM FOR PROVIDING HIGH AVAILABILITY TO DISTRIBUTED COMPUTER APPLICATIONS

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

42 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD AND SYSTEM FOR PROVIDING HIGH AVAILABILITY TO DISTRIBUTED COMPUTER APPLICATIONS

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

42 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links