System and method for topology-aware job scheduling and backfilling in an HPC environment

US 9,189,278 B2
Filed: 10/11/2013
Issued: 11/17/2015
Est. Priority Date: 04/15/2004
Status: Active Grant

- Alert
- Pin

First Claim

Patent Images

1. A method comprising:

determining, using one or more computers, a failure of a node included in a virtual cluster of a plurality of communicatively coupled nodes of a computing environment, the virtual cluster comprising a logical grouping of nodes configured to process a job having a plurality of tasks;

removing the failed node from the virtual cluster;

determining whether the job is associated with the failed node;

responsive to a determination that the job is associated with the failed node, determining other operational nodes in the virtual cluster associated with the job;

terminating the plurality of tasks of the job on each of the other operational nodes in the virtual cluster currently executing the plurality of tasks of the job;

deallocating the other operational nodes associated with the job;

determining an optimum subset of nodes other than the deallocated nodes and the failed node from the virtual cluster to re-execute the job, wherein the optimum subset of nodes is determined according to a specific topology that allows cooperating tasks of the job to communicate with any other tasks by minimizing distance between any two nodes;

allocating the optimum subset of nodes; and

re-executing the job on the optimum subset of nodes.

View all claims

1 Assignment

Timeline View

Assignment View

Litigations

0 Petitions

Accused Products

Abstract

A method for job management in an HPC environment includes determining an unallocated subset from a plurality of HPC nodes, with each of the unallocated HPC nodes comprising an integrated fabric. An HPC job is selected from a job queue and executed using at least a portion of the unallocated subset of nodes.

Citations

18 Claims

1. A method comprising:
- determining, using one or more computers, a failure of a node included in a virtual cluster of a plurality of communicatively coupled nodes of a computing environment, the virtual cluster comprising a logical grouping of nodes configured to process a job having a plurality of tasks;
  
  removing the failed node from the virtual cluster;
  
  determining whether the job is associated with the failed node;
  
  responsive to a determination that the job is associated with the failed node, determining other operational nodes in the virtual cluster associated with the job;
  
  terminating the plurality of tasks of the job on each of the other operational nodes in the virtual cluster currently executing the plurality of tasks of the job;
  
  deallocating the other operational nodes associated with the job;
  
  determining an optimum subset of nodes other than the deallocated nodes and the failed node from the virtual cluster to re-execute the job, wherein the optimum subset of nodes is determined according to a specific topology that allows cooperating tasks of the job to communicate with any other tasks by minimizing distance between any two nodes;
  
  allocating the optimum subset of nodes; and
  
  re-executing the job on the optimum subset of nodes.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein determining the optimum subset of nodes further comprises:
    - retrieving one or more administrative policies for the job, the one or more policies storing processing and management information including one or more of;
      
      problem size, problem run time, timeslots, and users'"'"' allocation of nodes or virtual clusters; and
      
      selecting the subset of nodes based on the retrieved administrative policies and parameters of the job.
  - 3. The method of claim 2, wherein the determined optimum subset of nodes is a completely different subset of nodes than an original subset of nodes initially selected to execute the job.
  - 4. The method of claim 1, wherein the plurality of communicatively coupled nodes included in the computing environment is arranged in a three dimensional node structure.
  - 5. The method of claim 1, wherein each node comprises at least two first processors operable to communicate with each other, and a first switch, the first processors communicably coupled to the first switch, the first switch operable to communicably couple the first processors to at least six second nodes.
  - 6. The method of claim 1, wherein determining the failure of the node comprises one of:
    - determining that a repeating communication has not been received from the failed node; and
      
      polling of the failed node.

7. A system comprising:
- a plurality of communicatively coupled nodes of a computing environment; and
  
  a management node, comprising at least one processing device, configured to;
  
  determine a failure of a node included in a virtual cluster of the plurality of communicatively coupled nodes of the computing environment, the virtual cluster comprising a logical grouping of nodes configured to process a job having a plurality of tasks;
  
  remove the failed node from the virtual cluster;
  
  determine whether the job is associated with the failed node;
  
  responsive to a determination that the job is associated with the failed node, determine other operational nodes in the virtual cluster associated with the job;
  
  terminate the plurality of tasks of the job on each of the other operational nodes in the virtual cluster currently executing the plurality of tasks of the job;
  
  deallocate the other operational nodes associated with the job;
  
  determine an optimum subset of nodes other than the deallocated nodes and the failed node from the virtual cluster to re-execute the job, wherein the optimum subset of nodes is determined according to a specific topology that allows cooperating tasks of the job to communicate with any other tasks by minimizing distance between any two nodes;
  
  allocate the optimum subset of nodes; and
  
  re-execute the job on the optimum subset of nodes.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The system of claim 7, wherein the management node is further configured to:
    - retrieve one or more administrative policies for the job, the one or more policies storing processing and management information including one or more of;
      
      problem size, problem run time, timeslots, and users'"'"' allocation of nodes or virtual clusters; and
      
      select the subset of nodes based on the retrieved administrative policies and parameters of the job.
  - 9. The system of claim 8, wherein the determined optimum subset of nodes is a completely different subset of nodes than an original subset of nodes initially selected to execute the job.
  - 10. The system of claim 7, wherein the plurality of communicatively coupled nodes included in the computing environment is arranged in a three dimensional node structure.
  - 11. The system of claim 7, wherein each node comprises at least two first processors operable to communicate with each other, and a first switch, the first processors communicably coupled to the first switch, the first switch operable to communicably couple the first processors to at least six second nodes.
  - 12. The system of claim 7, wherein determining the failure of the node comprises one of:
    - determining that a repeating communication has not been received from the failed node; and
      
      polling of the failed node.

13. A non-transitory computer readable storage medium having computer readable instructions stored thereon that, when executed by a computer, implement a method, the method comprising:
- determining, using one or more computers, a failure of a node included in a virtual cluster of a plurality of communicatively coupled nodes of a computing environment, the virtual cluster comprising a logical grouping of nodes configured to process a job having a plurality of tasks;
  
  removing the failed node from the virtual cluster;
  
  determining whether the job is associated with the failed node;
  
  responsive to a determination that the job is associated with the failed node, determining other operational nodes in the virtual cluster associated with the job;
  
  terminating the plurality of tasks of the job on each of the other operational nodes in the virtual cluster currently executing the plurality of tasks of the job;
  
  deallocating the other operational nodes associated with the job;
  
  determining an optimum subset of nodes other than the deallocated nodes and the failed node from the virtual cluster to re-execute the job, wherein the optimum subset of nodes is determined according to a specific topology that allows cooperating tasks of the job to communicate with any other tasks by minimizing distance between any two nodes;
  
  allocating the optimum subset of nodes; and
  
  re-executing the job on the optimum subset of nodes.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The storage medium of claim 13, wherein determining the optimum subset of nodes further comprises:
    - retrieving one or more administrative policies for the job, the one or more policies storing processing and management information including one or more of;
      
      problem size, problem run time, timeslots, and users'"'"' allocation of nodes or virtual clusters; and
      
      selecting the subset of nodes based on the retrieved administrative policies and parameters of the job.
  - 15. The storage medium of claim 14, wherein the determined optimum subset of nodes is a completely different subset of nodes than an original subset of nodes initially selected to execute the job.
  - 16. The storage medium of claim 13, wherein the plurality of communicatively coupled nodes included in the computing environment is arranged in a three dimensional node structure.
  - 17. The storage medium of claim 13, wherein each node comprises at least two first processors operable to communicate with each other, and a first switch, the first processors communicably coupled to the first switch, the first switch operable to communicably couple the first processors to at least six second nodes.
  - 18. The storage medium of claim 13, wherein determining the failure of the node comprises one of:
    - determining that a repeating communication has not been received from the failed node; and
      
      polling of the failed node.

Specification

Resources

Litigation Campaign Assessment

Litigation Data

Current Assignee
Raytheon Company (Rtx Corporation)
Original Assignee
Raytheon Company (Rtx Corporation)
Inventors
Davidson, Shannon V., Richoux, Anthony N.
Primary Examiner(s)
An, Meng
Assistant Examiner(s)
Lee, James J

Application Number

US14/052,127
Publication Number

US 20140047092A1
Time in Patent Office

767 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 11/1482   by means of middleware or O...

G06F 11/2025   using centralised failover ...

G06F 2201/815   Virtual

G06F 9/4881   Scheduling strategies for d...

G06F 9/50   Allocation of resources, e....

G06F 9/5038   considering the execution o...

G06F 9/5066   Algorithms for mapping a pl...

G06F 9/5072   Grid computing

G06F 9/5077   Logical partitioning of res...

G06F 9/5083   Techniques for rebalancing ...

System and method for topology-aware job scheduling and backfilling in an HPC environment

First Claim

1 Assignment

Litigations

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for topology-aware job scheduling and backfilling in an HPC environment

First Claim

1 Assignment

Subscription Required

Subscription Required

Litigations

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links