Specifying a highly-resilient system in a disaggregated compute environment

US 20170295108A1
Filed: 04/07/2016
Published: 10/12/2017
Est. Priority Date: 04/07/2016
Status: Active Grant

First Claim

Patent Images

1. A method for assigning resources in a compute environment, comprisingproviding a set of server resource pools, wherein a server resource pool comprises a set of resources of a common type;

for a given tenant, defining a server entity composed of one or more resources selected from one or more of the server resource pools, wherein the one or more resources are selected from the one or more of the server resource pools based on a projected workload and a resiliency requirement;

receiving information collected from monitoring health of the one or more resources in the server entity as the workload is processed; and

based on the monitoring indicating a change in health of a resource in the server entity, adjusting a composition of the server entity to attempt to maintain the resiliency requirement.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Server resources in a data center are disaggregated into shared server resource pools. Servers are constructed dynamically, on-demand and based on workload requirements and a tenant'"'"'s resiliency requirements (e.g., as specified in an SLA), by allocating from these resource pools. A disaggregated compute system of this type keeps track of resources that are available in the shared server resource pools, and it manages those resources based on that information and the health of the resources. As a workload is processed by the server entity and component resources fail, the server entity composition is changed, e.g. by allocating other resources to the server entity, or by transitioning to other server entities, to ensure that a resiliency requirement is maintained.

30 Citations

25 Claims

1. A method for assigning resources in a compute environment, comprisingproviding a set of server resource pools, wherein a server resource pool comprises a set of resources of a common type;
- for a given tenant, defining a server entity composed of one or more resources selected from one or more of the server resource pools, wherein the one or more resources are selected from the one or more of the server resource pools based on a projected workload and a resiliency requirement;
  
  receiving information collected from monitoring health of the one or more resources in the server entity as the workload is processed; and
  
  based on the monitoring indicating a change in health of a resource in the server entity, adjusting a composition of the server entity to attempt to maintain the resiliency requirement.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method as described in claim 1 the change in health of a resource is a component failure.
  - 3. The method as described in claim 2 wherein the component failure is one of:
    - a processor failure, a memory failure, an accelerator failure, a storage failure, and another component failure.
  - 4. The method as described in claim 1 wherein adjusting the composition of the server entity de-allocates a failed component and promotes a second component of a same type to assume responsibility for the failed component.
  - 5. The method as described in claim 4 wherein the second component that is promoted is associated with the server entity, or a different server entity.
  - 6. The method as described in claim 4 wherein the second component that is promoted is assigned based on its network locality relative to the failed component.
  - 7. The method as described in claim 4 further including de-associating another lower-priority workload that is running on the second component from the second component prior to promoting the second component to assume responsibility for the failed component.
  - 8. The method as described in claim 1 wherein resources are assigned for multiple tenants, and at least first and second of the multiple tenants have different resiliency requirements.

9. Apparatus for assigning resources in a compute environment, comprising:
- one or more hardware processors;
  
  computer memory holding computer program instructions executed by the hardware processors and operative to;
  
  manage a set of server resource pools, wherein a server resource pool comprises a set of resources of a common type;
  
  for a given tenant, define a server entity composed of one or more resources selected from one or more of the server resource pools, wherein the one or more resources are selected from the one or more of the server resource pools based on a projected workload and a resiliency requirement;
  
  receive information collected from monitoring health of the one or more resources in the server entity as the workload is processed; and
  
  based on the monitoring indicating a change in health of a resource in the server entity, adjust a composition of the server entity to attempt to maintain the resiliency requirement.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The apparatus as described in claim 9 wherein the change in health of a resource is a component failure.
  - 11. The apparatus as described in claim 10 wherein the component failure is one of:
    - a processor failure, a memory failure, an accelerator failure, a storage failure, and another component failure.
  - 12. The apparatus as described in claim 9 wherein the computer program instructions to adjust the composition of the server entity are operative to de-allocate a failed component and promote a second component of a same type to assume responsibility for the failed component.
  - 13. The apparatus as described in claim 12 wherein the second component that is promoted is associated with the server entity, or a different server entity.
  - 14. The apparatus as described in claim 12 wherein the second component that is promoted is assigned based on its network locality relative to the failed component.
  - 15. The apparatus as described in claim 12 wherein the computer program instructions are further operative to de-associate another lower-priority workload that is running on the second component from the second component prior to promoting the second component to assume responsibility for the failed component.
  - 16. The apparatus as described in claim 9 wherein resources are assigned for multiple tenants, and at least first and second of the multiple tenants have different resiliency requirements.

17. A computer program product in a non-transitory computer readable medium for use in a data processing system for assigning resources in a compute environment, the computer program product holding computer program instructions executed in the data processing system and operative to:
- manage a set of server resource pools, wherein a server resource pool comprises a set of resources of a common type;
  
  for a given tenant, define a server entity composed of one or more resources selected from one or more of the server resource pools, wherein the one or more resources are selected from the one or more of the server resource pools based on a projected workload and a resiliency requirement;
  
  receive information collected from monitoring health of the one or more resources in the server entity as the workload is processed; and
  
  based on the monitoring indicating a change in health of a resource in the server entity, adjust a composition of the server entity to attempt to maintain the resiliency requirement.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
- - 18. The computer program product as described in claim 17 wherein the change in health of a resource is a component failure.
  - 19. The computer program product as described in claim 18 wherein the component failure is one of:
    - a processor, failure, a memory failure, an accelerator failure, a storage failure, and another component failure.
  - 20. The computer program product as described in claim 17 wherein the computer program instructions to adjust the composition of the server entity are operative to de-allocate a failed component and promote a second component of a same type to assume responsibility for the failed component.
  - 21. The computer program product as described in claim 20 wherein the second component that is promoted is associated with the server entity, or a different server entity.
  - 22. The computer program product as described in claim 20 wherein the second component that is promoted is assigned based on its network locality relative to the failed component.
  - 23. The computer program product as described in claim 20 wherein the computer program instructions are further operative to de-associate another lower-priority workload that is running on the second component from the second component prior to promoting the second component to assume responsibility for the failed component.
  - 24. The computer program product as described in claim 17 wherein resources are assigned for multiple tenants, and at least first and second of the multiple tenants have different resiliency requirements.

25. A data center facility, comprising:
- a set of server resource pools that comprise a compute pool, and a memory pool;
  
  a disaggregated compute system comprising processors selected from the compute pool, computer memories selected from the memory pool, and an optical interconnect, the disaggregated compute system being configured to meet a resiliency requirement associated with a tenant, the resiliency requirement being associated with a tenant'"'"'s service level agreement (SLA); and
  
  a resiliency manager executing in a hardware element and responsive to a failure in one or more resources in the disaggregated compute system as the tenant'"'"'s workload is processed to selectively adjust a composition of the disaggregate compute system to maintain the resiliency requirement.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Mahindru, Ruchi, Bivens, John Alan, Das, Koushik K., Li, Min, Ramasamy, Harigovind V., Ruan, Yaoping, Salapura, Valentina, Schenfeld, Eugen

Granted Patent

US 10,129,169 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

H04L 47/70   Admission control; Resource...

H04L 47/805   QOS or priority aware

H04L 47/822   Collecting or measuring res...

Specifying a highly-resilient system in a disaggregated compute environment

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

30 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Specifying a highly-resilient system in a disaggregated compute environment

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

30 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links