Data arrangement management in a distributed data cluster environment of a shared pool of configurable computing resources

US 10,387,415 B2
Filed: 06/28/2016
Issued: 08/20/2019
Est. Priority Date: 06/28/2016
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for data arrangement management in a distributed data cluster environment of a shared pool of configurable computing resources, the method comprising:

monitoring, in the distributed data cluster environment, a set of data for a data redistribution candidate trigger;

detecting, in the distributed data cluster environment, the data redistribution candidate trigger with respect to the set of data, wherein detecting the data redistribution candidate trigger comprises;

detecting a data structure which indicates a workload pattern;

building a new distribution key for the data structure to change the workload pattern to reduce data movement during a query operation;

determining, based on the new distribution key, a new data arrangement associated with the set of data, and comparing the new data arrangement with a current data arrangement to determine which data arrangement is more efficient based on resource usage; and

in response to determining that the new data arrangement is more efficient than the current data arrangement, establishing, based on the new distribution key, the new data arrangement in the distributed data cluster environment such that at least a portion of the set of data comprises a different physical location in the new data arrangement.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed aspects relate to data arrangement management in a distributed data cluster environment of a shared pool of configurable computing resources. In the distributed data cluster environment, a set of data is monitored for a data redistribution candidate trigger. The data redistribution candidate trigger is detected with respect to the set of data. Based on the data redistribution candidate trigger, the set of data is analyzed with respect to a candidate data redistribution action. Using the candidate data redistribution action, a new data arrangement associated with the set of data is determined. Accordingly, the new data arrangement is established.

27 Citations

View as Search Results

20 Claims

1. A computer-implemented method for data arrangement management in a distributed data cluster environment of a shared pool of configurable computing resources, the method comprising:
- monitoring, in the distributed data cluster environment, a set of data for a data redistribution candidate trigger;
  
  detecting, in the distributed data cluster environment, the data redistribution candidate trigger with respect to the set of data, wherein detecting the data redistribution candidate trigger comprises;
  
  detecting a data structure which indicates a workload pattern;
  
  building a new distribution key for the data structure to change the workload pattern to reduce data movement during a query operation;
  
  determining, based on the new distribution key, a new data arrangement associated with the set of data, and comparing the new data arrangement with a current data arrangement to determine which data arrangement is more efficient based on resource usage; and
  
  in response to determining that the new data arrangement is more efficient than the current data arrangement, establishing, based on the new distribution key, the new data arrangement in the distributed data cluster environment such that at least a portion of the set of data comprises a different physical location in the new data arrangement.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the new data arrangement positively impacts a query performance metric when running a query in the distributed data cluster environment.
  - 3. The method of claim 1, wherein the workload pattern includes one or more of a join on a common column and a set of query accesses which use a join key column other than a current distribution key column.
  - 4. The method of claim 1, wherein the new distribution key is range-based or hash-based.
  - 5. The method of claim 1, further comprising:
    - presenting, to a user in response to determining the new data arrangement, a selection with respect to whether to initiate establishment of the new data arrangement; and
      
      receiving, from the user in advance of establishing the new data arrangement, the selection to initiate establishment of the new data arrangement.
  - 6. The method of claim 1, wherein monitoring the set of data, detecting the data redistribution candidate trigger, building the new distribution key, determining the new data arrangement, and establishing the new data arrangement each occur in an automated fashion without user intervention.
  - 7. The method of claim 1, further comprising:
    - metering use of the data arrangement management; and
      
      generating an invoice based on the metered use.
  - 8. The method of claim 1, further comprising:
    - using a clustering technique to identify workload patterns involving one or more columns.

9. A computer-implemented method for data arrangement management in a distributed data cluster environment of a shared pool of configurable computing resources, the method comprising:
- detecting, in the distributed data cluster environment, a data redistribution candidate trigger with respect to a set of data, wherein detecting the data redistribution candidate trigger comprises;
  
  detecting a data skew of a data structure which exceeds a threshold data skew value, the data skew being a disproportionate distribution of the set of data across multiple partitions of the distributed data cluster environment;
  
  building a new distribution key for the data structure which exceeds the threshold data skew value to reduce the data skew of the data structure;
  
  determining, based on the new distribution key, a new data arrangement associated with the set of data, and comparing the new data arrangement with a current data arrangement to determine which data arrangement is more efficient based on resource usage; and
  
  in response to determining that the new data arrangement is more efficient than the current data arrangement, establishing, based on the new distribution key, the new data arrangement in the distributed data cluster environment such that at least a portion of the set of data comprises a different physical location in the new data arrangement.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The method of claim 9, wherein the set of data includes a set of diagnostic metadata for the distributed data cluster environment.
  - 11. The method of claim 9, wherein the new data arrangement positively impacts a query performance metric when running a query in the distributed data cluster environment.
  - 12. The method of claim 9, wherein the new distribution key is range-based or hash-based.
  - 13. The method of claim 9, further comprising:
    - presenting, to a user in response to determining the new data arrangement, a selection with respect to whether to initiate establishment of the new data arrangement; and
      
      receiving, from the user in advance of establishing the new data arrangement, the selection to initiate establishment of the new data arrangement.
  - 14. The method of claim 9, wherein detecting the data redistribution candidate trigger, building the new distribution key, determining the new data arrangement, and establishing the new data arrangement each occur in an automated fashion without user intervention.
  - 15. The method of claim 9, further comprising:
    - metering use of the data arrangement management; and
      
      generating an invoice based on the metered use.

16. A computer-implemented method for data arrangement management in a distributed data cluster environment of a shared pool of configurable computing resources, the method comprising:
- detecting, in the distributed data cluster environment, a data redistribution candidate trigger with respect to a set of data, wherein detecting the data redistribution candidate trigger comprises;
  
  detecting a data structure which exceeds a data transmission frequency threshold by detecting data structures using network bandwidth beyond a threshold amount during a particular temporal period; and
  
  establishing, in the distributed data cluster environment, a new data arrangement by deploying the data structure which exceeds the data transmission frequency threshold to at least a threshold number of partitions in the distributed data cluster environment to reduce overall computing resources utilization such that at least a portion of the set of data comprises a different physical location in the new data arrangement.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The method of claim 16, wherein the set of data includes a set of diagnostic metadata for the distributed data cluster environment.
  - 18. The method of claim 16, wherein the new data arrangement positively impacts a query performance metric when running a query in the distributed data cluster environment.
  - 19. The method of claim 16, wherein detecting the data redistribution candidate trigger and establishing the new data arrangement each occur in an automated fashion without user intervention.
  - 20. The method of claim 16, further comprising:
    - metering use of the data arrangement management; and
      
      generating an invoice based on the metered use.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Chainani, Naresh K., Cho, James H.
Primary Examiner(s)
Channavajjala, Srirama

Application Number

US15/196,017
Publication Number

US 20170371928A1
Time in Patent Office

1,148 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/245   Query processing

G06F 16/24545   Selectivity estimation or d...

G06F 16/24549   Run-time optimisation

G06F 16/24554   Unary operations; Data part...

G06F 16/2471   Distributed queries

G06F 16/248   Presentation of query results

G06F 16/958   Organisation or management ...

Data arrangement management in a distributed data cluster environment of a shared pool of configurable computing resources

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

27 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Data arrangement management in a distributed data cluster environment of a shared pool of configurable computing resources

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

27 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links