Model-based self-optimizing distributed information management

US 7,720,841 B2
Filed: 10/04/2006
Issued: 05/18/2010
Est. Priority Date: 10/04/2006
Status: Expired due to Fees

First Claim

Patent Images

1. A method for managing data collection in a distributed processing system, the method on an information processing system comprising:

dynamically collecting at least one statistical query pattern associated with a plurality of queries received from a plurality of client nodes in a distributed processing system;

dynamically monitoring at least one operating attribute distribution across a plurality of overlay nodes, wherein the at least one operating attribute distribution is associated with an operating attribute that has been queried by at least one of the client nodes for the plurality of overlay nodes in the distributed processing system, wherein an overlay node performs one or more data stream processing functions, and wherein an operating attribute is a distributed resource consumable by the at least one of the client nodes;

dynamically, and without user intervention, selecting a first set of overlay nodes from the plurality of overlay nodes based on the at least one statistical query pattern and the at least one operating attribute distribution; and

dynamically configuring without user intervention, based on the query pattern and the operating attribute distribution, the first group of overlay nodes to periodically push a first set of operating attributes associated with each overlay node in the selected group to a managing node associated with at least the first group of overlay nodes, wherein the first group of overlay nodes and the first set of operating attributes are selected so that a majority of queries received by client nodes are resolved by the first set of operating attributes that have been pushed, and wherein on-demand pull operations are performed on a second group overlay nodes within the distributed processing system to acquire a second set of operating attributes to resolve queries received from client nodes in which the first set of operating attributes that have been pushed have failed to resolve.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed are a method, information processing system, and computer readable medium for managing data collection in a distributed processing system. The method includes dynamically collecting at least one statistical query pattern associated with a selected group of information processing nodes. The statistical query pattern is dynamically collected from a plurality of information processing nodes in a distributed processing system. At least one operating attribute distribution associated with an operating attribute that has been queried for the selected group is dynamically monitored. The selected group is dynamically configured, based on the query pattern and the operating attribute distribution, to periodically push a set of attributes associated with the each information processing node in the selected group.

Citations

20 Claims

1. A method for managing data collection in a distributed processing system, the method on an information processing system comprising:
- dynamically collecting at least one statistical query pattern associated with a plurality of queries received from a plurality of client nodes in a distributed processing system;
  
  dynamically monitoring at least one operating attribute distribution across a plurality of overlay nodes, wherein the at least one operating attribute distribution is associated with an operating attribute that has been queried by at least one of the client nodes for the plurality of overlay nodes in the distributed processing system, wherein an overlay node performs one or more data stream processing functions, and wherein an operating attribute is a distributed resource consumable by the at least one of the client nodes;
  
  dynamically, and without user intervention, selecting a first set of overlay nodes from the plurality of overlay nodes based on the at least one statistical query pattern and the at least one operating attribute distribution; and
  
  dynamically configuring without user intervention, based on the query pattern and the operating attribute distribution, the first group of overlay nodes to periodically push a first set of operating attributes associated with each overlay node in the selected group to a managing node associated with at least the first group of overlay nodes, wherein the first group of overlay nodes and the first set of operating attributes are selected so that a majority of queries received by client nodes are resolved by the first set of operating attributes that have been pushed, and wherein on-demand pull operations are performed on a second group overlay nodes within the distributed processing system to acquire a second set of operating attributes to resolve queries received from client nodes in which the first set of operating attributes that have been pushed have failed to resolve.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the operating attribute is at least one of:
    - an environmental attribute;
      
      a software attribute; and
      
      a hardware attribute.
  - 3. The method of claim 1, wherein the dynamically collecting the statistical query pattern further includes collecting statistical information on at least one of:
    - frequently queried attributes;
      
      frequently queried range values; and
      
      frequent staleness constraints.
  - 4. The method of claim 1, wherein the dynamically configuring further comprises at least one of:
    - dynamically selecting the first set of attributes associated with the first group from a set of available operating attributes associated with each overlay node within the first group;
      
      dynamically configuring a push triggering threshold associated with the operating attribute; and
      
      dynamically configuring an update interval for each attribute in the set of operating attributes.
  - 5. The method of claim 4, wherein the dynamically configuring of the push triggering threshold and the update interval is based on a set of data queries that are chosen from a historical list of data queries.
  - 6. The method of claim 4, wherein the dynamically configuring of the push triggering threshold and the update interval minimizes a system cost comprising at least one of:
    - a push cost for distributed hosts to periodically send information to a manager node;
      
      a pull cost for the manager node to dynamically retrieve information from at least one distributed host.
  - 7. The method of claim 4, wherein the push triggering threshold filters out any overlay node that fails to satisfy a data query.
  - 8. The method of claim 4, wherein the dynamically selecting the set of attributes further comprises:
    - grouping data queries based on operating attributes specified by the data queries as a push subset;
      
      identifying a collection of operating attribute subsets, wherein for each operating attribute subset in the collection of operating attribute subsets,determining a cumulative query frequency for each set of operating attributes;
      
      determining a cost reduction value for each set of operating attributes; and
      
      identifying an operating attribute subset comprising a largest cost reduction value;
      
      adding the operating attribute subset comprising the largest cost reduction value to the push subsetremoving attributes included in the operating attribute subset comprising the largest cost reduction value from each operating attribute subset in the collection of operating attribute subsets; and
      
      merging together any operating attribute subsets comprising duplicate operating attributes.
  - 9. The method of claim 4, wherein the dynamically configuring a push triggering threshold further comprises:
    - performing query positioning based on the operating attribute distribution.
  - 10. The method of claim 4, wherein the dynamically configuring a push triggering threshold, further comprises:
    - initializing the push triggering threshold to its minimum value;
      
      selecting an attribute subset with a largest cost reduction value;
      
      increasing the push triggering threshold until the cost reduction value is above a given threshold; and
      
      identifying, based on the push triggering threshold, any information processing nodes that satisfy the push triggering threshold for calculating a push cost; and
      
      identifying, based on the push triggering threshold, each query in a historical list of data queries that fail to be satisfied by a push data operation for calculating a pull cost.
  - 11. The method of claim 10, wherein the dynamically configuring the update interval update is further based on a query response time constraint, wherein the update interval minimizes monitoring traffic for satisfying a data query.

12. An information processing system for managing data collection in a distributed processing system, the information processing system comprising:
- a memory;
  
  a processor communicatively to the memory; and
  
  an information management system communicatively coupled to the memory and the processor, the information management system for;
  
  dynamically collecting at least one statistical query pattern associated with a plurality of queries received from a plurality of client nodes in a distributed processing system;
  
  dynamically monitoring at least one operating attribute distribution across a plurality of overlay nodes, wherein the at least one operating attribute distribution is associated with an operating attribute that has been queried by at least one of the client nodes for the plurality of overlay nodes in the distributed processing system, wherein an overlay node performs one or more data stream processing functions, and wherein the at least one operating attribute is a distributed resource consumable by the at least one of the client nodes;
  
  dynamically, and without user intervention, selecting a first set of overlay nodes from the plurality of overlay nodes based on the at least one statistical query pattern and the at least one operating attribute distribution; and
  
  dynamically configuring, based on the query pattern and the operating attribute distribution, the first group of overlay nodes to periodically push a first set of operating attributes associated with each overlay node in the selected group to a managing node associated with at least the first group of overlay nodes, wherein the first group of overlay nodes and the first set of operating attributes are selected so that a majority of queries received by client nodes are resolved by the first set of operating attributes that have been pushed, and wherein on-demand pull operations are performed on a second group overlay nodes within the distributed processing system to acquire a second set of operating attributes to resolve queries received from client nodes in which the first set of operating attributes that have been pushed have failed to resolve.
- View Dependent Claims (13, 14, 15)
- - 13. The information processing system of claim 12, wherein the dynamically configuring further comprises at least one of:
    - dynamically selecting the first set of attributes associated with the first group from a set of available operating attributes associated with each information processing node within the first group;
      
      dynamically configuring a push triggering threshold associated with the operating attribute; and
      
      dynamically configuring an update interval for each attribute in the set of operating attributes.
  - 14. The information processing system of claim 13, wherein the dynamically selecting the set of attributes further comprises:
    - grouping data queries based on operating attributes specified by the data queries as a push subset;
      
      identifying a collection of operating attribute subsets, wherein for each operating attribute subset in the collection of operating attribute subsets,determining a cumulative query frequency for each set of operating attributes;
      
      determining a cost reduction value for each set of operating attributes; and
      
      identifying an operating attribute subset comprising a largest cost reduction value;
      
      adding the operating attribute subset comprising the largest cost reduction value to the push subsetremoving attributes included in the operating attribute subset comprising the largest cost reduction value from each operating attribute subset in the collection of operating attribute subsets; and
      
      merging together any operating attribute subsets comprising duplicate operating attributes.
  - 15. The information processing system of claim 13, wherein the dynamically configuring a push triggering threshold, further comprises:
    - initializing the push triggering threshold to its minimum value;
      
      selecting an attribute subset with a largest cost reduction value;
      
      increasing the push triggering threshold until the cost reduction value is above a given threshold; and
      
      identifying, based on the push triggering threshold, any information processing nodes that satisfy the push triggering threshold for calculating a push cost; and
      
      identifying, based on the push triggering threshold, each query in a historical list of data queries that fail to be satisfied by a push data operation for calculating a pull cost.

16. A tangible computer readable medium for managing data collection in a distributed processing system, the computer readable medium comprising instructions for:
- dynamically monitoring at least one operating attribute distribution across a plurality of overlay nodes, wherein the at least one operating attribute distribution is associated with an operating attribute that has been queried by at least one of a plurality of client nodes for the plurality of overlay nodes in the distributed processing system, wherein an overlay node performs one or more data stream processing functions, and wherein the at least one operating attribute is a distributed resource consumable by the at least one of the client nodes;
  
  dynamically, and without user intervention, selecting a first set of overlay nodes from the plurality of overlay nodes based on the at least one statistical query pattern and the at least one operating attribute distribution; and
  
  dynamically configuring, based on the query pattern and the operating attribute distribution, the first group of overlay nodes to periodically push a first set of operating attributes associated with each overlay node in the selected group to a managing node associated with at least the first group of overlay nodes, wherein the first group of overlay nodes and the first set of operating attributes are selected so that a majority of queries received by client nodes are resolved by the first set of operating attributes that have been pushed, and wherein on-demand pull operations are performed on a second group overlay nodes within the distributed processing system to acquire a second set of operating attributes to resolve queries received from client nodes in which the first set of operating attributes that have been pushed have failed to resolve.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The tangible computer readable medium of claim 16, wherein the instructions for dynamically configuring further comprise instructions for at least one of:
    - dynamically selecting the first set of attributes associated with the first group from a set of available operating attributes associated with each information processing node within the first group;
      
      dynamically configuring a push triggering threshold associated with the operating attribute; and
      
      dynamically configuring an update interval for each attribute in the set of operating attributes.
  - 18. The tangible computer readable medium of claim 17, wherein the instructions for dynamically selecting the set of attributes further comprise instructions for:
    - grouping data queries based on operating attributes specified by the data queries as a push subset;
      
      identifying a collection of operating attribute subsets, wherein for each operating attribute subset in the collection of operating attribute subsets,determining a cumulative query frequency for each set of operating attributes;
      
      determining a cost reduction value for each set of operating attributes; and
      
      identifying an operating attribute subset comprising a largest cost reduction value;
      
      adding the operating attribute subset comprising the largest cost reduction value to the push subsetremoving attributes included in the operating attribute subset comprising the largest cost reduction value from each operating attribute subset in the collection of operating attribute subsets; and
      
      merging together any operating attribute subsets comprising duplicate operating attributes.
  - 19. The tangible computer readable medium of claim 17, wherein the instructions for dynamically configuring a push triggering threshold further comprise instructions for:
    - initializing the push triggering threshold to its minimum value;
      
      selecting an attribute subset with a largest cost reduction value;
      
      increasing the push triggering threshold until the cost reduction value is above a given threshold; and
      
      identifying, based on the push triggering threshold, any information processing nodes that satisfy the push triggering threshold for calculating a push cost; and
      
      identifying, based on the push triggering threshold, each query in a historical list of data queries that fail to be satisfied by a push data operation for calculating a pull cost.
  - 20. The tangible computer readable medium of claim 16, wherein the instructions for dynamically collecting the statistical query pattern further include instructions for collecting statistical information on at least one of:
    - frequently queried attributes;
      
      frequently queried range values; and
      
      frequent staleness constraints.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Yu, Philip S., Gu, Xiaohui, Chang, Shu-Ping
Primary Examiner(s)
Vo; Tim T.
Assistant Examiner(s)
Koo; Gary J

Application Number

US11/538,525
Publication Number

US 20080086469A1
Time in Patent Office

1,322 Days
Field of Search

707/6
US Class Current

707/721
CPC Class Codes

G06F 11/3447   Performance evaluation by m...

G06F 11/3495   for systems

G06F 16/24542   Plan optimisation

G06F 16/27   Replication, distribution o...

G06F 2201/81   Threshold

Y10S 707/966   Distributed

Model-based self-optimizing distributed information management

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Model-based self-optimizing distributed information management

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links