Identifying task instance outliers based on metric data in a large scale parallel processing system

US 9,280,386 B1
Filed: 07/14/2011
Issued: 03/08/2016
Est. Priority Date: 07/14/2011
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving, for each of a plurality of task instances that execute one or more computer-executable instructions to perform a task, a plurality of performance measures that each represent an execution performance of a property of the respective task instance for a particular time interval, wherein the plurality of task instances are executed in parallel on one or more computers;

for each task instance;

determining, for each performance measure of the respective task instance, whether the respective performance measure exceeds a threshold value that is based on a function of a mean and a standard deviation of the performance measure that represent the same property as the respective performance measure;

determining, for each of the performance measures that exceeds the threshold value, a score using the respective performance measure and a mean and a standard deviation of the performance measures that represent the same property as the respective performance measure; and

combining the scores for the performance measure that represent the execution performance measure of the same property of the respective task instance to obtain a combined score value;

ranking the combined score values associated with at least a subset of the plurality of task instances to identify an outlier; and

terminating an execution of a particular task instance on a first computer and executing the particular task instance on a second computer different from the first computer based on the ranking of the combined score values, the particular task instance from the plurality of task instances.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Among other disclosed subject matter, a method includes receiving metric data associated with an execution of each of a plurality of task instances. The plurality of task instances include task instances associated with a task and the metric data for each task instance relating to execution performance of the task instance. The method includes for each task instance determining a deviation of the metric data associated with the task instance relative to an overall deviation of the metric data for the plurality of task instances of the task during each of a plurality of intervals and combining deviation measurements for the task instance that exceed a threshold deviation to obtain a combined deviation value. Each deviation measurement corresponds to the deviation of the metric data for one of the plurality of intervals. The method includes ranking the combined deviation values associated with at least a subset of the task instances.

34 Citations

View as Search Results

28 Claims

1. A computer-implemented method comprising:
- receiving, for each of a plurality of task instances that execute one or more computer-executable instructions to perform a task, a plurality of performance measures that each represent an execution performance of a property of the respective task instance for a particular time interval, wherein the plurality of task instances are executed in parallel on one or more computers;
  
  for each task instance;
  
  determining, for each performance measure of the respective task instance, whether the respective performance measure exceeds a threshold value that is based on a function of a mean and a standard deviation of the performance measure that represent the same property as the respective performance measure;
  
  determining, for each of the performance measures that exceeds the threshold value, a score using the respective performance measure and a mean and a standard deviation of the performance measures that represent the same property as the respective performance measure; and
  
  combining the scores for the performance measure that represent the execution performance measure of the same property of the respective task instance to obtain a combined score value;
  
  ranking the combined score values associated with at least a subset of the plurality of task instances to identify an outlier; and
  
  terminating an execution of a particular task instance on a first computer and executing the particular task instance on a second computer different from the first computer based on the ranking of the combined score values, the particular task instance from the plurality of task instances.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The computer-implemented method of claim 1 wherein the property comprises cycles per instruction and the standard deviation is based on cycles per instruction values associated with the plurality of task instances.
  - 3. The computer-implemented method of claim 2 wherein the property comprises mean cycles per instruction and the threshold value is based on a function of a mean cycles per instruction value associated with the task performed on the one or more computers and the standard deviation.
  - 4. The computer-implemented method of claim 1 wherein the property comprises mean cycles per instruction and determining, for each of the performance measures that exceeds the threshold value, the score using the respective performance measure and the mean and the standard deviation of the performance measures that represent the same property as the respective performance measure comprises determining the score using a mean cycles per instruction value associated with the task performed on the one or more computers, the standard deviation associated with the task, and the respective performance measure associated with the task instance.
  - 5. The computer-implemented method of claim 1 wherein the one or more computers comprises one or more computers of the same hardware platform.
  - 6. The computer-implemented method of claim 1, further comprising:
    - modifying an execution of a second particular task instance executed on the first computer based on the ranking of the combined score values, the second particular task instance from the plurality of task instances.
  - 7. The computer-implemented method of claim 1, further comprising:
    - generating a report including the ranked combined score values;
      
      providing the report to a user; and
      
      receiving, in response to providing the report to the user, an input from the user, wherein the input causes an execution of a second particular task instance executed on a first computer to be modified, the second particular task instance from the plurality of task instances.
  - 8. The computer-implemented method of claim 1, wherein the property comprises a number of cycles per instruction and the performance measure includes data indicating a number of cycles per instruction for each task instance during the particular time interval.
  - 9. The computer-implemented method of claim 1, further comprising:
    - filtering, before determining whether the respective performance measure exceeds the threshold value that is based on the function of the mean and the standard deviation of the performance measure that represent the same property as the respective performance measure, the performance measures based on predetermined parameters that identify potentially unreliable measurements.
  - 10. The computer-implemented method of claim 9, wherein the filtering comprises removing performance measures that indicates low CPU usage.
  - 11. The method of claim 1, comprising normalizing each of the combined score values, wherein ranking the combined score values comprises ranking the normalized combined score values.
  - 12. The method of claim 11, wherein normalizing each of the combined score values comprises normalizing the combined score value using a quantity of the scores for the performance measure that represent the execution performance measure of the same property of the respective task instance.

13. A system, comprising:
- memory; and
  
  one or more processors coupled to the memory and configured to perform operations comprising;
  
  receiving, for each of a plurality of task instances that execute one or more computer-executable instructions to perform a task, a plurality of performance measures that each represent an execution performance measure of a property of the respective task instance for a particular time interval, wherein the plurality of task instances are executed in parallel on one or more computers;
  
  for each task instance;
  
  determining, for each performance measure of the respective task instance, whether the respective performance measure exceeds a threshold value that is based on a function of a mean and a standard deviation of the performance measure that represent the same property as the respective performance measure;
  
  determining, for each of the performance measures that exceeds the threshold value, a score using the respective performance measure and a mean and a standard deviation of the performance measures that represent the same property as the respective performance measure; and
  
  combining the scores for the performance measure that represent the execution performance measure of the same property of the respective task instance to obtain a combined score value;
  
  ranking the combined score values associated with at least a subset of the plurality of task instances to identify an outlier; and
  
  terminating an execution of a particular task instance on a first computer and executing the particular task instance on a second computer different from the first computer based on the ranking of the combined score values, the particular task instance from the plurality of task instances.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
- - 14. The system of claim 13 wherein the property comprises cycles per instruction and the standard deviation is based on cycles per instruction values associated with the plurality of task instances.
  - 15. The system of claim 14 wherein the property comprises mean cycles per instruction and the threshold value is based on a function of a mean cycles per instruction value associated with the task performed on the one or more computers and the standard deviation.
  - 16. The system of claim 13 wherein the property comprises mean cycles per instruction and determining, for each of the performance measures that exceeds the threshold value, the score using the respective performance measure and the mean and the standard deviation of the performance measures that represent the same property as the respective performance measure comprises determining the score using a mean cycles per instruction value associated with the task performed on the one or more computers, the standard deviation associated with the task, and the respective performance measure associated with the task instance.
  - 17. The system of claim 13 wherein the one or more computers comprises one or more computers of the same hardware platform.
  - 18. The system of claim 13 wherein the one or more processors are configured to perform operations further comprising:
    - modifying an execution of a second particular task instance executed on the first computer based on the ranking of the combined score values, the second particular task instance from the plurality of task instances.
  - 19. The system of claim 13 wherein the one or more processors are configured to perform operations further comprising:
    - generating a report including the ranked combined score values and the outlier;
      
      providing the report to a user; and
      
      receiving, in response to providing the report to the user, an input from the user, wherein the input causes an execution of a second particular task instance executed on a first computer to be modified, the second particular task instance from the plurality of task instances.
  - 20. The system of claim 13 wherein the property comprises a number of cycles per instruction and the performance measure includes data indicating a number of cycles per instruction for each task instance during the particular time interval.
  - 21. The system of claim 13 wherein the one or more processors are configured to perform operations further comprising:
    - generating a report including the ranked combined score values and the outlier.

22. A non-transitory computer readable medium encoded with a computer program comprising instructions that, when executed, operate to cause a computer to perform operations:
- receive, for each of a plurality of task instances that execute one or more computer-executable instructions to perform a task, a plurality of performance measures that each represent an execution performance measure of a property of the respective task instance for a particular time interval, wherein the plurality of task instances are executed in parallel on one or more computers;
  
  for each task instance;
  
  determine, for each performance measure of the respective task instance, whether the respective performance measure exceeds a threshold value that is based on a function of a mean and a standard deviation of the performance measure that represent the same property as the respective performance measure;
  
  determine, for each of the performance measures that exceeds the threshold value, a score using the respective performance measure and a mean and a standard deviation of the performance measures that represent the same property as the respective performance measure; and
  
  combine the scores for the performance measure that represent the execution performance measure of the same property of the respective task instance to obtain a combined score value;
  
  rank the combined score values associated with at least a subset of the plurality of task instances to identify an outlier; and
  
  terminating an execution of a particular task instance on a first computer and executing the particular task instance on a second computer different from the first computer based on the ranking of the combined score values, the particular task instance from the plurality of task instances.
- View Dependent Claims (23, 24, 25, 26, 27, 28)
- - 23. The computer readable medium of claim 22 wherein the property comprises cycles per instruction and the standard deviation is based on cycles per instruction values associated with the plurality of task instances.
  - 24. The computer readable medium of claim 23 wherein the property comprises mean cycles per instruction and the threshold value is based on a function of a mean cycles per instruction value associated with the task performed on the one or more computers and the standard deviation.
  - 25. The computer readable medium of claim 22 wherein the property comprises mean cycles per instruction and the operations to determine, for each of the performance measures that exceeds the threshold value, the score using the respective performance measure and the mean and the standard deviation of the performance measures that represent the same property as the respective performance measure comprises determining the score using a mean cycles per instruction value associated with the task performed on the one or more computers, the standard deviation associated with the task, and the respective performance measure associated with the task instance.
  - 26. The computer readable medium of claim 22 further comprising instructions that, when executed, operate to cause a computer to perform operations:
    - modify an execution of a second particular task instance executed on a first computer based on the ranking of the combined score values, the second particular task instance from the plurality of task instances.
  - 27. The computer readable medium of claim 22 further comprising instructions that, when executed, operate to cause a computer to perform operations:
    - generate a report including the ranked combined score values;
      
      provide the report to a user; and
      
      receive, in response to providing the report to the user, an input from the user, wherein the input causes an execution of a second particular task instance executed on a first computer to be modified, the second particular task instance from the plurality of task instances.
  - 28. The computer readable medium of claim 22, further comprising normalizing each of the combined score values, wherein ranking the combined score values comprises ranking the normalized combined score values.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Hagmann, Robert, Zhang, Xiao, Tune, Eric S., Gokhale, Vrijendra
Primary Examiner(s)
Puente, Emerson
Assistant Examiner(s)
Kamran, Mehran

Application Number

US13/183,234
Time in Patent Office

1,699 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 9/4881   Scheduling strategies for d...

G06F 9/5038   considering the execution o...

G06F 9/5088   involving task migration

Identifying task instance outliers based on metric data in a large scale parallel processing system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

34 Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Identifying task instance outliers based on metric data in a large scale parallel processing system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

34 Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links