Analyzing large-scale data processing jobs

US 10,514,993 B2
Filed: 02/14/2017
Issued: 12/24/2019
Est. Priority Date: 02/14/2017
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for data analysis in a distributed computing system, the method comprising:

accessing data, stored in a storage device of a first processing zone, that is associated with a particular child job created from a particular distributed data processing job that has been executed;

detecting, from the data stored in the storage device, identifying information that identifies the particular child job created from the particular distributed data processing job;

in response to detecting the identifying information that identifies the particular child job created from the particular distributed data processing job, determining that the identifying information that identifies the particular child job and second identifying information stored in a storage device of a second processing zone share a common prefix;

in response to determining that the identifying information that identifies the particular child job and the second identifying information stored in the storage device of the second processing zone share a common prefix, identifying an additional child job as being created from the particular distributed data processing job;

correlating particular output data associated with the particular child job and additional output data associated with the additional child job created from the particular distributed data processing job;

determining performance data for the particular distributed data processing job based on the particular output data associated with the particular child job and the additional output data associated with the additional child job; and

providing for display the performance data for the particular distributed data processing job based on the particular output data associated with the particular child job and the additional output data associated with the additional child job.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus for data analysis in a distributed computing system by accessing data stored at a first processing zone associated with a distributed data processing job, detecting information identifying a particular child job associated with the distributed data processing job, comparing the identifying information to data stored at a second processing zone, and identifying an additional child job as associated with the distributed data processing job based on a result of the comparison. The methods, systems and apparatus are further for correlating particular output data associated with the particular child job and additional output data associated with the additional child job for the distributed data processing job, determining performance data for the distributed data processing job based on the output data associated with each of the particular child job and the additional child job, and providing for display the performance data for the distributed data processing job.

33 Citations

View as Search Results

18 Claims

1. A computer-implemented method for data analysis in a distributed computing system, the method comprising:
- accessing data, stored in a storage device of a first processing zone, that is associated with a particular child job created from a particular distributed data processing job that has been executed;
  
  detecting, from the data stored in the storage device, identifying information that identifies the particular child job created from the particular distributed data processing job;
  
  in response to detecting the identifying information that identifies the particular child job created from the particular distributed data processing job, determining that the identifying information that identifies the particular child job and second identifying information stored in a storage device of a second processing zone share a common prefix;
  
  in response to determining that the identifying information that identifies the particular child job and the second identifying information stored in the storage device of the second processing zone share a common prefix, identifying an additional child job as being created from the particular distributed data processing job;
  
  correlating particular output data associated with the particular child job and additional output data associated with the additional child job created from the particular distributed data processing job;
  
  determining performance data for the particular distributed data processing job based on the particular output data associated with the particular child job and the additional output data associated with the additional child job; and
  
  providing for display the performance data for the particular distributed data processing job based on the particular output data associated with the particular child job and the additional output data associated with the additional child job.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, further comprising:
    - comparing performance data for the particular distributed data processing job to a performance threshold; and
      
      providing a notification based on a result of comparing performance data for the particular distributed data processing job to the performance threshold.
  - 3. The method of claim 2, wherein the notification comprises one or more of:
    - an audible alert, a tactile alert, a visual alert, or an electronic message.
  - 4. The method of claim 1, wherein the performance data comprises one or more of:
    - a running time, memory usage, CPU time, disk usage, a relationship between each child job and the particular distributed data processing job, one or more counters associated with the particular distributed data processing job, or a processing status.
  - 5. The method of claim 1, further comprising:
    - displaying a user interface that includes display of the performance data,wherein the user interface comprises an interactive hierarchical structure.
  - 6. The method of claim 1,wherein the particular distributed data processing job is associated with a particular pipeline;
    - wherein correlating particular output data associated with the particular child job and additional output data associated with the additional child job for the particular distributed data processing job comprises associating the particular child job and the additional child job with the particular pipeline; and
      
      the method further comprising;
      
      determining pipeline performance data for a first run of the particular pipeline; and
      
      determining pipeline performance data for a second run of the particular pipeline.

7. A system, comprising:
- one or more processors; and
  
  a memory storing instructions that are operable, when executed, to cause the one or more processors to perform operations comprising;
  
  accessing data, stored in a storage device of a first processing zone, that is associated with a particular child job created from a particular distributed data processing job that has been executed;
  
  detecting, from the data stored in the storage device, identifying information that identifies the particular child job created from the particular distributed data processing job;
  
  in response to detecting the identifying information that identifies the particular child job created from the particular distributed data processing job, determining that the identifying information that identifies the particular child job and second identifying information stored in a storage device of a second processing zone share a common prefix;
  
  in response to determining that the identifying information that identifies the particular child job and the second identifying information stored in the storage device of the second processing zone share a common prefix, identifying an additional child job as being created from the particular distributed data processing job;
  
  correlating particular output data associated with the particular child job and additional output data associated with the additional child job created from the particular distributed data processing job;
  
  determining performance data for the particular distributed data processing job based on the particular output data associated with the particular child job and the additional output data associated with the additional child job; and
  
  providing for display the performance data for the particular distributed data processing job based on the particular output data associated with the particular child job and the additional output data associated with the additional child job.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The system of claim 7, the operations further comprising:
    - comparing performance data for the particular distributed data processing job to a performance threshold; and
      
      providing a notification based on a result of comparing performance data for the particular distributed data processing job to the performance threshold.
  - 9. The system of claim 8, wherein the notification comprises one or more of:
    - an audible alert, a tactile alert, a visual alert, or an electronic message.
  - 10. The system of claim 7, wherein the performance data comprises one or more of:
    - a running time, memory usage, CPU time, disk usage, a relationship between each child job and the particular distributed data processing job, one or more counters associated with the particular distributed data processing job, or a processing status.
  - 11. The system of claim 7, the operations further comprising:
    - displaying a user interface that includes display of the performance data,wherein the user interface comprises an interactive hierarchical structure.
  - 12. The system of claim 7, wherein the particular distributed data processing job is associated with a particular pipeline;
    - wherein correlating particular output data associated with the particular child job and additional output data associated with the additional child job for the particular distributed data processing job comprises associating the particular child job and the additional child job with the particular pipeline; and
      
      the operations further comprising;
      
      determining pipeline performance data for a first run of the particular pipeline; and
      
      determining pipeline performance data for a second run of the particular pipeline.

13. A non-transitory computer-readable storage device storing instructions executable by one or more processors which, upon such execution, cause the one or more processors to perform operations in a distributed computing system, the operations comprising:
- accessing data, stored in a storage device of a first processing zone, that is associated with a particular child job created from a particular distributed data processing job that has been executed;
  
  detecting, from the data stored in the storage device, identifying information that identifies the particular child job created from the particular distributed data processing job;
  
  in response to detecting the identifying information that identifies the particular child job created from the particular distributed data processing job, determining that the identifying information that identifies the particular child job and second identifying information stored in a storage device of a second processing zone share a common prefix;
  
  in response to determining that the identifying information that identifies the particular child job and the second identifying information stored in the storage device of the second processing zone share a common prefix, identifying an additional child job as being created from the particular distributed data processing job;
  
  correlating particular output data associated with the particular child job and additional output data associated with the additional child job created from the particular distributed data processing job;
  
  determining performance data for the particular distributed data processing job based on the particular output data associated with the particular child job and the additional output data associated with the additional child job; and
  
  providing for display the performance data for the particular distributed data processing job based on the particular output data associated with the particular child job and the additional output data associated with the additional child job.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The computer-readable storage device of claim 13, the operations further comprising:
    - comparing performance data for the particular distributed data processing job to a performance threshold; and
      
      providing a notification based on a result of comparing performance data for the particular distributed data processing job to the performance threshold.
  - 15. The computer-readable storage device of claim 14, wherein the notification comprises one or more of:
    - an audible alert, a tactile alert, a visual alert, or an electronic message.
  - 16. The computer-readable storage device of claim 13, wherein the performance data comprises one or more of:
    - a running time, memory usage, CPU time, disk usage, a relationship between each child job and the particular distributed data processing job, one or more counters associated with the particular distributed data processing job, or a processing status.
  - 17. The computer-readable storage device of claim 13, further comprising:
    - displaying a user interface that includes display of the performance data,wherein the user interface comprises an interactive hierarchical structure.
  - 18. The computer-readable storage device of claim 13, wherein the particular distributed data processing job is associated with a particular pipeline;
    - wherein correlating particular output data associated with the particular child job and additional output data associated with the additional child job for the particular distributed data processing job comprises associating the particular child job and the additional child job with the particular pipeline; and
      
      the operations further comprising;
      
      determining pipeline performance data for a first run of the particular pipeline; and
      
      determining pipeline performance data for a second run of the particular pipeline.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
Sukoco, Arif, Li, Yesheng, Korsky, Ross Vincent, Sharma, Loveena, Garcia de Souza, Carlos Alexandre
Primary Examiner(s)
Lyons, Andrew M.

Application Number

US15/432,375
Publication Number

US 20180232295A1
Time in Patent Office

1,043 Days
Field of Search

718100
US Class Current
CPC Class Codes

G06F 11/3006   where the computing system ...

G06F 11/302   where the computing system ...

G06F 11/3024   where the computing system ...

G06F 11/3404   for parallel or distributed...

G06F 11/3409   for performance assessment

G06F 11/3495   for systems

G06F 2201/865   Monitoring of software

G06F 9/4843   by program, e.g. task dispa...

Analyzing large-scale data processing jobs

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

33 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Analyzing large-scale data processing jobs

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

33 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links