Parallel processing of ETL jobs involving extensible markup language documents

US 9,064,047 B2
Filed: 09/24/2009
Issued: 06/23/2015
Est. Priority Date: 09/24/2009
Status: Expired due to Fees

First Claim

Patent Images

1. A method for running an Extract Transform Load (ETL) job in parallel on multiple processors wherein the ETL job comprises use of an extensible markup language (XML) document, wherein the method comprises:

receiving an ETL job definition and an input XML document;

identifying a node in the input XML document at which partitioning of the input XML document is to begin based on the ETL job definition and the XML document;

identifying a size of each partition to be created within the input XML document, wherein each partition is created on a different processor;

sending partition information to each respective processor, wherein the partition information includes the node in the input XML document at which partitioning of the input XML document is to begin and the size parameter for each partition;

performing a shallow parsing of the input XML document in parallel on the multiple processors, wherein each processor performs shallow parsing using the identified partition node until it reaches its identified partition, and wherein said performing the shallow parsing comprises;

partitioning the input XML document in accordance with the portioning information; and

performing a schema validation for each of one or more partition nodes corresponding to the partitioned input XML document;

using the shallow parsing to generate the partition of the input XML document on each processor, wherein each processor independently generates a different partition of the input XML document without dependence on any part of the input XML document that is not part of the partition associated with that processor;

sending each partition in streaming format to an ETL job instance; and

running each ETL job instance, wherein said running each ETL job instance comprises shredding the XML document in parallel on multiple nodes, and wherein said shredding comprises using horizontal partitioning to relationalize different partitions of the input XML document in parallel on different nodes.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for running an Extract Transform Load (ETL) job in parallel on one or more processors wherein the ETL job comprises use of an extensible markup language (XML) document are provided. The techniques include receiving an XML document input, identifying a node in the XML document at which partitioning of the XML document is to begin, sending partition information to each respective processor, performing a shallow parsing of the XML document in parallel on the one or more processors, wherein each processor performs shallow parsing using the identified partition node until it reaches its identified partition, using the shallow parsing to generate the partition of the input XML document, wherein each processor generates a different partition of the same XML document, and sending each partition in streaming format to an ETL job instance.

Citations

17 Claims

1. A method for running an Extract Transform Load (ETL) job in parallel on multiple processors wherein the ETL job comprises use of an extensible markup language (XML) document, wherein the method comprises:
- receiving an ETL job definition and an input XML document;
  
  identifying a node in the input XML document at which partitioning of the input XML document is to begin based on the ETL job definition and the XML document;
  
  identifying a size of each partition to be created within the input XML document, wherein each partition is created on a different processor;
  
  sending partition information to each respective processor, wherein the partition information includes the node in the input XML document at which partitioning of the input XML document is to begin and the size parameter for each partition;
  
  performing a shallow parsing of the input XML document in parallel on the multiple processors, wherein each processor performs shallow parsing using the identified partition node until it reaches its identified partition, and wherein said performing the shallow parsing comprises;
  
  partitioning the input XML document in accordance with the portioning information; and
  
  performing a schema validation for each of one or more partition nodes corresponding to the partitioned input XML document;
  
  using the shallow parsing to generate the partition of the input XML document on each processor, wherein each processor independently generates a different partition of the input XML document without dependence on any part of the input XML document that is not part of the partition associated with that processor;
  
  sending each partition in streaming format to an ETL job instance; and
  
  running each ETL job instance, wherein said running each ETL job instance comprises shredding the XML document in parallel on multiple nodes, and wherein said shredding comprises using horizontal partitioning to relationalize different partitions of the input XML document in parallel on different nodes.
- View Dependent Claims (2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein the shallow parsing is performed on a single processor.
  - 3. The method of claim 2, wherein the shallow parsing sends start and end points of each partition to the one or more processors, wherein each processor seeks to the start of its partition and sends its partition to its instance of an ETL job definition.
  - 4. The method of claim 1, wherein the shallow parsing comprises parsing of only nodes that appear in an XML path language (XPATH) of the partition node and ignoring nodes not in the partition node XPATH.
  - 5. The method of claim 1, further comprising ensuring that each processor performs a single pass of the XML document.
  - 6. The method of claim 1, further comprising, if an XML schema file contains at least one of a minoccurs XML schema indicator and a maxoccurs XML schema indicator defined on a node on which partitioning is to be done:
    - removing the one or more indicators from the schema file;
      
      checking validity of the indicators during the shallow parsing on a node processing a last segment of the XML document; and
      
      generating an error if the validation fails during shallow parsing.
  - 8. The method of claim 1, further comprising generating output in the form of a modified extract, transform, and load (ETL) job definition that can run on multiple processors in parallel.
  - 9. The method of claim 1, wherein sending each partition in streaming format comprises adding a root node to the partition.
  - 10. The method of claim 1, further comprising passing the XML document beyond the identified partition size as a separate XML document in streaming format.
  - 11. The method of claim 1, further comprising partitioning the XML document for load balancing, comprising partitioning the XML document to keep an overall load on all parallel processors evenly distributed to achieve maximum performance gains.
  - 12. The method of claim 1, further comprising distributing XML partitions evenly in a single pass manner.
  - 13. The method of claim 1, further comprising, if a sub-tree rooted at the identified node does not contain all output nodes, adding each missing output node to the partition and modifying each XML path language (XPath) provided to the ETL job instance.
  - 14. The method of claim 1, further comprising, if there are multiple output nodes in the XML document that are not part of a repetition path, keeping only a first occurrence of such nodes in the XML document.
  - 15. The method of claim 1, further comprising providing a system, wherein the system comprises one or more distinct software modules, each of the one or more distinct software modules being embodied on a tangible computer-readable recordable storage medium, and wherein the one or more distinct software modules comprise a partition node identification module, a partition size computation module, a shallow parser module and an ETL job instance module executing on a hardware processor.

7. The method of 1, further comprising, if an XML schema file contains at least one of an all XML schema indicator, sequence XML schema indicator, and choice XML schema indicator defined on the node on which partitioning is to be done:
- sending each node taking part in the one or more indicators but not in a repetition element path, only to a first node; and
  
  sending each remaining node a first value of one or more output nodes that are not in the repetition path, and a schema file that does not have the XML schema indicators.

16. A computer program product comprising a tangible non-transitory computer readable recordable storage medium including computer useable program code for running an Extract Transform Load (ETL) job in parallel on multiple processors wherein the ETL job comprises use of an extensible markup language (XML) document, the computer program product including:
- computer useable program code for receiving an ETL job definition and an input XML document;
  
  computer useable program code for identifying a node in the input XML document at which partitioning of the input XML document is to begin based on the ETL job definition and the XML document;
  
  computer useable program code for identifying a size of each partition to be created within the input XML document, wherein each partition is created on a different processor;
  
  computer useable program code for sending partition information to each respective processor, wherein the partition information includes the node in the input XML document at which partitioning of the input XML document is to begin and the size parameter for each partition;
  
  computer useable program code for performing a shallow parsing of the input XML document in parallel on the multiple processors, wherein each processor performs shallow parsing using the identified partition node until it reaches its identified partition, and wherein said performing the shallow parsing comprises;
  
  partitioning the input XML document in accordance with the portioning information; and
  
  performing a schema validation for each of one or more partition nodes corresponding to the partitioned input XML document;
  
  computer useable program code for using the shallow parsing to generate the partition of the input XML document on each processor, wherein each processor independently generates a different partition of the input XML document without dependence on any part of the input XML document that is not part of the partition associated with that processor;
  
  computer useable program code for sending each partition in streaming format to an ETL job instance; and
  
  computer useable program code for running each ETL job instance, wherein said running each ETL job instance comprises shredding the XML document in parallel on multiple nodes, and wherein said shredding comprises using horizontal partitioning to relationalize different partitions of the input XML document in parallel on different nodes.

17. A system for running an Extract Transform Load (ETL) job in parallel on multiple processors wherein the ETL job comprises use of an extensible markup language (XML) document, comprising:
- a memory; and
  
  at least one processor coupled to the memory and operative to;
  
  receive an ETL job definition and an input XML document;
  
  identify a node in the input XML document at which partitioning of the input XML document is to begin based on the ETL job definition and the XML document;
  
  identify a size of each partition to be created within the input XML document, wherein each partition is created on a different processor;
  
  send partition information to each respective processor, wherein the partition information includes the node in the input XML document at which partitioning of the input XML document is to begin and the size parameter for each partition;
  
  perform a shallow parsing of the input XML document in parallel on the multiple processors, wherein each processor performs shallow parsing using the identified partition node until it reaches its identified partition, and wherein said performing the shallow parsing comprises;
  
  partitioning the input XML document in accordance with the portioning information; and
  
  performing a schema validation for each of one or more partition nodes corresponding to the partitioned input XML document;
  
  use the shallow parsing to generate the partition of the input XML document on each processor, wherein each processor independently generates a different partition of the input XML document without dependence on any part of the input XML document that is not part of the partition associated with that processor;
  
  send each partition in streaming format to an ETL job instance; and
  
  run each ETL job instance, wherein said running each ETL job instance comprises shredding the XML document in parallel on multiple nodes, and wherein said shredding comprises using horizontal partitioning to relationalize different partitions of the input XML document in parallel on different nodes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
GlobalFoundries, Inc.
Original Assignee
International Business Machines Corporation
Inventors
Agarwal, Manoj K., Bhide, Manish A., Kotwal, Srilakshmi, Mittapalli, Srinivas Kiran, Padmanabhan, Sriram
Primary Examiner(s)
Alam, Hosain
Assistant Examiner(s)
HARPER, ELIYAH STONE

Application Number

US12/566,255
Publication Number

US 20110072319A1
Time in Patent Office

2,098 Days
Field of Search

707/999.104, 707/102, 707/867, 707/713
US Class Current

1/1
CPC Class Codes

G06F 11/3604 Software analysis for verif...

G06F 16/86 Mapping to a database

Parallel processing of ETL jobs involving extensible markup language documents

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Parallel processing of ETL jobs involving extensible markup language documents

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links