Parallel processing of ETL jobs involving extensible markup language documents
First Claim
1. A method for running an Extract Transform Load (ETL) job in parallel on multiple processors wherein the ETL job comprises use of an extensible markup language (XML) document, wherein the method comprises:
- receiving an ETL job definition and an input XML document;
identifying a node in the input XML document at which partitioning of the input XML document is to begin based on the ETL job definition and the XML document;
identifying a size of each partition to be created within the input XML document, wherein each partition is created on a different processor;
sending partition information to each respective processor, wherein the partition information includes the node in the input XML document at which partitioning of the input XML document is to begin and the size parameter for each partition;
performing a shallow parsing of the input XML document in parallel on the multiple processors, wherein each processor performs shallow parsing using the identified partition node until it reaches its identified partition, and wherein said performing the shallow parsing comprises;
partitioning the input XML document in accordance with the portioning information; and
performing a schema validation for each of one or more partition nodes corresponding to the partitioned input XML document;
using the shallow parsing to generate the partition of the input XML document on each processor, wherein each processor independently generates a different partition of the input XML document without dependence on any part of the input XML document that is not part of the partition associated with that processor;
sending each partition in streaming format to an ETL job instance; and
running each ETL job instance, wherein said running each ETL job instance comprises shredding the XML document in parallel on multiple nodes, and wherein said shredding comprises using horizontal partitioning to relationalize different partitions of the input XML document in parallel on different nodes.
6 Assignments
0 Petitions
Accused Products
Abstract
Techniques for running an Extract Transform Load (ETL) job in parallel on one or more processors wherein the ETL job comprises use of an extensible markup language (XML) document are provided. The techniques include receiving an XML document input, identifying a node in the XML document at which partitioning of the XML document is to begin, sending partition information to each respective processor, performing a shallow parsing of the XML document in parallel on the one or more processors, wherein each processor performs shallow parsing using the identified partition node until it reaches its identified partition, using the shallow parsing to generate the partition of the input XML document, wherein each processor generates a different partition of the same XML document, and sending each partition in streaming format to an ETL job instance.
-
Citations
17 Claims
-
1. A method for running an Extract Transform Load (ETL) job in parallel on multiple processors wherein the ETL job comprises use of an extensible markup language (XML) document, wherein the method comprises:
-
receiving an ETL job definition and an input XML document; identifying a node in the input XML document at which partitioning of the input XML document is to begin based on the ETL job definition and the XML document; identifying a size of each partition to be created within the input XML document, wherein each partition is created on a different processor; sending partition information to each respective processor, wherein the partition information includes the node in the input XML document at which partitioning of the input XML document is to begin and the size parameter for each partition; performing a shallow parsing of the input XML document in parallel on the multiple processors, wherein each processor performs shallow parsing using the identified partition node until it reaches its identified partition, and wherein said performing the shallow parsing comprises; partitioning the input XML document in accordance with the portioning information; and performing a schema validation for each of one or more partition nodes corresponding to the partitioned input XML document; using the shallow parsing to generate the partition of the input XML document on each processor, wherein each processor independently generates a different partition of the input XML document without dependence on any part of the input XML document that is not part of the partition associated with that processor; sending each partition in streaming format to an ETL job instance; and running each ETL job instance, wherein said running each ETL job instance comprises shredding the XML document in parallel on multiple nodes, and wherein said shredding comprises using horizontal partitioning to relationalize different partitions of the input XML document in parallel on different nodes. - View Dependent Claims (2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
7. The method of 1, further comprising, if an XML schema file contains at least one of an all XML schema indicator, sequence XML schema indicator, and choice XML schema indicator defined on the node on which partitioning is to be done:
-
sending each node taking part in the one or more indicators but not in a repetition element path, only to a first node; and sending each remaining node a first value of one or more output nodes that are not in the repetition path, and a schema file that does not have the XML schema indicators.
-
-
16. A computer program product comprising a tangible non-transitory computer readable recordable storage medium including computer useable program code for running an Extract Transform Load (ETL) job in parallel on multiple processors wherein the ETL job comprises use of an extensible markup language (XML) document, the computer program product including:
-
computer useable program code for receiving an ETL job definition and an input XML document; computer useable program code for identifying a node in the input XML document at which partitioning of the input XML document is to begin based on the ETL job definition and the XML document; computer useable program code for identifying a size of each partition to be created within the input XML document, wherein each partition is created on a different processor; computer useable program code for sending partition information to each respective processor, wherein the partition information includes the node in the input XML document at which partitioning of the input XML document is to begin and the size parameter for each partition; computer useable program code for performing a shallow parsing of the input XML document in parallel on the multiple processors, wherein each processor performs shallow parsing using the identified partition node until it reaches its identified partition, and wherein said performing the shallow parsing comprises; partitioning the input XML document in accordance with the portioning information; and performing a schema validation for each of one or more partition nodes corresponding to the partitioned input XML document; computer useable program code for using the shallow parsing to generate the partition of the input XML document on each processor, wherein each processor independently generates a different partition of the input XML document without dependence on any part of the input XML document that is not part of the partition associated with that processor; computer useable program code for sending each partition in streaming format to an ETL job instance; and computer useable program code for running each ETL job instance, wherein said running each ETL job instance comprises shredding the XML document in parallel on multiple nodes, and wherein said shredding comprises using horizontal partitioning to relationalize different partitions of the input XML document in parallel on different nodes.
-
-
17. A system for running an Extract Transform Load (ETL) job in parallel on multiple processors wherein the ETL job comprises use of an extensible markup language (XML) document, comprising:
-
a memory; and at least one processor coupled to the memory and operative to; receive an ETL job definition and an input XML document; identify a node in the input XML document at which partitioning of the input XML document is to begin based on the ETL job definition and the XML document; identify a size of each partition to be created within the input XML document, wherein each partition is created on a different processor; send partition information to each respective processor, wherein the partition information includes the node in the input XML document at which partitioning of the input XML document is to begin and the size parameter for each partition; perform a shallow parsing of the input XML document in parallel on the multiple processors, wherein each processor performs shallow parsing using the identified partition node until it reaches its identified partition, and wherein said performing the shallow parsing comprises; partitioning the input XML document in accordance with the portioning information; and performing a schema validation for each of one or more partition nodes corresponding to the partitioned input XML document; use the shallow parsing to generate the partition of the input XML document on each processor, wherein each processor independently generates a different partition of the input XML document without dependence on any part of the input XML document that is not part of the partition associated with that processor; send each partition in streaming format to an ETL job instance; and run each ETL job instance, wherein said running each ETL job instance comprises shredding the XML document in parallel on multiple nodes, and wherein said shredding comprises using horizontal partitioning to relationalize different partitions of the input XML document in parallel on different nodes.
-
Specification