Parallel Processing of ETL Jobs Involving Extensible Markup Language Documents
First Claim
1. A method for running an Extract Transform Load (ETL) job in parallel on one or more processors wherein the ETL job comprises use of an extensible markup language (XML) document, wherein the method comprises:
- receiving an XML document input;
identifying a node in the XML document at which partitioning of the XML document is to begin;
sending partition information to each respective processor;
to performing a shallow parsing of the XML document in parallel on the one or more processors, wherein each processor performs shallow parsing using the identified partition node until it reaches its identified partition;
using the shallow parsing to generate the partition of the input XML document, wherein each processor generates a different partition of the same XML document; and
sending each partition in streaming format to an ETL job instance.
6 Assignments
0 Petitions
Accused Products
Abstract
Techniques for running an Extract Transform Load (ETL) job in parallel on one or more processors wherein the ETL job comprises use of an extensible markup language (XML) document are provided. The techniques include receiving an XML document input, identifying a node in the XML document at which partitioning of the XML document is to begin, sending partition information to each respective processor, performing a shallow parsing of the XML document in parallel on the one or more processors, wherein each processor performs shallow parsing using the identified partition node until it reaches its identified partition, using the shallow parsing to generate the partition of the input XML document, wherein each processor generates a different partition of the same XML document, and sending each partition in streaming format to an ETL job instance.
-
Citations
20 Claims
-
1. A method for running an Extract Transform Load (ETL) job in parallel on one or more processors wherein the ETL job comprises use of an extensible markup language (XML) document, wherein the method comprises:
-
receiving an XML document input; identifying a node in the XML document at which partitioning of the XML document is to begin; sending partition information to each respective processor; to performing a shallow parsing of the XML document in parallel on the one or more processors, wherein each processor performs shallow parsing using the identified partition node until it reaches its identified partition; using the shallow parsing to generate the partition of the input XML document, wherein each processor generates a different partition of the same XML document; and sending each partition in streaming format to an ETL job instance. - View Dependent Claims (2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
8. The method of 1, further comprising, if an XML schema file contains at least one of an all XML schema indicator, sequence XML schema indicator, and choice XML schema indicator defined on the node on which partitioning is to be done:
-
sending each node taking part in the one or more indicators but not in a repetition element path, only to a first node; and sending each remaining node a first value of one or more output nodes that are not in the repetition path, and a schema file that does not have the XML schema indicators.
-
-
18. A computer program product comprising a tangible computer readable recordable storage medium including computer useable program code for running an Extract Transform Load (ETL) job in parallel on one or more processors wherein the ETL job comprises use of an extensible markup language (XML) document, the computer program product including:
-
computer useable program code for receiving an XML document input; computer useable program code for identifying a node in the XML document at which partitioning of the XML document is to begin; computer useable program code for sending partition information to each respective processor; computer useable program code for performing a shallow parsing of the XML document in parallel on the one or more processors, wherein each processor performs shallow parsing using the identified partition node until it reaches its identified partition; computer useable program code for using the shallow parsing to generate the partition of the input XML document, wherein each processor generates a different partition of the same XML document; and computer useable program code for sending each partition in streaming format to an ETL job instance. - View Dependent Claims (19)
-
-
20. A system for running an Extract Transform Load (ETL) job in parallel on one or more processors wherein the ETL job comprises use of an extensible markup language (XML) document, comprising:
-
a memory; and at least one processor coupled to the memory and operative to; receive an XML document input; identify a node in the XML document at which partitioning of the XML document is to begin; send partition information to each respective processor; perform a shallow parsing of the XML document in parallel on the one or more processors, wherein each processor performs shallow parsing using the identified partition node until it reaches its identified partition; use the shallow parsing to generate the partition of the input XML document, wherein each processor generates a different partition of the same XML document; and send each partition in streaming format to an ETL job instance.
-
Specification