×

Parallel processing of ETL jobs involving extensible markup language documents

  • US 9,064,047 B2
  • Filed: 09/24/2009
  • Issued: 06/23/2015
  • Est. Priority Date: 09/24/2009
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method for running an Extract Transform Load (ETL) job in parallel on multiple processors wherein the ETL job comprises use of an extensible markup language (XML) document, wherein the method comprises:

  • receiving an ETL job definition and an input XML document;

    identifying a node in the input XML document at which partitioning of the input XML document is to begin based on the ETL job definition and the XML document;

    identifying a size of each partition to be created within the input XML document, wherein each partition is created on a different processor;

    sending partition information to each respective processor, wherein the partition information includes the node in the input XML document at which partitioning of the input XML document is to begin and the size parameter for each partition;

    performing a shallow parsing of the input XML document in parallel on the multiple processors, wherein each processor performs shallow parsing using the identified partition node until it reaches its identified partition, and wherein said performing the shallow parsing comprises;

    partitioning the input XML document in accordance with the portioning information; and

    performing a schema validation for each of one or more partition nodes corresponding to the partitioned input XML document;

    using the shallow parsing to generate the partition of the input XML document on each processor, wherein each processor independently generates a different partition of the input XML document without dependence on any part of the input XML document that is not part of the partition associated with that processor;

    sending each partition in streaming format to an ETL job instance; and

    running each ETL job instance, wherein said running each ETL job instance comprises shredding the XML document in parallel on multiple nodes, and wherein said shredding comprises using horizontal partitioning to relationalize different partitions of the input XML document in parallel on different nodes.

View all claims
  • 6 Assignments
Timeline View
Assignment View
    ×
    ×