PARALLEL PROCESSING FOR ETL PROCESSES
First Claim
1. A computer-implemented method for parallel processing of data from a plurality of data sources in conjunction with an Extract-Transform-Load (ETL) process, the data being part of a related data set, comprising:
- staging a unit of extracted data from each of the plurality of data sources, thereby generating a plurality of units of staged data;
identifying a plurality of tasks for transforming the staged data;
assigning a subset of the tasks to each of a plurality of child processes being managed by a master process, such that dependent tasks are assigned to a same child process;
concurrently executing the subsets of tasks assigned to the child processes, thereby generating a plurality of units of transformed data from the plurality of units of staged data; and
publishing the transformed data to at least one data store after all of the plurality of tasks are completed, thereby ensuring that the published data represent the related data set.
3 Assignments
0 Petitions
Accused Products
Abstract
A technique for parallel processing of data from a plurality of data sources in conjunction with an Extract-Transform-Load (ETL) process, the data being part of a related data set, which comprises the following: staging a unit of extracted data from each of the plurality of data sources, thereby generating a plurality of units of staged data; identifying a plurality of tasks relating to transforming the staged data; assigning a subset of the tasks to each of a plurality of child processes being managed by a master process, such that dependent tasks are assigned to a same child process; concurrently executing the subsets of tasks assigned to the child processes, thereby generating a plurality of units of transformed data from the plurality of units of staged data; and publishing the transformed data after all tasks are completely executed, thereby ensuring that the published data represent the related data set.
-
Citations
20 Claims
-
1. A computer-implemented method for parallel processing of data from a plurality of data sources in conjunction with an Extract-Transform-Load (ETL) process, the data being part of a related data set, comprising:
-
staging a unit of extracted data from each of the plurality of data sources, thereby generating a plurality of units of staged data; identifying a plurality of tasks for transforming the staged data; assigning a subset of the tasks to each of a plurality of child processes being managed by a master process, such that dependent tasks are assigned to a same child process; concurrently executing the subsets of tasks assigned to the child processes, thereby generating a plurality of units of transformed data from the plurality of units of staged data; and publishing the transformed data to at least one data store after all of the plurality of tasks are completed, thereby ensuring that the published data represent the related data set. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for parallel processing of data in conjunction with an ETL process, comprising:
-
a plurality of data sources operable to generate the data, the data being part of a related data set; at least one data store operable to store published data generated by the ETL process; and at least one computing device configured to; stage a unit of extracted data from each of the plurality of data sources, thereby generating a plurality of units of staged data; identify a plurality of tasks for transforming the staged data; assign a subset of the tasks to each of a plurality of child processes being managed by a master process, such that dependent tasks are assigned to a same child process; concurrently execute the subsets of tasks assigned to the child processes, thereby generating a plurality of units of transformed data from the plurality of units of staged data; and publish the transformed data to the at least one data store after all of the plurality of tasks are completed, thereby ensuring that the published data represent the related data set. - View Dependent Claims (9, 10, 11, 12)
-
-
13. A computer program product for parallel processing of data from a plurality of data sources in conjunction with an Extract-Transform-Load (ETL) process, the data being part of a related data set, the computer program product comprising a computer-readable medium having a plurality of computer program instructions stored therein, which are operable to cause at least one computer device to:
-
stage a unit of extracted data from each of the plurality of data sources, thereby generating a plurality of units of staged data; identify a plurality of tasks for transforming the staged data; assign a subset of the tasks to each of a plurality of child processes being managed by a master process, such that dependent tasks are assigned to a same child process; concurrently execute the subsets of tasks assigned to the child processes, thereby generating a plurality of units of transformed data from the plurality of units of staged data; and publish the transformed data to at least one data store after all of the plurality of tasks are completed, thereby ensuring that the published data represent the related data set. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
Specification