Performing data analytics utilizing a user configurable group of reusable modules
First Claim
1. A system for performing analytics on a large quantity of data accommodated by an external mass storage device comprising:
- a computer system including at least one processor configured to;
divide the analytics into a plurality of analytic modules, wherein each of the analytic modules is selectively executed and comprises a script for a parallel processing engine to perform a corresponding atomic operation of the analytics, the plurality of analytic modules including one or more pre-processing modules, one or more statistical analytic modules and one or more post-processing modules;
receive an input from a user, the input including a user selection of one or more of the plurality of analytic modules to perform desired analytics on the large quantity of data from the external mass storage device;
responsive to the receiving the input including the user selection, automatically generate a master script designating the one or more of the plurality of analytic modules that are to be present in a module chain and an order of performing the designated one or more of the plurality analytic modules in the module chain, one or more pre-processing modules of the one or more of the plurality of analytic modules to be executed before one or more statistical analytic modules of the one or more of the plurality of analytic modules, and the one or more statistical analytic modules of the one or more of the plurality of analytic modules to be executed before one or more post-processing modules of the one or more of the plurality of analytic modules;
execute pre-processing scripts associated with the one or more pre-processing modules of the one or more of the plurality of analytic modules in the module chain to produce one or more partial solutions, the one or more pre-processing modules of the one or more of the plurality of analytic modules preparing and cleaning raw data to produce the one or more partial solutions to be provided to the one or more statistical analytic modules in the module chain;
accept one of the one or more partial solutions and automatically break down scripts associated with the one or more statistical modules of the one or more of the plurality of analytic modules in the module chain into map/reduce jobs and optimize execution of the map/reduce jobs;
execute the map/reduce jobs;
andautomatically execute alternative statistical modules, based on scoring results of the one or more post-processing modules of the one or more of the plurality of analytic modules, the automatically executing reusing, as input, a partial solution of the one or more partial solutions produced by completing execution of at least one of the one or more pre-processing modules to avoid re-execution of the at least one of the one or more pre-processing modules.
1 Assignment
0 Petitions
Accused Products
Abstract
According to one embodiment of the present invention, a computer-implemented method of performing analytics on a large quantity of data accommodated by an external mass storage device is provided. The analytics may be divided into a set of modules, wherein each module is selectively executed and comprises a script for a parallel processing engine to perform a corresponding atomic operation on the analytics. A user selection is received of one or more modules to perform desired analytics on the large quantity of data from the external mass storage device, and the selected modules execute scripts for the parallel processing engine to perform the corresponding atomic operations of the desired analytics.
-
Citations
9 Claims
-
1. A system for performing analytics on a large quantity of data accommodated by an external mass storage device comprising:
a computer system including at least one processor configured to; divide the analytics into a plurality of analytic modules, wherein each of the analytic modules is selectively executed and comprises a script for a parallel processing engine to perform a corresponding atomic operation of the analytics, the plurality of analytic modules including one or more pre-processing modules, one or more statistical analytic modules and one or more post-processing modules; receive an input from a user, the input including a user selection of one or more of the plurality of analytic modules to perform desired analytics on the large quantity of data from the external mass storage device; responsive to the receiving the input including the user selection, automatically generate a master script designating the one or more of the plurality of analytic modules that are to be present in a module chain and an order of performing the designated one or more of the plurality analytic modules in the module chain, one or more pre-processing modules of the one or more of the plurality of analytic modules to be executed before one or more statistical analytic modules of the one or more of the plurality of analytic modules, and the one or more statistical analytic modules of the one or more of the plurality of analytic modules to be executed before one or more post-processing modules of the one or more of the plurality of analytic modules; execute pre-processing scripts associated with the one or more pre-processing modules of the one or more of the plurality of analytic modules in the module chain to produce one or more partial solutions, the one or more pre-processing modules of the one or more of the plurality of analytic modules preparing and cleaning raw data to produce the one or more partial solutions to be provided to the one or more statistical analytic modules in the module chain; accept one of the one or more partial solutions and automatically break down scripts associated with the one or more statistical modules of the one or more of the plurality of analytic modules in the module chain into map/reduce jobs and optimize execution of the map/reduce jobs; execute the map/reduce jobs; and automatically execute alternative statistical modules, based on scoring results of the one or more post-processing modules of the one or more of the plurality of analytic modules, the automatically executing reusing, as input, a partial solution of the one or more partial solutions produced by completing execution of at least one of the one or more pre-processing modules to avoid re-execution of the at least one of the one or more pre-processing modules. - View Dependent Claims (2, 3, 4, 5)
-
6. A computer program product for performing analytics on a large quantity of data accommodated by an external mass storage device comprising:
-
a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising computer readable program code configured to; divide the analytics into a plurality of analytic modules, wherein each of the analytic modules is selectively executed and comprises a script for a parallel processing engine to perform a corresponding atomic operation of the analytics, the plurality of analytic modules including one or more pre-processing modules, one or more statistical modules and one or more post-processing modules; receive an input from a user, the input including a user selection of one or more of the plurality of analytic modules to perform desired analytics on the large quantity of data from the external mass storage device; responsive to the receiving the input including the user selection, automatically generate a master script designating the one or more of the plurality of analytic modules that are to be present in a module chain and an order of performing the one or more designated analytic modules in the module chain, one or more pre-processing modules of the designated one or more of the plurality of analytic modules to be executed before one or more statistical analytic modules of the one or more of the plurality of analytic modules, and the one or more statistical analytic modules of the one or more of the plurality of analytic modules to be executed before one or more post-processing modules of the one or more of the plurality of analytic modules; execute pre-processing scripts associated with the one or more pre-processing modules of the one or more of the plurality of analytic modules in the module chain to produce one or more partial solutions, the one or more pre-processing modules of the one or more of the plurality of analytic modules preparing and cleaning raw data to produce the one or more partial solutions to be provided to the one or more statistical analytic modules in the module chain; accept one of the one or more partial solutions and automatically break down scripts associated with the one or more statistical modules of the one or more of the plurality of analytic modules in the module chain into map/reduce jobs and optimize execution of the map/reduce jobs; execute the map/reduce jobs; and automatically execute alternative statistical modules, based on scoring results of the one or more post-processing modules of the one or more of the plurality of analytic modules, the automatically executing reusing, as input, one of the one or more partial solutions produced by completing execution of at least one of the one or more pre-processing modules to avoid re-execution of the at least one of the one or more pre-processing modules. - View Dependent Claims (7, 8, 9)
-
Specification