Multi-input SQL-MR
First Claim
1. A system comprising:
- an array of storage devices configured to store data;
an array of processing nodes in communication with the array of storage devices, the array of processing nodes configured to;
receive a request to perform at least one task associated with the data, wherein the request includes a function call to a function configured to operate on a first data table and a second data table included in the data;
partition rows of the first data table into a plurality of row partitions among respective subsets of the processing nodes based on a partition key common to both the first data table and the second data table, wherein at least one row partition of the first data table comprises a plurality of rows;
for each partition key, generate a relation, wherein each relation is a data structure configured to include a plurality of columns in a single row, wherein each column of the relation is configured to maintain multiple column values from at least one row of a different data table;
for each relation, insert a plurality of row values from at least one row of the first data table from one of the row partitions into a single column of a row of the relation according to the partition key;
distribute row values from at least one row of the second data table into another single column of at least one relation according to the partition key, wherein at least one row value from at least one column of the second data table is distributed into a column of a row of a relation and, wherein the column of the row of the relation is in a row of the relation having at least one row value from the first data table in a different column of the relation; and
execute the function on each relation to generate at least one output data object.
1 Assignment
0 Petitions
Accused Products
Abstract
A system may include an array of storage devices configured to store a data. The system may further include an array of processing nodes in communication with the array of storage devices. The array of processing nodes may receive a request to perform at least one task associated with the data. The request may include a function call to a function configured to operate on a first data table and a second data table included in the data. The array of processing nodes may partition the first data table among respective subsets of the processing nodes based on a partition key. The array of processing nodes may distribute the second data table among the partitions based on the partition key. The array of processing nodes may execute the function on the first data table and the second data table at each of the partitions. A method and computer-readable medium may also be implemented.
-
Citations
17 Claims
-
1. A system comprising:
-
an array of storage devices configured to store data; an array of processing nodes in communication with the array of storage devices, the array of processing nodes configured to; receive a request to perform at least one task associated with the data, wherein the request includes a function call to a function configured to operate on a first data table and a second data table included in the data; partition rows of the first data table into a plurality of row partitions among respective subsets of the processing nodes based on a partition key common to both the first data table and the second data table, wherein at least one row partition of the first data table comprises a plurality of rows; for each partition key, generate a relation, wherein each relation is a data structure configured to include a plurality of columns in a single row, wherein each column of the relation is configured to maintain multiple column values from at least one row of a different data table; for each relation, insert a plurality of row values from at least one row of the first data table from one of the row partitions into a single column of a row of the relation according to the partition key; distribute row values from at least one row of the second data table into another single column of at least one relation according to the partition key, wherein at least one row value from at least one column of the second data table is distributed into a column of a row of a relation and, wherein the column of the row of the relation is in a row of the relation having at least one row value from the first data table in a different column of the relation; and execute the function on each relation to generate at least one output data object. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer-implemented method executable by a plurality of processing nodes, the method comprising:
-
receiving a request to perform at least one task associated with data in a storage device, wherein the request includes a function call to a function configured to operate on a plurality of data tables included in the data; retrieving the plurality of data tables from at least one storage device; partitioning rows of a first one of the plurality of data tables into a plurality of row partitions among respective subsets of the processing nodes based on a partition key common to each data table, wherein at least one row partition of the first one of the plurality of data tables comprises a plurality of rows; for each partition key, generating a relation, wherein each relation is a data structure configured to include a plurality of columns in a single row, wherein each column of the relation is configured to maintain multiple column values from at least one row of a different data table; for each relation, inserting a plurality of row values from at least one row from a row partition of the first one of the plurality of data tables into a single column of the relation according to the partition key; distribute row values from at least one row of the second data table into another single column of at least one relation according to the partition key, wherein at least one row value from at least one column of the second data table is distributed into a column of a row of a relation and, wherein the column of the row of the relation is in a row of the relation having at least one row value from the first data table in a different column of the relation; distributing row values of other data tables into a respective column of at least one relation according to the partition key; and executing the function on each partition to generate at least one output data object. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A non-transitory computer-readable medium encoded with a plurality of instructions executable by a processor, the plurality of instructions comprising:
-
instructions to receive a request to perform at least one task associated with data stored in at least one storage device, wherein the request includes a function call to a function configured to operate on a first data table and a second data table included in the data; instructions to retrieve the first data table and the second data table from the at least one storage device; instructions to partition rows of the first data table into a plurality of row partitions among respective subsets of the processing nodes based on a partition key common to both the first data table and the second data table, wherein at least one row partition of the first data table comprises a plurality of rows; instructions to generate a relation for each partition key, wherein each relation is a data structure configured to include a plurality of columns in a single row, wherein each column of the relation is configured to maintain multiple column values from at least one row of a different data table; instructions to insert, for each relation, a plurality of row values from at least one row of the first data table from one of the row partitions into a single column of a row of the relation according to the partition key; instructions to distribute row values from at least one row of the second data table into another single column of at least one relation according to the partition key, wherein at least one row value from at least one column of the second data table is distributed into a column of a row of a relation and, wherein the column of the row of the relation is in a row of the relation having at least one row value from the first data table in a different column of the relation; and instructions to execute the function on each relation to generate at least one output data object. - View Dependent Claims (14, 15, 16, 17)
-
Specification