Processing data from multiple sources
First Claim
1. A method including:
- at a node of a Hadoop cluster, the node storing a first portion of data in HDFS data storage;
executing a first instance of a data processing engine capable of receiving data from a data source external to the Hadoop cluster;
receiving a dataflow graph by the data processing engine, the dataflow graph including a) at least one component representing the Hadoop cluster, b) at least one component representing the data source external to the Hadoop cluster, and c) at least one link that represents at least one dataflow associated with a data processing operation;
executing at least part of the dataflow graph by the first instance of the data processing engine;
receiving, by the data processing engine, a second portion of data from the external data source; and
performing, by the data processing engine, the data processing operation using at least the first portion of data and the second portion of data.
3 Assignments
0 Petitions
Accused Products
Abstract
In a first aspect, a method includes, at a node of a Hadoop cluster, the node storing a first portion of data in HDFS data storage, executing a first instance of a data processing engine capable of receiving data from a data source external to the Hadoop cluster, receiving a computer-executable program by the data processing engine, executing at least part of the program by the first instance of the data processing engine, receiving, by the data processing engine, a second portion of data from the external data source, storing the second portion of data other than in HDFS storage, and performing, by the data processing engine, a data processing operation identified by the program using at least the first portion of data and the second portion of data.
41 Citations
43 Claims
-
1. A method including:
-
at a node of a Hadoop cluster, the node storing a first portion of data in HDFS data storage; executing a first instance of a data processing engine capable of receiving data from a data source external to the Hadoop cluster; receiving a dataflow graph by the data processing engine, the dataflow graph including a) at least one component representing the Hadoop cluster, b) at least one component representing the data source external to the Hadoop cluster, and c) at least one link that represents at least one dataflow associated with a data processing operation; executing at least part of the dataflow graph by the first instance of the data processing engine; receiving, by the data processing engine, a second portion of data from the external data source; and performing, by the data processing engine, the data processing operation using at least the first portion of data and the second portion of data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A non-transitory computer-readable storage device including instructions for causing a node of a Hadoop cluster storing a first portion of data in HDFS data storage to carry out operations including:
-
executing a first instance of a data processing engine capable of receiving data from a data source external to the Hadoop cluster; receiving a dataflow graph by the data processing engine, the dataflow graph including a) at least one component representing the Hadoop cluster, b) at least one component representing the data source external to the Hadoop cluster, and c) at least one link that represents at least one dataflow associated with a data processing operation; executing at least part of the dataflow graph by the first instance of the data processing engine; receiving, by the data processing engine, a second portion of data from the external data source; and performing, by the data processing engine, the data processing operation using at least the first portion of data and the second portion of data. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A node of a Hadoop cluster storing a first portion of data in HDFS storage and including a computer processing device configured to carry out operations including:
-
executing a first instance of a data processing engine capable of receiving data from a data source external to the Hadoop cluster; receiving a dataflow graph by the data processing engine, the dataflow graph including a) at least one component representing the Hadoop cluster, b) at least one component representing the data source external to the Hadoop cluster, and c) at least one link that represents at least one dataflow associated with a data processing operation; executing at least part of the dataflow graph by the first instance of the data processing engine; receiving, by the data processing engine, a second portion of data from the external data source; and performing, by the data processing engine, the data processing operation using at least the first portion of data and the second portion of data. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
-
-
34. A node of a Hadoop cluster storing a first portion of data in HDFS storage and including:
-
means for executing a first instance of a data processing engine capable of receiving data from a data source external to the Hadoop cluster; means for receiving a dataflow graph by the data processing engine, the dataflow graph including a) at least one component representing the Hadoop cluster, b) at least one component representing the data source external to the Hadoop cluster, and c) at least one link that represents at least one dataflow associated with a data processing operation; means for executing at least part of the dataflow graph by the first instance of the data processing engine; means for receiving, by the data processing engine, a second portion of data from the external data source; and means for performing, by the data processing engine, the data processing operation using at least the first portion of data and the second portion of data.
-
-
35. A method including:
at a node of a cluster of nodes, the node storing a first portion of data and configured to carry out one or more data processing operations in conjunction with the cluster of nodes, the cluster storing a collection of data across the nodes, the nodes being configured to operate on the collection of data in parallel, the collection of data being split into portions, each portion being operated on by a respective node of the cluster; executing a first instance of a data processing engine capable of receiving data from a data source external to the cluster; receiving a dataflow graph by the data processing engine, the dataflow graph including a) at least one component representing the Hadoop cluster, b) at least one component representing the data source external to the Hadoop cluster, and c) at least one link that represents at least one dataflow associated with a data processing operation; executing at least part of the dataflow graph by the first instance of the data processing engine; based on characteristics of the first portion of data, requesting, by the data processing engine, a second portion of data; receiving, by the data processing engine, the second portion of data from the external data source; storing the second portion of data in volatile memory of the node; and performing, by the data processing engine, the data processing operation using at least the first portion of data and the second portion of data. - View Dependent Claims (36, 37)
-
38. A method including:
at a data processing engine of a node of a Hadoop cluster, performing a data processing operation identified by a dataflow graph being executed by the data processing engine, the dataflow graph including a) at least one component representing the Hadoop cluster, b) at least one component representing the data source external to the Hadoop cluster, and c) at least one link that represents at least one dataflow associated with a data processing operation, the data processing operation being performed using at least a first portion of data stored in HDFS data storage at the node and at least a second portion of data received from a data source external to the Hadoop cluster. - View Dependent Claims (39, 40)
-
41. A method including:
-
receiving a SQL query specifying sources of data including a Hadoop cluster and a relational database; generating a dataflow graph that corresponds to the SQL query, the dataflow graph including a) at least one component representing the Hadoop cluster, b) at least one component representing the data source external to the Hadoop cluster, and c) at least one link that represents at least one dataflow associated with a data processing operation; executing the dataflow graph at a data processing engine of a node of the Hadoop cluster; and performing, by the data processing engine, the data processing operation using at least data of the Hadoop cluster and data of the relational database. - View Dependent Claims (42, 43)
-
Specification