Data processing over very large databases
First Claim
Patent Images
1. A system that facilitates data processing, comprising:
- a processor that executes the following computer executable components stored on a computer readable storage medium;
a receiver component that receives a structured query language (SQL) query;
a partitioning component that partitions the SQL query into multiple tasks and provides the tasks to multiple cluster nodes for processing, wherein the multiple cluster nodes include a hierarchical arrangement of sub-clusters of nodes, at least one of the cluster nodes includes a second partitioning component that partitions the received tasks into multiple sub-tasks, the at least one of the cluster nodes determine for one or more sub-tasks whether to execute the sub-task at the at least one cluster node or to provide the sub-task to a first sub-cluster for execution, and further wherein the multiple tasks that are provided to the multiple cluster nodes are assigned based on the association of the data content accessible by each of the multiple cluster nodes with the data content required by the one or more tasks; and
a monitoring component that monitors the progress of a first task at a first cluster of nodes of the multiple clusters of nodes, wherein the monitoring component determines the first task is not completed within a first threshold of time, and further wherein the monitoring component reassigns the first task from the first cluster of nodes of the multiple clusters of nodes to a second cluster of nodes of the multiple clusters of nodes upon determining the first task was not completed in the first threshold of time.
2 Assignments
0 Petitions
Accused Products
Abstract
A system that facilitates data processing includes a receiver component that receives an SQL query. A partitioning component partitions the SQL query into multiple tasks and provides the tasks to multiple cluster nodes for processing. The system enables very large amounts of data (e.g., multiple terabytes) to be quickly prepared for analytical processing, such as for use in connection with a search engine, an advertisement provision system, etc.
-
Citations
17 Claims
-
1. A system that facilitates data processing, comprising:
-
a processor that executes the following computer executable components stored on a computer readable storage medium; a receiver component that receives a structured query language (SQL) query; a partitioning component that partitions the SQL query into multiple tasks and provides the tasks to multiple cluster nodes for processing, wherein the multiple cluster nodes include a hierarchical arrangement of sub-clusters of nodes, at least one of the cluster nodes includes a second partitioning component that partitions the received tasks into multiple sub-tasks, the at least one of the cluster nodes determine for one or more sub-tasks whether to execute the sub-task at the at least one cluster node or to provide the sub-task to a first sub-cluster for execution, and further wherein the multiple tasks that are provided to the multiple cluster nodes are assigned based on the association of the data content accessible by each of the multiple cluster nodes with the data content required by the one or more tasks; and a monitoring component that monitors the progress of a first task at a first cluster of nodes of the multiple clusters of nodes, wherein the monitoring component determines the first task is not completed within a first threshold of time, and further wherein the monitoring component reassigns the first task from the first cluster of nodes of the multiple clusters of nodes to a second cluster of nodes of the multiple clusters of nodes upon determining the first task was not completed in the first threshold of time. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method for preparing large amounts of data for analytical processing, comprising:
-
receiving a query; utilizing a processor to determine multiple tasks based on the query; providing the multiple tasks to a plurality of cluster nodes through usage of one-way messaging, wherein the plurality of cluster nodes comprises a hierarchical arrangement of multiple cluster nodes that are subservient to one or more parent cluster nodes, and further wherein the multiple tasks that are provided to the plurality of cluster nodes are assigned based on the association of the data content accessible by each of the plurality of cluster nodes with the data content required by the one or more tasks; partitioning the tasks into a plurality of sub-tasks at one or more of the plurality of cluster nodes; selecting one or more sub-tasks at the one or more of the plurality of cluster nodes; providing the selected subtasks to multiple cluster nodes that are subservient to the cluster node that is providing the selected subtasks; monitoring the progress of a first task at a first cluster node of the multiple cluster nodes, wherein the monitoring includes determining whether the first task is completed within a first threshold of time, and reassigning the first task from the first cluster node of the multiple cluster nodes to a second cluster node of the multiple cluster nodes if the first task is not completed within the first threshold of time; aggregating results provided from the plurality of cluster nodes with respect to the multiple tasks; and providing the aggregated results to an object linking and embedding database (OLE DB) client. - View Dependent Claims (14, 15, 16)
-
-
17. A data processing system, comprising:
-
means for receiving a structured query language (SQL) query that is to be executed over multiple terabytes of data; means for determining multiple tasks associated with the received SQL query and providing the multiple tasks to a plurality of cluster nodes for processing, the plurality of cluster nodes comprises a hierarchical arrangement of multiple cluster nodes that are subservient to one or more parent cluster nodes; means for partitioning at least one of the tasks into a plurality of sub-tasks at one or more of the plurality of cluster nodes; means for determining one or more sub-tasks at the one or more of the plurality of cluster nodes; means for providing the determined sub-tasks to multiple cluster nodes that are subservient to the cluster node that is providing the determined sub-tasks; means for monitoring the progress of a first task at a first cluster of nodes of the multiple clusters of nodes, wherein the monitoring component determines the first task is not completed within a first threshold of time; and means for reassigning the first task from the first cluster of nodes of the multiple clusters of nodes to a second cluster of nodes of the multiple clusters of nodes if the first task is not completed within the first threshold of time.
-
Specification