Data processing over very large databases

US 7,624,118 B2
Filed: 07/26/2006
Issued: 11/24/2009
Est. Priority Date: 07/26/2006
Status: Active Grant

First Claim

Patent Images

1. A system that facilitates data processing, comprising:

a processor that executes the following computer executable components stored on a computer readable storage medium;

a receiver component that receives a structured query language (SQL) query;

a partitioning component that partitions the SQL query into multiple tasks and provides the tasks to multiple cluster nodes for processing, wherein the multiple cluster nodes include a hierarchical arrangement of sub-clusters of nodes, at least one of the cluster nodes includes a second partitioning component that partitions the received tasks into multiple sub-tasks, the at least one of the cluster nodes determine for one or more sub-tasks whether to execute the sub-task at the at least one cluster node or to provide the sub-task to a first sub-cluster for execution, and further wherein the multiple tasks that are provided to the multiple cluster nodes are assigned based on the association of the data content accessible by each of the multiple cluster nodes with the data content required by the one or more tasks; and

a monitoring component that monitors the progress of a first task at a first cluster of nodes of the multiple clusters of nodes, wherein the monitoring component determines the first task is not completed within a first threshold of time, and further wherein the monitoring component reassigns the first task from the first cluster of nodes of the multiple clusters of nodes to a second cluster of nodes of the multiple clusters of nodes upon determining the first task was not completed in the first threshold of time.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system that facilitates data processing includes a receiver component that receives an SQL query. A partitioning component partitions the SQL query into multiple tasks and provides the tasks to multiple cluster nodes for processing. The system enables very large amounts of data (e.g., multiple terabytes) to be quickly prepared for analytical processing, such as for use in connection with a search engine, an advertisement provision system, etc.

Citations

17 Claims

1. A system that facilitates data processing, comprising:
- a processor that executes the following computer executable components stored on a computer readable storage medium;
  
  a receiver component that receives a structured query language (SQL) query;
  
  a partitioning component that partitions the SQL query into multiple tasks and provides the tasks to multiple cluster nodes for processing, wherein the multiple cluster nodes include a hierarchical arrangement of sub-clusters of nodes, at least one of the cluster nodes includes a second partitioning component that partitions the received tasks into multiple sub-tasks, the at least one of the cluster nodes determine for one or more sub-tasks whether to execute the sub-task at the at least one cluster node or to provide the sub-task to a first sub-cluster for execution, and further wherein the multiple tasks that are provided to the multiple cluster nodes are assigned based on the association of the data content accessible by each of the multiple cluster nodes with the data content required by the one or more tasks; and
  
  a monitoring component that monitors the progress of a first task at a first cluster of nodes of the multiple clusters of nodes, wherein the monitoring component determines the first task is not completed within a first threshold of time, and further wherein the monitoring component reassigns the first task from the first cluster of nodes of the multiple clusters of nodes to a second cluster of nodes of the multiple clusters of nodes upon determining the first task was not completed in the first threshold of time.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The system of claim 1, further comprising an aggregation component that aggregates processed data received from the clusters.
  - 3. The system of claim 1, further comprising a rollback component that initiates a rollback of data to a known good state if a load of data into the system fails.
  - 4. The system of claim 1, a subset of the cluster nodes act as reader nodes that read web logs from a web server and provide a subset of the web logs to a particular cluster nodes that act as writer nodes and format the data in a suitable form for querying.
  - 5. The system of claim 1, further comprising a monitoring component that monitors the multiple cluster nodes to ensure that the multiple tasks are being performed.
  - 6. The system of claim 1, unreliable communications are undertaken between the partitioning component and the multiple cluster nodes.
  - 7. The system of claim 1, the cluster nodes communicate with one another by way of unreliable messaging.
  - 8. The system of claim 1, the at least one cluster node includes an aggregation component that aggregates data resultant from execution of the sub-tasks at sub-clusters associated with the cluster node.
  - 9. The system of claim 1, the plurality of clusters reside within a shared nothing storage architecture.
  - 10. The system of claim 1, further comprising a loading component that loads data into the plurality of cluster nodes from a web server, the loading component employs one or more distributed sort algorithms to assign one or more data partitions to one or more certain clusters.
  - 11. The system of claim 1, further comprising a search engine that utilizes results of the SQL query to selectively provide content to a user.
  - 12. The system of claim 1, further comprising an advertisement server that utilizes results of the SQL query to selectively provide advertisements to a user.

13. A method for preparing large amounts of data for analytical processing, comprising:
- receiving a query;
  
  utilizing a processor to determine multiple tasks based on the query;
  
  providing the multiple tasks to a plurality of cluster nodes through usage of one-way messaging, wherein the plurality of cluster nodes comprises a hierarchical arrangement of multiple cluster nodes that are subservient to one or more parent cluster nodes, and further wherein the multiple tasks that are provided to the plurality of cluster nodes are assigned based on the association of the data content accessible by each of the plurality of cluster nodes with the data content required by the one or more tasks;
  
  partitioning the tasks into a plurality of sub-tasks at one or more of the plurality of cluster nodes;
  
  selecting one or more sub-tasks at the one or more of the plurality of cluster nodes;
  
  providing the selected subtasks to multiple cluster nodes that are subservient to the cluster node that is providing the selected subtasks;
  
  monitoring the progress of a first task at a first cluster node of the multiple cluster nodes, wherein the monitoring includes determining whether the first task is completed within a first threshold of time, and reassigning the first task from the first cluster node of the multiple cluster nodes to a second cluster node of the multiple cluster nodes if the first task is not completed within the first threshold of time;
  
  aggregating results provided from the plurality of cluster nodes with respect to the multiple tasks; and
  
  providing the aggregated results to an object linking and embedding database (OLE DB) client.
- View Dependent Claims (14, 15, 16)
- - 14. The method of claim 13, further comprising:
    - performing data mining on the aggregated results; and
      
      providing at least one of search content and an advertisement based at least in part on the data mining.
  - 15. The method of claim 13, the received query is an structured query language (SQL) query.
  - 16. The method of claim 13, further comprising:
    - receiving an identity of a user; and
      
      generating the query based at least in part upon the received identity.

17. A data processing system, comprising:
- means for receiving a structured query language (SQL) query that is to be executed over multiple terabytes of data;
  
  means for determining multiple tasks associated with the received SQL query and providing the multiple tasks to a plurality of cluster nodes for processing, the plurality of cluster nodes comprises a hierarchical arrangement of multiple cluster nodes that are subservient to one or more parent cluster nodes;
  
  means for partitioning at least one of the tasks into a plurality of sub-tasks at one or more of the plurality of cluster nodes;
  
  means for determining one or more sub-tasks at the one or more of the plurality of cluster nodes;
  
  means for providing the determined sub-tasks to multiple cluster nodes that are subservient to the cluster node that is providing the determined sub-tasks;
  
  means for monitoring the progress of a first task at a first cluster of nodes of the multiple clusters of nodes, wherein the monitoring component determines the first task is not completed within a first threshold of time; and
  
  means for reassigning the first task from the first cluster of nodes of the multiple clusters of nodes to a second cluster of nodes of the multiple clusters of nodes if the first task is not completed within the first threshold of time.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Hargrove, Thomas H., Schipunov, Vladimir, Prasad, Rajeev
Primary Examiner(s)
Robinson; Greta L
Assistant Examiner(s)
Wilcox; James J

Application Number

US11/460,070
Publication Number

US 20080027920A1
Time in Patent Office

1,217 Days
Field of Search

707/104
US Class Current

1/1
CPC Class Codes

G06F 16/2465   Query processing support fo...

G06F 16/2471   Distributed queries

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99936   Pattern matching access

Y10S 707/99942   Manipulating data structure...

Y10S 707/99945   Object-oriented database st...

Data processing over very large databases

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Data processing over very large databases

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links