PEER-TO-PEER ARCHITECTURE FOR PROCESSING BIG DATA
First Claim
1. A system for managing large datasets comprising:
- a physical network comprising a plurality of computing devices and a plurality of processors;
a peer-to-peer (P2P) network, the P2P network comprising a plurality of nodes and a logical network derived from the physical network;
a distributed file system for distributing data and jobs randomly across the plurality of nodes in the P2P network, by;
receiving a file by an originating data node of the plurality of nodes, with a first processor of the plurality of processors assigned to the originating data node and being configured to;
divide the file into a plurality of pages,assign a value to a first page of the plurality of pages, andtransfer the first page of the plurality of pages to an initial responsible data node, the initial responsible node including a name that closely matches the value of the first page of the plurality of pages;
replicating the first page of the plurality of page from the initial responsible data node to a first responsible data node and a second responsible data node;
receiving a job at an originating job node and dividing the job into a plurality of tasks;
routing a first task of the plurality of tasks to an initial responsible job node; and
assigning the first task of the plurality of tasks to a first processing node using the initial responsible job node; and
a task scheduler for delegating the first task of the plurality of tasks as necessary to optimize load distribution, by;
assigning the first task of the plurality of tasks to a first queue of at least two queues in the first processing node, andforwarding the first task of the plurality of tasks to a second processing node remote from the first processing node.
2 Assignments
0 Petitions
Accused Products
Abstract
A system is disclosed for managing large datasets. The system comprises a physical network. The physical network comprises a plurality of computing devices with a plurality of processors. The system further comprises a logical peer-to-peer (P2P) network with a plurality of nodes. The system further comprises a distributed file system for distributing data and jobs received by the system randomly across the plurality of nodes in the P2P network. The system duplicates the data to neighboring nodes of the plurality of nodes. The nodes monitor each other to reduce loss of data. The system further comprises a task scheduler. The task scheduler balances load across the plurality of nodes as tasks, derived from jobs, are distributed to various nodes. The task scheduler redistributes and forwards tasks to ensure the nodes processing the tasks are best suited to process those tasks.
48 Citations
20 Claims
-
1. A system for managing large datasets comprising:
-
a physical network comprising a plurality of computing devices and a plurality of processors; a peer-to-peer (P2P) network, the P2P network comprising a plurality of nodes and a logical network derived from the physical network; a distributed file system for distributing data and jobs randomly across the plurality of nodes in the P2P network, by; receiving a file by an originating data node of the plurality of nodes, with a first processor of the plurality of processors assigned to the originating data node and being configured to; divide the file into a plurality of pages, assign a value to a first page of the plurality of pages, and transfer the first page of the plurality of pages to an initial responsible data node, the initial responsible node including a name that closely matches the value of the first page of the plurality of pages; replicating the first page of the plurality of page from the initial responsible data node to a first responsible data node and a second responsible data node; receiving a job at an originating job node and dividing the job into a plurality of tasks; routing a first task of the plurality of tasks to an initial responsible job node; and assigning the first task of the plurality of tasks to a first processing node using the initial responsible job node; and a task scheduler for delegating the first task of the plurality of tasks as necessary to optimize load distribution, by; assigning the first task of the plurality of tasks to a first queue of at least two queues in the first processing node, and forwarding the first task of the plurality of tasks to a second processing node remote from the first processing node. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer program product for managing large datasets, comprising:
-
a peer-to-peer (P2P) network; a distributed file system for; dividing incoming files into a plurality of pages and distributing the pages randomly to a plurality of nodes of the P2P network; replicating each of the plurality of pages to neighboring nodes; assigning at least one task corresponding to the plurality of pages to a first processing node; and a task scheduler for delegating the at least one task from the first processing node to a second processing node when a load value of the second processing node is less than a load value of the first processing node. - View Dependent Claims (12, 13, 14)
-
-
15. A method for managing large datasets, comprising:
-
forming a peer-to-peer (P2P) network comprising a plurality of nodes, each node of the plurality of nodes including at least one processor; distributing a page of a file to a first data node of the plurality of nodes; replicating the page to a second data node of the plurality of nodes; receiving a job on the P2P network; distributing a task of the job to a first processing node; and determining whether the task should be further distributed to a second processing node. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification