Peer-to-peer architecture for processing big data
First Claim
1. A system for managing large datasets comprising:
- a physical network comprising a plurality of computing devices and a plurality of processors;
a peer-to-peer (P2P) network, the P2P network comprising a plurality of nodes and a logical network derived from the physical network;
a distributed file system fordistributing data and jobs randomly across the plurality of nodes in the P2P network, by;
receiving a file by an originating data node of the plurality of nodes, with a first processor of the plurality of processors assigned to the originating data node and being configured to;
divide the file into a plurality of pages, assign a hash value to a first page of the plurality of pages, and transfer the first page of the plurality of pages to an initial responsible data node, the initial responsible data node including a name defining a string of characters that shares a predetermined number of values with the hash value of the first page of the plurality of pages;
replicating the first page of the plurality of pages from the initial responsible data node to a first responsible data node and a second responsible data node;
receiving a job at an originating job node and dividing the job into a plurality of tasks, wherein the job comprises an input file name, a map function to process the plurality of tasks, and a reduce function to generate a set of results for the plurality of tasks from values derived from the map function;
routing a first task of the plurality of tasks to an initial responsible job node; and
assigning the first task of the plurality of tasks to a first processing node using the initial responsible job node; and
a task scheduler for delegating the first task of the plurality of tasks as necessary to optimize load distribution, by;
assigning the first task of the plurality of tasks to a first queue of at least two queues in the first processing node, andforwarding the first task of the plurality of tasks to a second processing node remote from the first processing node,wherein each of the plurality of nodes in the P2P network is configured to perform data storage, task execution, and job delegation, andwherein a distributed hash table is utilized to form the P2P network, andwherein globally unique arbitrary keys are mapped to individual nodes of the plurality of nodes to allow for a node lookup by referencing the unique arbitrary keys to accommodate random distribution of node names within the P2P network.
2 Assignments
0 Petitions
Accused Products
Abstract
A system is disclosed for managing large datasets. The system comprises a physical network. The physical network comprises a plurality of computing devices with a plurality of processors. The system further comprises a logical peer-to-peer (P2P) network with a plurality of nodes. The system further comprises a distributed file system for distributing data and jobs received by the system randomly across the plurality of nodes in the P2P network. The system duplicates the data to neighboring nodes of the plurality of nodes. The nodes monitor each other to reduce loss of data. The system further comprises a task scheduler. The task scheduler balances load across the plurality of nodes as tasks, derived from jobs, are distributed to various nodes. The task scheduler redistributes and forwards tasks to ensure the nodes processing the tasks are best suited to process those tasks.
38 Citations
16 Claims
-
1. A system for managing large datasets comprising:
-
a physical network comprising a plurality of computing devices and a plurality of processors; a peer-to-peer (P2P) network, the P2P network comprising a plurality of nodes and a logical network derived from the physical network;
a distributed file system fordistributing data and jobs randomly across the plurality of nodes in the P2P network, by; receiving a file by an originating data node of the plurality of nodes, with a first processor of the plurality of processors assigned to the originating data node and being configured to;
divide the file into a plurality of pages, assign a hash value to a first page of the plurality of pages, and transfer the first page of the plurality of pages to an initial responsible data node, the initial responsible data node including a name defining a string of characters that shares a predetermined number of values with the hash value of the first page of the plurality of pages;replicating the first page of the plurality of pages from the initial responsible data node to a first responsible data node and a second responsible data node; receiving a job at an originating job node and dividing the job into a plurality of tasks, wherein the job comprises an input file name, a map function to process the plurality of tasks, and a reduce function to generate a set of results for the plurality of tasks from values derived from the map function; routing a first task of the plurality of tasks to an initial responsible job node; and assigning the first task of the plurality of tasks to a first processing node using the initial responsible job node; and a task scheduler for delegating the first task of the plurality of tasks as necessary to optimize load distribution, by; assigning the first task of the plurality of tasks to a first queue of at least two queues in the first processing node, and forwarding the first task of the plurality of tasks to a second processing node remote from the first processing node, wherein each of the plurality of nodes in the P2P network is configured to perform data storage, task execution, and job delegation, and wherein a distributed hash table is utilized to form the P2P network, and wherein globally unique arbitrary keys are mapped to individual nodes of the plurality of nodes to allow for a node lookup by referencing the unique arbitrary keys to accommodate random distribution of node names within the P2P network. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer program product comprising non-transitory media for managing large dataset and defining instructions, comprising:
-
forming a peer-to-peer (P2P) network from a physical network comprising a plurality of computing devices, the P2P network comprising a plurality of nodes and a logical network associated with the plurality of computing devices of the physical network; generating a distributed file system, configured for; receiving incoming files by an origination node of the plurality of nodes, dividing the incoming files into a plurality of pages and distributing the pages to the plurality of nodes of the P2P network, by transferring a first page of the plurality of pages to an initial responsible data node, the initial responsible data node including a name defining a string of characters that shares a predetermined number of values with a hash value of the first page of the plurality of pages; replicating each of the plurality of pages to neighboring nodes; assigning at least one task corresponding to the plurality of pages to a first processing node; and implementing a task scheduler for delegating the at least one task from the first processing node to a second processing node when a load value of the second processing node is less than a load value of the first processing node, wherein the first processing node and second processing node comprise map nodes and reduce nodes to perform map tasks and reduce functions, wherein each of the plurality of nodes in the P2P network is adapted to perform data storage, task execution, and job delegation, and wherein a distributed hash table is implemented to assign values or names to peers of the P2P network. - View Dependent Claims (10)
-
-
11. A method for managing large datasets, comprising:
-
forming a peer-to-peer (P2P) network comprising a plurality of nodes, each node of the plurality of nodes associated with at least one processor of a physical network; distributing a page of a file to a first data node of the plurality of nodes; assigning a hash value to the page; transferring the page to an initial responsible data node, the initial responsible data node associated with a name defining a string of characters that shares a predetermined number of values with the hash value of the page; replicating the page to at least a second data node of the plurality of nodes; receiving a job at an originating job node associated with the P2P network; distributing a task of the job to a first processing node; determining whether the task should be further distributed to a second processing node; and implementing a task scheduler for delegating the task as necessary to optimize load distribution, by; assigning the task to a first queue of at least two queues in the first processing node, and forwarding the task to a second processing node remote from the first processing node, wherein the first processing node and second processing node comprise map nodes and reduce nodes to perform map tasks and reduce functions, wherein each of the plurality of nodes in the P2P network is operable to perform data storage, task execution, and job delegation, wherein a distributed hash table is utilized to form the P2P network, and wherein globally unique arbitrary keys are mapped to individual nodes of the plurality of nodes to allow for a node lookup by referencing the unique arbitrary keys to accommodate random distribution of node names within the P2P network. - View Dependent Claims (12, 13, 14, 15, 16)
-
Specification