Peer-to-peer architecture for processing big data

US 10,291,696 B2
Filed: 04/28/2015
Issued: 05/14/2019
Est. Priority Date: 04/28/2014
Status: Active Grant

First Claim

Patent Images

1. A system for managing large datasets comprising:

a physical network comprising a plurality of computing devices and a plurality of processors;

a peer-to-peer (P2P) network, the P2P network comprising a plurality of nodes and a logical network derived from the physical network;

a distributed file system fordistributing data and jobs randomly across the plurality of nodes in the P2P network, by;

receiving a file by an originating data node of the plurality of nodes, with a first processor of the plurality of processors assigned to the originating data node and being configured to;

divide the file into a plurality of pages, assign a hash value to a first page of the plurality of pages, and transfer the first page of the plurality of pages to an initial responsible data node, the initial responsible data node including a name defining a string of characters that shares a predetermined number of values with the hash value of the first page of the plurality of pages;

replicating the first page of the plurality of pages from the initial responsible data node to a first responsible data node and a second responsible data node;

receiving a job at an originating job node and dividing the job into a plurality of tasks, wherein the job comprises an input file name, a map function to process the plurality of tasks, and a reduce function to generate a set of results for the plurality of tasks from values derived from the map function;

routing a first task of the plurality of tasks to an initial responsible job node; and

assigning the first task of the plurality of tasks to a first processing node using the initial responsible job node; and

a task scheduler for delegating the first task of the plurality of tasks as necessary to optimize load distribution, by;

assigning the first task of the plurality of tasks to a first queue of at least two queues in the first processing node, andforwarding the first task of the plurality of tasks to a second processing node remote from the first processing node,wherein each of the plurality of nodes in the P2P network is configured to perform data storage, task execution, and job delegation, andwherein a distributed hash table is utilized to form the P2P network, andwherein globally unique arbitrary keys are mapped to individual nodes of the plurality of nodes to allow for a node lookup by referencing the unique arbitrary keys to accommodate random distribution of node names within the P2P network.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system is disclosed for managing large datasets. The system comprises a physical network. The physical network comprises a plurality of computing devices with a plurality of processors. The system further comprises a logical peer-to-peer (P2P) network with a plurality of nodes. The system further comprises a distributed file system for distributing data and jobs received by the system randomly across the plurality of nodes in the P2P network. The system duplicates the data to neighboring nodes of the plurality of nodes. The nodes monitor each other to reduce loss of data. The system further comprises a task scheduler. The task scheduler balances load across the plurality of nodes as tasks, derived from jobs, are distributed to various nodes. The task scheduler redistributes and forwards tasks to ensure the nodes processing the tasks are best suited to process those tasks.

38 Citations

View as Search Results

16 Claims

1. A system for managing large datasets comprising:
- a physical network comprising a plurality of computing devices and a plurality of processors;
  
  a peer-to-peer (P2P) network, the P2P network comprising a plurality of nodes and a logical network derived from the physical network;
  
  a distributed file system fordistributing data and jobs randomly across the plurality of nodes in the P2P network, by;
  
  receiving a file by an originating data node of the plurality of nodes, with a first processor of the plurality of processors assigned to the originating data node and being configured to;
  
  divide the file into a plurality of pages, assign a hash value to a first page of the plurality of pages, and transfer the first page of the plurality of pages to an initial responsible data node, the initial responsible data node including a name defining a string of characters that shares a predetermined number of values with the hash value of the first page of the plurality of pages;
  
  replicating the first page of the plurality of pages from the initial responsible data node to a first responsible data node and a second responsible data node;
  
  receiving a job at an originating job node and dividing the job into a plurality of tasks, wherein the job comprises an input file name, a map function to process the plurality of tasks, and a reduce function to generate a set of results for the plurality of tasks from values derived from the map function;
  
  routing a first task of the plurality of tasks to an initial responsible job node; and
  
  assigning the first task of the plurality of tasks to a first processing node using the initial responsible job node; and
  
  a task scheduler for delegating the first task of the plurality of tasks as necessary to optimize load distribution, by;
  
  assigning the first task of the plurality of tasks to a first queue of at least two queues in the first processing node, andforwarding the first task of the plurality of tasks to a second processing node remote from the first processing node,wherein each of the plurality of nodes in the P2P network is configured to perform data storage, task execution, and job delegation, andwherein a distributed hash table is utilized to form the P2P network, andwherein globally unique arbitrary keys are mapped to individual nodes of the plurality of nodes to allow for a node lookup by referencing the unique arbitrary keys to accommodate random distribution of node names within the P2P network.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The system of claim 1, wherein the distributed file system and the task scheduler are present on every node of the plurality of nodes and work together to ensure data is replicated and task loads are balanced across the plurality of nodes.
  - 3. The system of claim 1, further comprising:
    - providing a remote method invocation framework to facilitate communication between the plurality of nodes.
  - 4. The system of claim 2, wherein the tasks are balanced across the plurality of nodes due to a configuration of the P2P network.
  - 5. The system of claim 1, wherein the initial responsible job node assigns reduce functions to at least one of the plurality of nodes which becomes one or more reduce nodes and the initial responsible job node delegates a plurality of map tasks to at least one of the plurality of nodes which become map nodes;
    - and wherein when all of the plurality of map tasks have been completed, the one or more reduce nodes apply the reduce functions and submit results of the reduce functions to an output file.
  - 6. The system of claim 1, wherein the at least two queues include a local queue for processing the plurality of tasks at the first processing node, and a forwarding queue for delegating the plurality of tasks to the second processing node.
  - 7. The system of claim 1, wherein forwarding the first task to the second processing node is executed upon the task scheduler determining that the second processing node has a reduced load value as compared to the first processing node.
  - 8. The system of claim 1, wherein a number of the plurality of nodes in the system equals a total of processors of the plurality of processors per computing device.

9. A computer program product comprising non-transitory media for managing large dataset and defining instructions, comprising:
- forming a peer-to-peer (P2P) network from a physical network comprising a plurality of computing devices, the P2P network comprising a plurality of nodes and a logical network associated with the plurality of computing devices of the physical network;
  
  generating a distributed file system, configured for;
  
  receiving incoming files by an origination node of the plurality of nodes,dividing the incoming files into a plurality of pages and distributing the pages to the plurality of nodes of the P2P network, bytransferring a first page of the plurality of pages to an initial responsible data node, the initial responsible data node including a name defining a string of characters that shares a predetermined number of values with a hash value of the first page of the plurality of pages;
  
  replicating each of the plurality of pages to neighboring nodes;
  
  assigning at least one task corresponding to the plurality of pages to a first processing node; and
  
  implementing a task scheduler for delegating the at least one task from the first processing node to a second processing node when a load value of the second processing node is less than a load value of the first processing node,wherein the first processing node and second processing node comprise map nodes and reduce nodes to perform map tasks and reduce functions,wherein each of the plurality of nodes in the P2P network is adapted to perform data storage, task execution, and job delegation, andwherein a distributed hash table is implemented to assign values or names to peers of the P2P network.
- View Dependent Claims (10)
- - 10. The computer program product of claim 9, wherein the at least one task is initially delegated to a random one of the plurality of nodes bearing corresponding pages of the at least one task.

11. A method for managing large datasets, comprising:
- forming a peer-to-peer (P2P) network comprising a plurality of nodes, each node of the plurality of nodes associated with at least one processor of a physical network;
  
  distributing a page of a file to a first data node of the plurality of nodes;
  
  assigning a hash value to the page;
  
  transferring the page to an initial responsible data node, the initial responsible data node associated with a name defining a string of characters that shares a predetermined number of values with the hash value of the page;
  
  replicating the page to at least a second data node of the plurality of nodes;
  
  receiving a job at an originating job node associated with the P2P network;
  
  distributing a task of the job to a first processing node;
  
  determining whether the task should be further distributed to a second processing node; and
  
  implementing a task scheduler for delegating the task as necessary to optimize load distribution, by;
  
  assigning the task to a first queue of at least two queues in the first processing node, andforwarding the task to a second processing node remote from the first processing node,wherein the first processing node and second processing node comprise map nodes and reduce nodes to perform map tasks and reduce functions,wherein each of the plurality of nodes in the P2P network is operable to perform data storage, task execution, and job delegation,wherein a distributed hash table is utilized to form the P2P network, andwherein globally unique arbitrary keys are mapped to individual nodes of the plurality of nodes to allow for a node lookup by referencing the unique arbitrary keys to accommodate random distribution of node names within the P2P network.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The method of claim 11, further comprising:
    - replicating the page to a third data node of the plurality of nodes.
  - 13. The method of claim 11, further comprising:
    - establishing a communication link between the first data node and second data node using a remote procedure call.
  - 14. The method of claim 11, wherein the plurality of nodes represent one or more physical computing devices.
  - 15. The method of claim 11, wherein the first data node and second data node monitor each other to determine loss or corruption of the page on either the first data node or second data node.
  - 16. The method of claim 11, wherein the page of the file is named with a file name concatenated with an integral index of the page such that the page can be found by iterating over indices of the page and requesting each in turn.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Arizona Board of Regents (University of Arizona)
Original Assignee
Arizona Board of Regents (University of Arizona)
Inventors
Ying, Lei, Wang, Weina, Barnard, Matthew
Primary Examiner(s)
Nguyen, Dustin
Assistant Examiner(s)
Nguyen, Hao H

Application Number

US14/698,672
Publication Number

US 20150312335A1
Time in Patent Office

1,477 Days
Field of Search

709201, 709202, 709203, 709204, 709220, 707827, 710 29, 710314, 714 411, 714 62, 714 632
US Class Current
CPC Class Codes

G06F 9/5066   Algorithms for mapping a pl...

H04L 67/104   Peer-to-peer [P2P] networks

H04L 67/1061   using node-based peer disco...

H04L 67/1065   Discovery involving distrib...

H04L 67/1076   Resource dissemination mech...

H04L 67/34   involving the movement of s...

Peer-to-peer architecture for processing big data

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

38 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Peer-to-peer architecture for processing big data

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

38 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links