METHOD AND SYSTEM FOR DISTRIBUTED MACHINE LEARNING
First Claim
1. A method, implemented on at least one machine each of which has at least one processor, storage, and a communication platform connected to a network for distributed machine learning on a cluster including a plurality of nodes, the method comprising the steps of:
- performing a machine learning process in each of the plurality of nodes based on a respective subset of training data to calculate a local parameter, wherein the training data is partitioned over the plurality of nodes;
determining a plurality of operation nodes from the plurality of nodes based on a status of the machine learning process performed in each of the plurality of nodes;
connecting the plurality of operation nodes to form a network topology; and
generating an aggregated parameter by merging local parameters calculated in each of the plurality of operation nodes in accordance with the network topology.
9 Assignments
0 Petitions
Accused Products
Abstract
Method, system, and programs for distributed machine learning on a cluster including a plurality of nodes are disclosed. A machine learning process is performed in each of the plurality of nodes based on a respective subset of training data to calculate a local parameter. The training data is partitioned over the plurality of nodes. A plurality of operation nodes are determined from the plurality of nodes based on a status of the machine learning process performed in each of the plurality of nodes. The plurality of operation nodes are connected to form a network topology. An aggregated parameter is generated by merging local parameters calculated in each of the plurality of operation nodes in accordance with the network topology.
-
Citations
24 Claims
-
1. A method, implemented on at least one machine each of which has at least one processor, storage, and a communication platform connected to a network for distributed machine learning on a cluster including a plurality of nodes, the method comprising the steps of:
-
performing a machine learning process in each of the plurality of nodes based on a respective subset of training data to calculate a local parameter, wherein the training data is partitioned over the plurality of nodes; determining a plurality of operation nodes from the plurality of nodes based on a status of the machine learning process performed in each of the plurality of nodes; connecting the plurality of operation nodes to form a network topology; and generating an aggregated parameter by merging local parameters calculated in each of the plurality of operation nodes in accordance with the network topology. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system for distributed machine learning, the system comprising:
-
a plurality of nodes, each node is configured to perform a machine learning process based on a respective subset of training data to calculate a local parameter, wherein the training data is partitioned over the plurality of nodes; and a coordination node operatively coupled to the plurality of operation nodes, configured to; determine a plurality of operation nodes from the plurality of nodes based on a status of the machine learning process performed in each of the plurality of nodes, and connect the plurality of operation nodes to form a network topology, wherein the plurality of operation nodes are configured to generate an aggregated parameter by merging local parameters calculated in each of the plurality of operation nodes in accordance with the network topology. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A machine-readable tangible and non-transitory medium having information for distributed machine learning on a cluster including a plurality of nodes recorded thereon, wherein the information, when read by the machine, causes the machine to perform the following:
-
partitioning training data over the plurality of nodes such that each of the plurality of nodes stores a subset of the training data, wherein a machine learning process is performed in each of the plurality of nodes based on a respective subset of the training data to calculate a local parameter; performing a machine learning process in each of the plurality of nodes based on a respective subset of training data to calculate a local parameter, wherein the training data is partitioned over the plurality of nodes; determining a plurality of operation nodes from the plurality of nodes based on a status of the machine learning process performed in each of the plurality of nodes; connecting the plurality of operation nodes to form a network topology; and generating an aggregated parameter by merging local parameters calculated in each of the plurality of operation nodes in accordance with the network topology. - View Dependent Claims (14, 15, 16, 17, 18)
-
-
19. A method, implemented on at least one machine each of which has at least one processor, storage, and a communication platform connected to a network for distributed machine learning on a cluster including a plurality of nodes, the method comprising the steps of:
-
storing a subset of training data that is partitioned over the plurality of nodes; performing a stochastic gradient descent process based on the subset of the training data to calculate an initial local parameter; transmitting the initial local parameter to at least one connected node in accordance with a network topology; receiving an initial aggregated parameter from the at least one connected node, wherein the initial aggregated parameter is calculated by merging initial local parameters calculated by each of the plurality of nodes in accordance with the network topology; performing a batch gradient descent process based on the received initial aggregated parameter and the subset of the training data to calculate an updated local parameter; and transmitting the updated local parameter to the at least one connected node in accordance with the network topology for calculating an updated aggregated parameter. - View Dependent Claims (20)
-
-
21. An apparatus comprising:
-
a storage configured to store a subset of training data that is partitioned over the plurality of nodes; an AllReducing module configured to; transmit a local parameter to at least one connected node in accordance with a network topology, and receive an aggregated parameter from the at least one connected node, wherein an initial aggregated parameter is calculated by merging initial local parameters calculated by each of the plurality of nodes in accordance with the network topology; and a machine learning module configured to; perform a stochastic gradient descent process based on the subset of the training data to calculate the initial local parameter, and perform a batch gradient descent process based on the initial aggregated parameter and the subset of the training data to calculate an updated local parameter, wherein the updated local parameter is transmitted to the at least one connected node for calculating an updated aggregated parameter. - View Dependent Claims (22)
-
-
23. A machine-readable tangible and non-transitory medium having information for distributed machine learning on a cluster including a plurality of nodes recorded thereon, wherein the information, when read by the machine, causes the machine to perform the following:
-
storing a subset of training data that is partitioned over the plurality of nodes; performing a stochastic gradient descent process based on the subset of the training data to calculate an initial local parameter; transmitting the initial local parameter to at least one connected node in accordance with a network topology; receiving an initial aggregated parameter from the at least one connected node, wherein the initial aggregated parameter is calculated by merging initial local parameters calculated by each of the plurality of nodes in accordance with the network topology; performing a batch gradient descent process based on the received initial aggregated parameter and the subset of the training data to calculate an updated local parameter; and transmitting the updated local parameter to the at least one connected node in accordance with the network topology for calculating an updated aggregated parameter. - View Dependent Claims (24)
-
Specification