Scalable transport system for multicast replication
First Claim
1. A distributed data storage system that allows for available storage capacity and workload of an individual storage server to impact distribution of chunks within a cluster of storage servers, the distributed data storage system comprising:
- a network of non-blocking switches communicatively interconnecting the storage servers of the cluster;
an initiating client comprising a client system that communicatively interconnects to the network; and
a transport protocol implemented by the storage servers of the cluster and the initiating client for distributing a chunk within the cluster of storage servers,wherein a rendezvous transfer, including a rendezvous group, a designated time and a designated rate, is negotiated by exchanging unreliable datagrams amongst the initiating client and a negotiating group,wherein, during negotiation of the rendezvous transfer, reservations offered by storage servers in the negotiating group grant permission to use reserved bandwidth on the storage servers, wherein the reservations are offered in response to a request from the initiating client, andwherein the chunk is encoded in a sequence of unreliable datagrams, and the chunk is multicast by transmitting the sequence of unreliable datagrams in the rendezvous transfer at the designated time and the designated rate to the rendezvous group, which is a multicast group, such that a single transmission of the sequence of the unreliable datagrams results in reception of the chunk by multiple members of the rendezvous group.
4 Assignments
0 Petitions
Accused Products
Abstract
Embodiments disclosed herein provide advantageous methods and systems that use multicast communications via unreliable datagrams sent on a protected traffic class. These methods and systems provide effectively reliable multicast delivery while avoiding the overhead associated with point-to-point protocols. Rather than an exponential scaling of point-to-point connections (with expensive setup and teardown of the connections), the traffic from one server is bounded by linear scaling of multicast groups. In addition, the multicast rendezvous disclosed herein creates an edge-managed flow control that accounts for the dynamic state of the storage servers in the cluster, without needing centralized control, management or maintenance of state. This traffic shaping avoids the loss of data due to congestion during sustained oversubscription. Other embodiments, aspects and features are also disclosed.
33 Citations
23 Claims
-
1. A distributed data storage system that allows for available storage capacity and workload of an individual storage server to impact distribution of chunks within a cluster of storage servers, the distributed data storage system comprising:
-
a network of non-blocking switches communicatively interconnecting the storage servers of the cluster; an initiating client comprising a client system that communicatively interconnects to the network; and a transport protocol implemented by the storage servers of the cluster and the initiating client for distributing a chunk within the cluster of storage servers, wherein a rendezvous transfer, including a rendezvous group, a designated time and a designated rate, is negotiated by exchanging unreliable datagrams amongst the initiating client and a negotiating group, wherein, during negotiation of the rendezvous transfer, reservations offered by storage servers in the negotiating group grant permission to use reserved bandwidth on the storage servers, wherein the reservations are offered in response to a request from the initiating client, and wherein the chunk is encoded in a sequence of unreliable datagrams, and the chunk is multicast by transmitting the sequence of unreliable datagrams in the rendezvous transfer at the designated time and the designated rate to the rendezvous group, which is a multicast group, such that a single transmission of the sequence of the unreliable datagrams results in reception of the chunk by multiple members of the rendezvous group. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
-
Specification