Apparatus and method for packet transmission over a high speed network supporting remote direct memory access operations
First Claim
1. An apparatus, for performing remote direct memory access (RDMA) operations between a first server and a second server over an Ethernet fabric, the RDMA operations being initiated by execution of a verb according to a remote direct memory access protocol, the verb being executed by a central processing unit (CPU) on the first server, the apparatus comprising:
- a network adapter in the first server, the network adapter including transaction logic, the transaction logic being to process a work queue element corresponding to the verb, and also being to accomplish the RDMA operations over a Transmission Control Protocol/Internet Protocol (TCP/IP) interface between the first and second servers, wherein said work queue element resides within first host memory in the first server, the first host memory being coupled to the CPU via a memory controller, the network adapter being coupled to the first host memory via both a host interface that is comprised in the network adapter and the memory controller, the first host memory to store an adapter driver to provide control of the network adapter, said transaction logic comprising;
transmit history information stores, to maintain a local copy of a subset of parameters in said work queue element, the transmit history information stores including additional parameters in addition to the parameters in the work queue element, the transmit history stores being in a local memory that is comprised in the network adapter, the local memory being separate and distinct from the first host memory within which resides the work queue element, the transmit history information stores being to store the local copy and the additional parameters in one or more entries in one or more first-in-first-out (FIFO) buffers in the transmit history information stores, the one or more FIFO buffers being dynamically bound to the work queue element residing within the first host memory; and
a protocol engine, coupled to said transmit history information stores, to access said local copy of the subset of the parameters and the additional parameters, the subset being selected so as to enable the protocol engine to rebuild, based on the local copy, for retransmission one or more TCP segments corresponding to the RDMA operations in event of network transmission error, the subset also being selected so as to enable the protocol engine to determine, based on the local copy, if the RDMA operations have been completed;
the transaction logic in the network adapter also comprising IP address logic coupled both to a medium access controller (MAC) of the network adapter and to the protocol engine, the IP address logic to contain IP address entries to be used as source IP addresses in transmitted messages, the network adapter to compare with the IP address entries a destination IP address of an inbound datagram received by the MAC, the network adapter to process the inbound datagram in accordance with a RDMA connection processing pipeline only if the destination IP address matches one of the IP address entries, the network adapter to process the inbound datagram using a TCP/IP stack if no match to the destination IP address is in the IP address entries, the transaction logic including connection correlation logic to provide, for an outgoing transmission, mapping of a work queue number to TCP/IP routing parameters, the TCP/IP routing parameters including source and destination TCP ports and source and destination IP addresses, the one or more entries in the one or more FIFO buffers in the transmit history information stores including a plurality of such entries, each respective one of the plurality of such entries including a respective field set and corresponding with a respective corresponding one of entries in the work queue element, each respective field set including a respective sendmsn field, a respective readmsn field, a respective first flag field, a respective startseqnum field, a respective finalseqnum field, a respective sackpres field, a respective notifyoncomp field, and a respective maximum upper level protocol data unit (MULPDU) field, the respective sendmsn field maintaining a current send message sequence number, the respective readmsn field maintaining a current read message sequence number, the respective startseqnum field maintaining an initial TCP sequence number of the respective one of the entries in the work queue elements, the finalseqnum field maintaining a final TCP sequence number of a message corresponding to the respective one of the entries in the work queue elements, the startseqnum field and the finalseqnum field being provided to the respective one of the plurality of entries in the one or more FIFO buffers in the transmit history information stores during creation of a first TCP segment of the message, the respective first flag field indicating whether a TCP streaming mode, other than RDMA over TCP, is being employed to perform a TCP-offload related data transaction associated with the respective corresponding one of the entries in the work queue element, the respective MULPDU field being to record a size of a MULPDU, associated with the respective corresponding one of the entries in the work queue element, that was in effect at a previous transmission time of the MULPDU, the size recorded in the MULPDU field to be used to re-segment one or more framed protocol data units (FPDU) and to rebuild one or more TCP segments that were transmitted during the previous transmission time in event of either of the following limitations numbered (1) and (2);
(1) a network error associated with the one or more TCP segments, and (2) dynamic changing of the size of the MULPDU, the one or more TCP segments that are rebuilt consisting of a partial FPDU if the size of the MULPDU has been dynamically changed, the respective sackpres field being to indicate whether the respective MULPDU field has been reduced by allocation for a maximum sized SACK block, the respective notifyoncomp field being to indicate whether completion queue element generation is to occur for the adapter after outstanding TCP message segment acknowledgement.
5 Assignments
0 Petitions
Accused Products
Abstract
A mechanism for performing remote direct memory access (RDMA) operations between a first server and a second server over an Ethernet fabric. The RDMA operations are initiated by execution of a verb according to a remote direct memory access protocol. The verb is executed by a CPU on the first server. The apparatus includes transaction logic that is configured to process a work queue element corresponding to the verb, and that is configured to accomplish the RDMA operations over a TCP/IP interface between the first and second servers, where the work queue element resides within first host memory corresponding to the first server. The transaction logic includes transmit history information stores and a protocol engine. The transmit history information stores maintains parameters associated with said work queue element. The protocol engine is coupled to the transmit history information stores and is configured to access the parameters to enable retransmission of one or more TCP segments corresponding to the RDMA operations.
150 Citations
27 Claims
-
1. An apparatus, for performing remote direct memory access (RDMA) operations between a first server and a second server over an Ethernet fabric, the RDMA operations being initiated by execution of a verb according to a remote direct memory access protocol, the verb being executed by a central processing unit (CPU) on the first server, the apparatus comprising:
-
a network adapter in the first server, the network adapter including transaction logic, the transaction logic being to process a work queue element corresponding to the verb, and also being to accomplish the RDMA operations over a Transmission Control Protocol/Internet Protocol (TCP/IP) interface between the first and second servers, wherein said work queue element resides within first host memory in the first server, the first host memory being coupled to the CPU via a memory controller, the network adapter being coupled to the first host memory via both a host interface that is comprised in the network adapter and the memory controller, the first host memory to store an adapter driver to provide control of the network adapter, said transaction logic comprising; transmit history information stores, to maintain a local copy of a subset of parameters in said work queue element, the transmit history information stores including additional parameters in addition to the parameters in the work queue element, the transmit history stores being in a local memory that is comprised in the network adapter, the local memory being separate and distinct from the first host memory within which resides the work queue element, the transmit history information stores being to store the local copy and the additional parameters in one or more entries in one or more first-in-first-out (FIFO) buffers in the transmit history information stores, the one or more FIFO buffers being dynamically bound to the work queue element residing within the first host memory; and a protocol engine, coupled to said transmit history information stores, to access said local copy of the subset of the parameters and the additional parameters, the subset being selected so as to enable the protocol engine to rebuild, based on the local copy, for retransmission one or more TCP segments corresponding to the RDMA operations in event of network transmission error, the subset also being selected so as to enable the protocol engine to determine, based on the local copy, if the RDMA operations have been completed; the transaction logic in the network adapter also comprising IP address logic coupled both to a medium access controller (MAC) of the network adapter and to the protocol engine, the IP address logic to contain IP address entries to be used as source IP addresses in transmitted messages, the network adapter to compare with the IP address entries a destination IP address of an inbound datagram received by the MAC, the network adapter to process the inbound datagram in accordance with a RDMA connection processing pipeline only if the destination IP address matches one of the IP address entries, the network adapter to process the inbound datagram using a TCP/IP stack if no match to the destination IP address is in the IP address entries, the transaction logic including connection correlation logic to provide, for an outgoing transmission, mapping of a work queue number to TCP/IP routing parameters, the TCP/IP routing parameters including source and destination TCP ports and source and destination IP addresses, the one or more entries in the one or more FIFO buffers in the transmit history information stores including a plurality of such entries, each respective one of the plurality of such entries including a respective field set and corresponding with a respective corresponding one of entries in the work queue element, each respective field set including a respective sendmsn field, a respective readmsn field, a respective first flag field, a respective startseqnum field, a respective finalseqnum field, a respective sackpres field, a respective notifyoncomp field, and a respective maximum upper level protocol data unit (MULPDU) field, the respective sendmsn field maintaining a current send message sequence number, the respective readmsn field maintaining a current read message sequence number, the respective startseqnum field maintaining an initial TCP sequence number of the respective one of the entries in the work queue elements, the finalseqnum field maintaining a final TCP sequence number of a message corresponding to the respective one of the entries in the work queue elements, the startseqnum field and the finalseqnum field being provided to the respective one of the plurality of entries in the one or more FIFO buffers in the transmit history information stores during creation of a first TCP segment of the message, the respective first flag field indicating whether a TCP streaming mode, other than RDMA over TCP, is being employed to perform a TCP-offload related data transaction associated with the respective corresponding one of the entries in the work queue element, the respective MULPDU field being to record a size of a MULPDU, associated with the respective corresponding one of the entries in the work queue element, that was in effect at a previous transmission time of the MULPDU, the size recorded in the MULPDU field to be used to re-segment one or more framed protocol data units (FPDU) and to rebuild one or more TCP segments that were transmitted during the previous transmission time in event of either of the following limitations numbered (1) and (2);
(1) a network error associated with the one or more TCP segments, and (2) dynamic changing of the size of the MULPDU, the one or more TCP segments that are rebuilt consisting of a partial FPDU if the size of the MULPDU has been dynamically changed, the respective sackpres field being to indicate whether the respective MULPDU field has been reduced by allocation for a maximum sized SACK block, the respective notifyoncomp field being to indicate whether completion queue element generation is to occur for the adapter after outstanding TCP message segment acknowledgement. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. An apparatus, for performing remote direct memory access (RDMA) operations between a first server and a second server over an Ethernet fabric, the RDMA operations being initiated by execution of a verb according to a remote direct memory access protocol, the verb being executed by a central processing unit (CPU) on the first server, the apparatus comprising:
-
a first network adapter in the first server, to access a work queue element responsive to execution of the verb, and to transmit framed protocol data units (FPDUs) corresponding to the RDMA operations over a Transmission Control Protocol/Internet Protocol (TCP/IP) interface between the first and second servers, wherein the RDMA operations are responsive to said work queue element, and wherein said work queue element is provided within first host memory in the first server, the first host memory being coupled to the CPU via a memory controller, the first network adapter being coupled to the first host memory via both a host interface that is comprised in the first network adapter and the memory controller, the first host memory to store an adapter driver to provide control of the first network adapter, said first network adapter comprising; transmit history information stores, to maintain a local copy of a subset of parameters in said work queue element, the transmit history information stores including additional parameters in addition to the parameters in the work queue element, the transmit history stores being in a local memory that is comprised in the first network adapter, the local memory being separate and distinct from the first host memory within which resides the work queue element, the transmit history information stores being to store the local copy and the additional parameters in one or more entries in one or more first-in-first-out (FIFO) buffers in the transmit history information stores, the one or more FIFO buffers being dynamically bound to the work queue element residing within the first host memory, the one or more entries in the one or more FIFO buffers in the transmit history information stores including a plurality of such entries, each respective one of the plurality of such entries including a respective field set and corresponding with a respective corresponding one of entries in the work queue element, each respective field set including a respective sendmsn field, a respective readmsn field, a respective first flag field, a respective startseqnum field, a respective finalseqnum field, a respective sackpres field, a respective notifyoncomp field, and a respective maximum upper level protocol data unit (MULPDU) field, the respective sendmsn field maintaining a current send message sequence number, the respective readmsn field maintaining a current read message sequence number, the respective startseqnum field maintaining an initial TCP sequence number of the respective one of the entries in the work queue elements, the finalseqnum field maintaining a final TCP sequence number of a message corresponding to the respective one of the entries in the work queue elements, the startseqnum field and the finalseqnum field being provided to the respective one of the plurality of entries in the one or more FIFO buffers in the transmit history information stores during creation of a first TCP segment of the message, the respective first flag field indicating whether a TCP streaming mode, other than RDMA over TCP, is being employed to perform a TCP-offload related data transaction associated with the respective corresponding one of the entries in the work queue element, the respective MULPDU field being to record a size of a MULPDU, associated with the respective corresponding one of the entries in the work queue element, that was in effect at a previous transmission time of the MULPDU, the size recorded in the MULPDU field to be used to re-segment one or more framed protocol data units (FPDU) and to rebuild one or more TCP segments that were transmitted during the previous transmission time in event of either of the following limitations numbered (1) and (2);
(1) a network error associated with the one or more TCP segments, and (2) dynamic changing of the size of the MULPDU, the one or more TCP segments that are rebuilt consisting of a partial FPDU if the size of the MULPDU has been dynamically changed, the respective sackpres field being to indicate whether the respective MULPDU field has been reduced by allocation for a maximum sized SACK block, the respective notifyoncomp field being to indicate whether completion queue element generation is to occur for the first network adapter after outstanding TCP message segment acknowledgement; anda protocol engine, coupled to said transmit history information stores, to access said local copy of the subset of the parameters and the additional parameters, the subset being selected so as to enable the protocol engine to rebuild, based on the local copy, for retransmission one or more TCP segments corresponding to a subset of said FPDUs in the event of network transmission error, the subset also being selected so as to enable the protocol engine to determine, based on the local copy, if the RDMA operations have been completed; the first network adapter also comprising IP address logic coupled both to a medium access controller (MAC) of the first network adapter and to the protocol engine, the IP address logic to contain IP address entries to be used as source IP addresses in transmitted messages, the first network adapter to compare with the IP address entries a destination IP address of an inbound datagram received by the MAC, the first network adapter to process the inbound datagram in accordance with a RDMA connection processing pipeline only if the destination IP address matches one of the IP address entries, the first network adapter to process the inbound datagram using a TCP/IP stack if no match to the destination IP address is in the IP address entries, the first network adapter including connection correlation logic to provide, for an outgoing transmission, mapping of a work queue number to TCP/IP routing parameters, the TCP/IP routing parameters including source and destination TCP ports and source and destination IP addresses; and a second network adapter, to receive said FPDUs, wherein reception of said FPDUs includes receiving said one or more TCP segments. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
-
20. An apparatus, for performing remote direct memory access (RDMA) operations between a first server and a second server over an Ethernet fabric, the RDMA operations being initiated by execution of a verb according to a remote direct memory access protocol, the verb being executed by a central processing unit (CPU) on the first server, the apparatus comprising:
-
a network adapter in the first server, the network adapter comprising transaction logic to process a work queue element corresponding to the verb, and to accomplish the RDMA operations over a Transmission Control Protocol/Internet Protocol (TCP/IP) interface between the first and second servers, wherein said work queue element resides within first host memory in the first server, the first host memory being coupled to the CPU via a memory controller, the network adapter being coupled to the first host memory via both a host interface that is comprised in the network adapter and the memory controller, the first host memory to store an adapter driver to provide control of the network adapter, said transaction logic comprising; transmit history information stores to maintain a local copy of a subset of parameters in said work queue element, the transmit history information stores including additional parameters in addition to the parameters in the work queue element, the transmit history stores being in a local memory that is comprised in the network adapter, the local memory being separate and distinct from the first host memory within which resides the work queue element, the transmit history information stores being to store the local copy and the additional parameters in one or more entries in one or more first-in-first-out (FIFO) buffers in the transmit history information stores, the one or more FIFO buffers being dynamically bound to the work queue element residing within the first host memory, the one or more entries in the one or more FIFO buffers in the transmit history information stores including a plurality of such entries, each respective one of the plurality of such entries including a respective field set and corresponding with a respective corresponding one of entries in the work queue element, each respective field set including a respective sendmsn field, a respective readmsn field, a respective first flag field, a respective startseqnum field, a respective finalseqnum field, a respective sackpres field, a respective notifyoncomp field, and a respective maximum upper level protocol data unit (MULPDU) field, the respective sendmsn field maintaining a current send message sequence number, the respective readmsn field maintaining a current read message sequence number, the respective startseqnum field maintaining an initial TCP sequence number of the respective one of the entries in the work queue elements, the finalseqnum field maintaining a final TCP sequence number of a message corresponding to the respective one of the entries in the work queue elements, the startseqnum field and the finalseqnum field being provided to the respective one of the plurality of entries in the one or more FIFO buffers in the transmit history information stores during creation of a first TCP segment of the message, the respective first flag field indicating whether a TCP streaming mode, other than RDMA over TCP, is being employed to perform a TCP-offload related data transaction associated with the respective corresponding one of the entries in the work queue element, the respective MULPDU field being to record a size of a MULPDU, associated with the respective corresponding one of the entries in the work queue element, that was in effect at a previous transmission time of the MULPDU, the size recorded in the MULPDU field to be used to re-segment one or more framed protocol data units (FPDU) and to rebuild one or more TCP segments that were transmitted during the previous transmission time in event of either of the following limitations numbered (1) and (2);
(1) a network error associated with the one or more TCP segments, and (2) dynamic changing of the size of the MULPDU, the one or more TCP segments that are rebuilt consisting of a partial FPDU if the size of the MULPDU has been dynamically changed, the respective sackpres field being to indicate whether the respective MULPDU field has been reduced by allocation for a maximum sized SACK block, the respective notifyoncomp field being to indicate whether completion queue element generation is to occur for the adapter after outstanding TCP message segment acknowledgement; anda protocol engine, coupled to said transmit history information stores, to access said MULPDU, the protocol engine accessing said local copy of the subset of the parameters and the additional parameters, the subset being selected so as to enable the protocol engine to rebuild, based on the local copy, for retransmission one or more TCP segments corresponding to the RDMA operations in event of network transmission error, the subset also being selected so as to enable the protocol engine to determine, based on the local copy, if the RDMA operations have been completed; the transaction logic in the network adapter also comprising IP address logic coupled both to a medium access controller (MAC) of the network adapter and to the protocol engine, the IP address logic to contain IP address entries to be used as source IP addresses in transmitted messages, the network adapter to compare with the IP address entries a destination IP address of an inbound datagram received by the MAC, the network adapter to process the inbound datagram in accordance with a RDMA connection processing pipeline only if the destination IP address matches one of the IP address entries, the network adapter to process the inbound datagram using a TCP/IP stack if no match to the destination IP address is in the IP address entries, the transaction logic comprising connection correlation logic to provide, for an outgoing transmission, mapping of a work queue number to TCP/IP routing parameters, the TCP/IP routing parameters including source and destination TCP ports and source and destination IP addresses. - View Dependent Claims (21)
-
-
22. A method for performing remote direct memory access (RDMA) operations between a first server and a second server over an Ethernet fabric, the RDMA operations being initiated by execution of a verb according to a remote direct memory access protocol, the verb being executed by a central processing unit (CPU) on the first server, the method comprising:
-
processing by a network adapter in the first server a work queue element corresponding to the verb, wherein the work queue element resides within a work queue that is within first host memory in the first server, the first host memory being coupled to the CPU via a memory controller, the network adapter being coupled to the first host memory via both a host interface that is comprised in the network adapter and the memory controller, the first host memory to store an adapter driver to provide control of the network adapter; and accomplishing by the network adapter the RDMA operations over a Transmission Control Protocol/Internet Protocol (TCP/IP) interface between the first and second servers, wherein said accomplishing comprises; maintaining in transmission history information stores a local copy of a subset of parameters in the work queue element, the transmission history information stores including additional parameters in addition to the parameters in the work queue element, the transmit history stores being in a local memory that is comprised in the network adapter, the local memory being separate and distinct from the first host memory within which resides the work queue element, the transmit history information stores being to store the local copy and the additional parameters in one or more entries in one or more first-in-first-out (FIFO) buffers in the transmit history information stores, the one or more FIFO buffers being dynamically bound to the work queue element residing within the first host memory, the one or more entries in the one or more FIFO buffers in the transmit history information stores including a plurality of such entries, each respective one of the plurality of such entries including a respective field set and corresponding with a respective corresponding one of entries in the work queue element, each respective field set including a respective sendmsn field, a respective readmsn field, a respective first flag field, a respective startseqnum field, a respective finalseqnum field, a respective sackpres field, a respective notifyoncomp field, and a respective maximum upper level protocol data unit (MULPDU) field, the respective sendmsn field maintaining a current send message sequence number, the respective readmsn field maintaining a current read message sequence number, the respective startseqnum field maintaining an initial TCP sequence number of the respective one of the entries in the work queue elements, the finalseqnum field maintaining a final TCP sequence number of a message corresponding to the respective one of the entries in the work queue elements, the startseqnum field and the finalseqnum field being provided to the respective one of the plurality of entries in the one or more FIFO buffers in the transmit history information stores during creation of a first TCP segment of the message, the respective first flag field indicating whether a TCP streaming mode, other than RDMA over TCP, is being employed to perform a TCP-offload related data transaction associated with the respective corresponding one of the entries in the work queue element, the respective MULPDU field being to record a size of a MULPDU, associated with the respective corresponding one of the entries in the work queue element, that was in effect at a previous transmission time of the MULPDU, the size recorded in the MULPDU field to be used to re-segment one or more framed protocol data units (FPDU) and to rebuild one or more TCP segments that were transmitted during the previous transmission time in event of either of the following limitations numbered (1) and (2);
(1) a network error associated with the one or more TCP segments, and (2) dynamic changing of the size of the MULPDU, the one or more TCP segments that are rebuilt consisting of a partial FPDU if the size of the MULPDU has been dynamically changed, the respective sackpres field being to indicate whether the respective MULPDU field has been reduced by allocation for a maximum sized SACK block, the respective notifyoncomp field being to indicate whether completion queue element generation is to occur for the network adapter after outstanding TCP message segment acknowledgement; andaccessing the local copy of the subset of the parameters and the additional parameters, the subset being selected so as to enable the network adapter to rebuild, based on the local copy, for retransmission one or more TCP segments corresponding to the RDMA operations in event of network transmission error, the subset also being selected so as to enable the network adapter to determine, based on the local copy, if the RDMA operations have been completed; comparing, by the network adapter, IP address entries with a destination IP address of an inbound datagram received by a medium access controller (MAC) of the network adapter, the IP address entries being contained in IP address logic coupled both to the MAC and to a protocol engine in the network adapter, the IP address entries to be used as source IP addresses in transmitted messages, the network adapter to process the inbound datagram in accordance with a RDMA connection processing pipeline only if the destination IP address matches one of the IP address entries, the network adapter to process the inbound datagram using a TCP/IP stack if no match to the destination IP address is in the IP address entries, the network adapter comprising connection correlation logic to provide, for an outgoing transmission, mapping of a work queue number to TCP/IP routing parameters, the TCP/IP routing parameters including source and destination TCP ports and source and destination IP addresses. - View Dependent Claims (23, 24, 25, 26, 27)
-
Specification