HOME

TheInfoList



OR:

RDMA over Converged Ethernet (RoCE) or InfiniBand over Ethernet (IBoE) is a network protocol that allows remote direct memory access (RDMA) over an
Ethernet Ethernet () is a family of wired computer networking technologies commonly used in local area networks (LAN), metropolitan area networks (MAN) and wide area networks (WAN). It was commercially introduced in 1980 and first standardized in 19 ...
network. It does this by encapsulating an
InfiniBand InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also us ...
(IB) transport packet over Ethernet. There are two RoCE versions, RoCE v1 and RoCE v2. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet
broadcast domain A broadcast domain is a logical division of a computer network, in which all nodes can reach each other by broadcast at the data link layer. A broadcast domain can be within the same LAN segment or it can be bridged to other LAN segments. In t ...
. RoCE v2 is an
internet layer The internet layer is a group of internetworking methods, protocols, and specifications in the Internet protocol suite that are used to transport network packets from the originating host across network boundaries; if necessary, to the destinati ...
protocol which means that RoCE v2 packets can be routed. Although the RoCE protocol benefits from the characteristics of a converged Ethernet network, the protocol can also be used on a traditional or non-converged Ethernet network.


Background

Network-intensive applications like networked storage or cluster computing need a network infrastructure with a high bandwidth and low latency. The advantages of RDMA over other network
application programming interfaces An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
such as
Berkeley sockets Berkeley sockets is an application programming interface (API) for Internet sockets and Unix domain sockets, used for inter-process communication (IPC). It is commonly implemented as a library of linkable modules. It originated with the 4.2BSD Un ...
are lower latency, lower CPU load and higher bandwidth. The RoCE protocol allows lower latencies than its predecessor, the iWARP protocol. There are RoCE HCAs (Host Channel Adapter) with a latency as low as 1.3 microseconds while the lowest known iWARP HCA latency in 2011 was 3 microseconds.


RoCE v1

The RoCE v1 protocol is an Ethernet link layer protocol with Ethertype 0x8915. This means that the frame length limits of the Ethernet protocol apply: 1500 bytes for a regular
Ethernet frame In computer networking, an Ethernet frame is a data link layer protocol data unit and uses the underlying Ethernet physical layer transport mechanisms. In other words, a data unit on an Ethernet link transports an Ethernet frame as its payload. ...
and 9000 bytes for a jumbo frame.


RoCE v1.5

The RoCE v1.5 is an uncommon, experimental, non-standardized protocol that is based on the IP protocol. RoCE v1.5 uses the IP protocol field to differentiate its traffic from other IP protocols such as TCP and UDP. The value used for the protocol number is unspecified and is left to the deployment to select.


RoCE v2

The RoCE v2 protocol exists on top of either the UDP/IPv4 or the UDP/IPv6 protocol. The UDP destination port number 4791 has been reserved for RoCE v2. Since RoCEv2 packets are routable the RoCE v2 protocol is sometimes called Routable RoCE or RRoCE. Although in general the delivery order of UDP packets is not guaranteed, the RoCEv2 specification requires that packets with the same UDP source port and the same destination address must not be reordered. In addition, RoCEv2 defines a congestion control mechanism that uses the IP ECN bits for marking and CNP frames for the acknowledgment notification. Software support for RoCE v2 is still emerging. Mellanox OFED 2.3 or later has RoCE v2 support and also Linux Kernel v4.5.


RoCE versus InfiniBand

RoCE defines how to perform RDMA over
Ethernet Ethernet () is a family of wired computer networking technologies commonly used in local area networks (LAN), metropolitan area networks (MAN) and wide area networks (WAN). It was commercially introduced in 1980 and first standardized in 19 ...
while the
InfiniBand InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also us ...
architecture specification defines how to perform RDMA over an InfiniBand network. RoCE was expected to bring InfiniBand applications, which are predominantly based on clusters, onto a common Ethernet converged fabric. Others expected that InfiniBand will keep offering a higher bandwidth and lower latency than what is possible over Ethernet. The technical differences between the RoCE and InfiniBand protocols are: * Link Level Flow Control: InfiniBand uses a credit-based algorithm to guarantee lossless HCA-to-HCA communication. RoCE runs on top of Ethernet. Implementations may require lossless Ethernet network for reaching to performance characteristics similar to InfiniBand. Lossless Ethernet is typically configured via
Ethernet flow control Ethernet flow control is a mechanism for temporarily stopping the transmission of data on Ethernet family computer networks. The goal of this mechanism is to avoid packet loss in the presence of network congestion. The first flow control mechan ...
or priority flow control (PFC). Configuring a
Data center bridging Data center bridging (DCB) is a set of enhancements to the Ethernet local area network communication protocol for use in data center environments, in particular for use with clustering and storage area networks. Motivation Ethernet is the primary ...
(DCB) Ethernet network can be more complex than configuring an InfiniBand network. * Congestion Control: Infiniband defines congestion control based on FECN/BECN marking, RoCEv2 defines a congestion control protocol that uses ECN for marking as implemented in standard switches and CNP frames for acknowledgments. * InfiniBand switches typically have lower latency than Ethernet switches. Port-to-port latency for one particular type of Ethernet switch is 230 ns versus 100 ns for an InfiniBand switch with the same number of ports.


RoCE versus iWARP

While the RoCE protocols define how to perform RDMA using Ethernet and UDP/IP frames, the iWARP protocol defines how to perform RDMA over a connection-oriented transport like the
Transmission Control Protocol The Transmission Control Protocol (TCP) is one of the main protocols of the Internet protocol suite. It originated in the initial network implementation in which it complemented the Internet Protocol (IP). Therefore, the entire suite is commonl ...
(TCP). RoCE v1 is limited to a single Ethernet
broadcast domain A broadcast domain is a logical division of a computer network, in which all nodes can reach each other by broadcast at the data link layer. A broadcast domain can be within the same LAN segment or it can be bridged to other LAN segments. In t ...
. RoCE v2 and iWARP packets are routable. The memory requirements of a large number of connections along with TCP's flow and reliability controls lead to scalability and performance issues when using iWARP in large-scale datacenters and for large-scale applications (i.e., large-scale enterprises, cloud computing, web 2.0 applications etc.). Also, multicast is defined in the RoCE specification while the current iWARP specification does not define how to perform multicast RDMA. Reliability in iWARP is given by the protocol itself, as TCP is reliable. RoCEv2 on the other hand utilizes UDP which has a far smaller overhead and better performance but does not provide inherent reliability, and therefore reliability must be implemented alongside RoCEv2. One solution is to use converged Ethernet switches to make the local area network reliable. This require converged Ethernet support on all the switches in the local area network and prevents RoCEv2 packets from traveling through a wide area network such as the internet which is not reliable. Another solution is to add reliability to the RoCE protocol (i.e., reliable RoCE) which adds handshaking to RoCE to provide reliability at the cost of performance. The question of which protocol is better depends on the vendor. Chelsio recommends and exclusively support iWARP. Mellanox, Xilinx, and Broadcom recommend and exclusively support RoCE/RoCEv2. Intel initially supported iWARP but now supports both iWARP and RoCEv2. Other vendors involved in the network industry provide support for both protocols such as Marvell, Microsoft, Linux and Kazan. Cisco supports both RoCE and their own VIC RDMA protocol. Both Protocols are standardized with iWARP being the standard for RDMA over TCP defined by the
IETF The Internet Engineering Task Force (IETF) is a standards organization for the Internet and is responsible for the technical standards that make up the Internet protocol suite (TCP/IP). It has no formal membership roster or requirements and a ...
and RoCE being the standard for RDMA over Ethernet defined by the
IBTA InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used ...
.


Criticism

Some aspects that could have been defined in the RoCE specification have been left out. These are: * How to translate between primary RoCE v1 GIDs and Ethernet
MAC address A media access control address (MAC address) is a unique identifier assigned to a network interface controller (NIC) for use as a network address in communications within a network segment. This use is common in most IEEE 802 networking techno ...
es. * How to translate between secondary RoCE v1 GIDs and Ethernet MAC addresses. It is not clear whether it is possible to implement secondary GIDs in the RoCE v1 protocol without adding a RoCE-specific address resolution protocol. * How to implement VLANs for the RoCE v1 protocol. Current RoCE v1 implementations store the VLAN ID in the twelfth and thirteenth byte of the sixteen-byte GID, although the RoCE v1 specification does not mention VLANs at all. * How to translate between RoCE v1 multicast GIDs and Ethernet MAC addresses. Implementations in 2010 used the same address mapping that has been specified for mapping IPv6 multicast addresses to Ethernet MAC addresses. * How to restrict RoCE v1 multicast traffic to a subset of the ports of an Ethernet switch. As of September 2013, an equivalent of the
Multicast Listener Discovery Multicast Listener Discovery (MLD) is a component of the Internet Protocol Version 6 (IPv6) suite. MLD is used by IPv6 routers for discovering multicast listeners on a directly attached link, much like Internet Group Management Protocol (IGMP) is ...
protocol has not yet been defined for RoCE v1. In addition, any protocol running over IP cannot assume the underlying network has guaranteed ordering, any more than it can assume congestion cannot occur. It is known that the use of PFC can lead to a network-wide deadlock.


Vendors

Some vendors of RoCE enabled equipment include: *
Mellanox Mellanox Technologies Ltd. ( he, מלאנוקס טכנולוגיות בע"מ) was an Israeli-American multinational supplier of computer networking products based on InfiniBand and Ethernet technology. Mellanox offered adapters, switches, softwa ...
(acquired by
Nvidia Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...
in 2020, brand retained) *
Emulex Emulex Corporation is a provider of computer network connectivity, monitoring and management hardware and software. The company's I/O connectivity offerings, including its line of Ethernet and Fibre Channel-based connectivity products, are or w ...
(acquired by
Broadcom Broadcom Inc. is an American designer, developer, manufacturer and global supplier of a wide range of semiconductor and infrastructure software products. Broadcom's product offerings serve the data center, networking, software, broadband, wirel ...
) *
Broadcom Broadcom Inc. is an American designer, developer, manufacturer and global supplier of a wide range of semiconductor and infrastructure software products. Broadcom's product offerings serve the data center, networking, software, broadband, wirel ...
* QLogic (acquired by Cavium, rebranded) * Cavium (acquired by
Marvell Technology Group Marvell Technology, Inc. is an American company, headquartered in Santa Clara, California, which develops and produces semiconductors and related technology. Founded in 1995, the company had more than 6,000 employees as of 2021, with over 10,00 ...
, rebranded) *
Huawei Huawei Technologies Co., Ltd. ( ; ) is a Chinese multinational technology corporation headquartered in Shenzhen, Guangdong, China. It designs, develops, produces and sells telecommunications equipment, consumer electronics and various smart ...
* ATTO Technology *
Dell Technologies Dell Technologies Inc. is an American multinational technology company headquartered in Round Rock, Texas. It was formed as a result of the September 2016 merger of Dell and EMC Corporation (which later became Dell EMC). Dell's products incl ...
*
Intel Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the developers of the x86 seri ...
* Bloombase *
Xilinx Xilinx, Inc. ( ) was an American technology and semiconductor company that primarily supplied programmable logic devices. The company was known for inventing the first commercially viable field-programmable gate array (FPGA) and creating the ...
(via
FPGA A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturinghence the term '' field-programmable''. The FPGA configuration is generally specified using a hardware de ...
soft IP core)
Grovf
ref>


References

{{Reflist, 30em Operating system technology Parallel computing Ethernet