Wired magazine online had a series of articles earlier in the year where it was revealed that over 60% of the network ports in the data centers managed by internet giants were 10Gbe. The potential latency for Infiniband (a similar but slightly older technology) is known to be under 5 microseconds. It is fairly typical to see latencies in 1Gbe hover in the 150 to 30 microseconds (latency is dependent on the size of the packet payload).
In order to push a network interface to handle this much data, a modern computer with a PCIe slot with enough lanes (typically 10Gbe cards use an x8 slot -- so the x16 slot available for a graphics card is sufficient) is required. Achieving the lower latencies this new hardware is capable of is challenging for the operating system (in my case Linux) since the TCP/IP stack and Berkeley sockets API begins to become a bottle neck. Almost every manufacturer has attempted to solve this problem in their own way, providing a software work-around which achieves higher performance than what is dirrectly available via the kernel and standard API.
Method
To test a pair of cards, I plugged them into the x16 slots on two cluster nodes and cabled them directly to each other (so no switch in between). I then configured them for ethernet, assigned IP addresses, and ran some benchmarks:
# modprobe# service openib start # ifconfig eth1 192.168.3.10 # iperf # NPtcp
And on the other node:
# modprobe# service openib start # ifconfig eth1 192.168.3.11 # iperf -c 192.168.3.10 # NPtcp -h 192.168.3.10
The iperf test mostly checks bandwidth and for me is just a basic sanity test. The more interesting test is netpipe (NPtcp) which does a latency test at various packet sizes.
Testing RDMA latency on a card that provides it is a simple matter of running a bundled utility (-s indicates payload bytes and -t is the number of iterations):
# rdma_lat -c -s32 -t500 192.168.3.10 2935: Local address: LID 0000, QPN 000000, PSN 0xb1c951 RKey 0x70001901 VAddr 0x00000001834020 2935: Remote address: LID 0000, QPN 000000, PSN 0xcbb0d3, RKey 0x001901 VAddr 0x0000000165b020 Latency typical: 0.984693 usec Latency best : 0.929267 usec Latency worst : 15.8892 usec
Mellanox
This card is widely used and very popular. Mellannox has considerable experience with Infiniband products, and have been able to produce cards which are capable of transporting TCP/IP traffic ontop of infiniband technology. For them this is mostly accomplished using kernel drivers which are part of the OFED software stack. I found that it is best to get a snapshot of this suite of packages from Mellanox directly for one of the particular distributions (all of them, ultimately, a variation on Red Hat Linux). Although Debian wheezy had OFED packages in its repository, they were not recent enough for one of the newer cards I was trying. For these reasons, I ended up dual booting my cluster to Oracle Enterprise Linux (OEL 6.1 specifically). Debian Wheezy was able to run this card as an ethernet interface (using the kernel TCP/IP stack), it's just that fancy things like Infiniband and RDMA were not accessible.
I also managed to test a Mellannox ConnectX3 card, but I found that its performance was not (statistically) discernable from the ConnectX2. If you told me to figure out which card was in a box from its benchmarks I would not be able to separate the ConnectX2 and ConnectX3 -- although presumably the new revision does have some advantage which I did not find.
Solar Flare
Solar Flare makes several models of 10Gbe cards. The value added by solar flare is mostly in their open onload driver technology which makes use of their alternative network stack which runs mostly in user space. This software accesses a so-called virtual NIC interface on the card to speedup network interaction bypassing the standard kernel TCP/IP stack. Just like the Mellanox cards, I found that Debian Wheezy could recognize the cards and run them with the Linux kernel TCP stack, but the special drivers (open onload) needed to run on OEL (I hope to attempt to build the sfc kernel driver on Wheezy soon).
Measurements and Summary
Below is a summary of the measurements that I did on these cards using various TCP stacks (vanilla indicates the Linux 3.2.0 Kernel is being used) and RDMA
Communication Type | Card | Distro | K Mod | Mesg bytes | Latency |
---|---|---|---|---|---|
vanilla TCP | solar flare | OEL | sfc | 32 | 17 usec |
vanilla TCP | solar flare | OEL | sfc | 1024 | 18 usec |
vanilla TCP | Mellanox connectX3 | OEL | mlx4_en | 32 | 12 usec |
vanilla TCP | Mellanox connectX3 | OEL | mlx4_en | 1024 | 16 usec |
vanilla TCP | Mellanox connectX2 | OEL | mlx4_en | 32 | 13 usec |
vanilla TCP | Mellanox connectX2 | OEL | mlx4_en | 1024 | 9 usec |
onload userspaceTCP | solar flare | OEL | sfc | 32 | 2.4 usec |
onload userspaceTCP | solar flare | OEL | sfc | 1024 | 3.6 usec |
RDMA | Mellanox connectX3 | OEL | mlx4_ib | 32 | 1.0 usec |
RDMA | Mellanox connectX3 | OEL | mlx4_ib | 1024 | 3.0 usec |
RDMA | Mellanox connectX2 | OEL | mlx4_ib | 32 | 1.0 usec |
RDMA | Mellanox connectX2 | OEL | mlx4_ib | 1024 | 3.0 usec |
The big surprise in this investigation is open onload (more info can be had from this presentation). This driver is activated selectively using user space system call interposition (so you can choose which applications run on it). It does not require the application to be rewritten, recompiled or rebuilt in any way. This means, in particular that closed source third party software can use it. It's this extra flexibility which really has my attention. Without coding to a fancy/complicated API, a developer can make use of familiar programming tools to create systems with low networking latency.
The advance 10G bit tcp offload is capable of reducing data traffic that will increase the speed of data transfer. This is among the best solution for faster and dependable internet connection.
ReplyDeleteThanks...
However, this should not be problematic since one of the key features and benefits of the TCP / IP is to provide an abstraction of the medium so that the exchange of information between different media and technologies that are initially incompatible possible.
ReplyDeleteTCP Offload
Thanks....