ZFSBuild2012 – InfiniBand Performance

We love InfiniBand.  But it is no merely enough to simply install InfiniBand.  We decided to test three different popular connection options with InfiniBand so we could better understand which method offers the best performance.  We tested IPoIB with connected mode enabled (IPoIB-CM), IPoIB with connected mode disabled (IPoIB-UD), and SRP.

IPoIB stands for IP over InfiniBand.  IPoIB is a TCP/IP networking stack that runs on top of InfiniBand fabric.  Without IPoIB, you cannot do IP based network traffic over an InfiniBand network. Connected Mode means the connection is TCP (rather than UDP) style connection is used.  Connected Mode allows for packets as large as 64KB.

When Connected Mode is disabled, the mode is called Unreliable Datagram.  In Unreliable Datagram mode, UDP (rather than TCP) packets are used.  Max packet size with IPoIB-UD is 2KB.

With both IPoIB-CM and IPoIB-UD, you need to use iSCSI to connect to the ZFSBuild2012 SAN.

SRP means SCSI over Remote Direct Memory Access Protocol.  If you are really serious about performance and your environment supports SRP, you definitely want to use SRP.  With SRP, you don’t use iSCSI or TCP/IP.  SRP operates at fabric level.  The ZFSBuild2012 server registers its available SRP targets with the InfiniBand subnet manager.  Then your SRP enabled clients ask the subnet manager about the SRP target.  From the client side, it looks like a local drive.  Xen admins can create LVMs on the drive.  Windows/Hyper-V admins can create a Clustered Shared Volume on the SRP drive.

SRP was the only way ZFSBuild2012 could deliver over 100,000 IOPS.  Neither IPoIB option could deliver that level of performance.  We expect this was due to the overhead of iSCSI and TCP/IP.  If you cannot get SRP working, you will still see really good performance, but not the huge amount that SRP can deliver.

If you cannot use SRP, we recommend using IPoIB-UD (connected mode disabled). Connected mode seems to hurt performance.

All of the following benchmarks were run using the ZFSBuild2012 server with Nexenta 3.1.3. Write back caching was enabled for the shared ZVol in each test. Click here to read about benchmark methods.  The 1Gbps performance of the ZFSBuild2012 is included as a point of reference within these graphs.

IOMeter 4k Benchmarks:
IOMeter 4k Benchmarks

IOMeter 4k Benchmarks

IOMeter 4k Benchmarks

IOMeter 4k Benchmarks

IOMeter 8k Benchmarks:
IOMeter 8k Benchmarks

IOMeter 8k Benchmarks

IOMeter 8k Benchmarks

IOMeter 8k Benchmarks

IOMeter 16k Benchmarks:
IOMeter 16k Benchmarks

IOMeter 16k Benchmarks

IOMeter 16k Benchmarks

IOMeter 16k Benchmarks

IOMeter 32k Benchmarks:
IOMeter 32k Benchmarks

IOMeter 32k Benchmarks

IOMeter 32k Benchmarks

IOMeter 32k Benchmarks

Saturday, December 15th, 2012 Benchmarks

12 Comments to ZFSBuild2012 – InfiniBand Performance

  • tomoiaga says:

    Can you please look at iSer (iSCSI extension for RDMA) ?
    It should deliver the same performance as SRP with the added functionality of iSCSI.

  • expertaz says:

    thanks for great info about infiniband options,is iscsi only option or any way use NFS with IPoIB ?and second question:anyway connect IPoIB like cross ethernet mode?

  • Jae-Hoon.Choi says:

    Thank you for your blog about infiniband information.

    I was experienced infiniband.

    IB-CM can support 64k(MTU=65520) Message, but IB-DM can only support 4k(MTU=4096-4=4092).

    IB-CM’s throuput was mor powerful than IB-DM’s.

    I’ll also test with IB QDR switch and HCA in 40Gbps environment.

    I think also IBSRP protocol is most powerful IB storage protocol, too!

  • Tomoiaga – as we have stated previously, this system is now in production, and we cannot do additional testing on it. iSER was on the list of things we wanted to try, but we could not find a suitable iSER implementation for the Windows client that we were doing all of the testing with. Without having an iSER implimentation on Windows, our only option would be Linux, which would have not been an apples to apples comparison with the other systems.

  • You can use NFS with IPoIB, or you can use NFS over RDMA. We did not test this option, as we are not nearly as well versed with it. It is, however, very possible to use NFS with Infiniband.

  • Pantagruel says:

    Nice data and great performance.

    Could you perhaps mention which OS was used to perfrm the IB tests. The ‘Benchmark methodes’ page does mention you used Nexenta 3.1.3, ZFSGuru 0.2.0-beta 7 (FreeBSD 9.1), and FreeNAS 8.3 (FreeBSD 8.3) but these test results do not state which OS served as a SRP target.

    Furthermoer, will you be sharing with us all the nitty gritty and sordid details on how where able to persuade the distro’s mentioned to serve as a SRP target. This would come as a handy how-to for the less technically educated.
    I personally would love to know if you where able to get ZFSguru (FreeBSD 9.x) to serve as a SRP target.

    Thanks for the info.

  • All of the data posted so far has been from Nexenta. We will be posting additional data as we have time. We will be going in to the nitty gritty on setting up SRP targets, and it’s a lot of command line.

  • Pantagruel says:

    Thanks for the heads up.

    I know, power to the CLI seems to be the motto regarding SRP. I have setup SRP/iSCSI before on Ubuntu,OpenIndiana and Solaris Express 11. \
    Even though it all seems quite arcane and medieval it tends to work better with regards to problem solving than the clicker-di-click interface.

  • lathiat says:

    Hi Guys,

    Just wondering how you are going with ARP on your new builds. Reading your old post here:

    We’re having the same issue.. our CentOS machines are fine but the Solaris machine is often quite slow to respond to ARP requests.

    We’re using NexentaStor 3.1 and CentOS 5.8 with the built-in OFED stack.

  • We have not seen these issues at all with ZFSBuild2012. We never figured out the issues with ZFSBuild2010 and abandoned InfiniBand for that system.

  • virtualstorage says:

    Thanks again Matt,

    Another set of questions are:
    1)Have you also tested without L2ARC and ZIL and with minimum RAM in ZFS?
    This will validate the theory that ” With ZFS , you don’t get iops based on per disk iops + workload + raid penalties +frontend iops, But you get overall iops based on number of vdev’s (raid groups) X slowest disk iops in pool “.

    2) Also, Have you tested performance with RAIDz1 OR RAIDz2?.

    3) If my L2ARC size is very large so that it can accommodate all working set sizes and ZIL is sufficiently large ( 2 x5 X speed of SSD), Theoretically, I should be able to get even 100000 iops with around just 10 mirrored SAS 15 K rpm disks? This won’t serve capacity requirements but will meet performance requirements.

  • virtualstorage says:

    Please ignore above comment as I posted it in the wrong thread .. Correct thread is http://www.zfsbuild.com/2012/12/14/zfsbuild2012-benchmark-methods/comment-page-1/#comment-542

  • Leave a Reply

    You must be logged in to post a comment.