Getting InfiniBand to work

The title of this post says it all. How to get Infiniband to work properly. We have had a roller coaster of fun trying to get the InfiniBand network up and running properly. From the headache of installing the InfiniBand switch module to finding the correct Mezzanine cards for our blades, to getting IPoIB (IP over InfiniBand) working properly on all systems, it’s been interesting to say the least.

Lets start out with the InfiniBand Mezzanine cards. The first Mezzanine card we ordered was a SuperMicro AOC-IBH-001. The first problem we ran in to was that this Mezzanine card wasn’t supported on our newest blades. We did have a few older blades in one of our bladecenters that did support it though, so all was not lost. The second round of cards that we got for our blades was an AOC-IBH-XDD. After getting two of those up and running in our newest blades, we realized that we couldn’t actually use both ports on those cards. The final round of cards that we will be purchasing will be AOC-IBH-XDS single port cards. Those will give us all of the connectivity that we need for each individual blade. The upside is that those are some of the least expensive Mezzanine cards that SuperMicro produces.

The next hurdle was getting InfiniBand up and running in OpenSolaris. This was more of a challenge than we would have expected. For starters, OpenSolaris 2009.06 had a few issues getting our Mellanox InfiniHostIII EX card working. It saw the card, but all of the commands that we issued said that the card was waiting on the network. We decided to use a development build from Genunix.Org hoping that it would have fixed some bugs that could have been hanging us up. We installed using the b134 installer for OpenSolaris. This also detected the InfiniBand card, but didn’t configure our IPoIB very well.

This led us to the next piece of the puzzle, SM (Subnet Manager). InfiniBand is not like other switched protocols that we have encountered. To get InfiniBand working, you have to have a piece of software running on one of the InfiniBand hosts called Subnet Manager.

OpenSM Command Line Howto

This is provided by the OpenFabrics alliance and controls all of the activity on the InfiniBand network. Without this piece of software, your InfiniBand setup will not work. We were grossly unaware of this, and spent a day poking and prodding everything in our setup before we realized this. To complicate matters further, some InfiniBand switches actually have this software running on the switch, some don’t. If you get a switch that has this software built in, your setup would probably work significantly better from the get-go than ours did. This is not to say that having the software on the switch is a better idea. It is significantly easier to upgrade the software on a system running Windows or Linux than it is to flash new software on a switch.

Once we got SM (Subnet Manager) running on one of the blades, things started working better. We were able to ping from one machine to another, but it was somewhat intermittent. We would lose the connection from one blade, then another, and then one would start working again. We were not sure what would cause this, and we spent another few days investigating this problem. Everything worked great on the Ethernet side of the network, pings worked great, transfers worked great, iSCSI worked great, and NFS worked (albeit a little slow). The IPoIB network was trouble with a capital T. We finally decided to install a package called WireShark on one of the systems to see if we could figure out why the network wasn’t working quite right.

Wireshark Screenshot
Screenshot of the Wireshark application monitoring the IPoIB network.

After a little bit of packet inspection, we saw that some systems were sending out ARP requests, but not getting a reply. We haven’t totally figured out why that isn’t working properly. We think that if that ARP request goes out on the switch, the appropriate machine should send out a response. For some reason this is not happening. To alleviate this we built a CentOS system to act as a Subnet Manager system and as a gateway device. We gave it an IP address of 10.0.1.254 and then told all of the systems in the IPoIB network that 10.0.1.254 was their gateway.

We then set up some IP Masquerading rules to allow the IPoIB network to get from the IB interface on the CentOS system to the Ethernet system and out to the internet. This had the side effect of allowing that system to sympathetically answer ARP requests for all of the other systems on the network. This fixed all of our communication issues between our blades. Currently the CentOS system is running on a blade that we had built for testing of the InfiniBand network and then as a virtualization system. Test Blade Configuration It’s overkill to say the least. We have ordered another blade outfitted with a Xeon 5506, 2GB RAM, and a pair of 250GB SATA boot drives. We do not expect that the Subnet Manager nor the IP bridging will be very taxing on that system, and we hate to waste a blade with 48GB of RAM and some of the fastest processors you can buy on such simple tasks.

Another side effect of adding the IP Masquerading to the CentOS system is that we could now totally isolate the ZFSHEADEND box from the rest of the network. Our initial setup had the dual gigabit ports connected to the internet. We needed this connection so that we could do software updates and driver updates for OpenSolaris. Now all of our internet traffic is routed through the CentOS system, which effectively firewalls that system off from anything that is not directly connected to the InfiniBand network.

Thursday, May 6th, 2010 Configuration, InfiniBand

17 Comments to Getting InfiniBand to work

  • jdye says:

    thank you for documenting your experiments with IB and ZFS.

  • We have ordered another blade outfitted with a Xeon 5506, 2GB RAM, and a pair of 250GB SATA boot drives.

    Still a waste of a blade slot, you could pop in a cheap 1u with a sticker on it called storage network firewall, for about 1/4 the price of that blade. I will show you a snap of mine when I am done 🙂

  • admin says:

    screamingservers: I don’t doubt that a cheaper solution could be built, but the blade style environment still offers better management options and less cable clutter.

  • jron says:

    Thanks for sharing the information. My company is considering using infiniband for a new SAN.

    Could you share some more info on how stable the infiniband solution is ?
    Have you had problems with IPoIF on different OS’s, Linux, Windows etc. ?
    Have you found a solution to running the Subnet Manager on the SAN host system, and not a separate server ?
    Have you tested infiniband on OpenIndiana as host OS, any new finding with regards to infiniband ?

  • admin says:

    jron: We have seen pretty good stability in general with InfiniBand when using OFED on Linux and Windows. The stability of InfiniBand on OpenSolaris, OpenIndiana, and Nexenta has been quite poor.

    We initially thought it might be a bad IB card in the ZFS server. However, the card is 100% stability when we run Linux or Windows on the same hardware. The IB card is only unstable with OpenSolaris or variants of OpenSolaris.

    You cannot run a subnet manager on OpenSolaris at this time, so you cannot run SM on the SAN box if the SAN box is OpenSolaris, OpenIndiana, or Nexenta. OpenSolaris does not use the standard OFED drivers that are used on Linux and Windows platforms, so some tools found in OFED are not present in any of the OpenSolaris builds.

    Yes, we have tested IB on OpenIndiana. IB was just as unstable on OpenIndiana as it was on OpenSolaris. We suspect there is a driver problem on OpenSolaris/OpenIndiana, but there is no easy way to confirm that.

    IPoIB seems very stable on Linux and Windows.

  • jron says:

    Thanks for the quick reply. I have been trying to read up on OpenSolaris/OpenIndiana and infiniband support.

    Latest info on OFED version supported is:
    http://blogs.sun.com/cindi/entry/infiniband_for_q3_2009
    where OFED 1.4 is mentioned, but thats regarding Sun Storage 7410.

    There has been some mention of OFED 1.5.1 and in July 2010 this case was approved:
    http://arc.opensolaris.org/caselog/PSARC/2010/239/

    Do you know what OFED version (implementation) you are using?

    Is the OFED version configurable on Linux/Windows or backwards-compatible to 1.4 (or earlier) which “probably” is still used in the OpenSolaris/OpenIndiana implementation ?

  • admin says:

    We have used the latest OFED builds with excellent results on both Linux and Windows. With OpenSolaris/OpenIndiana, we are using the IB support included with the OS instead of using OFED.

  • So did the latest openIndiana build fix your stability issues on the mellanox card? Do you thinks its the driver or the IPoIB?

  • admin says:

    screamingservers: No, the latest OpenIndiana build did not fix stability issues with InfiniBand. IB is completely stable on the same hardware if we install Linux or Windows using the OFED. IB is not stable under OpenSolaris, OpenIndiana, or Nexenta. I think the problem is drivers related, but I don’t know for sure.

  • Have you tried different cards? I saw voltair hca400 on the hardware compatibility list.

  • admin says:

    The card we have is in the HCL. We can try different cards if manufactures want to send us cards to test, but we are not going to buy a bunch of different ones ourselves.

  • Good point, I did not see it there.

  • Daniel says:

    Why couldn’t you use both ports on the HBA cards, you know you need to run another instance of the subnet manager to get both ports to work.

    I had heaps of problems getting my infiniband network up and running a few months ago. I ran the SRPT on oracle solaris 11 using COMSTAR, client side was centos 5.5 64bit

    I had problems with hba firmware a first but finally got SRP working. I haven’t played with IPoIB but I’ll like start working with it soon.

    See below the oracle forum post I started to get it working, turned out no one could really help me anyways…

    https://forums.oracle.com/forums/thread.jspa?messageID=9745520

  • shaneshort says:

    Now that you guys are using Nexenta, have you managed to get all your infiniband stuff working there too? I’m rather keen on going that route, I’m just a bit concerned it’s going to be a massive hassle (a medium hassle I can deal with)

    Thanks! 🙂

  • We do not have Infiniband up and running on the ZFSBuild2010 system, but we are actively pursueing it with another build. Once we put the ZFSBuild2010 system into production, we left everything alone to prevent stability problems.

  • gandalf says:

    Hi,
    i’ve seen that you don’t have any kind of redundancy at switch level.

    Did you tried to get any HA with multiple switches?
    How will you use both port for the HCA?

  • At this point we only have one single Infiniband switch, which is the single point of failure. Any multi-pathing still has a single switch, or a single HCA as the failure point. Since the environment does have these single points of failure that we cannot engineer around at this time, we have not done any multi-pathing. At some point if we move past the SuperMicro systems, or get dual 10Gbit/Infiniband switches in new blade systems, we will investigate those options, but as of today those are not high-priority issues.

  • Leave a Reply

    You must be logged in to post a comment.