So in my lab I’ve got a Cisco Nexus 5548 and a SuperMicro SuperServer 6026-6RFT+. I’ve put Nexenta on this, Windows, and several other things. One thing I hadn’t tried is running FCoE. Intel announced FCoE support for all Nintanic based chips several years ago, and I hadn’t tried it.
I figured this was as good of a time as any to play with ESXi and FCoE so I dug in. ESXi 5.1 installed flawlessly. It saw all of the NIC’s, all of the hard drives, everything. The Nexus 5548 worked great, I sailed along creating new vSAN’s for FCoE, and thought, “here we go!”.
I followed the guide here for enabling FCoE http://www.intel.com/content/www/us/en/network-adapters/10-gigabit-network-adapters/ethernet-x520-configuring-fcoe-vmware-esxi-5-guide.html. It all looked splendid until I actually got to the part where you activate the new FCoE Storage adapter. Every time I tried to add the Software FCoE adapter, it acted like there was no available adapter that supported FCoE. I knew this wasn’t the case, as it was very clearly mentioned that it _was_ supported.
After several hours of poking, prodding, trying different versions of ESXi, updating the system board BIOS, tinkering with BIOS settings, trying Windows – thinking maybe, just maybe ESXi wasn’t going to work, I gave up and sent an email to Intel.
Intel responded very graciously that since it was an integrated controller on the system board, there wasn’t much they could do for me, and that I would have to talk to my manufacturer. I followed their advice, contacted SuperMicro, and got a fantastic response.
Would you don’t mind flash the EEPROM firmware. The firmware release on 08/09/11 will allow Intel 82599EB to support FCoE.
Steps to flash onboard LAN EEPROM
1.Extract the files and Copy them to a bootable USB stick or to a bootable floppy disk.
(If you don’t have a bootable USB stick you can make it using:
2.Boot up the system using the USB stick.
3.At the command prompt type — <filename>.bat
4.Enter the 12 digit LAN1 MAC address, when prompted.
5.Power cycle the system.
6.Reinstall the LAN drivers after EEPROM is flashed.
After flashing the new EEPROM on to the LAN controller, I was able to sucessfully enable FCoE for this system using ESXi (and subsequently Windows Server 2008R2).
This article includes pictures taken while we were assembling the ZFSBuild2012 SAN.
Installing the motherboard:
› Continue reading
We took a lot of pictures while we were building and testing the ZFSBuild2012 SAN.
The ZFSBuild 2012 system is comprised of the following :
SuperMicro SC846BE16-R920 chassis – 24 bays, single expander, 6Gbit SAS capable.
SuperMicro X9SRI-3F-B Motherboard – Single socket Xeon E5 compatible motherboard.
Intel Xeon E5 1620 – 3.6Ghz latest generation Intel Xeon CPU.
20x Toshiba MK1001TRKB 1TB SAS 6Gbit HDD’s – 1TB SAS drives.
LSI 9211-8i SAS controller – Moving the SAS duties to a Nexenta HSL certified SAS controller.
Intel SSD’s all around
2x Intel 313 series 20GB SSD drives for ZIL
2x Intel 520 series 240GB SSD drives for L2ARC
2x Intel 330 series 60GB SSD drives for boot (installed into internal cage)
64GB RAM (8x 8GB) – Generic Kingston ValueRAM.
20Gbps ConnectX InfiniBand card
4x ICYDock 2.5″ to 3.5″ drive bay converters
Internal drive bay bracket for boot drives
Y-cable for the rear fans (not enough fan sockets without one Y-cable)
Here are some pictures of the first few parts that showed up:
Benchmarking for ZFSBuild 2012 has completed. We’ve got a bunch of articles in the pipeline about this build, and we’ll be releasing them over the next few weeks and months. Stay tuned!
We are still running benchmarks. We decided to share a Friday afternoon action shot of the server mounted in the rack.
It’s been two years since we built our last ZFS based server, and we decided that it was about time for us to build an updated system. The goal is to build something that exceeds the functionality of the previous system, while costing approximately the same amount. The original ZFSBuild 2010 system cost US $6765 to build, and for what we got back then, it was a heck of a system. The new ZFSBuild 2012 system is going to match the price point of the previous design, yet offer measurably better performance.
The new ZFSBuild 2012 system is comprised of the following :
SuperMicro SC846BE16-R920 chassis – 24 bays, single expander, 6Gbit SAS capable. Very similar to the ZFSBuild 2010 server, with a little more power, and a faster SAS interconnect.
SuperMicro X9SRI-3F-B Motherboard – Single socket Xeon E5 compatible motherboard. This board supports 256GB of RAM (over 10x the RAM we could support in the old system) and significantly faster/more powerful CPU’s.
Intel Xeon E5 1620 – 3.6Ghz latest generation Intel Xeon CPU. More horsepower for better compression and faster workload processing. ZFSBuild 2010 was short on CPU, and we found it lacking in later NFS tests. We won’t make that mistake again.
20x Toshiba MK1001TRKB 1TB SAS 6Gbit HDD’s – 1TB SAS drives. The 1TB SATA drives that we used in the previous build were ok, but SAS drives give much better information about their health and performance, and for an enterprise deployment, are absolutely necessary. These drives are only $5 more per drive than what we paid for the drives in ZFSBuild 2010. Obviously if you’d like to save more money, SATA drives are an option, but we strongly recommend using SAS drives when ever possible.
LSI 9211-8i SAS controller – Moving the SAS duties to a Nexenta HSL certified SAS controller. Newer chipset, better performance, and replaceability in case of failure.
Intel SSD’s all around – We went with a mix of 2x Intel 313 (ZIL), 2x 520 (L2ARC) and 2x 330 (boot – internal cage) SSD’s for this build. We have less ZIL space than the previous build (20GB vs 32GB) but rough math says that we shouldn’t ever need more than 10-12GB of ZIL. We will have more L2ARC (480GB vs 320GB) and the boot drives are roughly the same.
64GB RAM – Generic Kingston ValueRAM. The original ZFSBuild was based on 12GB of memory, which 2 years ago seemed like a lot of RAM for a storage server. Today we’re going with 64 GB right off the bat using 8GB DIMM’s. The motherboard has the capacity to go to 256GB with 32GB DIMM’s. With 64GB of RAM, we’re going to be able to cache a _lot_ of data. My suggestion is to not go super-overboard on RAM to start with, as you can run into issues as noted here : http://www.zfsbuild.com/2012/03/05/when-is-enough-memory-too-much-part-2/
For the same price as our ZFSBuild 2010 project, the ZFSBuild 2012 project will include more CPU, much more RAM, more cache, better drives, and better chassis. It’s amazing what two years difference makes when building this stuff.
Expect that we’ll evaluate Nexenta Enterprise, OpenIndiana, and revisit FreeNAS’s ZFS implementation. We probably won’t go back over the Promise units, as we’ve already discussed them and they likely haven’t changed (and we don’t have any lying about not doing anything anymore).
We are planning to re-run the same battery of tests that we used in 2010 for the original ZFSBuild 2010 benchmarks. We still have the same test blade server available to reproduce the testing environment. We also plan to run additional tests using various sized working sets. InfiniBand will be benchmarked in additional to standard gigabit Ethernet this round.
So far, we have received nearly all of the hardware. We are still waiting on a cable for the rear fans and a few 3.5 to 2.5 drive bay converters for the ZIL and L2ARC SSD drives. As soon as those items arrive, we will place the ZFSBuild 2012 server in our server room and begin the benchmarking. We are excited to see how it performs relative to the ZFSBuild 2010 server design.
Here are a couple pictures we have taken so far on the ZFSBuild 2012 project:
Just got a note in my inbox that Nexenta has “Important” news. They’ve cancelled the OpenStorage Summit in San Jose, and they’ve refunded my registration fee. Great. Just what I wanted. I absolutely did not want to go to San Jose, meet and greet with Nexenta employees and users, and network with said people. I also didn’t want to get out of Iowa for a few days and enjoy the weather in San Jose. What I _am_ looking forward to is cancelling both my hotel reservation and my airfare. I’m betting the airfare probably doesn’t get refunded. Class act guys. It would have been really nice if you would have decided this a month ago when you opened registration. I wonder how many other people are holding the bag on airfare and room reservations. I would have to think that Nexenta had the forethought to check with the hotel and see how many rooms had been reserved, but maybe not. Maybe they expected everyone to book their rooms at the last minute, along with their airfare, because that’s what most conference attendees do.
Nexenta – you can go ahead and assume I won’t be registering for the virtual open storage summit. Also go ahead and assume that I’m probably not interested in coming next year either, because who knows how much notice I’ll get that you’ve cancelled that one too.
We’ve got a newfound lab to play with, and it’s a monster. Suffice to say there is going to be a lot of information coming out about this system in the next few weeks. One of the more interesting things that we’re trying with it is software FCoE. Obviously for the best performance, you’re going to want to use a hardware FCoE CNA (Converged Network Adapter). For testing purposes though, Intel X520 10GbE network cards seem to work just fine. Just wanted to throw this out there if anyone wanted/needed to play with FCoE
Here’s the steps to getting FCoE up and running on Intel X520 nic’s (this is being done on NexentaStor Enterprise, 3.1.3 – officially unsupported) This should work on OpenIndiana also.
1. Install 10GBE Cards and connect them.
2. Configure interfaces – set MTU to 9000 (this is important) – test traffic with ping –s 9000 and iperf
3. Unconfigure network interfaces
setup network interface ixgbe0 unconfigure
4. Create FCoE Target – on iscsi target machine
fcadm create-fcoe-port -t -f ixgbe0 (-f enables Promiscuous Mode: On)
5. Create FCoE Initiator – on iscsi initiator machine
fcadm create-fcoe-port -i -f ixgbe0
6. Check state
7. List targets
8. Online your targets
9. Create volume on target side
a. Setup volume create data
10. Create zvol on target side
setup zvol create data/test
11. Share zvol over iscsi
setup zvol data/test share
12. Make sure lun shows up
a. Show lun
Here at ZFSBuild we have come across something unusual that we thought we would share. This isn’t necessarily related to ZFS, but we encountered it while working on a ZFS/Nexenta system.
We recently had to dig deeper into RAM Ranks and voltage specifications. This stems from populating a SuperMicro SYS-6026-RFT+ barebones system full of RAM. The system in question has 18 DDR3 RAM slots. We ordered 8GB DIMM’s from Wiredzone.com, and based on it’s specifications we were pretty sure it would work. We got the system in, populated the RAM slots, and started running tests. The weird thing about it was that the system only ever saw 120GB of RAM. We started reading….and reading….and reading… Finally came across some SuperMicro documents here and here. Turns out the RAM we ordered was substituted with RAM that was assumed to be compatible. The only difference between the DIMM’s we ordered and the DIMM’s we received was the voltage that they operated at. We ordered 1.5V DIMM’s, and we were shipped 1.35V DIMM’s. When using 1.35V DIMM’s the system detected 120GB of usable RAM.
We fought for a few days between SuperMicro, WiredZone, and our own gut feelings and finally got it sorted out. Wiredzone shipped us new DIMM’s that were Dual Rank and 1.5V, and they worked flawlessly. We’d like to give a big shout out to the WiredZone staff and to the SuperMicro staff that helped us on this. It’s not terribly well understood black magic that goes on in these servers, and when working on the boundaries of what’s possible all sorts of odd things come up. The last week has been one of them.
A side note on this is that we would have seen similar behavior if the RAM would have been quad-ranked ram. In a quad-ranked configuration, the server will apparently only see 12 DIMM’s also. In all of our years of building systems and working with servers we had never encountered this, and are very happy that we had the folks at WiredZone and SuperMicro to help us sort this out.
So one of the Nexenta systems that I’ve been working on quadrupled memory, and ever since then has been having some issues (as detailed in the previous post – that was actually supposed to go live a few weeks ago). Lots of time spend on Skype with Nexenta support has led us in a few directions. Yesterday, we made a breakthrough.
We have been able to successfully correlate VMware activities with the general wackyness of our Nexenta system. This occurs at the end of a snapshot removal, or at the end of a storage vmotion. Yesterday, we stumbled across something that we hadn’t noticed before. After running the storage vmotion, the Nexenta freed up the same amount of RAM from the ARC cache as the size of the VMDK that just got moved. This told us something very interesting.
1 – There is no memory pressure at all. The entire VMDK got loaded into the ARC cache as it was being read out. And it wasn’t replaced.
2 – Even after tuning the arc_shrink_shift variables, we were still freeing up GOBS of memory. 50GB in this case.
3 – When we free up that much RAM, Nexenta performs some sort of cleanup, and gets _very_ busy.
After reviewing the facts of the case, we started running some dtrace scripts that I’ve come across. Arcstat.pl (from Mike Harsch) showed that as the data was being deleted from disk, arc usage was plummeting, and as soon as it settled down, the arc target size was reduced by the same amount. When that target size was reduced, bad things happened.
At the same time, I ran mpstat to show what was going on with the CPU. While this was going on, we consistently saw millions of cross-calls from one processor core to another, and 100% system time. The system was litterally falling over trying to free up RAM.
Currently the solution that we have put into place is setting arc_c_min to arc_max -1GB. This has so far prevented arc_c (target size) from shrinking aggressively and causing severe outages.
There still appears to be a little bit of a hiccup going on when we do those storage vmotions, but the settings that we are using now appear to at least be preventing the worst of the outages.
Good question. One would think that there’s never too much memory, but in some cases, you’d be dead wrong (at least, not without tuning). I’m battling that exact issue today. On a system that I’m working with, we upgraded the RAM from 48GB to 192GB of RAM. ZFS Evil Tuning guide says don’t worry, we auto-tune better than Chris Brown. I’m starting to not believe that. We’ve been intermittently seeing the system go dark (literally dropping portchannels to Cisco Nexus 5010 switches), then roaring back to life. Standard logging doesn’t appear to be giving much insight, but after digging through ZenOSS logs and multiple dtrace scripts, I think we’ve found a pattern.
It appears as though by default, Nexenta will de-allocate a certain percentage of your memory when it does memory cleanup related to the ARC cache. When you get to larger memory systems, the amount of memory it frees grows. I monitored an event where it free’d up something to the tune of 8GB of RAM. That happened to coincide with a portchannel dropping.
Through all of this, support has been great. We’ve been tuning the amount of memory it free’s up. We’ve tuned the minimum amount of RAM to free up (in an effort to get it to free memory more often). We’ve allocated more memory to ARC metadata. Pretty much we’ve thrown the kitchen sink at it. The last tweak was done today, and I’m monitoring the system to see if we continue to see problems. Hopefully, once this is all done I can post some tuneables for larger memory systems.
We’ve been asked about the SuperMicro 6036ST-6LR, affectionately known as the SuperMicro Storage Bridge Bay, and why we did not use that platform. I threw out a few reasons quick last night, but wanted to elaborate on those points a little bit, and add another one. First and foremost, when we started our build, the Storage Bridge Bay wasn’t available. If it had been, we probably would have gotten it just for the new/neat factor. Now, on to the other reasons I posted last night.
1 – We don’t _need_ HA. This sounds a bit silly, but for all intents and purposes, we don’t _need_ HA on a day to day basis. Our hardware has been stable enough to allow us to get by without having a full HA system for storage. Yes, for maintenance windows, it would be nice to be able to fail to another head, do maintenance, then fail back. We are a service provider though, and our SLA’s allow us to open maintenance windows. Everyone has them, everyone needs them, and sometimes downtime is an unavoidable consequence. Our customers are well aware of this and tolerate these maintenance windows quite well. While we’d love to have HA available, it’s not a requirement in our environment at this time.
2 – It’s expensive (Hardware) – For about the same price, you can get 2x 2U SuperMicro systems and cable them up to an external Jbod. If you don’t need or want HA, your costs go down exponentially.
3 – It’s expensive (Software) – To get into HA with Nexenta, you have to run at least Gold level licensure, plus the HA cluster plugin. For example, if you’ve only got 8TB of RAW storage, the difference is between 1725 for an 8TB Silver license, vs 10480 for 2 8TB Gold licenses plus the HA Cluster plugin. Obviously going with more storage makes the cost differential smaller, but there is definately a premium associated with going HA. Our budget simply doesn’t allow us to spend that much money on the storage platform. Our original build clocked in well under $10,000 (6,700 to be exact, checking the records). Our next build has a budget of under 20,000. Spending half of the budget on HA software just breaks the bank.
4 – (Relatively) limited expansion – Our new build will likely be focused on a dedicated 2U server as a head node. This node has multiple expansion slots, integrated 10GbE, and support for up to 288GB of RAM (576GB if 32GB DIMMS get certified). It’s a much beefier system allowing for much more power in the head for compression and caching. Not that the Storage Bridge Bay doesn’t have expansion, but it’s nowhere near as expandable as a dedicated 2U head node.
Now, after all of this, don’t go throwing in the towel on building an HA system, or even building an HA system using the Storage Bridge Bay. For many use cases, it’s the perfect solution. If you don’t need a ton of storage but still need High Availability and are constrained on space, this is a perfect solution. 3U and you can have a few dozen TB of storage, plus read and write caches. It’d also be the perfect solution for a VDI deployment requiring HA. Slap a bunch of SSD’s in it, and it’s rocking. After using the Nexenta HA plugin, I can say that it’s definately a great feature to have, and if you’ve got the requirement for HA I’d give it a look.