We are still running benchmarks. We decided to share a Friday afternoon action shot of the server mounted in the rack.
We are still running the IOMeter benchmarks using the new ZFSBuild2012 server and InfiniBand SRP. The numbers we are seeing with SRP are absolutely awesome. For example, right now it is running a 32k random read benchmark and it is getting nearly 60,000 IOPS and moving 1854MB/second. This is amazing because when we ran raw IB performance tests on the network, the max performance of the IB network was 1890MB/second. The ib_read_bw and ib_write_bw tools showed our IB network moving an average of 1869MB/second and a peak of 1890MB/second. It is really exciting to see the ZFSBuild2012 box delivering 1854MB/second through IOMeter, which is about 98% of the wirespeed for our IB network.
So – just a quick update – we got Infiniband SRP (SCSI RDMA Protocol) working from our Windows system to our Nexenta system. Here’s a screengrab from IOMeter beating the crap out of our 20Gbit Infiniband network. This is a 16k, random read workload. Obviously it’s all fitting in to RAM, but compared to iSCSI w/ IPoIB, there’s no contest. iSCSI/IPoIB manages about 400MB/sec, and 27,000 IOPS. This should be very exciting to see this as a possibility moving forward.
We are re-running the original ZFSBuild2010 tests, and initial results are that this system is significantly faster than the old system. 4k Random Reads are peaking out at over 50,000 IOPS, delivering over 200MB/sec over Infiniband. 8k random reads delivering 40,000 IOPS and over 300MB/sec over Infiniband. These numbers are AWESOME!
Keep in mind though, this is with a 25GB working set. ZFSBuild 2010 only had 12GB of RAM, not nearly enough to cache even 25GB of data. ZFSBuild 2012 has 64GB of RAM, allowing to put all of this data in RAM. We’ll be tailoring the benchmarks after this run to more accurately reflect real world workloads, as we know the 25GB working set size is giving us artificially high results.
Look for more info in the next week or so!
It’s been two years since we built our last ZFS based server, and we decided that it was about time for us to build an updated system. The goal is to build something that exceeds the functionality of the previous system, while costing approximately the same amount. The original ZFSBuild 2010 system cost US $6765 to build, and for what we got back then, it was a heck of a system. The new ZFSBuild 2012 system is going to match the price point of the previous design, yet offer measurably better performance.
The new ZFSBuild 2012 system is comprised of the following :
SuperMicro SC846BE16-R920 chassis – 24 bays, single expander, 6Gbit SAS capable. Very similar to the ZFSBuild 2010 server, with a little more power, and a faster SAS interconnect.
SuperMicro X9SRI-3F-B Motherboard – Single socket Xeon E5 compatible motherboard. This board supports 256GB of RAM (over 10x the RAM we could support in the old system) and significantly faster/more powerful CPU’s.
Intel Xeon E5 1620 – 3.6Ghz latest generation Intel Xeon CPU. More horsepower for better compression and faster workload processing. ZFSBuild 2010 was short on CPU, and we found it lacking in later NFS tests. We won’t make that mistake again.
20x Toshiba MK1001TRKB 1TB SAS 6Gbit HDD’s – 1TB SAS drives. The 1TB SATA drives that we used in the previous build were ok, but SAS drives give much better information about their health and performance, and for an enterprise deployment, are absolutely necessary. These drives are only $5 more per drive than what we paid for the drives in ZFSBuild 2010. Obviously if you’d like to save more money, SATA drives are an option, but we strongly recommend using SAS drives when ever possible.
LSI 9211-8i SAS controller – Moving the SAS duties to a Nexenta HSL certified SAS controller. Newer chipset, better performance, and replaceability in case of failure.
Intel SSD’s all around – We went with a mix of 2x Intel 313 (ZIL), 2x 520 (L2ARC) and 2x 330 (boot – internal cage) SSD’s for this build. We have less ZIL space than the previous build (20GB vs 32GB) but rough math says that we shouldn’t ever need more than 10-12GB of ZIL. We will have more L2ARC (480GB vs 320GB) and the boot drives are roughly the same.
64GB RAM – Generic Kingston ValueRAM. The original ZFSBuild was based on 12GB of memory, which 2 years ago seemed like a lot of RAM for a storage server. Today we’re going with 64 GB right off the bat using 8GB DIMM’s. The motherboard has the capacity to go to 256GB with 32GB DIMM’s. With 64GB of RAM, we’re going to be able to cache a _lot_ of data. My suggestion is to not go super-overboard on RAM to start with, as you can run into issues as noted here : http://www.zfsbuild.com/2012/03/05/when-is-enough-memory-too-much-part-2/
For the same price as our ZFSBuild 2010 project, the ZFSBuild 2012 project will include more CPU, much more RAM, more cache, better drives, and better chassis. It’s amazing what two years difference makes when building this stuff.
Expect that we’ll evaluate Nexenta Enterprise, OpenIndiana, and revisit FreeNAS’s ZFS implementation. We probably won’t go back over the Promise units, as we’ve already discussed them and they likely haven’t changed (and we don’t have any lying about not doing anything anymore).
We are planning to re-run the same battery of tests that we used in 2010 for the original ZFSBuild 2010 benchmarks. We still have the same test blade server available to reproduce the testing environment. We also plan to run additional tests using various sized working sets. InfiniBand will be benchmarked in additional to standard gigabit Ethernet this round.
So far, we have received nearly all of the hardware. We are still waiting on a cable for the rear fans and a few 3.5 to 2.5 drive bay converters for the ZIL and L2ARC SSD drives. As soon as those items arrive, we will place the ZFSBuild 2012 server in our server room and begin the benchmarking. We are excited to see how it performs relative to the ZFSBuild 2010 server design.
Here are a couple pictures we have taken so far on the ZFSBuild 2012 project:
Just got a note in my inbox that Nexenta has “Important” news. They’ve cancelled the OpenStorage Summit in San Jose, and they’ve refunded my registration fee. Great. Just what I wanted. I absolutely did not want to go to San Jose, meet and greet with Nexenta employees and users, and network with said people. I also didn’t want to get out of Iowa for a few days and enjoy the weather in San Jose. What I _am_ looking forward to is cancelling both my hotel reservation and my airfare. I’m betting the airfare probably doesn’t get refunded. Class act guys. It would have been really nice if you would have decided this a month ago when you opened registration. I wonder how many other people are holding the bag on airfare and room reservations. I would have to think that Nexenta had the forethought to check with the hotel and see how many rooms had been reserved, but maybe not. Maybe they expected everyone to book their rooms at the last minute, along with their airfare, because that’s what most conference attendees do.
Nexenta – you can go ahead and assume I won’t be registering for the virtual open storage summit. Also go ahead and assume that I’m probably not interested in coming next year either, because who knows how much notice I’ll get that you’ve cancelled that one too.
Hotel and airfare booked for the OpenStorage Summit 2012 in San Jose. Last year was spectacular and was well worth the trip. Anyone else thinking about going should stop thinking and register!
It’s been noted over on the Nexentastor.org forums that there was a change in the Nexenta EULA that now prevents commercial usage of Nexenta Community Edition. Forum thread can be found here : http://www.nexentastor.org/boards/1/topics/7593
I found this discouraging, as the ability to use Nexenta Community Edition in our production environment was the reason that we selected Nexenta. All of our original testing was done with OpenSolaris, and it actually performed better than Nexenta. We went with Nexenta Community Edition because of the ease of use and the ability to upgrade to Enterprise Edition and purchase support if we needed it. Removing the option for small businesses to use Nexenta Community Edition in production is not something that I expected to see from Nexenta. I wondered why this happened.
I took some time to think about this and try to figure out why Nexenta’s stance might have changed. After browsing the forums, and seeing posts that say things like “I tried contacting Nexenta support” I stumbled upon the idea that support could be a big part of it. This is a free version of Nexenta, allowing you to use up to 18TB of space, with NO SUPPORT. People then come to the forums and complain that there’s no support, or they don’t get a response, or they got little help.
Support costs money. There have been a number of people that are using Nexenta Community Edition that have been contacting support (at least one noted here –http://www.nexentastor.org/boards/2/topics/5662#message-7672). Even if support doesn’t help you, you’re still tying up time on the phone with them, or forcing them to write an email response. This costs money. The EULA change isn’t going to change peoples behavior, but it does make it easier for Nexenta to send you to the forums for support, and use canned responses for Nexenta Community Edition questions.
The other possibility that I could see would be someone purchasing a single Nexenta Enterprise Edition Silver license, and then installed Nexenta Community Edition on 20 other devices, and try to get support on those devices also. That’s pretty shady, but I can easily envision someone doing that. Saying that Nexenta Community Edition isn’t to be run on production workloads allows Nexenta to punt questions if they come from a non-supported system much easier. This is similar to the problem that Cisco has with their SmartNet support. You buy 40 Cisco switches, put SmartNet on one of them, and voila, you’ve got support for every device that you own. Cisco is starting to get this worked out, and I can see a lot of shops hurting in the next few years when they have to buy SmartNet on every device they own.
My suggestion to potential Nexenta Community Edition users – if you’re considering running Nexenta Community Edition in production, go for it. From what I can tell, Nexenta is not planning to use the terms of the EULA to sue people for using Nexenta Community Edition in production. Nexenta IS likely going to give you grief if you call in to support, and you’ll likely not get any help. They’re a for-profit company, and I can’t fault them for wanting to remain in the black. If it’s a mission critical workload and you absolutely need support, buy Nexenta Enterprise Silver at a minimum instead of using Nexenta Community Edition. Nexenta Enterprise Silver is still cheaper than nearly any other support package you’ll find, and my experiences with support have been nothing less than stellar.
My suggestion to Nexenta – figure this out on the backend. By telling small business that they cannot use Nexenta Community Edition in production, you have opened the door for FreeNAS, OpenFiler and Napp-It to step in and grab these startups that desperately want to use your product. Your product is better than FreeNAS, OpenFiler and Napp-It, but FreeNAS, OpenFiler, and Napp-It don’t include draconian licensing limitations. Figure out how to allow these small businesses to use Nexenta Community Edition and flourish, and when they’re ready to go big time, let ’em buy Nexenta Enterprise Silver/Gold/Platinum licensing and support rather than figuring out how to pay Napp-It or one of the others for their support. If the EULA for Nexenta Community Edition had looked like this when we started using it, we would have thought long and hard about not using it or recommending it to other people. I don’t want to do that, and I don’t want anyone reading this site to do that. The Nexenta WebGUI is comfortable and I’ve gotten quite used to it, I’d hate to go back and have to create iSCSI targets from the command line.
At some point in time, somebody will sit down and write a good web based GUI that runs on OpenSolaris/OpenIndiana. By originally allowing production usage of Nexenta Community Edition, Nexenta took away the desire to code that alternative web GUI. After all, nobody wants to spend months coding something when Nexenta Community Edition existed for free, worked awesome, and allowed production usage. Now that Nexenta Community Edition is not allowed for production usage, there will likely be renewed interest in developing a good open source web GUI.
We’ve got a newfound lab to play with, and it’s a monster. Suffice to say there is going to be a lot of information coming out about this system in the next few weeks. One of the more interesting things that we’re trying with it is software FCoE. Obviously for the best performance, you’re going to want to use a hardware FCoE CNA (Converged Network Adapter). For testing purposes though, Intel X520 10GbE network cards seem to work just fine. Just wanted to throw this out there if anyone wanted/needed to play with FCoE
Here’s the steps to getting FCoE up and running on Intel X520 nic’s (this is being done on NexentaStor Enterprise, 3.1.3 – officially unsupported) This should work on OpenIndiana also.
1. Install 10GBE Cards and connect them.
2. Configure interfaces – set MTU to 9000 (this is important) – test traffic with ping –s 9000 and iperf
3. Unconfigure network interfaces
setup network interface ixgbe0 unconfigure
4. Create FCoE Target – on iscsi target machine
fcadm create-fcoe-port -t -f ixgbe0 (-f enables Promiscuous Mode: On)
5. Create FCoE Initiator – on iscsi initiator machine
fcadm create-fcoe-port -i -f ixgbe0
6. Check state
7. List targets
8. Online your targets
9. Create volume on target side
a. Setup volume create data
10. Create zvol on target side
setup zvol create data/test
11. Share zvol over iscsi
setup zvol data/test share
12. Make sure lun shows up
a. Show lun
So we’ve been talking about RAM usage, ram problems, and pretty much everything related to RAM lately, so I figured I’d mention this one too.
Many of you, if you’ve got a large memory system, may notice that your system never uses all of it’s RAM. I’ve got some 192GB systems that routinely reported that they had 30GB of RAM free. Why is this? Everything that I’ve read says ZFS will use all of the RAM in your system that it can, how can it leave 30GB just wasting away doing nothing?
The answer is both surprising, and not surprising at the same time. When ZFS was written if a system had 8GB of RAM it was a monster, and 16GB was nearly unheard of. Let’s not even talk about 32GB. Today having 192GB of RAM in a system isn’t difficult to achieve, and the new Xeon E5 platforms boast RAM capacities of 1TB and more. Back then there was a need to limit ZFS a little bit from encroaching on system processes. There is a Solaris “bug” (if you can call it that) that limits ARC cache to 7/8 of total system memory. This was intentional so that the ARC cache couldn’t cause contention for swapfile space in RAM. That “bug” still exists in Nexenta 3.1.2. This is why I have 30GB of RAM free on a 192GB system. 1/8 system memory made sense when monster systems were 8GB. That left 1GB of RAM that wouldn’t be touched for ARC caching to ensure proper system operation. Today the amount of RAM in systems dwarfs what most people would have used 10 years ago, and as such we need to make modifications. Fortunately, we can tune this variable.
From what I’ve read (based on information from Richard Elling (@richardelling on twitter)) the default for Nexenta 4.0 and Illumos distributions will be 256MB. I take a very cautious approach to this and I’m going to use 1GB. This is all discussed here.
For those who don’t want to check out the links – the pertinent info is here :
edit /etc/system for permanent fix – this reserves 1GB of RAM
To change it on the fly
echo swapfs_minfree/W0t262144 | mdb -kw
This change has allowed my ARC cache to grow from 151-155GB utilization to 176GB utilization, effectively allowing me to use all of the RAM that I should be able to use.
FYI – this is unsupported, and Nexenta support will probably give you grief if you do this without discussing it with them (if you have a support contract), so be forewarned. There may be unintended consequences from making this change that I am not aware of.
Here at ZFSBuild we have come across something unusual that we thought we would share. This isn’t necessarily related to ZFS, but we encountered it while working on a ZFS/Nexenta system.
We recently had to dig deeper into RAM Ranks and voltage specifications. This stems from populating a SuperMicro SYS-6026-RFT+ barebones system full of RAM. The system in question has 18 DDR3 RAM slots. We ordered 8GB DIMM’s from Wiredzone.com, and based on it’s specifications we were pretty sure it would work. We got the system in, populated the RAM slots, and started running tests. The weird thing about it was that the system only ever saw 120GB of RAM. We started reading….and reading….and reading… Finally came across some SuperMicro documents here and here. Turns out the RAM we ordered was substituted with RAM that was assumed to be compatible. The only difference between the DIMM’s we ordered and the DIMM’s we received was the voltage that they operated at. We ordered 1.5V DIMM’s, and we were shipped 1.35V DIMM’s. When using 1.35V DIMM’s the system detected 120GB of usable RAM.
We fought for a few days between SuperMicro, WiredZone, and our own gut feelings and finally got it sorted out. Wiredzone shipped us new DIMM’s that were Dual Rank and 1.5V, and they worked flawlessly. We’d like to give a big shout out to the WiredZone staff and to the SuperMicro staff that helped us on this. It’s not terribly well understood black magic that goes on in these servers, and when working on the boundaries of what’s possible all sorts of odd things come up. The last week has been one of them.
A side note on this is that we would have seen similar behavior if the RAM would have been quad-ranked ram. In a quad-ranked configuration, the server will apparently only see 12 DIMM’s also. In all of our years of building systems and working with servers we had never encountered this, and are very happy that we had the folks at WiredZone and SuperMicro to help us sort this out.
So one of the Nexenta systems that I’ve been working on quadrupled memory, and ever since then has been having some issues (as detailed in the previous post – that was actually supposed to go live a few weeks ago). Lots of time spend on Skype with Nexenta support has led us in a few directions. Yesterday, we made a breakthrough.
We have been able to successfully correlate VMware activities with the general wackyness of our Nexenta system. This occurs at the end of a snapshot removal, or at the end of a storage vmotion. Yesterday, we stumbled across something that we hadn’t noticed before. After running the storage vmotion, the Nexenta freed up the same amount of RAM from the ARC cache as the size of the VMDK that just got moved. This told us something very interesting.
1 – There is no memory pressure at all. The entire VMDK got loaded into the ARC cache as it was being read out. And it wasn’t replaced.
2 – Even after tuning the arc_shrink_shift variables, we were still freeing up GOBS of memory. 50GB in this case.
3 – When we free up that much RAM, Nexenta performs some sort of cleanup, and gets _very_ busy.
After reviewing the facts of the case, we started running some dtrace scripts that I’ve come across. Arcstat.pl (from Mike Harsch) showed that as the data was being deleted from disk, arc usage was plummeting, and as soon as it settled down, the arc target size was reduced by the same amount. When that target size was reduced, bad things happened.
At the same time, I ran mpstat to show what was going on with the CPU. While this was going on, we consistently saw millions of cross-calls from one processor core to another, and 100% system time. The system was litterally falling over trying to free up RAM.
Currently the solution that we have put into place is setting arc_c_min to arc_max -1GB. This has so far prevented arc_c (target size) from shrinking aggressively and causing severe outages.
There still appears to be a little bit of a hiccup going on when we do those storage vmotions, but the settings that we are using now appear to at least be preventing the worst of the outages.