Archive for November, 2011

SuperMicro 6036ST-6LR

We’ve been asked about the SuperMicro 6036ST-6LR, affectionately known as the SuperMicro Storage Bridge Bay, and why we did not use that platform.  I threw out a few reasons quick last night, but wanted to elaborate on those points a little bit, and add another one.  First and foremost, when we started our build, the Storage Bridge Bay wasn’t available.  If it had been, we probably would have gotten it just for the new/neat factor.  Now, on to the other reasons I posted last night.

1 – We don’t _need_ HA.  This sounds a bit silly, but for all intents and purposes, we don’t _need_ HA on a day to day basis.  Our hardware has been stable enough to allow us to get by without having a full HA system for storage.  Yes, for maintenance windows, it would be nice to be able to fail to another head, do maintenance, then fail back.  We are a service provider though, and our SLA’s allow us to open maintenance windows.  Everyone has them, everyone needs them, and sometimes downtime is an unavoidable consequence.  Our customers are well aware of this and tolerate these maintenance windows quite well.  While we’d love to have HA available, it’s not a requirement in our environment at this time.

2 – It’s expensive (Hardware) – For about the same price, you can get 2x 2U SuperMicro systems and cable them up to an external Jbod.  If you don’t need or want HA, your costs go down exponentially.

3 – It’s expensive (Software) – To get into HA with Nexenta, you have to run at least Gold level licensure, plus the HA cluster plugin.  For example, if you’ve only got 8TB of RAW storage, the difference is between 1725 for an 8TB Silver license, vs 10480 for 2 8TB Gold licenses plus the HA Cluster plugin.  Obviously going with more storage makes the cost differential smaller, but there is definately a premium associated with going HA.  Our budget simply doesn’t allow us to spend that much money on the storage platform.  Our original build clocked in well under $10,000 (6,700 to be exact, checking the records).  Our next build has a budget of under 20,000.  Spending half of the budget on HA software just breaks the bank.

4 – (Relatively) limited expansion – Our new build will likely be focused on a dedicated 2U server as a head node.  This node has multiple expansion slots, integrated 10GbE, and support for up to 288GB of RAM (576GB if 32GB DIMMS get certified).  It’s a much beefier system allowing for much more power in the head for compression and caching.  Not that the Storage Bridge Bay doesn’t have expansion, but it’s nowhere near as expandable as a dedicated 2U head node.

Now, after all of this, don’t go throwing in the towel on building an HA system, or even building an HA system using the Storage Bridge Bay.  For many use cases, it’s the perfect solution.  If you don’t need a ton of storage but still need High Availability and are constrained on space, this is a perfect solution.  3U and you can have a few dozen TB of storage, plus read and write caches.  It’d also be the perfect solution for a VDI deployment requiring HA.  Slap a bunch of SSD’s in it, and it’s rocking.  After using the Nexenta HA plugin, I can say that it’s definately a great feature to have, and if you’ve got the requirement for HA I’d give it a look.

Tuesday, November 22nd, 2011 Hardware 7 Comments

Dedupe – be careful!

Syndicated from

Just got done working on a friends Nexenta platform, and boy howdy was it tired.  NFS was slow, iSCSI was slow, and for all the slowness, we couldn’t really see a problem with the system.  The GUI was reporting there was free RAM, and IOSTAT showed the disks not being completely thrashed.  We didn’t see anything really out of the ordinary at first glance.

After some digging, we figured out that we were running out of RAM for the Dedupe tables.  It goes a little something like this.

Nexenta by default allocates 1/4 of your ARC cache (RAM) to metadata caching.  Your L2ARC map is considered metadata.  When you turn on dedupe, all of that dedupe information is stored in metadata.  The more you dedupe, the more RAM you use, the more L2ARC you use, the more RAM you use.

The system in question is a 48GB system, and it reported that had free memory, so we were baffled.  If its got free RAM, what’s the holdup?  Seems as though between the dedupe tables and the L2ARC, we had outstripped the capabilities of the ARC to hold all of the metadata.  This caused _everything_ to be slow.  The solution?  You can either increase the percentage of RAM that can be used for metadata, increase the total RAM (thereby increasing the amount you can use for metadata caching), or you can turn off dedupe, copy everything off of the volume, then copy it back.  Since there’s no way currently to “undedupe” a volume, once that data has been created, you’re stuck with it until you remove the files.

So, without further ado, here’s how to figure out what’s going on in your system.

echo ::arc|mdb -k

This will display some interesting stats.  The most important in this situation is the last three lines :

arc_meta_used             =     11476 MB
arc_meta_limit            =     12014 MB
arc_meta_max              =     12351 MB

These numbers will change.  Things will get evicted, things will come back.  You don’t want to see the meta_used and meta_limit numbers this close.  You definately don’t want to see the meta_max exceed the limit.  This is a great indicator that you’re out of RAM.

After quite a bit of futzing around, disabling dedupe, and shuffling data off of, then back on to pool, things look better :

arc_meta_used             =      7442 MB
arc_meta_limit            =     12014 MB
arc_meta_max              =     12351 MB

Just by disabling dedupe, and blowing away the dedupe tables, it freed up almost 5GB of RAM.  Who knows how much was being swapped in and out of RAM.

Other things to check :

zpool status -D <volumename>

This gives you your standard volume status, but it also prints out the dedupe information.  This is good to figure out how much dedupe data there is.  Here’s an example :

DDT entries 7102900, size 997 on disk, 531 in core

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    6.41M    820G    818G    817G    6.41M    820G    818G    817G
     2     298K   37.3G   37.3G   37.3G     656K   82.0G   82.0G   81.9G
     4    30.5K   3.82G   3.82G   3.81G     140K   17.5G   17.5G   17.5G
     8    43.9K   5.49G   5.49G   5.49G     566K   70.7G   70.7G   70.6G
    16      968    121M    121M    121M    19.1K   2.38G   2.38G   2.38G
    32      765   95.6M   95.6M   95.5M    33.4K   4.17G   4.17G   4.17G
    64       33   4.12M   4.12M   4.12M    2.77K    354M    354M    354M
   128        5    640K    640K    639K      943    118M    118M    118M
   256        2    256K    256K    256K      676   84.5M   84.5M   84.4M
    1K        1    128K    128K    128K    1.29K    164M    164M    164M
    4K        1    128K    128K    128K    5.85K    749M    749M    749M
   32K        1    128K    128K    128K    37.0K   4.63G   4.63G   4.62G
 Total    6.77M    867G    865G    864G    7.84M   1003G   1001G   1000G


This tells us that there are 7 million entries, with each entry taking up 997 bytes on disk, and 531 bytes in memory.  Simple math tells us how much space that takes up.

7102900*531=3771639900/1024/1024=3596MB used in RAM

The same math tells us that there’s 6753MB used on disk, just to hold the dedupe tables.

The dedupe ratio on this system wasn’t even worth it.  Overall dedupe ratio was something like 1.15x.  Compression on that volume(which has nearly no overhead) after shuffling the data around,is at 1.42x.  So at the cost of CPU time (which there is plenty of), we get a better over-subscription ratio from compression vs deduplication.

There are definitely use-cases for deduplication, but his generic VM storage pool is not one of them.

Friday, November 18th, 2011 Configuration, ZFS 13 Comments

What we’ve learned – RAM

This post is possibly the most important lesson that we learned.  RAM is of MAJOR importance to Nexenta.  You can’t have enough of it.  Our original Nexenta deployment was 12GB of RAM.  It seemed like a silly amount of ram just a year ago.  Today we’re looking at it as barely a starting point.  Consider these facts :

1 – RAM is an order of magnitude (or more) faster than Flash.

2 – RAM is getting cheaper every day.

3 – You can put silly amounts of RAM in a system today.

4 – Data ages, and goes cold, and doesn’t get accessed as it gets older, reducing your Hot data footprint.

Lets go through these statements one by one.

1 – RAM is an order of magnitude (or more) faster than Flash.  Flash will deliver, on average, between 2,000 and 5,000 IOPS, depending on the type of SSD, the wear on the SSD, and garbage collection routines.  RAM has the capability to deliver hundreds of thousands of IOPS.  It doesn’t wear out, and there’s no garbage collection.

2 – RAM is getting cheaper every day.  When we built this platform last year, we paid over US $200 per 6GB of RAM.  Today you can buy 8GB Registered ECC DIMMS for under US $100.  16GB DIMM’s are hovering around US $300-$400.  Given the trends, I’d expect those to drop over the next year or two significantly.

3 – You can put silly amounts of RAM in a system today.  Last year, we were looking at reasonably priced boards that could fit 24GB of RAM in them.  Today we’re looking at reasonably priced barebones systems that you can fit 288GB of RAM in.  Insane systems (8 socket Xeon) support 2TB of RAM.  Wow.

4 – Data ages, goes cold, and doesn’t get accessed as much.  Even with only 12GB of RAM and 320GB of SSD, much of our working set is cached.  With 288GB of RAM, you greatly expand your capability of adding L2ARC (remember, L2ARC uses some of main memory) and increase your ARC cache capacity.  If your working set was 500GB on our old system you’d be running at least 200GB of it from spinning disk.  New systems configured with nearly 300GB of ARC and a reasonable (say 1TB) amount of L2ARC would cache that entire working set.  You’d see much of that working set cached in RAM (delivering hundreds of thousands of IOPS) part of it delivered from Flash (delivering maybe 10,000 IOPS), and only very old, cold data being served up from disk.  Talk about a difference in capabilities.  This also allows you to leverage larger, slower disks for older data.  If the data isn’t being accessed, who cares if it’s on slow 7200RPM disks?  That powerpoint presentation from 4 years ago isn’t getting looked at every day, but you’ve still got to save it.  Why not put it on the slowest disk you can find.

This being said, our new Nexenta build is going to have boatloads of RAM.  Maybe not 288GB (16GB DIMMS are still expensive compared to 8GB DIMMS) but I would put 144GB out there as a high probability.


Tuesday, November 15th, 2011 Configuration, Hardware 6 Comments

What we’ve learned – Working Set

So lets start talking about some of the things that we’ve learned over the last year, shall we?  The number one thing that we have learned is that your working set size dramatically impacts your performance on ZFS.  When you can keep that working set inside of RAM, your performance numbers are outrageous.  Hundreds of thousands of IOPS.  Latency is next to nothing, and the world is wonderful.  Step over that line though, and you had better have architected your pool to absorb it.

Flash forward to a year later, and we’ve found that our typical response times and IOPS delivered are much higher than we expected, and latency is much lower.  Why is that you may ask?  With large datasets and very random access patterns, you would expect close to worst-case scenarios.  What we found is that we had a lot of blocks that were never accessed.  We have thick-provisioned most of our VM’s, which results in 100+GB of empty space in most of those VM’s (nearly 50% of all allocated capacity).  With Nexenta and ZFS, all of the really active data was being moved into the ARC/L2ARC cache.  While we still had some reads and writes going to the disks, it was a much smaller percentage.  We quickly figured out that the algorithms and tuning Nexenta employs in ZFS for caching seems to be very very intelligent, and our working set was much smaller than we ever really imagined.

Tuesday, November 15th, 2011 Configuration, Virtualization, ZFS 5 Comments

A quick note

First and foremost, welcome back to ZFSBuild.  We’ve missed you, and we hope you’ve missed us.  Over the last year we’ve found out numerous things about this blog, our readers, and ZFS.  Unfortunately the rigors of day to day operations have prevented us from blogging more.  We aim to rectify that, so thanks for sticking with us.  It’s been a while, and we’ve got a lot of new information to share with you.  Hopefully we can put it together in a format that you can follow.  If not, let us know in the comments!  Now, let’s get this party started.

What have we learned in the last year?  We found out we have a lot of people that have read this blog.  With hundreds of comments, and as many (or more) emails that have come in over the last year, there are certainly a lot of people that have checked out the site.  This makes us happy.  Not only because we’ve got big giant ego’s that love to be stroked, but that people are interested in ZFS and Nexenta.  With that, we’d like to say Thank You.  Without you we would just be talking to ourselves, and that’s just creepy.

We also found that we like evangelizing for Nexenta. It’s not because we get paid by them (Nexenta, if you’re reading….hint hint) but because they put out a great product that puts a great SAN within the reach of small businesses like ours.  Most businesses can’t afford to spend fifty thousand dollars on EMC or NetApp gear just to get started.  Most of them can’t spend ten thousand.  ZFS, Nexenta, and Open Source software puts a robust SAN/NAS into the hands of small businesses that need it the most.  Just because you can’t afford a fifty thousand dollar SAN doesn’t mean that your data is any less important than the MegaCorp down the street.  Nexenta and ZFS allows you to get into the game for a fraction of the cost, all while giving you features that the big boys try very hard to compete with.

Another thing that many people that manage enterprise storage may not know is that ZFS is becoming more popular.  I had the opportunity to attend the Open Storage Summit this year, which was hosted by Nexenta.  One word, Amazing!  Nexenta knew how to put on a show, and they knew what people were interested in.  There wasn’t one keynote or breakout session that was boring.  I also expected it to be quite a bit smaller.  I think the final tally was somewhere between 400 and 500 registered attendees.  They ran the gamut from government employees to VMWare nerds to hosting providers.  It seemed like all sectors were pretty well represented, and everyone was excited about ZFS, Open Storage, and Nexenta.  More about the Open Storage Summit later.

That about wraps it up for now.  Next time I’ll get in to some of the technical things that we’ve learned over the last year with our build, zfs, and Nexenta.

Friday, November 11th, 2011 ZFS 1 Comment

Updates, News, and Feedback

We’ve gotten loads of positive feedback from our Blog and from our Anandtech article over the last year.  Since we last posted, we’ve found quite a few things that we want to share with everyone about our first build.  Keep an eye here for updates and musings about what we have found over the last year, and what we plan on doing next!

Monday, November 7th, 2011 Configuration 5 Comments