Let ZFS use all of your RAM

So we’ve been talking about RAM usage, ram problems, and pretty much everything related to RAM lately, so I figured I’d mention this one too.

Many of you, if you’ve got a large memory system, may notice that your system never uses all of it’s RAM.  I’ve got some 192GB systems that routinely reported that they had 30GB of RAM free.  Why is this?  Everything that I’ve read says ZFS will use all of the RAM in your system that it can, how can it leave 30GB just wasting away doing nothing?

The answer is both surprising, and not surprising at the same time.  When ZFS was written if a system had 8GB of RAM it was a monster, and 16GB was nearly unheard of.  Let’s not even talk about 32GB.  Today having 192GB of RAM in a system isn’t difficult to achieve, and the new Xeon E5 platforms boast RAM capacities of 1TB and more.  Back then there was a need to limit ZFS a little bit from encroaching on system processes.  There is a Solaris “bug” (if you can call it that) that limits ARC cache to 7/8 of total system memory.  This was intentional so that the ARC cache couldn’t cause contention for swapfile space in RAM.  That “bug” still exists in Nexenta 3.1.2.  This is why I have 30GB of RAM free on a 192GB system.  1/8 system memory made sense when monster systems were 8GB.  That left 1GB of RAM that wouldn’t be touched for ARC caching to ensure proper system operation.  Today the amount of RAM in systems dwarfs what most people would have used 10 years ago, and as such we need to make modifications.  Fortunately, we can tune this variable.

From what I’ve read (based on information from Richard Elling (@richardelling on twitter)) the default for Nexenta 4.0 and Illumos distributions will be 256MB.  I take a very cautious approach to this and I’m going to use 1GB.  This is all discussed here.

For those who don’t want to check out the links – the pertinent info is here :

edit /etc/system for permanent fix – this reserves 1GB of RAM

set swapfs_minfree=262144

To change it on the fly

echo swapfs_minfree/W0t262144 | mdb -kw

This change has allowed my ARC cache to grow from 151-155GB utilization to 176GB utilization, effectively allowing me to use all of the RAM that I should be able to use.

FYI – this is unsupported, and Nexenta support will probably give you grief if you do this without discussing it with them (if you have a support contract), so be forewarned.  There may be unintended consequences from making this change that I am not aware of.

Wednesday, April 18th, 2012 Configuration, ZFS 2 Comments

RAM, Ranks, and Voltage

Here at ZFSBuild we have come across something unusual that we thought we would share.  This isn’t necessarily related to ZFS, but we encountered it while working on a ZFS/Nexenta system.

We recently had to dig deeper into RAM Ranks and voltage specifications.  This stems from populating a SuperMicro SYS-6026-RFT+ barebones system full of RAM.  The system in question has 18 DDR3 RAM slots.  We ordered 8GB DIMM’s from Wiredzone.com, and based on it’s specifications we were pretty sure it would work.  We got the system in, populated the RAM slots, and started running tests.  The weird thing about it was that the system only ever saw 120GB of RAM.  We started reading….and reading….and reading…  Finally came across some SuperMicro documents here and here.  Turns out the RAM we ordered was substituted with RAM that was assumed to be compatible.  The only difference between the DIMM’s we ordered and the DIMM’s we received was the voltage that they operated at.  We ordered 1.5V DIMM’s, and we were shipped 1.35V DIMM’s.  When using 1.35V DIMM’s the system detected 120GB of usable RAM.

We fought for a few days between SuperMicro, WiredZone, and our own gut feelings and finally got it sorted out.  Wiredzone shipped us new DIMM’s that were Dual Rank and 1.5V, and they worked flawlessly.  We’d like to give a big shout out to the WiredZone staff and to the SuperMicro staff that helped us on this.  It’s not terribly well understood black magic that goes on in these servers, and when working on the boundaries of what’s possible all sorts of odd things come up.  The last week has been one of them.

A side note on this is that we would have seen similar behavior if the RAM would have been quad-ranked ram.  In a quad-ranked configuration, the server will apparently only see 12 DIMM’s also.  In all of our years of building systems and working with servers we had never encountered this, and are very happy that we had the folks at WiredZone and SuperMicro to help us sort this out.

Monday, April 9th, 2012 Hardware 2 Comments

When is enough memory too much? Part 2

So one of the Nexenta systems that I’ve been working on quadrupled memory, and ever since then has been having some issues (as detailed in the previous post – that was actually supposed to go live a few weeks ago).  Lots of time spend on Skype with Nexenta support has led us in a few directions.  Yesterday, we made a breakthrough.

We have been able to successfully correlate VMware activities with the general wackyness of our Nexenta system.  This occurs at the end of a snapshot removal, or at the end of a storage vmotion.  Yesterday, we stumbled across something that we hadn’t noticed before.  After running the storage vmotion, the Nexenta freed up the same amount of RAM from the ARC cache as the size of the VMDK that just got moved.  This told us something very interesting.

1 – There is no memory pressure at all.  The entire VMDK got loaded into the ARC cache as it was being read out.  And it wasn’t replaced.

2 – Even after tuning the arc_shrink_shift variables, we were still freeing up GOBS of memory.  50GB in this case.

3 – When we free up that much RAM, Nexenta performs some sort of cleanup, and gets _very_ busy.

After reviewing the facts of the case, we started running some dtrace scripts that I’ve come across. Arcstat.pl (from Mike Harsch) showed that as the data was being deleted from disk, arc usage was plummeting, and as soon as it settled down, the arc target size was reduced by the same amount.  When that target size was reduced, bad things happened.

At the same time, I ran mpstat to show what was going on with the CPU.  While this was going on, we consistently saw millions of cross-calls from one processor core to another, and 100% system time.  The system was litterally falling over trying to free up RAM.

Currently the solution that we have put into place is setting arc_c_min to arc_max -1GB.  This has so far prevented arc_c (target size) from shrinking aggressively and causing severe outages.

There still appears to be a little bit of a hiccup going on when we do those storage vmotions, but the settings that we are using now appear to at least be preventing the worst of the outages.

Monday, March 5th, 2012 Hardware 2 Comments

When is enough memory too much?

Good question.  One would think that there’s never too much memory, but in some cases, you’d be dead wrong (at least, not without tuning).  I’m battling that exact issue today.  On a system that I’m working with, we upgraded the RAM from 48GB to 192GB of RAM.  ZFS Evil Tuning guide says don’t worry, we auto-tune better than Chris Brown.  I’m starting to not believe that.  We’ve been intermittently seeing the system go dark (literally dropping portchannels to Cisco Nexus 5010 switches), then roaring back to life.  Standard logging doesn’t appear to be giving much insight, but after digging through ZenOSS logs and multiple dtrace scripts, I think we’ve found a pattern.

It appears as though by default, Nexenta will de-allocate a certain percentage of your memory when it does memory cleanup related to the ARC cache.  When you get to larger memory systems, the amount of memory it frees grows.  I monitored an event where it free’d up something to the tune of 8GB of RAM.  That happened to coincide with a portchannel dropping.

Through all of this, support has been great.  We’ve been tuning the amount of memory it free’s up.  We’ve tuned the minimum amount of RAM to free up (in an effort to get it to free memory more often).  We’ve allocated more memory to ARC metadata.  Pretty much we’ve thrown the kitchen sink at it.  The last tweak was done today, and I’m monitoring the system to see if we continue to see problems.  Hopefully, once this is all done I can post some tuneables  for larger memory systems.

Friday, March 2nd, 2012 Hardware 2 Comments

SuperMicro 6036ST-6LR

We’ve been asked about the SuperMicro 6036ST-6LR, affectionately known as the SuperMicro Storage Bridge Bay, and why we did not use that platform.  I threw out a few reasons quick last night, but wanted to elaborate on those points a little bit, and add another one.  First and foremost, when we started our build, the Storage Bridge Bay wasn’t available.  If it had been, we probably would have gotten it just for the new/neat factor.  Now, on to the other reasons I posted last night.

1 – We don’t _need_ HA.  This sounds a bit silly, but for all intents and purposes, we don’t _need_ HA on a day to day basis.  Our hardware has been stable enough to allow us to get by without having a full HA system for storage.  Yes, for maintenance windows, it would be nice to be able to fail to another head, do maintenance, then fail back.  We are a service provider though, and our SLA’s allow us to open maintenance windows.  Everyone has them, everyone needs them, and sometimes downtime is an unavoidable consequence.  Our customers are well aware of this and tolerate these maintenance windows quite well.  While we’d love to have HA available, it’s not a requirement in our environment at this time.

2 – It’s expensive (Hardware) – For about the same price, you can get 2x 2U SuperMicro systems and cable them up to an external Jbod.  If you don’t need or want HA, your costs go down exponentially.

3 – It’s expensive (Software) – To get into HA with Nexenta, you have to run at least Gold level licensure, plus the HA cluster plugin.  For example, if you’ve only got 8TB of RAW storage, the difference is between 1725 for an 8TB Silver license, vs 10480 for 2 8TB Gold licenses plus the HA Cluster plugin.  Obviously going with more storage makes the cost differential smaller, but there is definately a premium associated with going HA.  Our budget simply doesn’t allow us to spend that much money on the storage platform.  Our original build clocked in well under $10,000 (6,700 to be exact, checking the records).  Our next build has a budget of under 20,000.  Spending half of the budget on HA software just breaks the bank.

4 – (Relatively) limited expansion – Our new build will likely be focused on a dedicated 2U server as a head node.  This node has multiple expansion slots, integrated 10GbE, and support for up to 288GB of RAM (576GB if 32GB DIMMS get certified).  It’s a much beefier system allowing for much more power in the head for compression and caching.  Not that the Storage Bridge Bay doesn’t have expansion, but it’s nowhere near as expandable as a dedicated 2U head node.

Now, after all of this, don’t go throwing in the towel on building an HA system, or even building an HA system using the Storage Bridge Bay.  For many use cases, it’s the perfect solution.  If you don’t need a ton of storage but still need High Availability and are constrained on space, this is a perfect solution.  3U and you can have a few dozen TB of storage, plus read and write caches.  It’d also be the perfect solution for a VDI deployment requiring HA.  Slap a bunch of SSD’s in it, and it’s rocking.  After using the Nexenta HA plugin, I can say that it’s definately a great feature to have, and if you’ve got the requirement for HA I’d give it a look.

Tuesday, November 22nd, 2011 Hardware 5 Comments

Dedupe – be careful!

Syndicated from StorageAdventures.com

Just got done working on a friends Nexenta platform, and boy howdy was it tired.  NFS was slow, iSCSI was slow, and for all the slowness, we couldn’t really see a problem with the system.  The GUI was reporting there was free RAM, and IOSTAT showed the disks not being completely thrashed.  We didn’t see anything really out of the ordinary at first glance.

After some digging, we figured out that we were running out of RAM for the Dedupe tables.  It goes a little something like this.

Nexenta by default allocates 1/4 of your ARC cache (RAM) to metadata caching.  Your L2ARC map is considered metadata.  When you turn on dedupe, all of that dedupe information is stored in metadata.  The more you dedupe, the more RAM you use, the more L2ARC you use, the more RAM you use.

The system in question is a 48GB system, and it reported that had free memory, so we were baffled.  If its got free RAM, what’s the holdup?  Seems as though between the dedupe tables and the L2ARC, we had outstripped the capabilities of the ARC to hold all of the metadata.  This caused _everything_ to be slow.  The solution?  You can either increase the percentage of RAM that can be used for metadata, increase the total RAM (thereby increasing the amount you can use for metadata caching), or you can turn off dedupe, copy everything off of the volume, then copy it back.  Since there’s no way currently to “undedupe” a volume, once that data has been created, you’re stuck with it until you remove the files.

So, without further ado, here’s how to figure out what’s going on in your system.

echo ::arc|mdb -k

This will display some interesting stats.  The most important in this situation is the last three lines :

arc_meta_used             =     11476 MB
arc_meta_limit            =     12014 MB
arc_meta_max              =     12351 MB

These numbers will change.  Things will get evicted, things will come back.  You don’t want to see the meta_used and meta_limit numbers this close.  You definately don’t want to see the meta_max exceed the limit.  This is a great indicator that you’re out of RAM.

After quite a bit of futzing around, disabling dedupe, and shuffling data off of, then back on to pool, things look better :

arc_meta_used             =      7442 MB
arc_meta_limit            =     12014 MB
arc_meta_max              =     12351 MB

Just by disabling dedupe, and blowing away the dedupe tables, it freed up almost 5GB of RAM.  Who knows how much was being swapped in and out of RAM.

Other things to check :

zpool status -D <volumename>

This gives you your standard volume status, but it also prints out the dedupe information.  This is good to figure out how much dedupe data there is.  Here’s an example :

DDT entries 7102900, size 997 on disk, 531 in core

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    6.41M    820G    818G    817G    6.41M    820G    818G    817G
     2     298K   37.3G   37.3G   37.3G     656K   82.0G   82.0G   81.9G
     4    30.5K   3.82G   3.82G   3.81G     140K   17.5G   17.5G   17.5G
     8    43.9K   5.49G   5.49G   5.49G     566K   70.7G   70.7G   70.6G
    16      968    121M    121M    121M    19.1K   2.38G   2.38G   2.38G
    32      765   95.6M   95.6M   95.5M    33.4K   4.17G   4.17G   4.17G
    64       33   4.12M   4.12M   4.12M    2.77K    354M    354M    354M
   128        5    640K    640K    639K      943    118M    118M    118M
   256        2    256K    256K    256K      676   84.5M   84.5M   84.4M
    1K        1    128K    128K    128K    1.29K    164M    164M    164M
    4K        1    128K    128K    128K    5.85K    749M    749M    749M
   32K        1    128K    128K    128K    37.0K   4.63G   4.63G   4.62G
 Total    6.77M    867G    865G    864G    7.84M   1003G   1001G   1000G

 

This tells us that there are 7 million entries, with each entry taking up 997 bytes on disk, and 531 bytes in memory.  Simple math tells us how much space that takes up.

7102900*531=3771639900/1024/1024=3596MB used in RAM

The same math tells us that there’s 6753MB used on disk, just to hold the dedupe tables.

The dedupe ratio on this system wasn’t even worth it.  Overall dedupe ratio was something like 1.15x.  Compression on that volume(which has nearly no overhead) after shuffling the data around,is at 1.42x.  So at the cost of CPU time (which there is plenty of), we get a better over-subscription ratio from compression vs deduplication.

There are definitely use-cases for deduplication, but his generic VM storage pool is not one of them.

Friday, November 18th, 2011 Configuration, ZFS 5 Comments

What we’ve learned – RAM

This post is possibly the most important lesson that we learned.  RAM is of MAJOR importance to Nexenta.  You can’t have enough of it.  Our original Nexenta deployment was 12GB of RAM.  It seemed like a silly amount of ram just a year ago.  Today we’re looking at it as barely a starting point.  Consider these facts :

1 – RAM is an order of magnitude (or more) faster than Flash.

2 – RAM is getting cheaper every day.

3 – You can put silly amounts of RAM in a system today.

4 – Data ages, and goes cold, and doesn’t get accessed as it gets older, reducing your Hot data footprint.

Lets go through these statements one by one.

1 – RAM is an order of magnitude (or more) faster than Flash.  Flash will deliver, on average, between 2,000 and 5,000 IOPS, depending on the type of SSD, the wear on the SSD, and garbage collection routines.  RAM has the capability to deliver hundreds of thousands of IOPS.  It doesn’t wear out, and there’s no garbage collection.

2 – RAM is getting cheaper every day.  When we built this platform last year, we paid over US $200 per 6GB of RAM.  Today you can buy 8GB Registered ECC DIMMS for under US $100.  16GB DIMM’s are hovering around US $300-$400.  Given the trends, I’d expect those to drop over the next year or two significantly.

3 – You can put silly amounts of RAM in a system today.  Last year, we were looking at reasonably priced boards that could fit 24GB of RAM in them.  Today we’re looking at reasonably priced barebones systems that you can fit 288GB of RAM in.  Insane systems (8 socket Xeon) support 2TB of RAM.  Wow.

4 – Data ages, goes cold, and doesn’t get accessed as much.  Even with only 12GB of RAM and 320GB of SSD, much of our working set is cached.  With 288GB of RAM, you greatly expand your capability of adding L2ARC (remember, L2ARC uses some of main memory) and increase your ARC cache capacity.  If your working set was 500GB on our old system you’d be running at least 200GB of it from spinning disk.  New systems configured with nearly 300GB of ARC and a reasonable (say 1TB) amount of L2ARC would cache that entire working set.  You’d see much of that working set cached in RAM (delivering hundreds of thousands of IOPS) part of it delivered from Flash (delivering maybe 10,000 IOPS), and only very old, cold data being served up from disk.  Talk about a difference in capabilities.  This also allows you to leverage larger, slower disks for older data.  If the data isn’t being accessed, who cares if it’s on slow 7200RPM disks?  That powerpoint presentation from 4 years ago isn’t getting looked at every day, but you’ve still got to save it.  Why not put it on the slowest disk you can find.

This being said, our new Nexenta build is going to have boatloads of RAM.  Maybe not 288GB (16GB DIMMS are still expensive compared to 8GB DIMMS) but I would put 144GB out there as a high probability.

 

Tuesday, November 15th, 2011 Configuration, Hardware 6 Comments

What we’ve learned – Working Set

So lets start talking about some of the things that we’ve learned over the last year, shall we?  The number one thing that we have learned is that your working set size dramatically impacts your performance on ZFS.  When you can keep that working set inside of RAM, your performance numbers are outrageous.  Hundreds of thousands of IOPS.  Latency is next to nothing, and the world is wonderful.  Step over that line though, and you had better have architected your pool to absorb it.

Flash forward to a year later, and we’ve found that our typical response times and IOPS delivered are much higher than we expected, and latency is much lower.  Why is that you may ask?  With large datasets and very random access patterns, you would expect close to worst-case scenarios.  What we found is that we had a lot of blocks that were never accessed.  We have thick-provisioned most of our VM’s, which results in 100+GB of empty space in most of those VM’s (nearly 50% of all allocated capacity).  With Nexenta and ZFS, all of the really active data was being moved into the ARC/L2ARC cache.  While we still had some reads and writes going to the disks, it was a much smaller percentage.  We quickly figured out that the algorithms and tuning Nexenta employs in ZFS for caching seems to be very very intelligent, and our working set was much smaller than we ever really imagined.

Tuesday, November 15th, 2011 Configuration, Virtualization, ZFS 5 Comments

A quick note

First and foremost, welcome back to ZFSBuild.  We’ve missed you, and we hope you’ve missed us.  Over the last year we’ve found out numerous things about this blog, our readers, and ZFS.  Unfortunately the rigors of day to day operations have prevented us from blogging more.  We aim to rectify that, so thanks for sticking with us.  It’s been a while, and we’ve got a lot of new information to share with you.  Hopefully we can put it together in a format that you can follow.  If not, let us know in the comments!  Now, let’s get this party started.

What have we learned in the last year?  We found out we have a lot of people that have read this blog.  With hundreds of comments, and as many (or more) emails that have come in over the last year, there are certainly a lot of people that have checked out the site.  This makes us happy.  Not only because we’ve got big giant ego’s that love to be stroked, but that people are interested in ZFS and Nexenta.  With that, we’d like to say Thank You.  Without you we would just be talking to ourselves, and that’s just creepy.

We also found that we like evangelizing for Nexenta. It’s not because we get paid by them (Nexenta, if you’re reading….hint hint) but because they put out a great product that puts a great SAN within the reach of small businesses like ours.  Most businesses can’t afford to spend fifty thousand dollars on EMC or NetApp gear just to get started.  Most of them can’t spend ten thousand.  ZFS, Nexenta, and Open Source software puts a robust SAN/NAS into the hands of small businesses that need it the most.  Just because you can’t afford a fifty thousand dollar SAN doesn’t mean that your data is any less important than the MegaCorp down the street.  Nexenta and ZFS allows you to get into the game for a fraction of the cost, all while giving you features that the big boys try very hard to compete with.

Another thing that many people that manage enterprise storage may not know is that ZFS is becoming more popular.  I had the opportunity to attend the Open Storage Summit this year, which was hosted by Nexenta.  One word, Amazing!  Nexenta knew how to put on a show, and they knew what people were interested in.  There wasn’t one keynote or breakout session that was boring.  I also expected it to be quite a bit smaller.  I think the final tally was somewhere between 400 and 500 registered attendees.  They ran the gamut from government employees to VMWare nerds to hosting providers.  It seemed like all sectors were pretty well represented, and everyone was excited about ZFS, Open Storage, and Nexenta.  More about the Open Storage Summit later.

That about wraps it up for now.  Next time I’ll get in to some of the technical things that we’ve learned over the last year with our build, zfs, and Nexenta.

Friday, November 11th, 2011 ZFS 1 Comment

Updates, News, and Feedback

We’ve gotten loads of positive feedback from our Blog and from our Anandtech article over the last year.  Since we last posted, we’ve found quite a few things that we want to share with everyone about our first build.  Keep an eye here for updates and musings about what we have found over the last year, and what we plan on doing next!

Monday, November 7th, 2011 Configuration 3 Comments

OpenIndiana Benchmarks

After Oracle decided to change the course of OpenSolaris (forum thread), the open source community reacted by forking the code base through a new project called Illumos. The first downloadable ISO from the Illumos project is OpenIndiana.

OpenIndiana is based on OpenSolaris b147. It is important to take a minute and look at build numbers of popular milestones within the OpenSolaris development process. Here are some major ones.
OpenSolaris 2008.11: b101
OpenSolaris 2009.06: b111
OpenSolaris 2010.03: b134
OpenSolaris b147 forks to create OpenIndiana b147

The b134 (2010.03) release was held back and never released as an official OpenSolaris release. If you go to the OpenSolaris site, the most recent official ISO is 2009.06. We have been using b134 in all of our tests anyway, because b134 has measurably better iSCSI performance than 2009.06 (b111). Additionally, the NexentaStor and Nexenta Core Platform distributions we benchmarked were both originally based on OpenSolaris b134.

OpenIndiana is built on OpenSolaris b147, which means it has a number of bug fixes since b134 and even more bug fixes since b111 (2009.06). At this point, you can think of OpenIndiana as the latest and greatest OpenSolaris code with OpenIndiana logos added to it. At this point, OpenIndiana is not significantly different from OpenSolaris b147 in any specific technical way.

› Continue reading

Monday, October 11th, 2010 Benchmarks 12 Comments

Nexenta Core Platform Benchmarks

In our benchmarks between OpenSolaris b134, NexentaStor Enterprise, and a Promise VTrack M610i box, we found that OpenSolaris consistently outperformed NexentaStor. We were never really quite sure why, since NexentaStor is based on Nexenta Core Platform and Nexenta Core Platform was based on OpenSolaris b134. We expected NexentaStor to match the performance of OpenSolaris, but it simply did not.

One theory we had for the performance difference was that the web GUI in NexentaStor used enough system memory that NexentaStor had significantly less ARC cache available and was therefore at a performance disadvantage to OpenSolaris. This got us curious about how Nexenta Core Platform would perform relative to OpenSolaris and NexentaStor.

We decided to benchmark Nexenta Core Platform using the same hardware and benchmarks that we have used for all of the previous benchmark runs. The results exceeded our expectations.
› Continue reading

Saturday, October 9th, 2010 Benchmarks 4 Comments

FreeNAS vs OpenSolaris ZFS Benchmarks

We have received a lot of feedback from members of the IT community since we published our benchmarks comparing OpenSolaris and Nexenta with an off the shelf Promise VTrak M610i. One question we received from several people was about FreeNAS. Several people asked “How does FreeNAS compare to OpenSolaris on the same hardware?” That was an excellent question, and we decided to run some tests to answer that question.
› Continue reading

Friday, September 10th, 2010 Benchmarks, ZFS 30 Comments

Benchmarks Comparing OpenSolaris, Nexenta, and a Promise VTrak

We ran some benchmarks using IOmeter (running on Windows 2008 R2 on our test blade) to compare OpenSolaris running on our ZFSBuild project hardware, Nexenta running on exactly the same hardware, and a Promise VTrak 610i box. We ran all of the benchmarks over gigabit ethernet.

Here are screenshots of the IOmeter config:


› Continue reading

Tuesday, August 3rd, 2010 Benchmarks 15 Comments

Testing the L2ARC

We posted an article a while back that explained the cool L2ARC feature of ZFS.  We thought it would be fun to actually test the L2ARC and build a chart of the performance as a function of time.  To test and graph usefulness of L2ARC, we set up an iSCSI share on the ZFS server and then ran IOmeter from our test blade in our blade center.  We ran these tests over gigabit Ethernet.
› Continue reading

Friday, July 30th, 2010 ZFS 2 Comments

HowTo : Our Zpool configuration

We’ve decided to go with striped mirrored vdev’s (similar to RAID10) for our ZFS configuration. It gives us the best performance and fault tolerance for what we use the system for. To reproduce our ZFS configuration, you would use all of the commands in the image below (assuming your drives were named the same way ours were) :
› Continue reading

Thursday, June 3rd, 2010 RAID, ZFS 12 Comments

HowTo : Add Spare Drives to Zpool

Spare drives in a ZFS or any RAID configuration is a must have. Consider this – A winter storm hits and a snowplow piles 3 feet of snow in your driveway. 5 minutes later a drive in your system fails. You’re stuck until you can plow or snowblow your way out of your drive. Without spare drives in your system, you run the risk of additional drives failing while that drive is offline. Depending on the ZFS or RAID level a second drive failure could cause permanent data loss. If you have a spare drive (or multiple spare drives) in your system, you can automatically start rebuilding the array as soon as it detects that a drive has failed. This limits the amount of time that your ZFS or RAID subsystem is unprotected.

To add spare drives to your system first run the format command to find the disks that you have in your system.

Format Command
› Continue reading

Thursday, June 3rd, 2010 RAID, ZFS 2 Comments

HowTo : Add Log Drives (ZIL) to Zpool

ZIL (ZFS Intent Log) drives can be added to a ZFS pool to speed up the write capabilities of any level of ZFS RAID. It writes the metadata for a file to a very fast SSD drive to increase the write throughput of the system. When the physical spindles have a moment, that data is then flushed to the spinning media and the process starts over. We have observed significant performance increases by adding ZIL drives to our ZFS configuration. One thing to keep in mind is that the ZIL should be mirrored to protect the speed of the ZFS system. If the ZIL is not mirrored, and the drive that is being used as the ZIL drive fails, the system will revert to writing the data directly to the disk, severely hampering performance.

To add a ZIL drive to your ZFS system first run the format command to find the disks that you have available in your system.

Format Command
› Continue reading

Thursday, June 3rd, 2010 RAID, ZFS No Comments

HowTo : Add Cache drives to a Zpool

The Cache drives (or L2ARC Cache) are used for frequently accessed data. In our system we have configured it with 320GB of L2ARC cache. This cache resides on MLC SSD drives which have significantly faster access times than traditional spinning media. This means that up to 320GB of the most frequently accessed data can be kept in an SSD cache, and when requested does not have to be read from spinning media. This greatly speeds up access time for files that are frequently used.

To add caching drives to your Zpool first run the format command to find the disks that you have in your system.

Format Command
› Continue reading

Thursday, June 3rd, 2010 RAID, ZFS 4 Comments

HowTo : Create Striped RAIDZ Nested Zpool

A Striped RAIDZ Zpool is useful for a number of reasons. It gives you additional resiliency against drive failures and performs slightly better due to being striped across more drives. A Striped RAIDZ Zpool is very similar to RAID50.

To create a Striped RAIDZ Zpool first run the format command to find the disks that you have in your system.

Format Command
› Continue reading

Thursday, June 3rd, 2010 RAID, ZFS 4 Comments

HowTo : Create Three Way Mirror Zpool

A three way mirror is useful if you are very very concerned about data integrity. Basically a three way mirror is similar to RAID1, except it mirrors it’s data across three drives instead of two drives. This effectively cuts your usable space to 1/3 of the total capacity of the drives, but it allows two drives to fail while maintaining data integrity.

To set up a three way mirror first run the format command to find the disks that you have in your system.

Format Command
› Continue reading

Thursday, June 3rd, 2010 RAID, ZFS 2 Comments

HowTo : Create Striped Mirror Vdev ZPool

A Striped Mirrored Vdev Zpool is very similar to RAID10. It does have the additional feature of having checksuming to prevent silent data corruption, but essentially it is the same as RAID10. It allows you to have great random read and random write performance, but it does decrease your available disk space to 50% of the physical capacity of your drives. Every time we have set up RAID for workloads though, we have found that the additional available IOPS for random writes far offsets the penalty of losing half of your disk space.

To create a Striped Mirrored Vdev Zpool first, run the format command to find the disks that you have in your system.

Format Command
› Continue reading

Thursday, June 3rd, 2010 RAID, ZFS 1 Comment

HowTo : Create RAIDZ2 ZPool

A RAIDZ2 Zpool is very similar in function to a RAID6 array. You get two parity points to prevent array failure in case of drive failures. A RAIDZ2 Zpool can tolerate two drive failures before it becomes vulnerable to data loss.

To create a RAIDZ2 Zpool first run the format command to find the disks that you have in your system.

Format Command
› Continue reading

Thursday, June 3rd, 2010 RAID, ZFS No Comments

HowTo : Create RAIDZ Zpool

RAIDZ is ZFS’s implementation of RAID5. It uses a variable width stripe for it’s parity, which allows for better performance than traditional RAID5 implementations. RAIDZ is typically used when you want the most out of your physical storage and are willing to sacrifice a bit of performance to get it. You can have a single disk failure in a RAIDZ array and still maintain all of your data.

To create a RAIDZ Zpool first run the format command to find the disks that you have in your system.

Format Command
› Continue reading

Thursday, June 3rd, 2010 RAID, ZFS 4 Comments

HowTo : Create Mirrored Vdev Zpool

Mirrored Vdev’s are equivalent to a RAID1 array, with the added bonus of checksum data to prevent silent data corruption. The performance of a Mirrored Vdev Zpool will be very similar to a RAID1 array.

To create a Mirrored Vdev Zpool first run the format command to find the disks that you have in your system.

Format Command
› Continue reading

Thursday, June 3rd, 2010 RAID, ZFS 1 Comment

Howto : Create ZFS Striped Vdev ZPool

A ZFS Striped Vdev pool is very similar to RAID0. You get to keep all of the available storage that your drives offer, but you have no resiliency to hard drive failure. If one drive in a Striped Vdev Zpool fails you will lose all of your data. You do still have checksum data to prevent silent data loss, but any physical failure of a drive will result in data loss. We strongly recommend never using this level of ZFS, as there is no resiliency to drive failure.

To create a Striped Vdev Zpool first run the format command to find the disks that you have in your system.

Format Command
› Continue reading

Thursday, June 3rd, 2010 RAID, ZFS 1 Comment

ZFS RAID levels

ZFS RAID levels

When we evaluated ZFS for our storage needs, the immediate question became – what are these storage levels, and what do they do for us? ZFS uses odd (to someone familiar with hardware RAID) terminology like Vdevs, Zpools, RAIDZ, and so forth. These are simply Sun’s words for a form of RAID that is pretty familiar to most people that have used hardware RAID systems.
› Continue reading

Wednesday, May 26th, 2010 RAID 10 Comments

Initial ZFS performance Stats

Some people have been asking for some performance stats from our Zpool. We’re still working on developing a rigorous testing mechanism to produce reliable and relevant numbers, but for those of you who cannot wait, here’s a few tidbits!
› Continue reading

Monday, May 24th, 2010 Benchmarks 21 Comments

Initial InfiniBand Performance Testing

We’ve gotten the InfiniBand network mostly up and running, and have been doing some performance tests with the included WinOF InfiniBand performance tools. Needless to say the results are nothing less than stunning. This is the fastest transport that we’ve ever had available to us to use in the DataCenter. › Continue reading

Friday, May 21st, 2010 Benchmarks, InfiniBand 7 Comments

Disk Drive Selection

Our search for an affordable yet high performance storage array has led us to use ZFS, OpenSolaris, and commodity hardware. To get the affordable part of the storage under hand, we had to investigate all of our options when it came to hard drives and available SATA technology. We finally settled on a combination of Western Digital RE3 1TB drives, Intel X25-M G2 SSD’s, Intel X25-E SSD’s, and Intel X25-V SSD’s.

All Internal Drives › Continue reading

Wednesday, May 19th, 2010 Hardware 10 Comments

Why go with a BladeCenter instead of 1U systems?

The pro’s and con’s of using a BladeCenter

Many people looking through this site may be saying to themselves – Why would you not just build 1U systems and save yourself the cost of running a BladeCenter. Certainly 1U systems would be more cost effective than running a BladeCenter, right?  We thought so too until we really dug into it and found out that running the BladeCenter was actually less expensive when you look at building more than a few servers.  The detailed answers to how this shook out will be explored in this post in depth. › Continue reading

Monday, May 10th, 2010 Hardware 4 Comments

Getting InfiniBand to work

The title of this post says it all. How to get Infiniband to work properly. We have had a roller coaster of fun trying to get the InfiniBand network up and running properly. From the headache of installing the InfiniBand switch module to finding the correct Mezzanine cards for our blades, to getting IPoIB (IP over InfiniBand) working properly on all systems, it’s been interesting to say the least.
› Continue reading

Thursday, May 6th, 2010 Configuration, InfiniBand 15 Comments

Test Blade Configuration

Our bladecenters are full of high performance blades that we use to run a virtualized hosting environment at this time. Since the blades that are in those systems are in production, we couldn’t very well use them to test the performance of our ZFS system. As such, we had to build another blade. We wanted the blade to be similar in spec to the blades that we were using, but we also wanted to utilize some of the new technology that has come out since we put many of our blades into production. › Continue reading

Wednesday, May 5th, 2010 Hardware 3 Comments

Explanation of ARC and L2ARC

ZFS includes two exciting features that dramatically improve the performance of read operations. I’m talking about ARC and L2ARC. ARC stands for adaptive replacement cache. ARC is a very fast cache located in the server’s memory (RAM). The amount of ARC available in a server is usually all of the memory except for 1GB. › Continue reading

Thursday, April 15th, 2010 ZFS 13 Comments

Why We Chose InfiniBand Instead of 10GigE

For years we have successfully connected all of our blade centers to our storage area networks using 1GigE. Each time we needed more bandwidth, we simply added more networking ports. For our ZFS Build project, we decided to break from this tradition and try out higher performance networking solutions in place of the 1GigE networking. › Continue reading

Thursday, April 15th, 2010 InfiniBand No Comments

Installing the Infiniband Switch

Installing Infiniband switch in a SuperMicro SBE-710E

Our current infrastructure relies completely on iSCSI for our storage solution. As such, we have dual gigabit switch modules in our bladecenter. While this has worked very well for us, we want to expand our bladecenter to accept the SuperMicro 4x DDR Infiniband switch.
› Continue reading

Wednesday, April 14th, 2010 Hardware, InfiniBand 1 Comment

Installing OpenSolaris

Installing OpenSolaris

When we took on this ZFS Build project, we decided to use OpenSolaris for our ZFS system rather than FreeBSD or a Linux variant. We chose OpenSolaris for the ZFS server because ZFS was originally built for Solaris/OpenSolaris, and we suspected OpenSolaris would therefore include better support for ZFS.
OpenSolaris feels very similar to FreeBSD or Linux, but specific commands may be different. One nice touch is that the installer is included on the LiveCD. › Continue reading

Wednesday, April 14th, 2010 Configuration 11 Comments

Important Considerations

Important items to remember :

While building up our ZFS SAN server, we encountered a few issues in not having the correct parts on hand. Once we identified these parts, we ordered them as needed. The following is a breakdown of what not to forget. › Continue reading

Monday, April 12th, 2010 Hardware No Comments

Motherboard, CPU, Heatsink, and RAM selection

Motherboard Selection – SuperMicro X8ST3-F

SuperMicro X8ST3-FSupermicro Packaging

SuperMicro X8ST3-FMotherboard Top Photo

We are planning on deploying this server with OpenSolaris 2009.06. As such we had to be very careful about our component selection. OpenSolaris does not support every piece of hardware sitting on your shelf. We had several servers that we tested with that would not boot into OpenSolaris at all. Granted, some of these were older systems with somewhat odd configurations. In any event, component selection needed to be made very carefully to make sure that OpenSolaris would install and work properly. › Continue reading

Thursday, April 8th, 2010 Hardware 3 Comments

Chassis Selection

SuperMicro SC846E1-R900B
SuperMicro SC846E1-R900B Chassis Photo

We host a lot of websites and need a lot of fast storage for those websites and our Cloud Infrastructure. We currently run a lot of individual iSCSI devices over Gigabit Ethernet. We want to consolidate those individual iSCSI devices in to a centralized unit that is comprised of Hybrid storage (SSD+HDD) that is expandable to support a large amount of drives with redundant connections to our Cloud Infrastructure. › Continue reading

Sunday, March 21st, 2010 Hardware 15 Comments

Welcome To ZFS Build!

Building a ZFS SAN to scale up to hundreds of drives and several enclosures takes a lot of fore-thought.  Also important in that decision is figuring out what kind or RAID levels you expect to be using, what kind of drives you will be using, and what kind of performance that you need.

We will discuss component selection including Chassis Selection, Motherboard Selection, Processor Selection, CPU Selection, Cooling Selection, Memory Selection,  SSD Selection, Hard drive selection, Infiniband selection,  and HBA selection.

We will also focus on the performance of the ZFS SAN using different ZFS RAID levels, and different failure modes.

Saturday, March 13th, 2010 Hardware No Comments