ZFS

Dtrace broken with SRP targets?

Anyone that’s using Infiniband, SRP targets, Dtrace, and version 3.1.3.5 of Nexenta Community edition, please raise your hands.

Nobody?  Not suprising :)  Pulled some dtrace scripts off of another system to evaluate performance on the ZFSBuild2012 system, and got a very wierd error :

 

#./arcreap.d

dtrace: failed to compile script ./arcreap.d: “/usr/lib/dtrace/srp.d”, line 49: translator member ci_local definition uses incompatible types: “string” = “struct hwc_parse_mt”

 

I’ve never seen this before, and the exact same script on ZFSBuild2010 works flawlessly.  Something in SRP I’m guessing, that’s the part that’s throwing the error, and we aren’t using SRP on the ZFSBuild2010 system.  If anyone at Nexenta or anyone working on the Illumos project sees anything here that make sense, I’d love to hear about it.

Thursday, December 27th, 2012 ZFS 4 Comments

OpenStorage Summit 2012

Hotel and airfare booked for the OpenStorage Summit 2012 in San Jose.  Last year was spectacular and was well worth the trip.  Anyone else thinking about going should stop thinking and register!

http://www.openstoragesummit.org/

Thursday, September 6th, 2012 ZFS No Comments

Nexenta Community Edition license changes

It’s been noted over on the Nexentastor.org forums that there was a change in the Nexenta EULA that now prevents commercial usage of Nexenta Community Edition.  Forum thread can be found here : http://www.nexentastor.org/boards/1/topics/7593

I found this discouraging, as the ability to use Nexenta Community Edition in our production environment was the reason that we selected Nexenta.  All of our original testing was done with OpenSolaris, and it actually performed better than Nexenta.  We went with Nexenta Community Edition because of the ease of use and the ability to upgrade to Enterprise Edition and purchase support if we needed it.  Removing the option for small businesses to use Nexenta Community Edition in production is not something that I expected to see from Nexenta.  I wondered why this happened.

I took some time to think about this and try to figure out why Nexenta’s stance might have changed.  After browsing the forums, and seeing posts that say things like “I tried contacting Nexenta support” I stumbled upon the idea that support could be a big part of it.  This is a free version of Nexenta, allowing you to use up to 18TB of space, with NO SUPPORT.  People then come to the forums and complain that there’s no support, or they don’t get a response, or they got little help.

Support costs money.  There have been a number of people that are using Nexenta Community Edition that have been contacting support (at least one noted here –http://www.nexentastor.org/boards/2/topics/5662#message-7672).  Even if support doesn’t help you, you’re still tying up time on the phone with them, or forcing them to write an email response.  This costs money.  The EULA change isn’t going to change peoples behavior, but it does make it easier for Nexenta to send you to the forums for support, and use canned responses for Nexenta Community Edition questions.

The other possibility that I could see would be someone purchasing a single Nexenta Enterprise Edition Silver license, and then installed Nexenta Community Edition on 20 other devices, and try to get support on those devices also.  That’s pretty shady, but I can easily envision someone doing that.  Saying that Nexenta Community Edition isn’t to be run on production workloads allows Nexenta to punt questions if they come from a non-supported system much easier.  This is similar to the problem that Cisco has with their SmartNet support.  You buy 40 Cisco switches, put SmartNet on one of them, and voila, you’ve got support for every device that you own.  Cisco is starting to get this worked out, and I can see a lot of shops hurting in the next few years when they have to buy SmartNet on every device they own.

My suggestion to potential Nexenta Community Edition users – if you’re considering running Nexenta Community Edition in production, go for it.  From what I can tell, Nexenta is not planning to use the terms of the EULA to sue people for using Nexenta Community Edition in production.  Nexenta IS likely going to give you grief if you call in to support, and you’ll likely not get any help.  They’re a for-profit company, and I can’t fault them for wanting to remain in the black.  If it’s a mission critical workload and you absolutely need support, buy Nexenta Enterprise Silver at a minimum instead of using Nexenta Community Edition.  Nexenta Enterprise Silver is still cheaper than nearly any other support package you’ll find, and my experiences with support have been nothing less than stellar.

My suggestion to Nexenta – figure this out on the backend.  By telling small business that they cannot use Nexenta Community Edition in production, you have opened the door for FreeNAS, OpenFiler and Napp-It to step in and grab these startups that desperately want to use your product.  Your product is better than FreeNAS, OpenFiler and Napp-It, but FreeNAS, OpenFiler, and Napp-It don’t include draconian licensing limitations.  Figure out how to allow these small businesses to use Nexenta Community Edition and flourish, and when they’re ready to go big time, let ’em buy Nexenta Enterprise Silver/Gold/Platinum licensing and support rather than figuring out how to pay Napp-It or one of the others for their support.  If the EULA for Nexenta Community Edition had looked like this when we started using it, we would have thought long and hard about not using it or recommending it to other people.  I don’t want to do that, and I don’t want anyone reading this site to do that.  The Nexenta WebGUI is comfortable and I’ve gotten quite used to it, I’d hate to go back and have to create iSCSI targets from the command line.

At some point in time, somebody will sit down and write a good web based GUI that runs on OpenSolaris/OpenIndiana.  By originally allowing production usage of Nexenta Community Edition, Nexenta took away the desire to code that alternative web GUI.  After all, nobody wants to spend months coding something when Nexenta Community Edition existed for free, worked awesome, and allowed production usage.  Now that Nexenta Community Edition is not allowed for production usage, there will likely be renewed interest in developing a good open source web GUI.

 

Tuesday, September 4th, 2012 ZFS No Comments

Let ZFS use all of your RAM

So we’ve been talking about RAM usage, ram problems, and pretty much everything related to RAM lately, so I figured I’d mention this one too.

Many of you, if you’ve got a large memory system, may notice that your system never uses all of it’s RAM.  I’ve got some 192GB systems that routinely reported that they had 30GB of RAM free.  Why is this?  Everything that I’ve read says ZFS will use all of the RAM in your system that it can, how can it leave 30GB just wasting away doing nothing?

The answer is both surprising, and not surprising at the same time.  When ZFS was written if a system had 8GB of RAM it was a monster, and 16GB was nearly unheard of.  Let’s not even talk about 32GB.  Today having 192GB of RAM in a system isn’t difficult to achieve, and the new Xeon E5 platforms boast RAM capacities of 1TB and more.  Back then there was a need to limit ZFS a little bit from encroaching on system processes.  There is a Solaris “bug” (if you can call it that) that limits ARC cache to 7/8 of total system memory.  This was intentional so that the ARC cache couldn’t cause contention for swapfile space in RAM.  That “bug” still exists in Nexenta 3.1.2.  This is why I have 30GB of RAM free on a 192GB system.  1/8 system memory made sense when monster systems were 8GB.  That left 1GB of RAM that wouldn’t be touched for ARC caching to ensure proper system operation.  Today the amount of RAM in systems dwarfs what most people would have used 10 years ago, and as such we need to make modifications.  Fortunately, we can tune this variable.

From what I’ve read (based on information from Richard Elling (@richardelling on twitter)) the default for Nexenta 4.0 and Illumos distributions will be 256MB.  I take a very cautious approach to this and I’m going to use 1GB.  This is all discussed here.

For those who don’t want to check out the links – the pertinent info is here :

edit /etc/system for permanent fix – this reserves 1GB of RAM

set swapfs_minfree=262144

To change it on the fly

echo swapfs_minfree/W0t262144 | mdb -kw

This change has allowed my ARC cache to grow from 151-155GB utilization to 176GB utilization, effectively allowing me to use all of the RAM that I should be able to use.

FYI – this is unsupported, and Nexenta support will probably give you grief if you do this without discussing it with them (if you have a support contract), so be forewarned.  There may be unintended consequences from making this change that I am not aware of.

Wednesday, April 18th, 2012 Configuration, ZFS 5 Comments

Dedupe – be careful!

Syndicated from StorageAdventures.com

Just got done working on a friends Nexenta platform, and boy howdy was it tired.  NFS was slow, iSCSI was slow, and for all the slowness, we couldn’t really see a problem with the system.  The GUI was reporting there was free RAM, and IOSTAT showed the disks not being completely thrashed.  We didn’t see anything really out of the ordinary at first glance.

After some digging, we figured out that we were running out of RAM for the Dedupe tables.  It goes a little something like this.

Nexenta by default allocates 1/4 of your ARC cache (RAM) to metadata caching.  Your L2ARC map is considered metadata.  When you turn on dedupe, all of that dedupe information is stored in metadata.  The more you dedupe, the more RAM you use, the more L2ARC you use, the more RAM you use.

The system in question is a 48GB system, and it reported that had free memory, so we were baffled.  If its got free RAM, what’s the holdup?  Seems as though between the dedupe tables and the L2ARC, we had outstripped the capabilities of the ARC to hold all of the metadata.  This caused _everything_ to be slow.  The solution?  You can either increase the percentage of RAM that can be used for metadata, increase the total RAM (thereby increasing the amount you can use for metadata caching), or you can turn off dedupe, copy everything off of the volume, then copy it back.  Since there’s no way currently to “undedupe” a volume, once that data has been created, you’re stuck with it until you remove the files.

So, without further ado, here’s how to figure out what’s going on in your system.

echo ::arc|mdb -k

This will display some interesting stats.  The most important in this situation is the last three lines :

arc_meta_used             =     11476 MB
arc_meta_limit            =     12014 MB
arc_meta_max              =     12351 MB

These numbers will change.  Things will get evicted, things will come back.  You don’t want to see the meta_used and meta_limit numbers this close.  You definately don’t want to see the meta_max exceed the limit.  This is a great indicator that you’re out of RAM.

After quite a bit of futzing around, disabling dedupe, and shuffling data off of, then back on to pool, things look better :

arc_meta_used             =      7442 MB
arc_meta_limit            =     12014 MB
arc_meta_max              =     12351 MB

Just by disabling dedupe, and blowing away the dedupe tables, it freed up almost 5GB of RAM.  Who knows how much was being swapped in and out of RAM.

Other things to check :

zpool status -D <volumename>

This gives you your standard volume status, but it also prints out the dedupe information.  This is good to figure out how much dedupe data there is.  Here’s an example :

DDT entries 7102900, size 997 on disk, 531 in core

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    6.41M    820G    818G    817G    6.41M    820G    818G    817G
     2     298K   37.3G   37.3G   37.3G     656K   82.0G   82.0G   81.9G
     4    30.5K   3.82G   3.82G   3.81G     140K   17.5G   17.5G   17.5G
     8    43.9K   5.49G   5.49G   5.49G     566K   70.7G   70.7G   70.6G
    16      968    121M    121M    121M    19.1K   2.38G   2.38G   2.38G
    32      765   95.6M   95.6M   95.5M    33.4K   4.17G   4.17G   4.17G
    64       33   4.12M   4.12M   4.12M    2.77K    354M    354M    354M
   128        5    640K    640K    639K      943    118M    118M    118M
   256        2    256K    256K    256K      676   84.5M   84.5M   84.4M
    1K        1    128K    128K    128K    1.29K    164M    164M    164M
    4K        1    128K    128K    128K    5.85K    749M    749M    749M
   32K        1    128K    128K    128K    37.0K   4.63G   4.63G   4.62G
 Total    6.77M    867G    865G    864G    7.84M   1003G   1001G   1000G

 

This tells us that there are 7 million entries, with each entry taking up 997 bytes on disk, and 531 bytes in memory.  Simple math tells us how much space that takes up.

7102900*531=3771639900/1024/1024=3596MB used in RAM

The same math tells us that there’s 6753MB used on disk, just to hold the dedupe tables.

The dedupe ratio on this system wasn’t even worth it.  Overall dedupe ratio was something like 1.15x.  Compression on that volume(which has nearly no overhead) after shuffling the data around,is at 1.42x.  So at the cost of CPU time (which there is plenty of), we get a better over-subscription ratio from compression vs deduplication.

There are definitely use-cases for deduplication, but his generic VM storage pool is not one of them.

Friday, November 18th, 2011 Configuration, ZFS 13 Comments

What we’ve learned – Working Set

So lets start talking about some of the things that we’ve learned over the last year, shall we?  The number one thing that we have learned is that your working set size dramatically impacts your performance on ZFS.  When you can keep that working set inside of RAM, your performance numbers are outrageous.  Hundreds of thousands of IOPS.  Latency is next to nothing, and the world is wonderful.  Step over that line though, and you had better have architected your pool to absorb it.

Flash forward to a year later, and we’ve found that our typical response times and IOPS delivered are much higher than we expected, and latency is much lower.  Why is that you may ask?  With large datasets and very random access patterns, you would expect close to worst-case scenarios.  What we found is that we had a lot of blocks that were never accessed.  We have thick-provisioned most of our VM’s, which results in 100+GB of empty space in most of those VM’s (nearly 50% of all allocated capacity).  With Nexenta and ZFS, all of the really active data was being moved into the ARC/L2ARC cache.  While we still had some reads and writes going to the disks, it was a much smaller percentage.  We quickly figured out that the algorithms and tuning Nexenta employs in ZFS for caching seems to be very very intelligent, and our working set was much smaller than we ever really imagined.

Tuesday, November 15th, 2011 Configuration, Virtualization, ZFS 5 Comments

A quick note

First and foremost, welcome back to ZFSBuild.  We’ve missed you, and we hope you’ve missed us.  Over the last year we’ve found out numerous things about this blog, our readers, and ZFS.  Unfortunately the rigors of day to day operations have prevented us from blogging more.  We aim to rectify that, so thanks for sticking with us.  It’s been a while, and we’ve got a lot of new information to share with you.  Hopefully we can put it together in a format that you can follow.  If not, let us know in the comments!  Now, let’s get this party started.

What have we learned in the last year?  We found out we have a lot of people that have read this blog.  With hundreds of comments, and as many (or more) emails that have come in over the last year, there are certainly a lot of people that have checked out the site.  This makes us happy.  Not only because we’ve got big giant ego’s that love to be stroked, but that people are interested in ZFS and Nexenta.  With that, we’d like to say Thank You.  Without you we would just be talking to ourselves, and that’s just creepy.

We also found that we like evangelizing for Nexenta. It’s not because we get paid by them (Nexenta, if you’re reading….hint hint) but because they put out a great product that puts a great SAN within the reach of small businesses like ours.  Most businesses can’t afford to spend fifty thousand dollars on EMC or NetApp gear just to get started.  Most of them can’t spend ten thousand.  ZFS, Nexenta, and Open Source software puts a robust SAN/NAS into the hands of small businesses that need it the most.  Just because you can’t afford a fifty thousand dollar SAN doesn’t mean that your data is any less important than the MegaCorp down the street.  Nexenta and ZFS allows you to get into the game for a fraction of the cost, all while giving you features that the big boys try very hard to compete with.

Another thing that many people that manage enterprise storage may not know is that ZFS is becoming more popular.  I had the opportunity to attend the Open Storage Summit this year, which was hosted by Nexenta.  One word, Amazing!  Nexenta knew how to put on a show, and they knew what people were interested in.  There wasn’t one keynote or breakout session that was boring.  I also expected it to be quite a bit smaller.  I think the final tally was somewhere between 400 and 500 registered attendees.  They ran the gamut from government employees to VMWare nerds to hosting providers.  It seemed like all sectors were pretty well represented, and everyone was excited about ZFS, Open Storage, and Nexenta.  More about the Open Storage Summit later.

That about wraps it up for now.  Next time I’ll get in to some of the technical things that we’ve learned over the last year with our build, zfs, and Nexenta.

Friday, November 11th, 2011 ZFS 1 Comment

FreeNAS vs OpenSolaris ZFS Benchmarks

We have received a lot of feedback from members of the IT community since we published our benchmarks comparing OpenSolaris and Nexenta with an off the shelf Promise VTrak M610i. One question we received from several people was about FreeNAS. Several people asked “How does FreeNAS compare to OpenSolaris on the same hardware?” That was an excellent question, and we decided to run some tests to answer that question.

NOTE: This article was written in 2010 using FreeNAS 0.7.1.  We have a new article with much newer benchmarks from 2012 posted about FreeNAS 8.3 at http://www.zfsbuild.com/2013/01/25/zfsbuild2012-nexenta-vs-freenas-vs-zfsguru/. We strongly encourage you to read both this article and the newer article before passing judgement on FreeNAS, because there were significant performance improvements made in FreeNAS in recent versions.

› Continue reading

Friday, September 10th, 2010 Benchmarks, ZFS 32 Comments

Testing the L2ARC

We posted an article a while back that explained the cool L2ARC feature of ZFS.  We thought it would be fun to actually test the L2ARC and build a chart of the performance as a function of time.  To test and graph usefulness of L2ARC, we set up an iSCSI share on the ZFS server and then ran IOmeter from our test blade in our blade center.  We ran these tests over gigabit Ethernet.
› Continue reading

Friday, July 30th, 2010 ZFS 2 Comments

HowTo : Our Zpool configuration

We’ve decided to go with striped mirrored vdev’s (similar to RAID10) for our ZFS configuration. It gives us the best performance and fault tolerance for what we use the system for. To reproduce our ZFS configuration, you would use all of the commands in the image below (assuming your drives were named the same way ours were) :
› Continue reading

Thursday, June 3rd, 2010 RAID, ZFS 12 Comments

HowTo : Add Spare Drives to Zpool

Spare drives in a ZFS or any RAID configuration is a must have. Consider this – A winter storm hits and a snowplow piles 3 feet of snow in your driveway. 5 minutes later a drive in your system fails. You’re stuck until you can plow or snowblow your way out of your drive. Without spare drives in your system, you run the risk of additional drives failing while that drive is offline. Depending on the ZFS or RAID level a second drive failure could cause permanent data loss. If you have a spare drive (or multiple spare drives) in your system, you can automatically start rebuilding the array as soon as it detects that a drive has failed. This limits the amount of time that your ZFS or RAID subsystem is unprotected.

To add spare drives to your system first run the format command to find the disks that you have in your system.

Format Command
› Continue reading

Thursday, June 3rd, 2010 RAID, ZFS 6 Comments

HowTo : Add Log Drives (ZIL) to Zpool

ZIL (ZFS Intent Log) drives can be added to a ZFS pool to speed up the write capabilities of any level of ZFS RAID. It writes the metadata for a file to a very fast SSD drive to increase the write throughput of the system. When the physical spindles have a moment, that data is then flushed to the spinning media and the process starts over. We have observed significant performance increases by adding ZIL drives to our ZFS configuration. One thing to keep in mind is that the ZIL should be mirrored to protect the speed of the ZFS system. If the ZIL is not mirrored, and the drive that is being used as the ZIL drive fails, the system will revert to writing the data directly to the disk, severely hampering performance.

To add a ZIL drive to your ZFS system first run the format command to find the disks that you have available in your system.

Format Command
› Continue reading

Thursday, June 3rd, 2010 RAID, ZFS No Comments