Dedupe – be careful!

Syndicated from StorageAdventures.com

Just got done working on a friends Nexenta platform, and boy howdy was it tired.  NFS was slow, iSCSI was slow, and for all the slowness, we couldn’t really see a problem with the system.  The GUI was reporting there was free RAM, and IOSTAT showed the disks not being completely thrashed.  We didn’t see anything really out of the ordinary at first glance.

After some digging, we figured out that we were running out of RAM for the Dedupe tables.  It goes a little something like this.

Nexenta by default allocates 1/4 of your ARC cache (RAM) to metadata caching.  Your L2ARC map is considered metadata.  When you turn on dedupe, all of that dedupe information is stored in metadata.  The more you dedupe, the more RAM you use, the more L2ARC you use, the more RAM you use.

The system in question is a 48GB system, and it reported that had free memory, so we were baffled.  If its got free RAM, what’s the holdup?  Seems as though between the dedupe tables and the L2ARC, we had outstripped the capabilities of the ARC to hold all of the metadata.  This caused _everything_ to be slow.  The solution?  You can either increase the percentage of RAM that can be used for metadata, increase the total RAM (thereby increasing the amount you can use for metadata caching), or you can turn off dedupe, copy everything off of the volume, then copy it back.  Since there’s no way currently to “undedupe” a volume, once that data has been created, you’re stuck with it until you remove the files.

So, without further ado, here’s how to figure out what’s going on in your system.

echo ::arc|mdb -k

This will display some interesting stats.  The most important in this situation is the last three lines :

arc_meta_used             =     11476 MB
arc_meta_limit            =     12014 MB
arc_meta_max              =     12351 MB

These numbers will change.  Things will get evicted, things will come back.  You don’t want to see the meta_used and meta_limit numbers this close.  You definately don’t want to see the meta_max exceed the limit.  This is a great indicator that you’re out of RAM.

After quite a bit of futzing around, disabling dedupe, and shuffling data off of, then back on to pool, things look better :

arc_meta_used             =      7442 MB
arc_meta_limit            =     12014 MB
arc_meta_max              =     12351 MB

Just by disabling dedupe, and blowing away the dedupe tables, it freed up almost 5GB of RAM.  Who knows how much was being swapped in and out of RAM.

Other things to check :

zpool status -D <volumename>

This gives you your standard volume status, but it also prints out the dedupe information.  This is good to figure out how much dedupe data there is.  Here’s an example :

DDT entries 7102900, size 997 on disk, 531 in core

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    6.41M    820G    818G    817G    6.41M    820G    818G    817G
     2     298K   37.3G   37.3G   37.3G     656K   82.0G   82.0G   81.9G
     4    30.5K   3.82G   3.82G   3.81G     140K   17.5G   17.5G   17.5G
     8    43.9K   5.49G   5.49G   5.49G     566K   70.7G   70.7G   70.6G
    16      968    121M    121M    121M    19.1K   2.38G   2.38G   2.38G
    32      765   95.6M   95.6M   95.5M    33.4K   4.17G   4.17G   4.17G
    64       33   4.12M   4.12M   4.12M    2.77K    354M    354M    354M
   128        5    640K    640K    639K      943    118M    118M    118M
   256        2    256K    256K    256K      676   84.5M   84.5M   84.4M
    1K        1    128K    128K    128K    1.29K    164M    164M    164M
    4K        1    128K    128K    128K    5.85K    749M    749M    749M
   32K        1    128K    128K    128K    37.0K   4.63G   4.63G   4.62G
 Total    6.77M    867G    865G    864G    7.84M   1003G   1001G   1000G

 

This tells us that there are 7 million entries, with each entry taking up 997 bytes on disk, and 531 bytes in memory.  Simple math tells us how much space that takes up.

7102900*531=3771639900/1024/1024=3596MB used in RAM

The same math tells us that there’s 6753MB used on disk, just to hold the dedupe tables.

The dedupe ratio on this system wasn’t even worth it.  Overall dedupe ratio was something like 1.15x.  Compression on that volume(which has nearly no overhead) after shuffling the data around,is at 1.42x.  So at the cost of CPU time (which there is plenty of), we get a better over-subscription ratio from compression vs deduplication.

There are definitely use-cases for deduplication, but his generic VM storage pool is not one of them.

Friday, November 18th, 2011 Configuration, ZFS

13 Comments to Dedupe – be careful!

  • Tom Callahan says:

    Great find. I’ve used zdb before to determine whether dedupe was really worth it or not as well. Love the site, keep up the good work

  • ZDB works great, some of the time. I’ve had sporradic issues with ZDB on HA systems, and ZDB not seeing a pool even though it is imported. zpool status -D seems to work every time though. Thanks for the comments!

  • rayvd says:

    Timely post. Just ran into a similar issue at work. Our meta cache is completely full and looks like we probably will need to disable deduplication (only getting 1.3 ratio anyways), increase the meta cache size and then juggle data around to free up some room.

    Our system has 24GB of RAM — probably could use more, but I want to know how the presence of a 300GB L2ARC SSD affects how the DDT table is distributed. How much of the DDT does Nexenta prefer to put in L2ARC (and thus need to be tracked in meta cache) and how much in L1ARC? Perhaps we are hurting ourselves by having such a large L2ARC, and I wonder if you can specify how much DDT should live in L2 vs L1.

    Ray

  • cnagele says:

    That’s interesting. On a few boxes we have with 98GB of RAM I see that we are going over the limit.

    arc_meta_used = 22051 MB
    arc_meta_limit = 24314 MB
    arc_meta_max = 25240 MB

    The thing is, we don’t use dedupe at all. Do you know what else could affect metadata usage?

    Thanks for posting this. I need to dig a little more.

  • admin says:

    cnagele: If you use L2ARC, then some of the ARC meta data will be used to manage the L2ARC. The more L2ARC you have, the more meta data will be needed to service the L2ARC.

  • fryfrog says:

    “Just by disabling dedupe, and blowing away the dedupe tables…”

    How do you clear the ddt? I’ve been searching for a bit and just can’t see how one does this, besides destroying and re-creating the filesystem.

  • You have to do an import/export of the filesystem. There is no way to just “inflate” the filesystem.

  • fryfrog says:

    @Matt, can you just export it locally and then re-import it? That doesn’t sound very crazy, am I missing something?

  • I guess an “export/import” is a misnomer. You need to actually copy the data out of the current location, into a new location. A clone will work, but a simple zpool export or zfs export will not inflate the data.

  • […] is not stable, just problematic in terms of performance… Ah yes, i read one of those over here Dedupe – be careful! » ZFS Build. Looks like one must understand the mechanics in more detail to avoid these things. RAM seems to be […]

  • naisanza says:

    What happened when you turn off dedup? I read somewhere that it took ages for the dedup process to get removed because of all the processing it had to do. I think it was on a live web server environment, so it was already getting I/O saturated, on top it was trying to remove dedup.

  • Unchecking dedupe doesn’t actually remove the deduplication, it just doesn’t deduplicate any new data. To actually clear the deduplication you have to copy the data over to a new volume.

  • Leave a Reply

    You must be logged in to post a comment.