So in my lab I’ve got a Cisco Nexus 5548 and a SuperMicro SuperServer 6026-6RFT+. I’ve put Nexenta on this, Windows, and several other things. One thing I hadn’t tried is running FCoE. Intel announced FCoE support for all Nintanic based chips several years ago, and I hadn’t tried it.
I figured this was as good of a time as any to play with ESXi and FCoE so I dug in. ESXi 5.1 installed flawlessly. It saw all of the NIC’s, all of the hard drives, everything. The Nexus 5548 worked great, I sailed along creating new vSAN’s for FCoE, and thought, “here we go!”.
I followed the guide here for enabling FCoE http://www.intel.com/content/www/us/en/network-adapters/10-gigabit-network-adapters/ethernet-x520-configuring-fcoe-vmware-esxi-5-guide.html. It all looked splendid until I actually got to the part where you activate the new FCoE Storage adapter. Every time I tried to add the Software FCoE adapter, it acted like there was no available adapter that supported FCoE. I knew this wasn’t the case, as it was very clearly mentioned that it _was_ supported.
After several hours of poking, prodding, trying different versions of ESXi, updating the system board BIOS, tinkering with BIOS settings, trying Windows – thinking maybe, just maybe ESXi wasn’t going to work, I gave up and sent an email to Intel.
Intel responded very graciously that since it was an integrated controller on the system board, there wasn’t much they could do for me, and that I would have to talk to my manufacturer. I followed their advice, contacted SuperMicro, and got a fantastic response.
Would you don’t mind flash the EEPROM firmware. The firmware release on 08/09/11 will allow Intel 82599EB to support FCoE.
Steps to flash onboard LAN EEPROM
1.Extract the files and Copy them to a bootable USB stick or to a bootable floppy disk.
(If you don’t have a bootable USB stick you can make it using:
2.Boot up the system using the USB stick.
3.At the command prompt type — <filename>.bat
4.Enter the 12 digit LAN1 MAC address, when prompted.
5.Power cycle the system.
6.Reinstall the LAN drivers after EEPROM is flashed.
After flashing the new EEPROM on to the LAN controller, I was able to sucessfully enable FCoE for this system using ESXi (and subsequently Windows Server 2008R2).
So we’ve been talking about RAM usage, ram problems, and pretty much everything related to RAM lately, so I figured I’d mention this one too.
Many of you, if you’ve got a large memory system, may notice that your system never uses all of it’s RAM. I’ve got some 192GB systems that routinely reported that they had 30GB of RAM free. Why is this? Everything that I’ve read says ZFS will use all of the RAM in your system that it can, how can it leave 30GB just wasting away doing nothing?
The answer is both surprising, and not surprising at the same time. When ZFS was written if a system had 8GB of RAM it was a monster, and 16GB was nearly unheard of. Let’s not even talk about 32GB. Today having 192GB of RAM in a system isn’t difficult to achieve, and the new Xeon E5 platforms boast RAM capacities of 1TB and more. Back then there was a need to limit ZFS a little bit from encroaching on system processes. There is a Solaris “bug” (if you can call it that) that limits ARC cache to 7/8 of total system memory. This was intentional so that the ARC cache couldn’t cause contention for swapfile space in RAM. That “bug” still exists in Nexenta 3.1.2. This is why I have 30GB of RAM free on a 192GB system. 1/8 system memory made sense when monster systems were 8GB. That left 1GB of RAM that wouldn’t be touched for ARC caching to ensure proper system operation. Today the amount of RAM in systems dwarfs what most people would have used 10 years ago, and as such we need to make modifications. Fortunately, we can tune this variable.
From what I’ve read (based on information from Richard Elling (@richardelling on twitter)) the default for Nexenta 4.0 and Illumos distributions will be 256MB. I take a very cautious approach to this and I’m going to use 1GB. This is all discussed here.
For those who don’t want to check out the links – the pertinent info is here :
edit /etc/system for permanent fix – this reserves 1GB of RAM
To change it on the fly
echo swapfs_minfree/W0t262144 | mdb -kw
This change has allowed my ARC cache to grow from 151-155GB utilization to 176GB utilization, effectively allowing me to use all of the RAM that I should be able to use.
FYI – this is unsupported, and Nexenta support will probably give you grief if you do this without discussing it with them (if you have a support contract), so be forewarned. There may be unintended consequences from making this change that I am not aware of.
Syndicated from StorageAdventures.com
Just got done working on a friends Nexenta platform, and boy howdy was it tired. NFS was slow, iSCSI was slow, and for all the slowness, we couldn’t really see a problem with the system. The GUI was reporting there was free RAM, and IOSTAT showed the disks not being completely thrashed. We didn’t see anything really out of the ordinary at first glance.
After some digging, we figured out that we were running out of RAM for the Dedupe tables. It goes a little something like this.
Nexenta by default allocates 1/4 of your ARC cache (RAM) to metadata caching. Your L2ARC map is considered metadata. When you turn on dedupe, all of that dedupe information is stored in metadata. The more you dedupe, the more RAM you use, the more L2ARC you use, the more RAM you use.
The system in question is a 48GB system, and it reported that had free memory, so we were baffled. If its got free RAM, what’s the holdup? Seems as though between the dedupe tables and the L2ARC, we had outstripped the capabilities of the ARC to hold all of the metadata. This caused _everything_ to be slow. The solution? You can either increase the percentage of RAM that can be used for metadata, increase the total RAM (thereby increasing the amount you can use for metadata caching), or you can turn off dedupe, copy everything off of the volume, then copy it back. Since there’s no way currently to “undedupe” a volume, once that data has been created, you’re stuck with it until you remove the files.
So, without further ado, here’s how to figure out what’s going on in your system.
echo ::arc|mdb -k
This will display some interesting stats. The most important in this situation is the last three lines :
arc_meta_used = 11476 MB arc_meta_limit = 12014 MB arc_meta_max = 12351 MB
These numbers will change. Things will get evicted, things will come back. You don’t want to see the meta_used and meta_limit numbers this close. You definately don’t want to see the meta_max exceed the limit. This is a great indicator that you’re out of RAM.
After quite a bit of futzing around, disabling dedupe, and shuffling data off of, then back on to pool, things look better :
arc_meta_used = 7442 MB arc_meta_limit = 12014 MB arc_meta_max = 12351 MB
Just by disabling dedupe, and blowing away the dedupe tables, it freed up almost 5GB of RAM. Who knows how much was being swapped in and out of RAM.
Other things to check :
zpool status -D <volumename>
This gives you your standard volume status, but it also prints out the dedupe information. This is good to figure out how much dedupe data there is. Here’s an example :
DDT entries 7102900, size 997 on disk, 531 in core bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 6.41M 820G 818G 817G 6.41M 820G 818G 817G 2 298K 37.3G 37.3G 37.3G 656K 82.0G 82.0G 81.9G 4 30.5K 3.82G 3.82G 3.81G 140K 17.5G 17.5G 17.5G 8 43.9K 5.49G 5.49G 5.49G 566K 70.7G 70.7G 70.6G 16 968 121M 121M 121M 19.1K 2.38G 2.38G 2.38G 32 765 95.6M 95.6M 95.5M 33.4K 4.17G 4.17G 4.17G 64 33 4.12M 4.12M 4.12M 2.77K 354M 354M 354M 128 5 640K 640K 639K 943 118M 118M 118M 256 2 256K 256K 256K 676 84.5M 84.5M 84.4M 1K 1 128K 128K 128K 1.29K 164M 164M 164M 4K 1 128K 128K 128K 5.85K 749M 749M 749M 32K 1 128K 128K 128K 37.0K 4.63G 4.63G 4.62G Total 6.77M 867G 865G 864G 7.84M 1003G 1001G 1000G
This tells us that there are 7 million entries, with each entry taking up 997 bytes on disk, and 531 bytes in memory. Simple math tells us how much space that takes up.
7102900*531=3771639900/1024/1024=3596MB used in RAM
The same math tells us that there’s 6753MB used on disk, just to hold the dedupe tables.
The dedupe ratio on this system wasn’t even worth it. Overall dedupe ratio was something like 1.15x. Compression on that volume(which has nearly no overhead) after shuffling the data around,is at 1.42x. So at the cost of CPU time (which there is plenty of), we get a better over-subscription ratio from compression vs deduplication.
There are definitely use-cases for deduplication, but his generic VM storage pool is not one of them.
This post is possibly the most important lesson that we learned. RAM is of MAJOR importance to Nexenta. You can’t have enough of it. Our original Nexenta deployment was 12GB of RAM. It seemed like a silly amount of ram just a year ago. Today we’re looking at it as barely a starting point. Consider these facts :
1 – RAM is an order of magnitude (or more) faster than Flash.
2 – RAM is getting cheaper every day.
3 – You can put silly amounts of RAM in a system today.
4 – Data ages, and goes cold, and doesn’t get accessed as it gets older, reducing your Hot data footprint.
Lets go through these statements one by one.
1 – RAM is an order of magnitude (or more) faster than Flash. Flash will deliver, on average, between 2,000 and 5,000 IOPS, depending on the type of SSD, the wear on the SSD, and garbage collection routines. RAM has the capability to deliver hundreds of thousands of IOPS. It doesn’t wear out, and there’s no garbage collection.
2 – RAM is getting cheaper every day. When we built this platform last year, we paid over US $200 per 6GB of RAM. Today you can buy 8GB Registered ECC DIMMS for under US $100. 16GB DIMM’s are hovering around US $300-$400. Given the trends, I’d expect those to drop over the next year or two significantly.
3 – You can put silly amounts of RAM in a system today. Last year, we were looking at reasonably priced boards that could fit 24GB of RAM in them. Today we’re looking at reasonably priced barebones systems that you can fit 288GB of RAM in. Insane systems (8 socket Xeon) support 2TB of RAM. Wow.
4 – Data ages, goes cold, and doesn’t get accessed as much. Even with only 12GB of RAM and 320GB of SSD, much of our working set is cached. With 288GB of RAM, you greatly expand your capability of adding L2ARC (remember, L2ARC uses some of main memory) and increase your ARC cache capacity. If your working set was 500GB on our old system you’d be running at least 200GB of it from spinning disk. New systems configured with nearly 300GB of ARC and a reasonable (say 1TB) amount of L2ARC would cache that entire working set. You’d see much of that working set cached in RAM (delivering hundreds of thousands of IOPS) part of it delivered from Flash (delivering maybe 10,000 IOPS), and only very old, cold data being served up from disk. Talk about a difference in capabilities. This also allows you to leverage larger, slower disks for older data. If the data isn’t being accessed, who cares if it’s on slow 7200RPM disks? That powerpoint presentation from 4 years ago isn’t getting looked at every day, but you’ve still got to save it. Why not put it on the slowest disk you can find.
This being said, our new Nexenta build is going to have boatloads of RAM. Maybe not 288GB (16GB DIMMS are still expensive compared to 8GB DIMMS) but I would put 144GB out there as a high probability.
So lets start talking about some of the things that we’ve learned over the last year, shall we? The number one thing that we have learned is that your working set size dramatically impacts your performance on ZFS. When you can keep that working set inside of RAM, your performance numbers are outrageous. Hundreds of thousands of IOPS. Latency is next to nothing, and the world is wonderful. Step over that line though, and you had better have architected your pool to absorb it.
Flash forward to a year later, and we’ve found that our typical response times and IOPS delivered are much higher than we expected, and latency is much lower. Why is that you may ask? With large datasets and very random access patterns, you would expect close to worst-case scenarios. What we found is that we had a lot of blocks that were never accessed. We have thick-provisioned most of our VM’s, which results in 100+GB of empty space in most of those VM’s (nearly 50% of all allocated capacity). With Nexenta and ZFS, all of the really active data was being moved into the ARC/L2ARC cache. While we still had some reads and writes going to the disks, it was a much smaller percentage. We quickly figured out that the algorithms and tuning Nexenta employs in ZFS for caching seems to be very very intelligent, and our working set was much smaller than we ever really imagined.
We’ve gotten loads of positive feedback from our Blog and from our Anandtech article over the last year. Since we last posted, we’ve found quite a few things that we want to share with everyone about our first build. Keep an eye here for updates and musings about what we have found over the last year, and what we plan on doing next!
The title of this post says it all. How to get Infiniband to work properly. We have had a roller coaster of fun trying to get the InfiniBand network up and running properly. From the headache of installing the InfiniBand switch module to finding the correct Mezzanine cards for our blades, to getting IPoIB (IP over InfiniBand) working properly on all systems, it’s been interesting to say the least.
› Continue reading
When we took on this ZFS Build project, we decided to use OpenSolaris for our ZFS system rather than FreeBSD or a Linux variant. We chose OpenSolaris for the ZFS server because ZFS was originally built for Solaris/OpenSolaris, and we suspected OpenSolaris would therefore include better support for ZFS.
OpenSolaris feels very similar to FreeBSD or Linux, but specific commands may be different. One nice touch is that the installer is included on the LiveCD. › Continue reading