ZFS RAID levels

ZFS RAID levels

When we evaluated ZFS for our storage needs, the immediate question became – what are these storage levels, and what do they do for us? ZFS uses odd (to someone familiar with hardware RAID) terminology like Vdevs, Zpools, RAIDZ, and so forth. These are simply Sun’s words for a form of RAID that is pretty familiar to most people that have used hardware RAID systems.

Striped Vdev’s (RAID0)
Striped Vdev’s is equivilent to RAID0. While ZFS does provide checksumming to prevent silent data corruption, there is no parity nor a mirror to rebuild your data from in the event of a physical disk failure. This configuration is not recommended due to the potential catastrophic loss of data that you would experience if you lost even a single drive from a striped array.  How To Create Striped Vdev Zpool

Mirrored Vdev’s (RAID1)
This is akin to RAID1. If you mirror a pair of Vdev’s (each Vdev is usually a single hard drive) it is just like RAID1, except you get the added bonus of automatic checksumming. This prevents silent data corruption that is usually undetectable by most hardware RAID cards. Another bonus of mirrored Vdev’s in ZFS is that you can use multiple mirrors. If we wanted to mirror all 20 drives on our ZFS system, we could. We would waste an inordinate amount of space, but we could sustain 19 drive failures with no loss of data.  How To Create Mirrored Vdev Zpool

Striped Mirrored Vdev’s (RAID10)
This is very similar to RAID10. You create a bunch of mirrored pairs, and then stripe data across those mirrors. Again, you get the added bonus of checksumming to prevent silent data corruption. This is the best performing RAID level for small random reads.  How To Create Striped Mirrored Vdev Zpool

RAIDZ is very popular among many users because it gives you the best tradeoff of hardware failure protection vs useable storage. It is very similar to RAID5, but without the write-hole penalty that RAID5 encounters. The drawback is that when reading the checksum data, you are limited to basically the speed of one drive since the checksum data is spread across all drives in the zvol. This causes slowdowns when doing random reads of small chunks of data. It is very popular for storage archives where the data is written once and accessed infrequently.  How To Create RAIDZ Zpool

RAIDZ2 is like RAID6. You get double parity to tolerate multiple disk failures. The performance is very similar to RAIDZ.  How To Create RAIDZ2 Zpool

This is like RAIDZ and RAIDZ2, but with a third parity point. This allows you to tolerate 3 disk failures before losing data. Again, performance is very similar to RAIDZ and RAIDZ2.

Nested RAID levels – You can also add striped RAIDZ volumes to a storage pool. This would be akin to RAID50 or RAID60. This would increase performance over RAIDZ while reducing capacity of your physical storage.  How to create Striped RAIDZ Zpool

We have decided to go with Mirrored Striped Vdev’s (RAID10). This gives us the best performance in a scenario where we do a lot of writing and a lot of small random reads. It also gives us great fault tolerance. In a best case scenario, we could lose 10 of 20 disks and have no data loss. Obviously we would replace drives immediately after a failure occurs to maintain optimum performance and reliability, but having that safety net of being able to lose that many drives is comforting at night while servers are humming away in the Datacenter.

Wednesday, May 26th, 2010 RAID

12 Comments to ZFS RAID levels

  • intel says:

    I am just curious:

    Say you have 20 drives, are you creating a stripe (RAID0) between two VDEV’s

    So you have

    Virtual set of disk 0
    - Disk0 to Disk9 (these disk configured RAID1)

    Virtual set of disk 1
    - Disk 10 to disk 19 (these disk configured on RAID1)

    And then you stripe the data across those two virtual/vdevs? I believe I read somewhere that for better performance you could create TWO RAIDZ arrays and then strip them for performance. In the example above you would create TWO raidz with 10 drives each, and then stripe RAIDz #1 and raidz #2

    It would be great if you could test/benchmark that to compare if true.

  • admin says:

    There are a lot of options for doing nested RAID with ZFS. The two most common ways are RAID10 and RAID50.

    For RAID10, you would create 10 two drive mirrors and then stipe across the groups. RAID10 is always my personal favorite, since it offers excellent performance and reliability. Some people are critical of the amount of disk space that gets wasted/used by RAID10 compared to RAID50.

    For RAID50, you would create several RAIDZ groups and then stripe across them. The RAIDZ groups would need at least three drives each. With 20 drives, you could make four groups of five or five groups of four. Or you could do 6 groups of three and leave two drives as hot spares. ZFS is really flexible.

    Back to your specific question, you could do a RAID10 that way. It would greatly reduce the total disk space available, though. Your example used two RAID1 groups with 10 drives in each RAID1 group. If you did that, disks 0 through 9 would all contain indentical data. You would be able to tolerate 9 drive failures per mirroring group, and that much reliability is generally not needed. If you used 1TB drives, your example would result in a storage volume of only 2TB in total size. If you used my example of ten mirroring pair striped together, your total storage would be 10TB and you would still be able to tolerate one drive failure per mirroring pair.

  • ccolumbu says:

    Is it possible to use 2 servers for this for added redundancy? So you put 10 drives in serverA and 10 in serverB. The for each mirrored pair (0-9) 1 drive from serverA is used and 1 from serverB is used.
    The if you lost your power supply, or network, on serverA, serverB could continue to serve up the data.

  • admin says:

    Yes, you can do redundancy, but it is not as easy as using DRDB under Linux. With OpenSolaris, you can use snapshots and then push the changes since the last snapshot to the backup node using SSH.

  • gmorey1024 says:

    I’ve just been given the assignment of installing an application (Cisco Transport Manager) on a T5120 SPARC with 2 300 gigabyte drives. I installed a UFS filesystem Solaris 10 Update 10 and updated it with a fixpack and thought I was through, however the application, apart from the root, usr, swap and the partition representing the disk, wants 10 additional partitions (a.k.a. filesystems). In addtiion the application wants striping and DMP (dynamic multipathing). I’m not sure how to proceed. I immediately started looking a SVM but in the process became aware of ZFS. I’m going to do a re-install this weekend but am uncertain how this is going to pan out. I’m pretty sure I will have to define a root, usr, and swap partition during the install but don’t have a clue how this will relate to ZFS (perhaps this will become clear during the install) then I need to create the partitions the application requires. Originally I was going to mirror the two disks, at least in my mind but at this point I’m really uncertain how to proceed. If someone could rough this out for me it would be appreciated.

  • Are you installing the application directly on the hardware, or are you planning on using the ZFS system as a storage backend and connecting to it remotely? We don’t dive much into installing applications directly on the hardware in this blog, and don’t have a terrible amount of experience doing it. If you’re looking at creating additional filesystems or logical volumes with ZFS, it’s quite easy, but you’ll likely want more disks backing your datastore. ZFS really excels when you give lots of spindles.

  • cryptz says:

    Hi, i have 12 2tb sata drives as well as 2 perc h700′s with 1gig of cache available for a zfs build. I plan to run raid 10 or the zfs equivalent. Since these controllers dont do jbod my plan was to break the drives into 2 pairs, 6 on each controller and create the raid 1 pairs on the hardware raid controllers. This would give me 2gb of cache from the controller (1gb per 3 raid 1 groupings) and then use zfs to create the striping groups.

    Would the cache benefit me in this situation? Would i be better off having zfs do everything? I figured any of the speed benefits of zfs would be trumped by the cache, without testing I cant be sure.

  • admin says:

    cryptz: You are better off using ZFS instead of a hardware card if you are building a SAN or NAS. Install as much RAM as possible, and ZFS will scale to outperform any hardware card.

  • byteharmony says:

    Rather than use DRDB, what about gluster? I’ve read 2 interesting articles suggesting better results were obtained using it:

    Linux based ZFS approach:

    Solaris based ZFS approach:
    (later in the blog there is a post of lockup on high IO)

    The way I see it:
    Gluster is better in Linux
    ZFS is better in Solaris

    What’s that in the rear view mirror? Windows Server 8 with DEDUPE and VMWare like feature sets?

    So back to reality, right now for a simple, even relatively large SAN system your Solaris system looks great. As time progresses what do you see?


  • admin says:

    We ran some benchmarks with Gluster a while back. We obviously liked the idea of an open source HA solution. But honestly, the performance of Gluster was not impressive. This was with an older version of Gluster and maybe things have improved since then.

  • […] ZFS RAID Levels […]

  • Leave a Reply

    You must be logged in to post a comment.