[kwlug-disc] Old Man Yells at ZFS (was: Docker Host Appliance)
Chris Irwin
chris at chrisirwin.ca
Tue Jan 17 01:19:15 EST 2023
This is kind of derailing into "Old man yells at Cloud" territory here,
so I changed the subject line.
Also, I just want to re-iterate: I don't hate ZFS, though I am annoyed
by some of it. I am actually using it on the NAS, and may stick with it,
at least for now, because the pros outweigh the cons (and despite Linus'
official stance of "Don't use it").
https://www.realworldtech.com/forum/?threadid=189711&curpostid=189841
Gripes aside, I'm not actually trying to start a filesystem fight
(otherwise I'd say "FAT16 ought to be enough for anybody" and run away).
On Mon, Jan 16, 2023 at 04:23:27PM -0800, Ronald Barnes wrote:
>Chris Irwin via kwlug-disc wrote on 16/01/2023 14.45:
>
>>But I'm a home user. And a cheap one, at that. I have a bunch of
>>data and a bunch of disks, and I want to be reasonably sure both the
>>data (and backups) are valid.
>
>Then ZFS is the only option - only it detects (and can *correct*)
>corrupt data (if I understand correctly).
Btrfs also does automatic error detection and correction.
Arguably you could count bcachefs as well, but it's still out-of-tree,
and I haven't seen any reliability analysis on it. I'm not trusting it
with the family photos just yet.
>From listening to them, as I understand it, you'd take the 2 new disks,
>make a vdev from them, then add that vdev to your pool and storage space
>has expanded.
Yeah, that is an option. Two new drives would be limited to mirroring
each other, instead of expanding the existing raid-z vdev, reducing the
amount of storage gained. It also necessitates adding drives in at least
pairs (i.e., I can't go from 4 to 5 drives).
Honestly, though, we shouldn't have to watch a podcast, or study vdev
calculators to figure out how to add drives effectively. We should be
able to say "Here's two new drives" and have the existing raid-z expand
to use them. We have this with mdadm, we have this with lvm, we have
this with btrfs. It should be in zfs.
>>I've been using mdadm (+lvm) and btrfs for a lot of years,
>I use mdadm + lvm myself, but only through inertia. Adding btrfs to that
>is never gonna happen; as Doug pointed out, it's not reliable.
BTRFS is the default in Fedora Workstation, and SUSE offers commercial
support for it.
The only caveat is don't use parity raid modes.
>And with layers upon layers (ext4 on lvm on mdadm), it still doesn't
>achieve the features of ZFS at a single layer.
LVM (itself, and it's snapshots) serve a different purpose, and use a
very different mechanism than ZFS or BTRFS.
LVM snapshots are not designed to be long-lived. They cause write
amplification just by existing. The idea behind LVM snapshots was to
snapshot a state of a filesystem, use filesystem or generic tools
(dump/rsync/etc) to stream it to tape/disk/cloud in a consistent state,
then delete the snapshot. (although "snapshot, [dangerous task],
merge/rollback" is possible with LVM, too)
You can do long-lived LVM snapshots if you're using thin-provisioned
LVM, which also helps mitigate some of the write amplification issues.
Other than Redhat's Stratis, this doesn't seem to be of any active
interest or development (which is probably fine).
(That said, thin provision LVM is pretty neat, and I wish it existed
years ago)
BTRFS & ZFS snapshots cause no write penalty for existing, because data
is never written in-place anyway. So snapshots are "free" in terms of
performance, and therefore, long-lived. While there is no additional
write amplification, you do potentially suffer from file fragmentation
over time (basically non-issue on solid-state, but maybe annoying on
HDDs, depending on environment).
(Also, annoyingly, LVM and BTRFS/ZFS are all referred to as "COW", even
though they do very different things with very different impacts (and
only LVM actually has a "C" on "W". Don't get me started on qcow2...).
>Even rsync falls far short of ZFS's ability to detect a single changed
>block in a 1TB file and backup only that one block.
Yeah, filesystems like ZFS and BTRFS that break layer boundries are
somewhat necessary to get that most efficient "Minimum difference
between snapshot A and snapshot B" backup. Any after-the-fact comparison
tool, like rsync, will never be able to compete in terms of speed or
efficiency.
>>if you wanted. WIth BTRFS if I had four disks and one failed (and I
>>have enough free space), I could rebalance the array to use one fewer
>>drive and recover a measure of redundancy while waiting for
>>stock/sale/shipping/payday.
>
>Sounds like a handy feature.
>
>But again, with btrfs RAID5/6 *should not be used in production*.
Agreed, never use the parity modes in BTRFS.
However, if you missed that warning when you created your BTRFS
filesystem and want to fix it, you can rebalance from RAID5 to RAID1 and
be comfortably safe again. No need to reboot or even umount.
If you have a plethora of drives and want extra redundancy? Rebalance to
RAID1c3, and you have three copies of everything.
If you decided you used the wrong layout for your zfs vdev, you need to
destory it and recreate.
>>ZFS also can't fully utilize mismatched disks, apparently. My
>>4-drive array has 2x6TB and 2x8TB drives, which means there's 2x2TB
>>worth of unusable space on the 8TB drives. This worked fine with
>>btrfs.
>
>I believe you could have 1 vdev of 6TB drives and 1 vdev of the
>8TB drives together in a pool without losing the 2GB.
Correct, although using 2x mirrors to use that extra 2x2TB of disk ends
up with a smaller usable amount of space than just using raid-z and
losing that 2x2TB. It's not at capacity yet, so currently it's just
annoying. I'll be at the angry muttering stage if these disks fill up,
though.
--
Chris Irwin
email: chris at chrisirwin.ca
xmpp: chris at chrisirwin.ca
web: https://chrisirwin.ca
More information about the kwlug-disc
mailing list