[kwlug-disc] So why not tar -cf tarball.tar a.xz b.xz c.xz, instead of tar -cJf tarball.tar.xz a b c ?
B. S.
bs27975 at gmail.com
Thu Nov 3 18:10:34 EDT 2016
On 11/03/2016 12:37 PM, bob+kwlug at softscape.ca wrote:
> To turn this on its head a bit.....
>
> Are you lamenting the shortcomings of tar and compression or are you
> trying to solve for bit-rot within archives that are in this format?
>
> If the objective is to factor and mitigate bit-rot within compressed
> tar files, perhaps we should look at the medium it's being stored on
> instead.
No choice, today's medium, particularly given sizes involved, is disk.
[Tape is and was always even worse, lifespan wise. Same is true for
optical, but more importantly, for both even, is cost (given capacities
involved).] ['Worse', here, for tape, is number of passes. Not
necessarily longevity per pass.]
Objective has always been ... given it can be a very long time before it
matters, and stuff happens, files go bad, let alone entire disks, how to
know a backup tar is good. And being many files, one error does not
necessarily mean all files are inaccessible. Problem: How to determine
which files within the 'bad' archive are still good?
And remember the mantra ... test your backups!
> If the compressed tar sits on disk, then you have various options.
> ZFS and BTRFS have the notion of checksumming disk blocks plus
> redundancy and logic to "heal" bit-rotted sectors.
BTRFS has been mentioned throughout, for the reasons you again state.
(Let alone, deduplication possibilities.)
However ... btrfs does not necessarily, and probably doesn't in the home
environment, bring anything more than (possible) detection when
something has gone bad. Which it still does - damaged is still damaged,
even with btrfs. Moreover, one still doesn't know which files within the
archive are damaged - or conversely, which files are still entirely
accurate.
Note to selves: SCRUB YOUR DISKS REGULARLY! Detect that something has
gone bad sooner rather than later, when you are more likely to be able
to do something about it. (Like have a backup, or source copy, still
around.)
> So your compressed tar file _should_ never fault.
No. They will. Count on it. It's only a matter of when. Problem is: will
it ever matter? e.g. Archive files are no longer important, e.g.
outdated copy, or superseded by a subsequent backup.
When / should it matter - how to know which files within the archive are
still integral.
> Granted, you've traded your space
> gains with compression for losses with redundant data having to be
> being stored (ie: mirrored or erasure coded blocks). (I guess you're
> still benefiting since you're not having to keep redundancy for your
> uncompressed data.)
ONLY, if your btrfs is mirrored AND you scrub regularly. In those case
it will self-heal.
But ... what about that backup disk (unmirrored) you've put off-site, or
under the stairs in the basement? Perhaps in a slightly moist
environment, perhaps you see a touch of rust ...
When you DO have to haul it out, and it's being persnickety, WHICH files
within the tar are specifically bad, and thus conversely, which are OK?
(And prove it.)
Tape is no longer in the picture in today's environment. 'Worst case'
would be a robotic optical media library. I expect such are less and
less prevalent given the 'cheapness' of disk space, let alone in the
cloud - where I expect data redundancy and certainty -should- be built in.
[Again, though, how do you know? (The backup is good, wherever it resides.)]
> ... RAID5
RAID 0 is for speed, RAID 1 is for safety (above), otherwise RAID is for
uptime. Not data integrity.
Thus off-line archives, let alone off-site backups.
> Ie: tar cv some_source | gzip -9v | erasure_encode > /dev/st0 And:
> cat /dev/st0 | erasure_decode | gunzip | tar tv
Is there any difference here from the built in de/compression options,
and/or de/compressing after the fact with 'xz file' -> file.xz?
Where, for the purposes of this thread, file is myfile.tar
> Does pushing the problem down lower into the storage stack help?
No, because ultimately it is the integrity of the whole, from tar's
perspective, that matters (to be able to extract it with the tool used,
in this case, tar).
These other steps merely make it more likely that tar will always be
presented with an integral file to operate upon.
But btrfs bit rot demonstrates that such presentation, even today, isn't
100%. [But btrfs will enable one to know when rot has happened, and
prevent further writing and making it worse (by further writing).]
As said in prior, solution is --to-command='md5sum -' at creation time.
Periodically rerun and diff the two.
In my recent exercising of these things ... I have seen some pretty
startling differences in compressed file sizes, even though btrfs is
compressing. e.g. file sizes before / after running xz.
Problem, as noted, is that compressing, and thus putting various control
data (e.g. checksums) -throughout- the file, means that one sector of
bit rot, instead of taking out one file in the tar, likely takes out the
entire tar.
The dearth of checksumming within tar feels predicated on certainty of
integrity of storage once stored. Experience has demonstrated that that
certainty absolutely does not exist.
So back to the start - detecting which files within a tar are bad.
More information about the kwlug-disc
mailing list