Physical Corruption (aka "Bit Rot")

1.0 What and Why is Bit Rot?

Computers store the data they need to process in RAM chips temporarily; and on hard disks (or solid state drives) more permanently. Every time your computer writes data to one of these forms of data storage, it can make a mistake. Even when it writes it correctly, the data at rest in any of these storage locations can change over time -the magnetic material of which a hard disk is made can physically degrade, for example; or a power-spike can affect the contents of data temporarily at rest on a RAM chip. Finally, even if your computer writes the data correctly; and even if it is stored unchangedly in or on its storage medium; the computer can mis-read the data when it next needs to access it: the charge levels of the electrons in solid state drive might have dropped below what can reliably be read, or the hard disk might simply mis-read a tiny magnetic 1 as a 0, given the enormous data densities and high speeds of modern hard disks.

In short: whether the data degrades because of mechanical or material failures, or because power grids occasionally supply electricity at voltages and frequencies outside their normal range, or simply because (and it is has happened!) a cosmic ray happens to strike a storage device at the wrong time and in the wrong place ...your digital data may not always remain in the state you want it to be in. That's called bit rot in the computer industry and for anyone listening to classical music stored in digital form, its practical consequences can range anywhere between "wouldn't have known" and "glitching like old scratched CDs used to when they were having a fit".

Digital music collections, therefore, need protecting against bit rot. Niente has a part to play in detecting bit rot if it happens, but it can never fix it if it finds it. Prevention is therefore miles better than the cure -and I thought I'd briefly discuss here how to prevent bit rot from happening in the first place.

2.0 Bit Rot Prevention - Physical

To prevent bit rot, as much as possible, you want to be using reliable computer kit in a controlled environment. By this I mean, try not to stick your music storage devices in lofts that alternately freeze in Winter and cook in Summer (guilty as charged, though: sometimes you have no choice in such things!). Keep your computers in as dust-free an environment as you can possibly make it. If at all possible, run your computers from Uninterruptible Power Supplies (otherwise known as batteries), as these will have current surge protections built-in: if your domestic electricity supply fails completely, a good UPS can shut your servers down gracefully, so they don't just crash; and if your power supply is 'dirty', having lots of spikes and surges, the battery features of a UPS can even those fluctuations out. Incidentally, this is why an old laptop can make quite a good music server: the batteries that make laptops portable can also prove useful in cleaning up questionable power supplies. If your budget won't spring for a full-featured UPS, at the very least invest in one of those multi-socket power extenders that provide some element of surge protection.

Next on the prevention list: if at all possible, use ECC RAM for your computer's physical memory. ECC stands for 'Error Correcting Code' and RAM that is ECC capable use nine chips rather than standard RAM's eight. The extra chip is used to store "parity" data that can be used to reconstruct the 'real' data, if part of it goes wrong. Imagine you wanted to store the numbers '2' and '3' in RAM: on conventional RAM, the 2 would be stored in one piece of RAM, the 3 in another. In ECC RAM, as well as storing the 2 and the 3, we also store a '1' on the extra parity chip. Why? Because if a cosmic ray wiped out the 2, but left the 3, I could take the 3, minus the 1, and reconstruct the 2: I get 3 and 2 back again. Or, if I lost the 3, I could take the 2 and add the 1, reconstructing the 3. That explanation is embarrassingly simplified, I hasten to add! But the principle is true: if I construct 'extra' data from the data you actually want to store, I can use that extra data to reconstruct any real data that goes missing or gets corrupted. ECC memory, therefore, helps prevents bit rot happening whilst data is temporarily at rest in your computer's working memory. Technically, ECC RAM can detect and correct 'single bit memory errors', thereby preventing already-corrupted-in-memory data from being written to disk.

ECC RAM is slightly more expensive than ordinary RAM and not every CPU or motherboard can use it. In particular, Intel is notorious for actively preventing their consumer CPUs (things like i7, i5 and i3 chips, for example) from making use of it, though AMD's Ryzen chips are frequently said to be able to use it effectively, even when AMD don't officially endorse its use. Intel server and workstation chips (so, Xeons) usually do support the use of ECC RAM -and I would therefore encourage you to buy computer hardware which uses Xeons or Ryzen chips to use for your digital music storage server(s). An old server that uses a 2011-ish vintage Xeon can be purchased very cheaply these days:

 

The Xeon E5-2407 was released in 2012, so we are talking pretty ancient by now... but it has more than enough power to run a storage server and it supports the use of ECC memory, so will physically help keep your music in pristine, not-rotting fashion! If your preferred storage server is a more modern prebuilt device like a QNAP or a Synology NAS, be aware that the 2- and 4-disk consumer appliances tend not to support ECC RAM, but the higher-end 8-disk units do. Shop around and prefer ECC-capable models, if you and your wallet can, anyway.

3.0 Bit Rot Prevention - Software

If you are not a computer-ese guru, the next bit of bit-rot prevention advice is likely to sound like complete gobbledygook, I realise. I can only urge you to try to get on top of this particular subject matter, however, because the longevity of your music collection really depends on it!

ECC RAM will help ensure that what your computer writes to disk is in good order. Once it has been written to disk (or SSD), though, it can degrade over time, for the reasons outlined above -and ECC RAM cannot help you detect that. Similarly, pristine data that sits on a hard disk which simply and completely mechanically fails is not pristinely lost! No amount of ECC RAM can walk you back from that particular form of catastrophe!

What you need, therefore, is a bit-rot-detecting file system that your operating system can rely on to do hard disk bit-rot detection and hard disk failure recovery. For Linux and *BSD users, this is not a difficult requirement to meet: the ZFS file system has built-in bit-rot and hard disk failure detection and healing capabilities when used in a multi-disk storage array and it will work fine on Linux, even though its licence is technically incompatible with Linux's own GPLv2. An alternative to ZFS is Btrfs, whose licence is fully compatible with Linux's, but whose tendency to lose data in a multi-disk setup means I can't recommend it. Plenty of people do use it for its bit-rot protection, nonetheless.

The underlying principle of ZFS is that it does to data on a file system what ECC memory does to RAM chips: it writes extra data that it can use to re-construct data it detects to have been corrupted or lost. Using the previous example: I write 2 and 3 to the hard disk; ZFS writes a 1 on another part of the storage array. If a cosmic ray strikes the hard disk platter and alters the 3 into being a 9, then ZFS will be able to detect the error, since 2+1 doesn't equal 9; having detected the error, it can then correct it, because it knows that the 9 should actually be equal to 2+1, and therefore can over-write it with a replacement for the lost 3. To get this sort of error detection and correction going, you simply schedule regular 'scrubs' of your ZFS storage 'pools'.

I won't go into too much more detail about all this here, other than to summarise by saying: you should store your digital music collection on a ZFS file system that writes to at least 2 physical hard disks, preferably 4. Personally, I use a zpool which is configured as a 'mirrored stripe': that is, I put two hard disks into a state where one mirrors the contents of the other. If one disk fails, a good copy of the data is still available from the surviving disk. Two more hard disks are then put into a second mirror, with the same result. Finally, both mirrors are put into a single 'stripe', so that a chunk of data is written not to one or the other, but to both mirror pairs, spread out across them. The net result is that I can have two hard disks mechanically fail -and still be able to recover all my data. The cost is that if my storage array uses four 4TB hard disks (so, 16TB raw storage in all), I end up creating two 4TB mirrors and then striping across both... meaning that I've only got 8TB of usable storage, not 16TB. Effectively, I lose two hard disks to 'data redundancy': two disks are effectively there to protect your data, rather than store it by themselves.

There is another 'cost' to this approach, however: you will have to not use Windows on your storage server, because ZFS does not yet reliably work on Windows (though development efforts continue) and Windows itself has no equivalent 'self-healing' file system. Technically, the ReFS file system was supposed to be usable in this sort of way, but Microsoft has mucked about with it since it was first released in such a way as to make it look like abandonware at this stage. I wouldn't want to store anything I care about on an ReFS file system, anyway! Whilst it's only your storage server that needs to be running Linux or *BSD, leaving your PC free to continue to enjoy the delights of Windows 11, it does mean you need to learn how to set up a Linux or *BSD server of some sort. A relatively simple way to dip your toes in those waters is to experiment with setting up a TrueNAS or OpenMediaVault server (both involve installing an operating system which is more of a 'software appliance', making things quite easy to administer: both are completely free of charge to obtain and use, too).

4.0 Bit Rot Prevention - Duplication and Backup

In my (all too brief and over-simplified) explanation of ZFS, I explained how my own personal music server can survive the loss (or corruption) of two hard disks. This is true: but it won't help if I suffer corruption or loss on three of my four hard disks! In other words, whilst ZFS is a powerful way of detecting and fixing limited data corruption and loss, it can't fix everything.

For that sort of capability, you need to make whole copies of your data, from your 'true server' to a duplicate server.

In my own setup, for example, I have a 4-disk server running ZFS duplicating to another 4-disk server, also running ZFS. If one server bites the dust, my music collection remains safe on the other. ZFS has a duplicating capability built-in, so no other software is really required. However, I choose to script my duplication runs using rsync software -other products are available which can achieve the same sort of results. The details of how you duplicate your music collection are not as important as the fact that you should duplicate it.

As it happens, I'd go one step further and recommend a third duplicate onto a third server. My third server only runs 4x2TB hard disks in a stripe: if I lose a single hard disk, I lose the entire array. So it's not a robust duplicate, but it's there as a fallback of last resort. Two other servers have to suffer fairly catastrophic failures before I need to rely on this third duplicate one, so it's a risk I'm prepared to take (and which my wallet wasn't prepared to properly mitigate!)

Finally, follow the 3-2-1 rule. That is, have (at least) three copies of your data; two of them on-site, but on different physical media; one of them off-site. I certainly meet the first two of those requirements: I have the 'source' music collection, a duplicate copy on a second server, and a third copy on a slightly-risky third server. So, I have two backups, plus the originals, at home, on different disks and different disk configurations. However, if my house burns down, it takes all three sets of my music with it, resulting in me having no more music! To meet the '1' part of the 3-2-1 rule, therefore, I continually backup my third server to 'The Cloud', in the form of a cheap, unlimited data plan with Backblaze. For US$65 a year, I can store as much data with them as I need to, on the other side of the world, which should come in handy should my house ever be destroyed by flood, fire or hurricane. Should I need my data restored from them at any point, they will ship it to you on a hard disk, which you can return to them once you've copied it onto your own storage ...and thus not pay for.

One drawback with Backblaze's 'personal' plans is: they only work when the source of the data is a Windows PC. So, I have a fourth computer in my loft, pulling data from my third server, running Windows 10. If the prospect of running Windows fills you with as much dread as it did me, you could always spring for one of Backblaze's more 'professional' products: their 'Business Backup' appears to cost $5 per TB per month, for example -and supports TrueNAS out-of-the-box, no intermediate Windows PC required! I don't have personal experience with these sorts of products, however.

If the cost of backups in the cloud puts you off, then at least spring for a USB hard drive onto which you take periodic backups -and then store that hard drive with a trusted neighbour or relative.

One final but related point to mention here, though it possibly descends to the depths of complete data paranoia: I have a fourth server in the house which I only power on the first day of each month, and which then pulls an exact copy of the data from the third server before immediately shutting itself down. The reason for yet another in-house copy of the data is that it's offline for most of the time, wehreas the other three servers (plus the Windows PC) are permanently running. If my other half's resolutely Windows PC was ever to pick up a piece of crypto ransomware, for example, then it's possible that would start encrypting data right across the home network -including on all three storage servers. No cryptoware can damage data that's not accessible because its host server is powered off, though! That USB drive you trusted your neighbour to store for you? That counts as offline storage, too, as well as being off-site 🙂

5.0 Niente's Purpose

Summing up so far, therefore: if you run your storage hardware in clean, controlled conditions; if you provide a relatively stable and clean power supply to it; if you use ECC RAM to prevent bit-flips in memory; if you use the ZFS file system across multiple hard disks; if you duplicate your data onto separate storage, also configured with ZFS across multiple hard disks; if you ensure that you copy your data off-site, at least periodically; and if you finally ensure that at least one copy of your data is not permanently accessible over your home network... then you are unlikely ever to suffer from bit rot, corruption, data losses caused by hard disk failures and so on.

In which case, the question arises: what's the point of Niente, then?!

I'll answer that in two ways. First, and most simply, Niente doesn't just check for the presence of physical corruption. It does logical inconsistency checking too, so it does things which ECC RAM, ZFS file systems and a thousand backups could never detect or resolve! See this other article for a description of that sort of functionality.

But the second part of my answer is that FLACs are written by software and that software can itself go wrong over time. A FLAC could degrade internally, for example, by you re-tagging the list of performers stored in the COMMENT tag. The software used to do that tagging might not be entirely clean in the way it writes back to the FLAC, and might cause internal corruption to the audio signal it contains which wouldn't appear to count as physical 'bit rot' from the point of view of ZFS or ECC RAM. It's not supposed to do that, of course... but software can contain bugs, so what's supposed to happen and what actually happens can quite often be two different things!

If you are doing everything 'right' as far as computer memory, robust file systems and multiple copies and backups of your data... there is nevertheless purpose in periodically running a Niente integrity check against your music collection, in order to detect such 'software errors' and to give you a chance to correct them (by restoring good copies of the affected files from all those duplicate and backup copies, of course!)

6.0 Niente's Functionality

6.1 General Principles

In case you haven't met this particular aspect of computing before, let's start axiomatically: it is possible to apply a mathematical function to any piece of data and thereby derive a single, unique 'hash value' for that data. The hash value cannot (usually) be converted back into the data itself, but the same data subject to the same mathematical function at a later date should return the same hash value, so that there is a one-to-one correspondence between the data and its hash value.

By way of example, let's take the phrase "This is some data!" and pass it through a particular mathematical function to generate what's called the MD5 hash. This is easy to do on most operating systems, at the command line:

Here we see that the phrase 'encodes' to a long string of letters and digits, starting "e914b8466...". If I ask for a new MD5 hash for that same data, I should always get the same result:

A key feature of the MD5 hash function, however, is that if you change the input data by even a tiny amount, the resulting hash will likely be wildly different from the original:

Here, you see me remove the final exclamation mark from the input phrase. So the phrase is now just one character different from before... but the resulting MD5 hash is nowhere close to the original. Now it starts "4b4cbba68adce2d..." and is therefore different from the original hash in almost every place.

Hashes don't just have to be generated from text strings, though: any data can be fed to the MD5 hashing function and return an MD5 hash, including FLAC files:

Here, you see me first checking that a FLAC exists in my Desktop folder and then passing that entire FLAC through the MD5 function, to once more return a simple hash value. Were I to change one bit within that FLAC, the entire hash value would change. Let me demonstrate that for you now.

First, I'll open that same FLAC file in a hexadecimal editor (that's just an editor which can alter the contents of binary files, just as Word or Notepad can alter the contents of text files):

Being binary data, it looks complete gibberish, of course. Let me alter one small piece of this data, though:

If you compare that screenshot to the previous one, you'll notice only a single change to the binary data: a "4F" has become a "6F" on line 00000168. So now let's see what that one change has made to the MD5 hash value for this FLAC file:

Now we are getting "4ac0c7c0c78ce0a83430bf36ce8fd92f" where before we had "e97d860551bc33ab9e7009259ecf69d0": one tiny change in the source data results in a huge change in the MD5 hash value, therefore.

We can work this in reverse, too, of course: if we keep getting the same MD5 hash as we got before, we can conclude with absolute certainty that no alteration to the underlying data could possibly have taken place. But if we ever see a change in the MD5 hash value, we know with equal certainty that, to one extent or another, the data we now have is not the same data we had originally. This is the underlying principle of the way Niente performs a physical integrity check of your FLAC files. Every time you ask it to perform a complete integrity check, Niente computes a new MD5 hash value for the FLAC file and stores it in its database. It can then compare that new MD5 hash value to the one it had before: if they are different, the FLAC file has changed internally (for whatever reason, and to whatever extent, Niente cannot know) and that needs looking into.

6.2 Nuances

I pause at this point to explain a subtlety: a FLAC file contains more than just music! The FLAC can (and should, in my view) contain embedded album artwork, various metadata tag information telling us who the composer is, who is performing and what they're performing, for example. This tag data is usually supplied by you, typing it in (using a tool such as Semplice)... and with the best will in the world, you're going to make mistakes from time to time when typing it in. You'll therefore want to go back and correct such tag-typos... and at that point, you'll be modifying the internal contents of the FLAC file as a whole.

If Niente did what I just did in Section 6.1 above and calculated an MD5 hash value for the entire FLAC as a whole, therefore, all such tag amendments would trigger a change in the hash value... and Niente would forever be declaring the file potentially corrupt, despite the change being perfectly legitimate.

Fortunately, FLAC files store their internal bits and pieces in distinct 'blocks'. Album art is stored within a picture bliock, for example; tag data is stored within multiple 'vorbis comment' data blocks. Crucially, the audio bit of a FLAC (the bit we really care about, I suppose!) is stored within its own 'stream block'. Niente therefore applies the MD5 hash algorithm to only the audio stream part of the file. You can therefore modify metadata tags and add and remove album artwork until the cows come home -and Niente will not regard any such changes as being a sign of internal corruption. Only if the audio data part of the FLAC changes will the MD5 hash value change -and that can only happen if some sort of 'bit rot' is affecting that part of the file inappropriately (as might occur if, for example, a tagging program you use doesn't understand how to write back to FLACs properly).

Another subtlety I'll mention at this point is that when any FLAC file is first created, an MD5 hash value is computed for the audio stream component of the file and stored within the file itself, entirely automatically and without your knowledge or express permission. All FLAC encoders that meet the FLAC specifications should do this. This means that there's an MD5 hash we can compare to from the moment a FLAC file is created. You can see this 'native' MD5 hash by exposing it using the metaflac tool (that's included whenever you install the FLAC encoding/decoding software):

In this screenshot, we see I have a new FLAC file. I first use metaflac to expose its 'internal, original MD5' hash value: metaflac returns a value of a476c66b3f4d9b2273b403ca1e90e120. I then use the md5sum program to compute my own MD5: I get aa14fca10894ce55e91ea186993aa54f. Oh no: they're different! But, hang on... we expect them to be different: metaflac's result is the MD5 hash for only the audio component of the file. The value being returned by the md5sum program is the 'fingerprint' of that audio signal plus all the metadata and album art. They are naturally and inevitably not the same data, so the hashes won't agree.

Fortunately, Niente is fully aware of this. It certainly reads the 'internal, original MD5' that was embedded within the FLAC at the moment of its creation. When it computes a new MD5 hash value, however, it does so by using methods which are a good deal subtler than what md5sum is capable of! It thus computes the audio-only MD5 hash value, which can legitimately be compared to the 'birth MD5' value.

6.3 When are new hashes computed?

Once you run Niente for the first time, you are probably first going to take main menu Option 1: Create a database of music files which gets Niente to scan for FLACs in a folder hierarchy. If any are found, records are created for them within Niente's database. This 'discover, scan and populate' process does not compute nor create any MD5 hash information.

Only when you take main menu Option 3: Perform a full Integrity Check or main menu Option 4: Perform a differential Integrity Check are fresh MD5 hashes computed or original-at-the-time-of-creation hashes read from FLAC files. The difference between the two options is that the differential integrity check only reads/computes MD5 hashes for some FLACs, according to a list of criteria, whilst the comprehensive integrity check reads and computes MD5 hashes for every FLAC file that is listed as existing in the Niente database. Note that Option 5: Perform a fast Integrity Check specifically will not do any MD5 hash computation or checking: it's fast for a reason, and the reason is that it was specifically designed to skip performing any checks of physical integrity!

The differential integrity checks only apply to FLACs which:

  • FLACs which have not been checked before (i.e., they are new to the collection and Niente knows about them because of a refresh operation, but has not yet integrity checked them)
  • FLACs which have inconsistent MD5 checksums (i.e., they are already suspected of being physically corrupt: the re-check is done to see if they've been fixed or the detected corruption was a one-off sort of thing)
  • FLACs which have any sort of logical inconsistency (such as missing tags, composer names that don't match what's in the ARTIST tag and so on)

Assume a static music collection (i.e., no new music is being added to it). Assume further that you perform an initial database creation and complete integrity check on 1st January. Finally, assume you schedule a differential integrity check every evening. Then on 2nd January, nothing will be re-checked (because nothing new has been added, and all FLACs were checked only 1 day ago). The same will be true for 3rd January, 4th January and so on. On 31st January, however, the entire music collection will be subjects of a fresh integrity check, because the CHECKDAYS=30 rule will kick in. On 1st February, once again nothing at all will be a candidate for an integrity check, since the last check was now only 1 day eaarlier.

Now lets say there's a supernova explosion in the constellation of Triangulum. The cosmic rays produced by that event strike your computer on the morning of February 2nd and corrupt multiple bits of data that affect 4 FLAC files. The Niente integrity check on the evening of February 2nd will not detect this corruption, because the four files were not corrupt at their last check and thus do not qualify on the 'previously detected as corrupt' criterion for a fresh check now. They were also checked less than 30 days ago, so they don't qualify for a fresh check under that rule either. In fact, it won't be until the differential scan on March 2nd that the entire collection once again falls due for a new check because their last one was more than 30 days ago. At that point, the four files will have new MD5 hashes computed that don't agree with their "assigned at creation" internal hashes and Niente will only then be able to see that they have been corrupted.

It's for this reason that I prefer to schedule a complete integrity check once a week. That would have spotted the February 2nd corruption much more quickly (and allowed you to restore from a not-already-overwritten backup to correct it). If you schedule complete integrity checks relatively frequently, of course, then every file's 'last checked' date will always be, at most, only a few days in the past, and the nightly differential check's 'recheck if last check was done more than 30 days ago' will never kick in.

In summary: I would recommend you do nightly incremental checks to ensure new additions to your music collection are picked up in a timely fashion, but I would also schedule weekly complete integrity checks to ensure that 'silent physical corruption' is not left undetected for several weeks or months at a time, since a delay in detecting corruption might well mean the corruption is copied to your dupliactes and backups, resulting in a complete set of corrupted backups and thus rendering restoring a good copy of the file difficult or impossilbe. You want a mix of frequent differential and less frequent complete integrity checks, in other words.

It is, of course, possible that a file system such as ZFS will be able to correct a cosmic ray-induced bit of physical corruption before Niente even gets a chance to find out about it. But I wouldn't rely on ZFS being a miracle worker -and you may not be running ZFS at all, of course.

7.0 Conclusion

In this article, I hope to have at least hinted at the need for 'defence in depth' when it comes to protecting a digital music collection from catastrophic, physical corruption. It starts with running your servers in clean conditions, with clean power supplies. It extends to using proper server-grade machines which can make use of ECC memory. It also involves storing your music on multiple multi-disk arrays using a bit rot-proof file system, such as ZFS -rather than bunging it on a single USB drive using NTFS and hoping for the best!

If your hardware foundations are sound, then most of Niente's functionality regarding the detection of unexpected physical alteration of the audio stream in your FLAC files is rendered a bit redundant. But not everyone will be using ECC RAM; Windows users may not want to be messing around with Linux and ZFS; software meant to help you tag your FLACs might end up accidentally messing with the audio stream itself because of bugs... and so, even in an ideal hardware and filesystem environment, there is still a purpose to scheduling nightly Niente differential integrity checks and weekly (or maybe fortnightly: it depends on your backup strategy and the size of your music collection, really) complete integrity checks.


|[Back to Front Page]|