The Dizwell FLAC Checker

1.0 Introduction

You presumably rip CDs to FLAC files because you care about listening to good quality digital music files: instead of throwing away a significant proportion of the audio signal to create a small MP3 file, you chose to keep the entire audio signal in FLAC format. At least, that's the reason I ripped everything I own to FLAC!

When you care about quality above all else, then, it helps to make sure that your FLAC files don't degrade over time -introduce a little bit of corruption there, a smidgen of bit-rot there… pretty soon your FLAC file will not be a bit-perfect copy of your original CD.

The FLAC encoder itself already has a means of testing a file for internal corruption: you run the command

flac -t <name of file>

…and the encoder will de-code the file and report anything that goes wrong as it tries to do so.

The only problem with the “-t” test is that it's done on a per-file basis and consumes a relatively high amount of CPU, not to mention time. So I wanted to come up with a way of checking files in bulk and quickly -meaning, in particular, that if a file had been verified as being non-corrupt in the very recent past, it should not be subjected to another bout of CPU- and time-consuming verification.

The Dizwell FLAC Checker (DFC) is the result.

2.0 How DFC works

When the FLAC encoder creates a new FLAC file in the first place, it also computes (and stores) an MD5 hash of the audio signal component of the file. I emphasise the “audio signal” bit there: the MD5 is a hash purely of the music component of a FLAC file, not of any metadata associated with it. You can re-tag and de-tag a FLAC file a zillion times if you like, but the stored MD5 value for the actual sound part of the file will never change.

You can display the MD5 hash for the audio in a file, put there by the FLAC encoder itself, with the command:

metaflac --show-md5sum <name of file>

For example:

[hjr@britten Saint Nicolas (Best)]$ metaflac --show-md5sum 01\ -\ Introduction.flac 
328b55c5b74e5cf10dd21be4d87d6bf6

Now it's also possible to compute a fresh MD5 for a FLAC file using the ffmpeg program, like so:

[hjr@britten Saint Nicolas (Best)]$ ffmpeg -i "01 - Introduction.flac" -map 0:a -f md5 - 2>/dev/null | sed s/.*=//g
328b55c5b74e5cf10dd21be4d87d6bf6

When the hash returned by the ffmpeg command (which has nothing at all to do with FLAC, given that it's developed completely independently) matches the hash value which the FLAC encoder computed at the time this CD was ripped, as you see in this example, then you've got fairly good assurance that the music signal in that file today must be exactly the same music signal that existed on the day the CD was ripped.

On the other hand, if a single bit of the audio signal has changed for any reason at all, the two MD5 hashes will not agree… and then you know some change to the signal has taken place over time.

This is basically what DFC does for you. For each FLAC found in a directory structure, it checks to see when the file was last checked for “FLAC integrity”. If it was checked less than 30 days ago, it skips doing anything else with the file (thus saving CPU and time!). But if it finds the file was checked more than 30 days ago, it performs a fresh ffmpeg-based calculation of the MD5 hash for the audio in the file. If that matches the one stored in the file by FLAC when the file was first created, we're all good: the audio signal hasn't changed and the file contents must, therefore, be fine -in which case, DFC just updates the date of the last 'health check'.

If the comparison of the new MD5 hash with the original reveals that the hashes are not identical, however, then DFC generates an alert, so you know one of your files is potentially corrupt.

The results of its work are written out to a log file, so you can skip to the end of that to quickly find out how many of your FLACs were skipped (because they were checked not too long ago); validated (i.e., checked and found not to have changed audio signals); or failed (i.e., checked and their new MD5 hash doesn't match the original, so some sort of corruption has probably occurred).

DFC won't fix any corruption it finds: that's up to you to do, using whatever tools you have at your disposal (such as re-ripping a CD, restoring from a good backup, or some other approach of your own devising). But DFC will let you know whether your bit-perfect audio collection is becoming a little less bit-perfect with time!

3.0 Obtaining and Running DFC

DFC is supplied as a Bash shell script and can be downloaded from here. Since it is only a shell script, you can open it in a text editor of your choosing and make sure it's not going to do anything untoward.

Once you've downloaded the script (say to your own Downloads directory), you need to make the file executable and (I would recommend) easily run-able. To that end, issue the following commands:

sudo mv /home/hjr/Downloads/dfc.sh /usr/bin
sudo ln -s /usr/bin/dfc.sh /usr/bin/dfc
sudo chmod +x /usr/bin/dfc.sh

The first command moves the download into the /usr/bin directory, so that the file is then in your PATH and can be invoked from anywhere simply by typing its name (rather than having to type its full path and name everytime). Instead of running the script by typing /home/hjr/Downloads/dfc.sh, therefore, you can now just type dfc.sh.

The second command creates a symbolic link to the file, using a name that lacks the “.sh” extension. So now you can invoke the script with the simple command dfc.

Only the third command is actually compulsory, though: it's the one that makes the shell script executable and thus runnable.

Once you've made the script executable, it can be run from anywhere in your file system: when you run it, you tell it the 'root' of the folder structure where you store your FLAC music files which you want checked. For example:

dfc "/multimedia/flac/hjr/classical/B/Benjamin Britten"

…which is a good example of how to invoke the program when your directories contain spaces: you wrap the entire directory name inside a pair of double-quotation marks. If you wanted to be less precise, and thus to check more files, you might instead do:

dfc /multimedia/flac/hjr/classical

…which starts 'higher up' in my storage tree hierarchy -and since there are no spaces in the directory names anywhere, no double quotes are needed.

Note that you can optionally and additionally specify a place where the log file for the run should be written to. If you miss this out, then the log file will be written to your $HOME directory. So, for example:

dfc /multimedia/flac/hjr/classical /home/hjr/logs

…shows two run-time parameters being specified. The first is the root of the FLAC file directory structure, as before: the second is where I want the log file written to. Again, if you are wanting the log to be written to somewhere that contains spaces or other special characters, wrap that second parameter in double-quotes. So, for example:

dfc /multimedia/flac/hjr/classical "/home/hjr/Logs/Flac Checker"

4.0 Example Output

When you first run the checker, you may see output similar to this:

[hjr@britten ~]$ dfc "/multimedia/flac/hjr/classical/B/Benjamin Britten"
-----------------------------------------------------------------------------
          The Dizwell Flac Checker, Copyright © Howard Rogers 2019
                             Version: 1.0
  No log directory specified. Using /home/hjr instead...
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------

  Validating /Ballet/Plymouth Town (Llewellyn)/01 - Plymouth Town.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/01 - Act 1. Prelude.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/02 - The Fool and the Dwarf.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/03 - The Emperor-March.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/04 - Gavotte.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/05 - The Four Kings.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/06 - The King of the North.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/07 - The King of the East.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/08 - The King of the West.flac...
  Validating /Ballet/The Prince of the Pagodas (Britten)/09 - The King of the South.flac...

…and so on.

“Validating” means the program has noted that the file needs checking (that is, it hasn't previously been checked within the past 30 days or so). It also indicates that DFC is re-computing the MD5 hash of the audio signal in the listed files, and doing the comparison of that new hash value to the old one stored within the FLAC file by the FLAC encoder itself. In short, “validating” means real work is being done.

If you re-run the program regularly, then you may see this sort of output instead:

[hjr@britten Scripts]$ dfc "/multimedia/flac/hjr/classical/B/Benjamin Britten" 
-----------------------------------------------------------------------------
          The Dizwell Flac Checker, Copyright © Howard Rogers 2019
                             Version: 1.0
  No log directory specified. Using /home/hjr instead...
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------

  Skipping /Ballet/Plymouth Town (Llewellyn)/01 - Plymouth Town.flac... Last analysis done less than 30 days ago!
  Skipping /Ballet/The Prince of the Pagodas (Britten)/01 - Act 1. Prelude.flac... Last analysis done less than 30 days ago!
  Skipping /Ballet/The Prince of the Pagodas (Britten)/02 - The Fool and the Dwarf.flac... Last analysis done less than 30 days ago!
  Skipping /Ballet/The Prince of the Pagodas (Britten)/03 - The Emperor-March.flac... Last analysis done less than 30 days ago!
  Skipping /Ballet/The Prince of the Pagodas (Britten)/04 - Gavotte.flac... Last analysis done less than 30 days ago!
  Skipping /Ballet/The Prince of the Pagodas (Britten)/05 - The Four Kings.flac... Last analysis done less than 30 days ago!

“Skipping” in this context means that the tool has checked the value of the TAGDATE metadata for your file and has discovered it to be less than 30 days old. Therefore, DFC simply skips further processing for the listed file and moves on to the next; and so on. This saves decoding/testing files which were checked so recently that it's most unlikely that any of their contents have become corrupted since.

(If you ever wanted to force a re-check, regardless of when it was last done, see Section 4.1 below).

If you are very unfortunate, you may see this sort of output of the DFC tool instead:

[hjr@britten Eternity's Sunrise (Goodwin)]$ /home/hjr/Scripts/dfc.sh /home/hjr/Music/
-----------------------------------------------------------------------------
          The Dizwell Flac Checker, Copyright © Howard Rogers 2019
                             Version: 1.0
  No log directory specified. Using /home/hjr instead...
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------

  Validating /Eternity's Sunrise (Goodwin)/track01.flac...
   /Eternity's Sunrise (Goodwin)/track01.flac : Stored hash is not current hash!
  Skipping /Songs (Schreier)/42 - Urians Reise um die Welt.flac... Last analysis done less than 30 days ago!
  Skipping /Tavener/01 Track01.flac... Last analysis done less than 30 days ago!
  Skipping /Tavener/02 Track02.flac... Last analysis done less than 30 days ago!
  Skipping /Tavener/03 Track03.flac... Last analysis done less than 30 days ago!
  Skipping /Tavener2/01 - Choir and Orchestra of the Academy of Ancient Music, Paul Goodwin - Eternity's Sunrise.flac... Last analysis done less than 30 days ago!
  Skipping /Vaughan Williams - A Cambridge Mass/01 - Blest pair of Sirens.flac... Last analysis done less than 30 days ago!
  Validated Files:  1
  Skipped Files:    6
  Files in error:   1

  There were errors. Please check the log!

==========================================================================================================

Here, the validation process was started on the Eternity's Sunrise file, but the program has detected that its freshly-computed MD5 hash is not the same as that stored in the file by the FLAC encoder. This means that the 'current hash' is different to the 'stored hash'. It's a sign of 'change', potentially of corruption.

Note the final message is to 'check the log': DFC may check many thousands of files during a single run, so the “real time” error that scrolls up the screen when you run DFC directly would long since have scrolled off the screen into oblivion. No matter: the count summary is shown as the last thing DFC displays, so you can immediately see a count of any files that have errors. But it will be the log that contains the specifics of which files failed the verification check. In the log file for this run, for example, you'd see this:

[hjr@britten ~]$ cat DFC-201906071510.log

Starting the Dizwell Flac Checker in /home/hjr/Music/...
Validating files last checked more than 30 days ago
/Eternity's Sunrise (Goodwin)/track01.flac : Stored hash is not current hash!
Validated Files:  1
Skipped Files:    6
Files in error:   1

Note that the log file doesn't record details of which files were skipped or that correctly passed validation: it only lists the details of the files that have failed verification. The log is therefore the 'agenda' of things which need fixing (though how you go about resolving corruption in a FLAC file is beyond the scope of the DFC tool or this page!)

4.1 Force Checking

By default, as you've seen, you supply two arguments to the dfc script: the directory where your music files are stored and the directory you want the log file written to. By default, too, DFC will only re-check MD5 hashes for files which last had their MD5s checked more than about 30 days ago.

Sometimes, though, you will want to force a re-check of audio files no matter how short a time has elapsed since they were last checked. That's why you can also supply a third run-time parameter to the dfc script, --force-recheck. If that's typed in after the other two parameters, then the MD5 check will take place for any file that was last checked more than 1 second ago (which, in all likelihood, will be all of them!)

For example:

[hjr@britten Desktop]$ dfc . .
-----------------------------------------------------------------------------
          The Dizwell Flac Checker, Copyright © Howard Rogers 2019
                             Version: 1.01
-----------------------------------------------------------------------------

  Skipping /01 - Rhapsody in Blue.flac... Last analysis done less than 30 days ago!
  Validated Files:  0
  Skipped Files:    1
  Files in error:   0


  All good!

=========================================================================================================

Here you see me invoke dfc with two full-stops (or periods) as the usual two inputs, with a space in between them, so that they are registered as two periods and not one double-period(..): in Linux, the period simply means “here”, so I'm telling it to check files in my current directory and write the log file to that same current directory, too. My current directory in this case happens to be my home/Desktop folder. I could equally well have spelled out both directories, but happen not to have done so on this occasion.

You can see that dfc is happy to run with that sort of input -but that it has spotted that my “Rhapsody in Blue” file was checked for corruption less than 30 days ago. Therefore, it's shown as being 'Skipped Files: 1'.

But if I demand to check this file, I can re-run DFC using this sort of command:

dfc . . --force-check

Or perhaps this one:

dfc /home/hjr/Desktop /home/hjr/Logs --force-check

That time, I've spelled out what directory to check and where to write the log-file …but the important thing (for these purposes!) is that my third argument is the “force-check” one. It means this happens:

[hjr@britten Desktop]$ dfc . . --force-check
Forcing re-check of files...
-----------------------------------------------------------------------------
          The Dizwell Flac Checker, Copyright © Howard Rogers 2019
                             Version: 1.01
-----------------------------------------------------------------------------

  Validating /01 - Rhapsody in Blue.flac...
  Validated Files:  1
  Skipped Files:    0
  Files in error:   0


  All good!

=========================================================================================================

DFC runs as you'd hope -but this time, it's actually validating the Rhapsody in Blue file. Since it passes the check, the “Validated Files: 1” indication comes up… note that “Skipped Files” remains at zero. So, when you use --force-check as the third run-time parameter when invoking the DFC utility, the audio file will be re-validated, no matter when it was last validated.

Here's one more run, this time using spelled-out directory names, rather than mere full-stops:

[hjr@britten Desktop]$ dfc /home/hjr/Desktop /home/hjr/Logs --force-check
Forcing re-check of files...
-----------------------------------------------------------------------------
          The Dizwell Flac Checker, Copyright © Howard Rogers 2019
                             Version: 1.01
-----------------------------------------------------------------------------

  Validating /01 - Rhapsody in Blue.flac...
  Validated Files:  1
  Skipped Files:    0
  Files in error:   0


  All good!

=========================================================================================================

5.0 Other Matters

5.1 Rsync/backup Consequences

When DFC finds a file that was last checked more than 30 days ago, it will re-verify it (as you'd hope, I think!) Part of that re-verification process, however, will result in DFC modifying the content of the TAGDATE metadata tag within the FLAC file, so that it correctly records the new date at which the freshest verification was performed.

Changing the contents of a file in this way, however, will likely make your backup program think that new data needs to be copied to the backup location (I'm assuming you do take backups of your valuable music collection!)

If you only modified the contents of a single music file now and again by running DFC against it occasionally, that probably wouldn't be a problem. But if you schedule DFC (see below, Section 5.2) to run routinely so that once every thirty days, your entire music collection is updated... well, that could be a problem!

First, that 30th day backup will take a lot longer to complete than the one that runs on the 29th day or the 31st day. But more significantly, your backup on day 30 might end up consuming a lot more disk space than you expected. Imagine, for example, that you backup your music collection every night to a remote server's hard disk using a command such as:

rsync -avh /my/music/collection/ hjr@remote-server:/backup/my/music/collection

Now that's a command that tells rsync to copy changed files to a remote server, but not to touch any files already on the remote server... and on day 30, every music file has just been changed... so now your remote server had better have disk space sufficient to accommodate two complete copies of your music collection!

If, instead, your backup command had been:

rsync -avh --delete /my/music/collection/ hjr@remote-server:/backup/my/music/collection

...then that command would copy changed files to the remote server but would then also delete the original copies off the backup server. The backup server then contains only a single copy of your entire music collection. This second variant of the rsync command, with the --delete switch, is called a 'synchronisation' operation, rather than a simple 'copy'.

So one way around the effect on your backup storage of your entire music collection being modified every 30 days by DFC is to make sure you synchronise your backup with your source more frequently. In particular, make sure you synchronise after every DFC run, so that files modified at source by DFC almost immediately replace their backups, rather than being copied as a second version of the file.

Another option (the one I use myself) is to make sure that your backup server has sufficient storage to be able to hold, effectively, twice your music collection. If it can cope with that, then having original+modified copies stored simultaneously on the backup server won't be a problem. An infrequent synchronisation operation can then be scheduled at leisure to make the 'multiple copies problem' go away in its entirety.

Finally, you might need to think about running DFC only against the backup server and never against the original "source" of your music. That way, you get to know if your backup copies are getting corrupted or suffering from bit-rot without that check affecting the quantity of data that gets transferred between 'source' and 'backup'. If you ever detect a corruption in a backup file, you could replace it with a known-good copy from your 'source' (or from a secondary backup, perhaps).

Anyway: the point is to realise that DFC will modify your music files -and that has consequences for your backup strategy which you need to think about.

5.2 Scheduling

The whole point of DFC is to permit the frequent and regular integrity check of a large music collection without actually having to do massive amounts of actual work every time you run it. The TAGDATE metadata is there to allow DFC to work out that it can skip checking 'this' file whilst concentrating of actually re-calculating things for 'that' one, thus keeping the amount of work it has to do at each run to a minimum.

Accordingly, DFC lends itself to being scheduled as a routine task you fire off against your music collection on a fairly frequent basis. I personally run it nightly, via a crontab entry such as this one:

0 2 * * * /usr/bin/dfc.sh "/multimedia/flac/hjr/classical" "/home/hjr/Logs"

On most early-mornings, the script will start and finish within minutes only, because most of music collection will be sat there with TAGDATES within the last 30 days. Only new music that I've added to my collection will every really be picked up by the nightly run, therefore. But around once a month, when the last set of TAGDATES suddenly become more than 30 days old, nearly all of my music collection will be re-validated in one sitting. For me, with around 1TB of FLAC files to check, that takes around 4 hours -so it's not a huge job, but it's also non-trivial.

Of course, you could split the job up into smaller chunks by scheduling different parts of your music collection to be checked at different times. For example:

0 2 1 * * /usr/bin/dfc.sh "/multimedia/flac/hjr/classical/A" "/home/hjr/Logs"

0 2 2 * * /usr/bin/dfc.sh "/multimedia/flac/hjr/classical/B" "/home/hjr/Logs"

0 2 3 * * /usr/bin/dfc.sh "/multimedia/flac/hjr/classical/C" "/home/hjr/Logs"

...meaning that only music written by composers whose first names start with 'A' would be checked on the 1st of a month; only 'B' composers would be checked on the 2nd day of each month; only 'C' composers would be checked on the 3rd day of each month, and so on. Given there are only 26 letters of the alphabet and at least 28 days in even the shortest month, there is plenty of room to chop your music collection integrity checks up in a manner such as this!

It would mean that on the 2nd of each month, the entire set of music written by the likes of Benjamin Britten, Bedrich Smetana and Bohuslav Martinů would be re-checked -but at least there would be no work occasioned by Aaron Copand's music or Carl Neilsen's. In this way, therefore, you could keep the amount of checking/re-validation being performed to a minimum without triggering a wholesale change to absolutely everything in one evening. Keep this in mind when considering the effects on your backup strategy of running DFC regularly!

Whatever your precise crontab schedule for DFC, though: don't forget to explicity specify the two run-time parameters (source of music files, write-location of log files, respectively). You can't miss them off (as you might do when running the program interactively), because doing so means "current directory" -and crontab's environment is so different from your own that the concept of 'current directory' in that context will simply result in errors.

6.0 Dependencies and Limitations

DFC is intended to be run on Linux systems only. It will only work when applied to FLAC files: it does not work with MP3 files, for example (or, indeed, any other audio format you care to mention). It requires that the metaflac utility already exists on your system. Metaflac is part of the standard flac package which is often a standard component of many Linux distros or is readily installed from their standard repositories if not already installed.

DFC also requires the pre-existence of the ffmpeg utility (used to calculate MD5 hashes for a file). Again, this is either a standard component of most distros or can be easily installed from a distro's standard repositories.

Finally, DFC is a Bash script …so if you haven't installed Bash (or it's not installed already), you should install that before it will run correctly. Distros which alias Bash to some other shell may not run it correctly.

Author

DFC was devised and written by Howard Rogers ([email protected]).

License

DFC is copyright © Howard Rogers 2019, but is made available freely under the GPL v2.0 only. That license may be downloaded here.

Bugs Tracking, Feature Requests, Comments

There is no formal mechanism for reporting and tracking bugs, feature requests or general comments. But you are very welcome to email your comments to [email protected].