My last post explained that my music collection needed to be re-catalogued to some extent. In particular, I needed to make sure that every track I've ever ripped has an entry in its YEAR tag, identifying when a particular recording was made (because that little piece of information turns out to be a crucial component in classical music recordings' "primary key")
I wasn't going to check all 64,000+ ripped audio tracks by hand to achieve this! Instead, I needed to script something that could batch-check my entire collection in one go.
Here's the bash script I ended up writing to do that:
1 #!/bin/bash 2 clear 3 # Clear up previous runs 4 rm -f /home/hjr/Desktop/missingdates.txt 2> /dev/null 5 rm -f /home/hjr/Desktop/fixmissingdates.txt 2> /dev/null 6 a=0 7 i=0 8 # First read the metadata from a file 9 shopt -s globstar 10 for f in **/*.flac; do 11 i=$(( i+1 )) 12 remainder=$(( i % 100 )) 13 MUSICDIR=$(pwd) 14 DIRNAME=$(dirname "$f") 15 YEARNO=$(metaflac "--show-tag=Date" "$f" | grep -oP '=\K.+' ) 16 if [ "$remainder" -eq 0 ]; then 17 echo " Checking... : $DIRNAME" 18 fi 19 if [ -z "$YEARNO" ]; then 20 a=$(( a+1 )) 21 echo "No date detected for... $MUSICDIR/$DIRNAME" >> /home/hjr/Desktop/missingdates.txt 22 fi 23 done 24 if [ "$a" = 0 ]; then 25 echo " --------------------------------------------------------------------------------" 26 echo " No tracks found with missing dates" 27 fi 28 # Now all per-file processing is complete, process the rename file 29 if [ -f "/home/hjr/Desktop/missingdates.txt" ]; then 30 awk '!a[$0]++' /home/hjr/Desktop/missingdates.txt > /home/hjr/Desktop/fixmissingdates.txt 31 fi
Going through this in some detail, then. Lines 3 to 7 delete some text files and initialise some variables that are going to be used to count records.
Line 9 switches on the ability to hunt for all FLAC files within the current directory, no matter how many sub-folders down they might be. Line 10 then switches on a loop, so that each FLAC file found will be processed in turn by what's coming. As each file is found and processed, it's full filename is assigned to the "$f" variable.
Line 11 simply increments the 'i' counter, so we know we're processing a FLAC file -and every other FLAC file we process as we loop through will thus also increment 'i' by 1. Line 12 simply takes the 100 modulus of i and assigns that to a variable called "remainder". This basically allows us to count every FLAC file we're processing, but to do something special on every 100th file, as you'll see.
Lines 13 and 14 set up some special variables. MUSICDIR is simply the current working directory's full path. I might invoke this shell script, for example, in '/sourcedata/music/classical/B' -and that's exactly what MUSICDIR will then be set to. The variable DIRNAME, however, is set to the path in which the music file is found, minus the MUSICDIR component. In other words, if I have a FLAC file called, in full,
/sourcedata/music/classical/B/Benjamin Britten/Opera/Albert Herring (Britten)/09 - Right! We'll have him! May King!.flac
...then MUSICDIR will be /sourcedata/music/classical/B and DIRNAME will be Benjamin Britten/Opera/Albert Herring (Britten), whereas $f will be Benjamin Britten/Opera/Albert Herring (Britten)/09 - Right! We'll have him! May King!.flac.
I therefore now know (a) where I'm running; (b) what the path from there to the music file is; and (c) what the full path and filename of the specific music file is.
Line 15 then use the metaflac program (part of the standard FLAC utilities) to read the contents of the YEAR tag from each music file in turn, assigning it to a YEARNO variable. There is some character-stripping going on in that line. If you just read the tag value, you'd get something like this:
But we want a usable year number, so the "DATE=" part of that needs to be stripped off -and that's what the "grep..." part of line 15 does. By the end of line 15, then, we'd have $YEARNO set to, in this case, the value 1964.
Lines 16 to 18 are a simply progress meter: if the $remainder variable is zero (which it will be every 100th music file processed), then it will echo out to the console the name of the music file it's currently processing. If it isn't, (which will be true for the other 99 music files in every 100), then it doesn't do anything. It's crude, but you will at least see the script progressing through your music library every so often.
Line 19 tests to see if the $YEARNO variable is null. This will be the case if, back on line 15, having read the YEAR tag, and stripped of the 'DATE=' text, nothing remains. If it's true that $YEARNO is null, then that can only be because no tag value exists for the YEAR tag. These are the records I need to fix, so at line 20, I increment the "a" counter (so I can keep tabs on how many of these 'blank YEAR' files I've encountered) and at line 21, I write the path/folder name of the audio file that lacks a YEAR tag -but not its full filename- to a text file on my desktop. Line 23 then closes the processing loop, which means we go back to line 10, pick up the next audio file to check, and repeat the checks as before.
By the time we get to line 24, therefore, we will have read and checked every audio file found within the current working directory. If we get to line 24 and the "a" counter is still zero, then it can only be because every single file checked has a YEAR tag correctly set. We can, therefore, at line 26, display a simple message to this effect to the console.
If the "a" counter ever has a non-zero value, however, then we must at some point have encountered an audio file without a YEAR tag. It's file name will have been written into the text file sitting on my desktop. Unfortunately, if I've missed a YEAR tag from the rip of an entire opera, for example, then for every one of the (say) 67 tracks making up that opera, the same path/folder name will have been written into the text file 67 times by line 21. So line 29 checks if there's a text file at all (there will be, if there are YEAR-less audio files). If so, it then does some awk magic on the contents of that file to output a second file: the awk command isn't exactly obvious but it simply reduces duplicate entries in the file to a single entry. Thus, if I had 67 records in the first file all reading "/Benjamin Britten/Opera/Peter Grimes (Britten)", because all 67 tracks of that opera were YEAR-less, the newly outputted file will only contain a reference to that folder once. In other words, lines 28 to 31 de-duplicate the file produced by the audio file processing and result in a second text file on my desktop that contains only a list of the "compositions" which need to be fixed up, one line per composition, no matter how many specific FLAC files a composition might be made of.
Here's a worked example: I travel to my /sourcedata/music/classical/B folder and invoke my new shell script from the command prompt there. This is the output:
As you can see, it outputs about 27 lines of 'progress' (courtesy of line 17 of the script). Note that at the end, it hasn't output a message about 'No tracks found with missing dates', which line 26 would have done if no YEAR-less audio files had been encountered, so that's my first clue that I have some that need fixing. Sure enough, on my desktop, I have two new files.
The first of them, called "missingdates.txt", thanks to line 21 looks like this:
It's repeating the same folder name over and over, because each audio file within that folder has been found to be missing a YEAR tag. So it looks rather repetitive, because there are lots of tracks that belong to that particular opera.
But the second file produced, thanks to line 30, called 'fixmissingdates.txt' looks like this:
...and as you can see, this only contains a single, de-duplicated, line. Every line in this file, therefore, tells you of a single folder, the contents of which need a YEAR applied to them.
That's done using whatever tagging software you like: in my case, I use Easytag:
All the files within the affected folder are highlighted and, over on the right-hand part of the screen, I'm supplying the correct recording year to them all in one hit.
Once I fix the specific problems listed, I can re-run my script to check that I've done so correctly. I travel to the 'B' section of my music collection and fire the shell script off like so:
...and this time, the output looks slightly different than before:
This time, I get the 'No tracks found...' message, which means my music files are all nicely tagged with YEAR values. The two text files previously produced also disappear from off my desktop!
Out of about 64,000 music files, I probably had around 900 that were missing dates, due to poor ripping and tagging practices back in the early 2000s, back when I didn't quite appreciate the need to be as rigorously consistent in one's tagging practices as I am today!
So this script doesn't fix the problem of missing YEAR tags. It simply tells you where the problems exist in your music library. Only you can then go back to your library and fix the cataloguing errors by hand, after duly researching and discovering when the YEAR-less recordings were, in fact, recorded. When you think you've fixed everything, run the script one more time in the root of your entire catalogue. Hopefully, you'll see something like this:
That's me running the script on my entire classical music folder in one sitting -and, at the very end of it, getting the good news that no audio files were found to be lacking a YEAR tag. This was the first of the steps I needed to get right before I could ensure that my ALBUM tags all had a YEAR number included in them, as is required by logic.
The next step was to check that, if a music file already had an ALBUM tag that contained a year within it somewhere, did it match that stored in the YEAR tag or not. I'll explain how I checked my collection for that problem in my next post.