Fixing Mistakes #2: Checking the ALBUM tag

So, this is the next in a mini-series of posts, explaining how I went about fixing up the discovery that I'd tagged my music files incorrectly after all these years, despite knowing better!

The short version is that I always knew the recording date was an important factor in distinguishing between recordings of the same work by the same artist, but since I didn't often have duplicates, I assumed I'd get away without including it in the ALBUM tag for a composition. And then I realised that though I might well get away with it today, a new acquisition here or there could well mean that I wouldn't get away with it for ever: if the information is theoretically necessary to distinguish recordings, then it ought to be present, always.

Thus, I needed to go back to my music library and make sure that the recording date was present in the YEAR tag, which was designed for it. How I did that was the subject of my last post.

But a recording date stored in the YEAR tag isn't actually functionally useful for distinguishing between two recordings of the same work. That's because most music players will order and group recordings by what they call "Artist" and "Album Name" -which are the ALBUM and ARTIST tags in a FLAC file. If the recording date is stored in the YEAR tag, that's fine, but it won't usually be used from there to sort and group music properly. In other words, having made sure that I actually had a YEAR tag for every recording, my next task was to bring that YEAR data up into the ALBUM tag, where it could actually be useful.

I'll just pause at this point to say that I'm really doing something at this point which I'd much rather not have to do: namely, repeating information already stored in one place in a second place. There's a reason for not liking to do this: the two pieces of information are physically independent of each other and there's no intrinsic mechanism in the audio player world to make sure they stay identical. That is, I might have an opera called Peter Grimes which was recorded in 1958. I could set YEAR=1958, and ALBUM=Peter Grimes - 1993 ...and nothing can stop me now having two completely different recording dates associated with the same recording. You are then in the position of not really knowing whether one is right and one is wrong, nor which one is right or wrong -or whether both are as bad as each other! In strict database practise and theory we avoid duplicating information like this precisely because it makes data maintenance so tricky for the future.

On this occasion, however, I don't have a lot of choice. The fundamental piece of data every music player sorts and groups by is ALBUM. If the recording year is not present in that, then it functionally cannot help to distinguish between different recordings of the same work by the same performer. So practically trumps strict theory on this occasion -but it's still a good rule to bear in mind in general, which is why it's such a good idea to not include the ALBUM data in your track TITLE tag, for example.

Anyway: that's the purpose of this post. Now that I know all my recordings have a YEAR tag, I want to make sure that I haven't ever included the recording year in the ALBUM tag... and used a different date when doing so. If YEAR=1958 and ALBUM says "Peter Grimes 1969", I want to know that 1958 doesn't equal 1969. I'll have to make a manual decision about what to do, and how to fix, any discrepancies found -but the job at hand is to find any discrepancies that do exist.

As always, I'm after a script that will check my entire music library in either one huge go, or in whatever smaller chunks I feel like running from time to time. And here's the script I came up with to do the job:

    1  #!/bin/bash 
    2  # Clear up previous runs 
    3  rm -f /home/hjr/Desktop/freshdatacheck.txt 2> /dev/null 
    4  # Initialise some counters:  
    5  # i=count of records processed 
    6  # b=count of records where YEAR is present in ALBUM, but it's the wrong one (i.e. "bad records") 
    7  # g=count of records where YEAR is present in ALBUM, and it's the right one (i.e. "good records")  
    8  # c=count of records which have a year in their album name at all, whether right or wrong 
    9  # n=count of records which do not have a year in their album name 
   10  i=0 
   11  b=0 
   12  g=0 
   13  c=0 
   14  n=0 
   15  # First read the metadata from a file 
   16  shopt -s globstar 
   17  for f in **/*.flac; do 
   18  i=$(( i+1 )) 
   19  remainder=$(( i % 100 )) 
   20  DIRNAME=$(dirname "$f") 
   21  ALBUMNAME=$(metaflac "--show-tag=Album" "$f" | grep -oP '=\K.+' ) 
   22  YEARNO=$(metaflac "--show-tag=Date" "$f" | grep -oP '=\K.+' ) 
   23  # We want to check that *IF* the ALBUM contains a date, that 
   24  # date matches the one in the YEAR tag 
   25  ALBUMYEAR=${ALBUMNAME: -5:4}   #extracts the 5,4,3, and 2nd characters from the end of the ALBUMNAME 
   26  NUMCHECK='^[0-9]+$' 
   27  if [ "$remainder" -eq 0 ]; then 
   28      echo "  Checking... : $DIRNAME" 
   29  fi 
   30  if [[ $ALBUMYEAR =~ $NUMCHECK ]]; then 
   31    c=$(( c+1 )) 
   32    if [ "$ALBUMYEAR" = "$YEARNO" ]; then 
   33      g=$(( g+1 )) 
   34    else    
   35      b=$(( b+1 )) 
   36      echo "Date mismatch: ALBUM $ALBUMNAME has $ALBUMYEAR, but YEAR has $YEARNO" >> /home/hjr/Desktop/freshdatacheck.txt 
   37    fi 
   38  else 
   39    n=$(( n+1 )) 
   40  fi 
   41  done 
   42  echo 
   43  echo "Album titles NOT containing a year   : " $n 
   44  echo "Album titles containing a year       : " $c 
   45  echo "Records with matching YEARs          : " $g 
   46  echo "Records with non-matching YEARs      : " $b

Line 3 deletes the text file produced by any previous runs of the script, so that we always start with a clean slate.

Lines 4 to 14 initialise some variables that will act as counters. The function of each is explicitly mentioned, so that "i", for example, will increment as each new music file is processed and will thus end up representing the total number of records processed.

Line 16 allows the script to find all files within a folder, no matter how many sub-folders they may be nested within.

Line 17 instructs the script to find all FLAC files within the current folder structure and to start looping through them, one at a time. Each file name when encountered will have its name written into the "$f" variable.

Line 20 creates a DIRNAME variable that will be set to be whatever the current working directory happens to be. This will simply be for information purposes, as we'll see, so that we'll know roughly where the script is up to in its processing at any given time.

Lines 21 and 22 fetch the YEAR and ALBUM tag data from out of whatever music file is currently being processed. When you read a FLAC tag, it comes back out as a KEY=VALUE pairing, so that you'd read something like "YEAR=1958" and "ALBUM=Peter Grimes" -but we're not interested in the "ALBUM=" or "YEAR=" bit of that retrieved data. We only want to know the actual value of the tag. Therefore, the strange-looking "grep -oP..." command on each of these lines strip out everything up to and including the first equals sign in whatever is fetched from the music file. The $ALBUMNAME and $YEARNO variables will thus be set to "Peter Grimes" and "1958" respectively, without the tag names themselves cluttering things up.

Now comes the tricky part. Remember, we're trying to find music files which, if their ALBUM tags contain a recording year, that year doesn't match the YEAR tag value. So first, we have to find out whether the ALBUM tag contains a date at all. What follows is an object lesson in the value of data entry consistency: because my test for 'is there a date in this ALBUM name' depends very specifically on how I would enter a date in the ALBUM tag, if I actually ever did so.

For the opera Peter Grimes, conducted by Benjamin Britten, in 1958, if I was going to include the recording date in that album name, I would have set the ALBUM tag to Peter Grimes (Britten - 1958). And if I was tagging up Karajan's 1966 recording of Beethoven's 5th, I would have tagged it as Symphony No. 5 (Karajan - 1966). And so on: in every case, if I was going to include the recording year in the Album name, I would have done so within a pair of normal brackets, following the conductor's (or other distinguishing artist's) surname and a spaced hyphen. This allows me to say, without equivocation, that if there's a recording date in my ALBUM tag, it will be found in the 5th, 4th, 3rd and 2nd character positions from the end of the entire text entry. Or to put it another way, count 5 characters back from the end of the tag and keep counting for 4 characters: they should then be a date.

Of course, if I haven't got a date in my ALBUM tag, that particular collection of characters would be nonsensical from a numerical point of view. For example, if I've tagged something as Symphony No. 9 (Rattle), then the 5th to 2nd characters from the end of that ALBUM tag would be "ttle"... not a numeric value in sight.

So line 25 pulls out the 5th, 4th, 3rd and 2nd characters from the end of the previously-fetched ALBUM tag. Line 26 sets up something called a 'regular expression' which, simply put, tests for whether any characters other than 0 to 9 are present in a piece of information. It looks hieroglyphical, because regular expressions are like that, but the NUMCHECK variable is simply being asked to check the between the first and last characters of anything fed to it, only the numbers 0 to 9 are found.

Lines 27 to 29 are a bit of a diversion at this point: they're simply there to output some indication of progress (in the form of the current working directory name) every 100th music file progressed. Better to know that the script is doing something periodically than to just sit there, staring at a blank screen and wondering whether anything is happening at all!

Line 30 is where we present the value of the ALBUMNAME value at the NUMCHECK regular expression. This is the point where we're saying "does this ALBUM tag contain a year in the 'approved' or 'expected' position or not?" Note that the effect of line 25 is to make this test very precise: I mean, for example, that if I had an album called 1812 Overture (Karajan), that most certainly has a date in it... but since the date is not in the 5th to 2nd positions from the end, the script will not get confused on this point. That album name does not contain a valid date in the correct position, so wouldn't pass the test that line 30 is trying to perform. In that case, the script jumps down to line 39, where the 'n' counter is incremented by 1.

If Line 30's test is passed, however, that means that the album name does claim to contain a recording date. So at this point, at line 31, we increment the 'c' counter by 1. At line 32, we then immediately test whether the year found in the album name matches the year found in the YEAR tag. If it does, at line 33, we increment the 'g' counter and do nothing else. If it doesn't -if we have a mismatch between the year stored in the YEAR tag and the recording year stored in the ALBUM tag- we increment the 'b' counter at line 35 and output details about the affected music file to a text file stored on my desktop.

Line 41 tells us to keep looping through every possible audio file found in the current folder until we're finished processing all of them. Once we have finished, lines 42 to 46 output some concluding statistics to the console.

I shall pause at this point to emphasise, once again, that the key part of this script only works because I know, without fail, that if I ever included a recording date in the Album name, it would be done in such a way that "find the last 5th, 4th, 3rd and 2nd characters of the album name" will always find it. If I was haphazard about my tagging, so that sometimes I might tag something as Peter Grimes, Britten, 1958 and sometimes as Symphony No. 5 (Karajan - 1966), then my script couldn't work. I don't mean this to imply you must follow my formatting guidance when tagging your own music: but it does mean that whatever formatting you cook up for yourselves has to be consistent across your entire music library, or this sort of bulk processing against it will never be feasible. If you have cooked up your own, consistent, tagging format that's different from mine... well, good for you, but now you'll have to work out your own way of extracting the date component of an album name from that formatting, because my script's rules won't apply to yours!

Anyway... enough of the theory! Let's test it in practice. First, I navigate to one part of my music library (because I like to start with small parts of it before firing these things off against the whole thing!). Let's see how all the music written by composers whose first names start with 'B' fares in my library:


That's me moving to the appropriate folder on my file system and, once there, invoking the new shell script by typing its full path and name. Here's what happens next:

You can see a periodic display of progress with the multiple "Checking..." lines (thanks to lines 27 to 29 of the above script). At the end of the entire run, we see four lines of 'summary statistics' displayed (courtesy of lines 42 to 46 of the script). From this, I can see that most of my albums in this part of my music collection don't have years in their album names at all. That's a problem that will need to be fixed in the future -but it's not the problem I'm trying to fix right now!

We see that 78 of the many albums in this part of my music collection do claim to have a year in their names (which is good, because they should. All albums should!) Unfortunately, of the 78 records with a year in their ALBUM tag, only 33 of them have a year in ALBUM that matches what's been found in the YEAR tag. A whopping 45 items have a date in ALBUM which is not what the YEAR says it should be: and these are the records that I'm trying to find and fix right now.

Of course, you'll want to see which music files are in error at this point: that's what the freshdatacheck.txt file should be able to tell you. The script (line 36) wrote that out to the desktop for me any time it found a music file with a YEAR/ALBUM date mismatch. Here's what mine looks like:

Well, the good news is that all 45 bad items appear to come from the same album (the script is processing individual music files, any one of which might have erroneous information, even though all the other files for that album are fine, so it makes sense to list each file that's in error separately). That means I only have one real problem to fix up:

Using my favourite Linux-based tag editor, Easytag, I navigate to the album indicated in the error text and... sure enough, I see that the Album name contains a date (1959), but the Year tag contains a different date: in this case, 1958. You can't really tell just by looking which one of those dates, if any, are actually correct. For that, you're going to need to consult sites like Discogs or Allmusic for recording information -and it's often not easy to find, even so. Ultimately, if you can't find a recording date, you could just make one up or get something approximately correct. It's not completely essential that it should be the gospel truth: only that it should be plausibly correct and consistently applied to both YEAR and ALBUM tags.

In this case, I shall simply select all the tracks, correct the ALBUM tag so that it reads 1958 (because that's actually when this composition was recorded), and save the modified data. That done, I shall re-run my script as before:

...and sure enough, this time, the script finds 78 music files which purport to have a date in the album name, and all 78 of those dates match the relevant YEAR tag value. There are, on this second run, zero records claiming to have a year in their album tag which turn out to be different from that stored in their YEAR tag. That being the case, I can move on to check other parts of my music library in turn, until eventually I can do this:

...which is a final check of the entire music library in one fell swoop: 61513 files checked, only 3894 of which have a date in their album names, but every one of those have the correct date (or at least, a date which is consistent with their YEAR date).

Now at least, I know my tag data is consistent between different tags. The bigger problem remains, however: I have 57,619 files which don't include the recording year in their album names at all, when theory (and thus practise) mandate that they should. So that's the third and final problem with my music library I need to fix up... and it's a third script for another time!

I can't leave this post, however, without re-emphasising two crucial points that arose in this exercise. One is the absolutely critical need for consistency in the way you enter your tag data. If you do it 'composition-comma-conductor-comma-date' in one place, you must not do it 'composition-colon-date-colon-conductor' in another. It's not that your music player won't cope if you do... but you'll never be able to reliably bulk-fix your library by the running of batch scripts if you do. If your data is not stored precisely where and how you expect it to be, you cannot script things to fetch it and fix it in a way that will work consistently across your whole music collection. Consistency of data entry permits consistency of data processing.

And the second point I want to emphasise is that this entire script illustrates why the general rule is that you don't repeat data you've stored in one place in a second place. The fact that the two stores of data can end up disagreeing with each other means you can place no reliance on either instance of the data. That said, we're making an exception to the general rule on this specific occasion, because we have no choice and music players work this way! But I want to draw your attention to the general rule in any case 🙂