Fixing Mistakes #3: Fixing the ALBUM tag

This is the last in my series of three posts explaining what I've done to fix up a silly cataloguing error in my extensive music library. The problem was first described back here. But to recap: whilst I have always "allowed" the inclusion of a recording year in the ALBUM tag where it was necessary (to distinguish, for example, between Boult's 1959 and 1968 recordings of Vaughan Williams' 9th symphony), I only added the recording date rarely, and as an exception to the norm. Rethinking the logic of what counts as recorded music's primary key, however, I realised that the recording date should always and without exception be included in a recording's ALBUM tag.

So, the past few posts have been about the scripts I wrote to (a) check every recording had a recording year stored in its YEAR tag; (b) to check that those recordings that had a date included in their ALBUM tag matched there what their YEAR tag said should apply as the recording date.

In this last installment, I now have two things to do: i) for those recordings that don't already have a recording year in their ALBUM tag, I need to put it there. And ii), since it is an axiom of mine that the physical storage structures of my music library should mirror its logical organisation, if my ALBUM tag gets changed to include a recording year, the physical folder containing that recording should also be re-named to include a recording year. And, given the 60,000+ music files involved, all of that has to be done entirely automatically, using scripts!

So. Eyes down for a tricky one (because this last stage of fixing things is the most complex and awkward to accomplish)! Here's the script I wrote to achieve the necessary outcome:

    1  #!/bin/bash 
    2  # Clear up previous runs 
    3  rm -f /home/hjr/Desktop/renamealbums.txt 2> /dev/null 
    4  rm -f /home/hjr/Desktop/renamealbums.sh 2> /dev/null 
       
    5  # Initialise a counter 
    6  i=0 
       
    7  # First read the metadata from a file 
    8  shopt -s globstar 
       
    9  for f in **/*.flac; do 
       
   10  i=$(( i+1 )) 
   11  remainder=$(( i % 100 )) 
   12  MUSICDIR=$(pwd) 
   13  DIRNAME=$(dirname "$f") 
       
   14  # Fetch existing metadata 
   15  ALBUMNAME=$(metaflac "--show-tag=Album" "$f" | grep -oP '=\K.+' ) 
   16  YEARNO=$(metaflac "--show-tag=Date" "$f" | grep -oP '=\K.+' ) 
       
   17  # Extract the data in the 5th-2nd characters from the end of ALBUMNAME. It ought to be a year, but we'll check that 
   18  ALBUMYEAR=${ALBUMNAME: -5:4}    
   19  NUMCHECK='^[0-9]+$' 
       
   20  # Progress counter... 
   21  if [ "$remainder" -eq 0 ]; then 
   22      echo "  Checking... : $DIRNAME" 
   23  fi 
       
   24  # Right: if the extracted characters from the end of ALBUMNAME *are* all numeric (i.e., a year), do nothing. 
   25  # Otherwise, we need to add the YEAR data into the end of the ALBUMNAME tag 
   26  # If they are not, then the  
   27  if [[ $ALBUMYEAR =~ $NUMCHECK ]]; then 
   28    DOT='.' 
   29  else  
   30    ALBUMNAME2=${ALBUMNAME%)}" - $YEARNO)"   # That is, trim off last bracket and then add YEARNO+closing bracket to ALBUMNAME 
   31    metaflac --remove-tag=ALBUM "$f"             # That is, get rid of existing ALBUM tag from each music file 
   32    metaflac "--set-tag=ALBUM=$ALBUMNAME2" "$f"  # And set it back, this time using ALBUMNAME2 
   33     
   34    # Since the physical folder name should match the ALBUM name tag, if we change the one, we need to check if we need to change the folder name 
   35    DIRYEAR=${DIRNAME: -5:4} 
   36    if [[ $DIRYEAR =~ $NUMCHECK ]]; then 
   37      DOT='.' 
   38    else 
   39      NEWFOLDERNAME="$MUSICDIR/$ALBUMNAME2" 
   40      DIRNAME2=${DIRNAME%.} 
   41      OLDPATH=$(echo $MUSICDIR/$DIRNAME2 | gawk '{ gsub(/"/,"\\\"") } 1') 
   42      NEWPATH=$(echo $MUSICDIR/$(dirname "$DIRNAME2")/$ALBUMNAME2 | gawk '{ gsub(/"/,"\\\"") } 1') 
   43      echo "mv \"$OLDPATH\" \"$NEWPATH\"" >> /home/hjr/Desktop/renamealbums.txt
   44    fi 
   45 fi 
       
   46 done 
       
   47 # Now all per-file processing is complete, generate a shell script which, when run, 
   48 # would implement the folder name changes 
   49  if [ -f "/home/hjr/Desktop/renamealbums.txt" ]; then
   50    sed '1i#!/bin/bash' /home/hjr/Desktop/renamealbums.txt
   51    awk '!a[$0]++' /home/hjr/Desktop/renamealbums.txt > /home/hjr/Desktop/renamealbums.sh 
   52    chmod +x /home/hjr/Desktop/renamealbums.sh 
   53  fi

Lines 3 and 4 just tidy things up from previous runs by deleting any copies of the files the script will seek to create.

Line 6 initiates a counter that will increment by 1 every audio file the script goes on to read and process.

Line 8 allows the script to find any FLAC files in any sub-folder of the folder in which it is run, and Line 9 initiates the search for, and iteration through, every FLAC file stored within the current directory's sub-folder hierarchy. Line 10 is only reached if the script finds a FLAC file -at which point, the 'i' counter is incremented by 1. We thus have a 'record counter'. Line 11 sets up a modulus counter: it divides i by 100 and returns the remainder. This allows us later to create a 'progress indicator' that will do something every 100 records processed, as we'll see. Lines 12 and 13 set two variables. MUSICDIR is set to be whatever the current working directory is; DIRNAME is the extracted upper level folder name within that.

Suppose, for example, I have an opera stored in the folder /sourcedir/processing/U/Umberto Giordano/Opera/Andrea Chénier (Levine). And suppose I visit /sourcedir/processing to run the script. In that case, "/sourcedir/processing" will be my MUSICDIR and "/U/Umberto Giordano/Opera/Andrea Chénier (Levine)" will be my DIRNAME. You'll see why we keep these things distinct shortly.

Note from Line 9 that every FLAC filename that is found is loaded into the "$f" variable. So, when the script does things to "$f", it means to do them to the particular and unique music file being processed on this run through the overall processing loop.

So Lines 15 and 16 take the currently-found FLAC file and read from it the values for its ALBUM and YEAR or DATE tags. When the metaflac program reads tags from a FLAC file, it reads them as a keyname=value pair, so you might get "ALBUM=Symphony No. 5" and "DATE=1995". In neither case are we interested in the 'keyname=' part of that data: we just want the actual value of the tag, not its name Hence Lines 15 and 16 use a complex bit of 'grep -oP '=\K.+'" code to strip out everything up to, and including, the equals sign. We are left with ALBUMNAME being set to 'Symphony No. 5' and YEARNO being set to '1995', therefore.

Line 18 then processes the ALBUMNAME variable. It extracts from it the 5th, 4th, 3rd and 2nd characters from the end of the variable. In the case of 'Symphony No. 5', therefore, it would set the ALBUMYEAR variable to a value of 'No. '. In the case of 'Symphony No. 5 (Karajan - 1985)', though, ALBUMYEAR would end up being set to '1985'. Line 19 creates a little function called NUMCHECK that will, eventually, test whether what is presented to it consists of entirely numeric characters (that is, the numbers 0 to 9, no exceptions allowed). We'll see why that's needed in just a second.

Lines 20 to 23 are a bit of a distraction at this point: they don't do anything by way of actual data processing. They simply say "if this is the 100th, 200th, 300th....99000th row, then do something" ...and the something is simply to display on-screen what music file is being processed at this run through the loop. It simply provides a progress indicator, therefore, so you know the script is actually doing something.

Line 27 is critically important. It asks whether the value currently assigned to the ALBUMYEAR variable consists only of digits by passing it through the NUMCHECK function. If it does, it's probably a year. If it doesn't, then that's the sort of record we're interested in fixing up. So Line 28 is reached if the last few bits of the Album Name tag are already a year: since we don't then have to do anything, we just do a meaningless assignment of a full-stop to the DOT variable.

But Line 29 is what we start doing when the Album Name tag doesn't end in a four-digit year. We start at Line 30 by assigning to a variable called ALBUMNAME2 the value of ALBUMNAME shorn of its last closing bracket, but then adding the YEARNO variable and a new closing bracket. In other words, if the existing ALBUM tag is "Symphony No. 5 (Karajan)", ALBUMNAME2 will be set to "Symphony No. 5 (Karajan", and then have "- 1995)" added to it, resulting in "Symphony No. 5 (Karajan - 1995)"

Lines 31 and 32 actually do the deed: the first line deletes the existing ALBUM tag from the music file we're dealing with. The ALBUMNAME2 variable is then used to write a new ALBUM tag back to the same music file. At the end of Line 32, therefore, the music file's ALBUM tag has a year in it. The correction job required is now therefore done -and we could move on to the next music file and do the same sort of correction, if necessary.

But not so fast: Lines 35 to 45 do the same sort of "does this thing end in a year?" test to the physical folder storing the music file under investigation. If we've just changed the ALBUM tag to be "Symphony No. 5 (Karajan - 1995)", then the physical directory in which that file is stored should also be called "Symphony No. 5 (Karajan - 1995)". The reason for this is simply that my physical disk structures (i.e., the directory and sub-directory structures) that store my music need to be organised in an exactly similar way, physically, to the way I organise my music logically. That way, I can use a music player to play my music (which will be busy reading logical metadata tags within music files), or I can just browse my music files in the file manager my operating system gives me. Either way, I navigate the same sort of mental organisation structure.

So, Line 35 sets the DIRYEAR variable to the the DIRNAME's 5th, 4th, 3rd and 2nd characters from the end. Line 36 then asks if those characters are purely numeric. If they are, great: the physical folder name already ends in a year, so we don't need to do anything and Line 37 again simply sets the DOT variable meaninglessly to a full-stop. But if they are not purely numeric, line 38 kicks in and we start doing some serious mangling of the DIRNAME variable to strip it of its existing closing bracket and append the year+closing bracket we require. Line 43 writes out a "move" command, which is Linux's way of renaming a folder, into a text file which in this case will sit in my Desktop folder. So nothing actually happens at this point to achieve the folder rename, but the command needed to achieve it later on is written out to a text file.

There's a reason for doing it this way, of course. Remember that this is being done within a per-file loop. So if a folder contains 8 music files (say), then line 43 will write out the same "rename folder" commands 8 times to the text file. And if we allowed the script to actually rename the folder, it would succeed on the first attempt -and then fail to be able to locate the other 7 files (because they're now in a directory that no longer exists)!

So our desktop text file gets very big, because every single music file contained within a folder will trigger an instruction to rename the one folder containing them all. We'll deal with this duplication problem shortly.

Line 46 is the closing of the loop initiated back at line 9. So now the processing of an individual music file is complete, and we whizz back to line 9 and pick up the next music file to deal with. We keep iterating around this loop until all discovered music files are processed in the same way. Only when all files are processed do we step onto line 47 and beyond.

Line 49 checks to see if a "renamealbums.txt" file exists on my desktop. If it does, it inserts a line right at the top of it which turns it into being a proper Bash script (Line 50). It then performs some awk magic to remove duplicate lines from the file. Remember that if a 56-track opera didn't have a date in its folder name, the instruction to rename the folder to include a date would have been echoed to the text file 56 times: but we're only interested in seeing it once, since one directory rename keeps all 56 files happy! Line 52 then works some Linux magic to make the resulting, de-duplicated file into an executable Bash script in its own right.

My script does not, therefore, actually rename any folders itself, though it does re-tag all the music files. Instead, my script creates a second script which, if run, will do the folder renaming for you. At the end of running that second script, you ALBUM tag name should match your physical folder name exactly -and both will include the recording year, which will match the YEAR tag.

That's quite a lot for a single script to be doing and I don't blame you if you're a little bamboozled by the end of it all! But this is the beauty of Linux, shell scripts and being able to process a music library both logically and physically in identical ways. Let's see the script in action, shall we? And let me state, up front, that you do not want to run this script against your actual music library! It is LETHAL. It will alter music tags and (potentially) folder names with no supervision or oversight, so if it mangles your actual music library, there's no going back! Therefore, please backup your music library before running anything. In my case, I will copy sections of my library (delimited by composer first name initial) to a processing location, run the script against that copy, and only if I'm happy will I then delete the music out of my 'real' library and copy my processed copy back into its place.

The point is, unlike the scripts in the earlier two parts of this series, which merely produced text files telling you of metadata problems which you had to go and fix manually, this script will actually alter things in the metadata of your music files. It does things! So it's really important you do NOT run this script against your one and only copy of your actual music library, or mayhem may result!

Right?

OK. In that case, consider my starting situation:

I've copied my 'U' files into a separate processing area, so you see that I've got a folder for compositions by Umberto Giordano and this one specifically is for his opera Andrea Chénier... and you'll notice that the folder name here has no year or date at the end of it. None of the audio files within that folder have a recording date in their ALBUM tag either, as you can see in EasyTag:

As you can see the ALBUM tag for any of the files within that folder are simply an entry of 'Andrea Chénier (Levine)'... again, no date anywhere in that.

So: let's run the script:

I've travelled to the 'parent' directory of all the 'U' folders, and invoked the shell script from there.

Now, nothing much by way of progress is reported on this occasion, because there's so little music in my collection that's written by composers named 'U...' that the 'every 100 files' progress indicator never has a chance to kick in. So the script simply runs and appears to do nothing. But let's check the music files' metadata shall we?

Bingo! Notice that the ALBUM tag (over on the right-hand third of the screen, roughly half way down) now contains a date, within the brackets. Instead of (Levine), it now ends (Levine - 1977), so the ALBUM tag now correctly contains the same data that's contained in the YEAR tag. But what of the physical folder?

Well: not so good news:

As you can see, the folder name still simply reads "(Levine)", date-less. But that, of course, is by design: though the script alters a music file's metadata tags, it doesn't alter the physical folder names, but merely creates another script that, if run, would do that.

So lets look at the files created by running the first script. First there's renamealbums.txt:

Well, this is the text file produced by my script. For every music file found without a year in its ALBUM tag, it's generated a 'mv "this folder" to "that folder"' script. But the final lines 47 to 53 of my script should take this raw content and turn it into a de-duplicated shell script. So let's see renamealbums.sh:

Sure enough, this file contains only a single instruction to rename the entire folder so that it includes a YEAR value in its name. All that I have to do, therefore, is run this new script to finish the job:

So you see me typing in the full path and name to this new 'renamealbum' shell script and invoking it immediately after the previous script had completed. The effect?

Well, now my file manager is telling me that the folder name ends in "(Levine - 1977)", which exactly mirrors my new ALBUM tag, so metadata and physical folder structures agree once more.

I could have written the script that starts this article so that it automatically ran this 'renamealbum' script, so that running one thing would do everything (just insert a new line at 53, reading: "./home/hjr/Desktop/renamealbums.sh"). But I got nervous: physical file structures are the pointy end of any music collection and messing around with them automatically is just asking for trouble! I therefore thought it wise merely to generate an intermediate script to get the physical folder names correct, so that you could inspect it and check it for any potential errors before running it and committing yourself. Once I'd run the scripts manually a couple of times, though, I got confident enough to include that extra line and let the one script do everything: but incremental baby-steps are no bad thing when so much is at stake!

Once my 'U' files are correctly names in my processing area, it's time to delete the original 'U' files in my actual music library, and move the re-tagged and re-named ones back into their place. Repeat for as many sections of your music collection as you have created!

Job done, basically. The complexity of my scripts in this series of three articles are down to two things. First, I probably can't write very good shell scripts! Nothing we can do about that now, I guess 🙂 But secondly, the fact is, my data is stored in a consistent, predictable way and in a consistent, predictable format. Therefore, with just a little bit of string formatting and processing, I'm able to script up something that fixes up over 60,000 music files with relatively little fuss.

Not that it's perfect, mind you!

You'll notice a series of 6 error messages appeared when I attempted to run the 'renamealbum' script on my "W" files (i.e., stuff like William Walton and Wolfgang Mozart). The reason for this is fairly simple (and a bit of a logic bomb!): the script assumes that the album folder will be named exactly as the ALBUM tag is set. It generates its series of "mv" commands (which is how Linux renames folders) using the value stored in the $ALBUMNAME tag, which was fetched directly from the ALBUM tag. So if you ever tag you ALBUM as (say) Symphony No. 6 and name the folder it's stored in as Symphony No. 6, the mv command will fail. Can you spot the problem?! The folder name as a double space between 'Symphony' and 'No.' and the ALBUM tag doesn't. It's not the only sort of mismatch that might happen, but I think it's the commonest in my setup.

What that means is that the list of errors is certainly something I need to go and sort out for the folder re-naming business -but it's also a list of folder names that don't exactly match the ALBUM they contain... and that's a separate error I ought to have corrected long before now, too.

So, short version: the errors listed are actually rather a good thing, because they are an indication of sloppy typing practice made years ago -and ones that can now be fixed! The important point is, at least, that the errors don't indicate that anything destructive has gone on, which is just as well! No music files are lost or damaged by this processm in other words.

Take note: you may not agree with the specific way I've organised my music collection, but the fact that it's organised this way allows me to do bulk processing and re-processing to get things corrected and 'just so' without too much difficulty. If you choose to go your own tagging way, fair enough and good luck to you: but keep it consistent and predictable anyway, otherwise you'll never be able to do this sort of bulk re-processing in the future. And that means: a lot of manual re-typing, one file at a time -which I'm betting you won't want to do on any music collection large enough to be cared about in the first place!