Let’s not get physical!

I had a slight mishap with my main PC on New Years’ Eve: Manjaro released a new kernel and I installed it without thinking -and, though I believe the PC rebooted fine, I couldn’t actually see anything on my monitor, so whether it had or not was really kind of moot!

So, a swift rebuild later, and we’re back in business -though it’s not quite how I imagined I would spend my New Years’ Eve!

Anyway: doing the rebuild, I made the decision not to re-install Strawberry or Clementine music players, which have long been my go-to music player/managers on all manner of Linuxes. Why not? Because I’d just released AMP and it was working nicely and I couldn’t imagine a need for either of the two ‘fat’ music players ever again.

I’m still happy with AMP: it does exactly what I need a media player to do. I particularly like the ‘random play’ feature: tell it to play music from a folder that doesn’t itself contain music, but which contains sub-folders which, at some point in the tree hierarchy, themselves contain music files, and AMP will randomly pick a folder to play. It’s meant I’ve been listening to a bewildering variety of composers since New Years’ Day that I probably would never have listened to if left to my own devices. For me, random selection of complete compositions to play is a very important feature to have -and it’s something no other music player I know of, on Windows or Linux, is able to offer.

But here’s the thing: the way AMP does its random selection at the moment is to search through every physical folder that contains FLAC music files and then sorts the results of that search into a random order, before picking the top of the scrambled list. Whilst that works, physically traversing a hard disk that contains 1.8TB of music in 10,000+ folders and containiing nearly 70,000 FLAC files is inevitably going to take time. Practically, the result is that every time I run AMP in random-play mode, it takes maybe 30 seconds or more for it to find something that it then decides to play.

I can live with that. (I wouldn’t have released it if I couldn’t!) It’s a bit like the old days, when you had to extract an LP from its cardboard sleeve, then remove the LP itself from a plastic or paper inner sleeve, place it on the turntable, and wait for the play arm to position itself, drop down, crackle a bit in the lead-in track, and then the music would begin… The anticipation that AMP will finally find something worth playing is worth enduring, I reckon! It’s all part of the (new and digital!) listening experience. 🙂 And yes, I realise that younger readers who have no idea what an LP is (or was!) won’t have the faintest idea what I’m on about, but that’s fine too:  let’s just agree that deferred gratification is a ‘thing’ that can be desirable, shall we?!

But: really, the problem is in making AMP do a physical trawl through a music collection before it picks something to play. That is inevitably slow -and, worse, it’s inevitably biased. If I have 10,000 Bach tracks and 3,000 Mozart tracks and 4 Ēriks Ešenvalds tracks, what do you think the chances are that AMP will ever pick an Ēriks Ešenvalds track?! Practically zero as it turns out: the physical dimensions of your music collection will inevitably affect the results of a physical random walk through that collection. Composers with lots of tracks will tend to be favoured for selection over those that have fewer tracks. I noticed this almost immediately upon AMP’s release when I noticed an unusual propensity for it to select Bach, Mozart and Handel -three composers of whom I have an inordinately large number of ‘tracks’ of music.

I partially fixed this issue a day or two after release by letting you create an ‘excludes.txt’ file, in which you name composers you don’t want randomly selected. But the probability bias created by physically searching through music collections of uneven sizes is still inherent, even if mitigated somewhat.

Fundamentally: the only thing that will fix that physical size bias is to stop using physical searches through your music and to switch, instead, to logical searches.

By a logical search, I mean somehow that we extract the metadata from our music files and store it in accessible fashion within a database table. If you’re not knowledgeable about databases, think of a database table as the equivalent of an Excel spreadsheet, with rows of items -each row representing ‘a track from a CD’ and each row being made up of separate columns, where column A stores the composer name, column B the composition name, colum C the track title and so on. In either case, the problem now becomes one of picking a row at random from a single ‘spreadsheet’, rather than having to physically walk through 10,000+ folders before making a random choice. Guess which is quicker to do?!

Once you’re dealing with a database table, you can select rows from it extremely quickly: it’s one ‘object’ stored in one physical location, so scanning through it is trivially easy. Since you’re not physically visiting 10,000+ folders, but merely querying a 10,000 row table, we’re talking sub-second search times, for any reasonably-sized music collection (by which I mean, 30,000-physical CDs or so).

But it gets better: imagine you capture the composer and composition names from your physical music collection into a table we’ll call ALBUMS. Because you have a huge collection of Bach and a tiny one of Ešenvalds, there will be many rows in this table of Bach and only a few of Ešenvalds. The ‘proportion’ problem still raises its head, therefore: you’re much more likely to randomly select a Bach row than an Ešenvalds one.

But this is where databases -and logical data manipulation- works its magic. Because, when working with databases, you can do this:

create table my_composers as select distinct composer from ALBUMS

That is, you can create a new table as a selection of data column(s) from another table. And -the key point- you are populating this new table with only the unique -or ‘distinct‘- composer names. So from a table containing 10,000 rows of Bach and 45 of Ešenvalds, you have just created a new table that contains one row for Bach and one for Ešenvalds: suddenly, the two composers are on an equal probability footing!

So, if you create a table -say, called COMPOSERS- that contains one row per composer as a selection made out of your ALBUMS table, you now have a maybe 1000-row table, one row per composer. A random selection from that table means every composer stands as equal a chance as any other to be selected. Having randomly selected a composer, you can go back to your ALBUMS table and ask for a random selection from there where the composer name matches the composer name you’ve previously selected at random.

At this point, Ešenvalds’ Ubi Caritas is as likely to be selected as Bach’s Mass in B Minor -because the choice of which composer’s music to select was initially made on a completely equal basis, because of the one-row-per-composer in the my_composers table/spreadsheet.

All of which is by way of prelude to explaining why AMP, merely days after its release, has been bumped to version 1.02: a new option to create or refresh a music database (or even several music databases) as an extraction of logical data from your physical music files has been provided; and another new option to use that database to generate random selections of music has also been made available. Let me elaborate on what the new options are.

First, there’s this:

amp --musicdir=/root/folder/of/your/music/collection --createdb

That tells AMP to scan your physical music collection and extract its metadata into a database. The database is called “music”, because I didn’t tell it otherwise. But I could have done this:

amp --musicdir=/root/folder/of/your/music/collection --createdb --dbname=main
amp --musicdir=/root/folder/of/your/music/collection --createdb --dbname=overflow

…in which I name the database(s) and can therefore have different databases for different music collections. The ‘music’ name is just the default.

When creating a database, AMP will populate it with a fast scan of the specified music collection. That is, it will race off to the folder specified, find the first FLAC file in each sub-folder, and extract the metadata from just that one file. If you’ve tagged everything correctly, the ALBUM, COMMENTS, GENRE and so on for track 1 of a rip should be just the same in all the other tracks, too, of course -but since AMP doesn’t check all those other files, it’s a lot quicker to scan that way.

If you prefer, however, you can do this at any time after database creation:

amp --musicdir=/root/folder/of/your/music/collection --refreshdb --scanmode=full

That is, you can ask for the database contents to be updated using a full scan, which is where AMP runs off to the physical folder specified and reads the metadata out of every FLAC file found in all folders, not just the first one. This is more thorough (and can expose some tagging errors!), but obviously much slower to perform and complete.

As ever, since I didn’t mention a database name in that full scan command, AMP will perform a full scan of the music database. If I needed it to full-scan a database with a non-default name, you just tack on the ‘dbname’ parameter as before:

amp --musicdir=/root/folder/of/your/music/collection --refreshdb --dbname=main --scanmode=full

And if you like, you can also specify ––scanmode=fast, explicitly, though that scan mode is the default anyway.

So that’s how you create and populate and keep up-to-date a musical database. How do you tell AMP to use that database for the purposes of actually playing music? Like this

amp
Read More...