Let’s not get physical! – Absolutely Baching

I had a slight mishap with my main PC on New Years' Eve: Manjaro released a new kernel and I installed it without thinking -and, though I believe the PC rebooted fine, I couldn't actually see anything on my monitor, so whether it had or not was really kind of moot!

So, a swift rebuild later, and we're back in business -though it's not quite how I imagined I would spend my New Years' Eve!

Anyway: doing the rebuild, I made the decision not to re-install Strawberry or Clementine music players, which have long been my go-to music player/managers on all manner of Linuxes. Why not? Because I'd just released AMP and it was working nicely and I couldn't imagine a need for either of the two 'fat' music players ever again.

I'm still happy with AMP: it does exactly what I need a media player to do. I particularly like the 'random play' feature: tell it to play music from a folder that doesn't itself contain music, but which contains sub-folders which, at some point in the tree hierarchy, themselves contain music files, and AMP will randomly pick a folder to play. It's meant I've been listening to a bewildering variety of composers since New Years' Day that I probably would never have listened to if left to my own devices. For me, random selection of complete compositions to play is a very important feature to have -and it's something no other music player I know of, on Windows or Linux, is able to offer.

But here's the thing: the way AMP does its random selection at the moment is to search through every physical folder that contains FLAC music files and then sorts the results of that search into a random order, before picking the top of the scrambled list. Whilst that works, physically traversing a hard disk that contains 1.8TB of music in 10,000+ folders and containiing nearly 70,000 FLAC files is inevitably going to take time. Practically, the result is that every time I run AMP in random-play mode, it takes maybe 30 seconds or more for it to find something that it then decides to play.

I can live with that. (I wouldn't have released it if I couldn't!) It's a bit like the old days, when you had to extract an LP from its cardboard sleeve, then remove the LP itself from a plastic or paper inner sleeve, place it on the turntable, and wait for the play arm to position itself, drop down, crackle a bit in the lead-in track, and then the music would begin... The anticipation that AMP will finally find something worth playing is worth enduring, I reckon! It's all part of the (new and digital!) listening experience. 🙂 And yes, I realise that younger readers who have no idea what an LP is (or was!) won't have the faintest idea what I'm on about, but that's fine too: let's just agree that deferred gratification is a 'thing' that can be desirable, shall we?!

But: really, the problem is in making AMP do a physical trawl through a music collection before it picks something to play. That is inevitably slow -and, worse, it's inevitably biased. If I have 10,000 Bach tracks and 3,000 Mozart tracks and 4 Ēriks Ešenvalds tracks, what do you think the chances are that AMP will ever pick an Ēriks Ešenvalds track?! Practically zero as it turns out: the physical dimensions of your music collection will inevitably affect the results of a physical random walk through that collection. Composers with lots of tracks will tend to be favoured for selection over those that have fewer tracks. I noticed this almost immediately upon AMP's release when I noticed an unusual propensity for it to select Bach, Mozart and Handel -three composers of whom I have an inordinately large number of 'tracks' of music.

I partially fixed this issue a day or two after release by letting you create an 'excludes.txt' file, in which you name composers you don't want randomly selected. But the probability bias created by physically searching through music collections of uneven sizes is still inherent, even if mitigated somewhat.

Fundamentally: the only thing that will fix that physical size bias is to stop using physical searches through your music and to switch, instead, to logical searches.

By a logical search, I mean somehow that we extract the metadata from our music files and store it in accessible fashion within a database table. If you're not knowledgeable about databases, think of a database table as the equivalent of an Excel spreadsheet, with rows of items -each row representing 'a track from a CD' and each row being made up of separate columns, where column A stores the composer name, column B the composition name, colum C the track title and so on. In either case, the problem now becomes one of picking a row at random from a single 'spreadsheet', rather than having to physically walk through 10,000+ folders before making a random choice. Guess which is quicker to do?!

Once you're dealing with a database table, you can select rows from it extremely quickly: it's one 'object' stored in one physical location, so scanning through it is trivially easy. Since you're not physically visiting 10,000+ folders, but merely querying a 10,000 row table, we're talking sub-second search times, for any reasonably-sized music collection (by which I mean, 30,000-physical CDs or so).

But it gets better: imagine you capture the composer and composition names from your physical music collection into a table we'll call ALBUMS. Because you have a huge collection of Bach and a tiny one of Ešenvalds, there will be many rows in this table of Bach and only a few of Ešenvalds. The 'proportion' problem still raises its head, therefore: you're much more likely to randomly select a Bach row than an Ešenvalds one.

But this is where databases -and logical data manipulation- works its magic. Because, when working with databases, you can do this:

create table my_composers as select distinct composer from ALBUMS

That is, you can create a new table as a selection of data column(s) from another table. And -the key point- you are populating this new table with only the unique -or 'distinct'- composer names. So from a table containing 10,000 rows of Bach and 45 of Ešenvalds, you have just created a new table that contains one row for Bach and one for Ešenvalds: suddenly, the two composers are on an equal probability footing!

So, if you create a table -say, called COMPOSERS- that contains one row per composer as a selection made out of your ALBUMS table, you now have a maybe 1000-row table, one row per composer. A random selection from that table means every composer stands as equal a chance as any other to be selected. Having randomly selected a composer, you can go back to your ALBUMS table and ask for a random selection from there where the composer name matches the composer name you've previously selected at random.

At this point, Ešenvalds' Ubi Caritas is as likely to be selected as Bach's Mass in B Minor -because the choice of which composer's music to select was initially made on a completely equal basis, because of the one-row-per-composer in the my_composers table/spreadsheet.

All of which is by way of prelude to explaining why AMP, merely days after its release, has been bumped to version 1.02: a new option to create or refresh a music database (or even several music databases) as an extraction of logical data from your physical music files has been provided; and another new option to use that database to generate random selections of music has also been made available. Let me elaborate on what the new options are.

First, there's this:

amp --musicdir=/root/folder/of/your/music/collection --createdb

That tells AMP to scan your physical music collection and extract its metadata into a database. The database is called "music", because I didn't tell it otherwise. But I could have done this:

amp --musicdir=/root/folder/of/your/music/collection --createdb --dbname=main
amp --musicdir=/root/folder/of/your/music/collection --createdb --dbname=overflow

...in which I name the database(s) and can therefore have different databases for different music collections. The 'music' name is just the default.

When creating a database, AMP will populate it with a fast scan of the specified music collection. That is, it will race off to the folder specified, find the first FLAC file in each sub-folder, and extract the metadata from just that one file. If you've tagged everything correctly, the ALBUM, COMMENTS, GENRE and so on for track 1 of a rip should be just the same in all the other tracks, too, of course -but since AMP doesn't check all those other files, it's a lot quicker to scan that way.

If you prefer, however, you can do this at any time after database creation:

amp --musicdir=/root/folder/of/your/music/collection --refreshdb --scanmode=full

That is, you can ask for the database contents to be updated using a full scan, which is where AMP runs off to the physical folder specified and reads the metadata out of every FLAC file found in all folders, not just the first one. This is more thorough (and can expose some tagging errors!), but obviously much slower to perform and complete.

As ever, since I didn't mention a database name in that full scan command, AMP will perform a full scan of the music database. If I needed it to full-scan a database with a non-default name, you just tack on the 'dbname' parameter as before:

amp --musicdir=/root/folder/of/your/music/collection --refreshdb --dbname=main --scanmode=full

And if you like, you can also specify --scanmode=fast, explicitly, though that scan mode is the default anyway.

So that's how you create and populate and keep up-to-date a musical database. How do you tell AMP to use that database for the purposes of actually playing music? Like this

amp --musicdir=/root/folder/of/your/music/collection --usedb

The --usedb parameter is what triggers AMP to read from the database: again, without specifying a database name, it will want to use a database called music. But if you've created specially-named databases, you should specify that name at the point of use, too. So:

amp --musicdir=/root/folder/of/your/music/collection --usedb --dbname=main

OR:

amp --musicdir=/root/folder/of/your/music/collection --usedb --dbname=overflow

Since you're using a database to find your music, it's not actually technically necessary to specify the "--musicdir" bit, because AMP will get the file locations of your music from the database. But it helps with a specific cosmetic factor: when playing AMP displays the full path of the music it's playing... but it strips out the 'root' component of the path, if it knows what it is. It's that last part which is important here. If it knows it's playing music hanging off /music/directory, then when it's playing /music/directory/Britten/Billy Budd, it will only display /Britten/Billy Budd. Hence, the use of the musicdir parameter is still recommended, even if playing from a database.

Obviously, if you are continually adding to your music collection, you will need to refresh the database periodically: fast or full, that's up to you. But if you've added 80 CDs to your collection, none of those will be selected by AMP for randomised playback until you've done a new database refresh. That doesn't mean AMP can't play them at all, of course: if you tell it a musicdir and not to use a database, it will still do its physical selection randomisation, and so all 80 CDs would be candidates to play in that. You can also go to the folders of any of those 80 CDs and invoke AMP locally, so that you are basically forcing it to play wherever you've moved yourself to in the file system. But yes, without a database refresh, playing from the database won't see them.

It obviously isn't terribly difficult to schedule (via cron) a run of 'amp --musicdir=<something> --refreshdb --scanmode=fast' nightly or maybe once a week. And if your music collection is fairly stable and static (i.e., you're not busily acquiring new material for it, then even monthly or quarterly refreshes will probably suffice.

I should note, too, that if you are using AMP to play music having specified --usedb then a record of your plays will be created within that database in a table called plays. It's pretty much identical to what gets scrobbled to Last.fm (if you enable that), or written to $HOME/.local/share/amp/scrobble.txt (ditto). Think of it as a sort of 'internal scrobble': it means you can later query the 'plays' table using standard sqlite3 client tools without ever having to transmit the information to anyone else. For this reason, AMP always writes to the plays table (if a database is being used), even if --noscrobble was specified.

And one final point: the excludes.txt file still works, even in 'use database' mode: the individual lines of the excludes file (one composer per line) get turned into a SQL statement of the sort '...and artist not in ('Wolfgang Amadeus Mozart', 'George Frideric Handel', 'Johann Sebastian Bach')' and so on. The random picks from the database will therefore definitely not pick composers you've asked to be excluded (provided only that you spell their names in the exclude exactly as you've spelled it when tagging the relevant music files, of course)!

I will be amending the AMP documentation shortly to explain this new database-reliant option. In the meantime, if you want to upgrade and try it out with the new options I've just described above, you can download Version 1.02 of AMP from here. To upgrade, once the download is complete, and assuming you downloaded AMP to your own Downloads folder, just issue the following command:

sudo mv $HOME/Downloads/amp.sh /usr/bin/amp.sh

I don't envisage making any further major alterations to the way AMP works for the foreseeable future, so this might be the last update for a while, and I therefore recommend you perform the upgrade as soon as you can. Any problems with the upgrade, you know how to let me know about them!

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Let's not get physical!

One thought on “Let's not get physical!”