Using beets and Python to filter naughty words in a large music collection out of radio automation.

WTUL's automation was hastily assembled during covid to keep us from going from a 24-hr live DJ station to dead air. I opted for using RadioDJ because it was free, regularly maintained, and had a decent user base. Luckily, we already had a gigantic digital music collection because we paid to have our vinyl digitized a few years ago. However, there wasn't much metadata past the essentials: artist, song title, album title, track number. While we screen our music pretty thoroughly for radio-unfriendly language, it's usually in the form of somebody scrawling "Track 3 sez Fuck" on a sticker on the album cover. That information doesn't make it into the digital files.

So naturally our automation has let some stuff slip. I was tasked with doing something about it. At first the only solutions I could see were

a) Automation should barely be on anyway now that covid is over, we have so many DJs

b) Whitelisting music that we know is radio-friendly.

Both answers were bad. Try as we might, sometimes people no-show. Moreso than when we didn't have automation. That's not my problem to solve though, I'm just a tech guy. Whitelisting safe music is very manual labor intensive, and the people that would be best suited to it (music directors, station librarian etc) are overwhelmed as is.

Then I had a bit of eureka moment. I knew it was flawed, but it was something and it was a neat enough idea that I felt like I had to do it anyway.

The gist: scan our collection using an app that automatically embeds lyrics to music files. Search those files for naughty words. Update the database for RadioDJ to put those files in an "obscene" category that doesn't get played. The biggest flaw of this is that the amount of lyrics available to be scraped (genius.com etc) is pretty low, especially for a library like ours with tons of unsigned, forgotten music. But it's something!

Part 1: Scan and embed lyrics

I tried a few apps for this and none of them were wonderful. MediaHuman, Mp3nity, and LyricsFinder all failed for me. I eventually settled on the command-line program Beets.

Beets isn't built specifically for this. It's a music library management app that has a lyrics searching plugin. We actually wanted to avoid a bunch of the finer features of this app because our folder structure, metadata, file namings etc were often done very intentionally. That said, it's a pretty cool app if you want to use those extra features, I'm a fan.

So my config file for beets looked like this

directory: ~/music library: ~/data/musiclibrary.db import: copy: no move: no write: yes lyrics: auto: yes fallback: '.' plugins: lyrics

So basically I told it not to move or copy any of my files, but that I do want it to write changes to the actual files. I included the lyrics plugin, told it to automatically search for lyrics, and to add a "." in the lyrics field if it didn't find anything. This was to keep it from searching the same lyric-less files every time I updated it, basically a way to mark it as a failure and to ignore it from then on.

Frustratingly, I forgot the line I would run to import, but the most important thing is that I used the -A option to not autotag anything. For whatever reason this did not go quickly for me, despite what their documentation says -- I briefly had it autotagging without user input and I genuinely think it went faster. For our very large library this took days, and I would occasionally have to check back in on it to make sure it hadn't stalled out (sometimes pressing enter would help it along) or to see if I had to restart it.

This is just getting songs into Beets' database. Thankfully the actual fetching of lyrics is much faster than you'd expect. And all you really have to do is type

beet lyrics

And it will automatically do its best to embed lyrics to those files.

And here's where it gets fun. All you have to do is type

beet list lyrics:shit

And you've got a list of songs that say "shit". I thought I would have to write a whole other script to parse these files for lyrics, but Beets has it built in! Obviously you can do it with other words too. See how many songs mention lizards! Probably not many! Certain words might need quotes. For instance if you don't search for " ass " (with empty spaces) in the quotes, you might get results that include glass or sass etc. But with words like fuck you might want to keep those quotes off because you'll want it to match words like "fucking" or "fuckwad" or "motherfucker" or "fuckety-fuck" or "assfuck"... you get it.

Anyway, once we've had our fun with that, let's add the -p tag to make it show a file path and have it put the results into a document

beet list -p lyrics:shit > shit_songs.txt

And now you have a .txt file containing the location of songs that say shit.

Part 2: Updating your database

This whole thing feels very specific to our use case, and I'm guessing that if anybody finds value in this it'll either be for the lyrics-finding part or for the radiodj update part and not both together. But that's why they're here, just in case you get some inspiration.

Some things you'll need. If you want to use my script, you'll need python on your computer. On windows you can just install it from the microsoft store. You might also need to run pip3 install mysql-connector to make sure it can access a mysql database.

You'll need to know some things about your radiodj database. Specifically, the name of the database, and your username and password to access it. You'll use those things to find one more thing: the id of the subcategory that you want to move these rude songs into. Which will have to already exist, if it doesn't just go ahead and make one in RadioDJ.

There are many ways to find your subcategory id. One way is to use Heidi Sql, log into your database, click the "subcategory" table of your database and the "Data" tab. Find the name of your subcategory and find the ID (not parentid) that matches it. In my case it was 32, so I wrote down id_subcat=32 somewhere where I could keep track of it.

OK, now the script. I didn't know any python before I wrote this so please excuse me. Open up a text editor and copy the following.

import mysql.connector

import sys
print (sys.argv)

naughtyfile=sys.argv[1]

try:
#mysql connection
mydb = mysql.connector.connect(
host="localhost",
user="root",
password="YOURPASSWORD",
database="YOURDATABASE"
)
if mydb.is_connected():
print("db connected!")
mycursor = mydb.cursor()

#words to look for to determine if it's clean. We have a LOT of files with the word clean in the filename that might not be, so need to be specific
matches=["clean edit", "[clean]", "(clean)"]

file = open(naughtyfile,'r')
for line in file:
naughtypath = line.strip()

#check if mistakenly labelled dirty
if any(x in naughtypath.lower() for x in matches):
print (naughtypath + " is probably clean")
#if not found to be clean...

else:
#print(line.strip())
input_data = (naughtypath,)

updatequery = "UPDATE songs SET id_subcat=32 WHERE path = %s"
mycursor.execute(updatequery, input_data)
mydb.commit()

confirmquery = "SELECT artist, title, id_subcat FROM songs WHERE path = %s"

mycursor.execute(confirmquery, input_data)
myresult = mycursor.fetchall()

for x in myresult:
print(x)
except mysql.connector.Error as error:
print("FAILED!: {}".format(error))
finally:
if mydb.is_connected():
mydb.close()
print("sql connection closed")

As for what that's doing:

We import the mysql and sys libraries. That lets us use mysql and the sys library lets us take an argument from the command line.

Then it connects to the database.

It searches for words in the filename that might suggest the lyrics are clean despite what might have been downloaded from Beet (basically does it say something like "Clean Edit" in the filename) If it does see that, it prints out that it might be clean.

Otherwise it goes ahead and updates the database and changes its subcategory to 32. Remember, 32 is specific to my database, it's likely different in yours. Change it!

Then it goes right back and finds the file we just updated and spits out what it finds in the database. It should show the artist name, title, and the subcategory you were hoping for.

If any of this doesn't work, it should throw an error. No matter what, it should close the sql connection when it's done.

Save it, ideally alongside your text files of naughty words. In this case I'll call this script naughtywords.py

To Run It

go to a command prompt and type

python naughtywords.py shit_songs.txt

And hopefully it will have moved all the songs with the word shit into a category that radiodj doesn't touch.

From there, go ahead and make new text files filled with undesirable words, then run the script again.

A few notes

I did all this to essentially be run once. I'm hoping that from here we can blacklist new songs that come in, especially since, as I said before, this process is absolutely not catching everything that come in.

That said, I hope there's something in here than can give you an idea for your own use case. I'd be amazed if there were many people trying to solve the exact same thing I'm doing here, but my script could likely be adapted to another automation system, or maybe just the Beets bit was helpful.

I'm actually hoping that parts of this could be used to screen our library for songs that are too popular for our station that tries to serve underplayed music. I'm having trouble finding a good way to scrape for that sort of data though.

Anyway, hope it helps. If you have any pointers or wish to just say thanks, email tech at wtulneworleans.com.

s5 register

WTUL New Orleans 91.5 FM