Classifying Genre Using Machine Learning, Part I

In this module, we will use machine learning to classify songs by genre.

This module is the first in a series on using classifying songs by genre using machine learning. This module takes a simplified approach to machine learning, introducing fundamental concepts along the way. We will use tools from the Orange Data Mining Library.

Machine learning is a term used to describe many different ways that computers solve problems with various degrees of autonomy. It is regarded as a subfield of artificial intelligence, or AI.

One common application for machine learning is what is known as a classification problem. In a classification problem, we seek to categorize the things being studied according to their observed or known qualities. Machine learning is helpful when we are not sure exactly how these qualities relate to the broader category.

For example, an email provider might use machine learning to attempt to classify incoming mail as spam or not. Let’s say that when a user receives a spam message, they flag it as spam. The machine learning algorithm looks for patterns among all of the messages flagged as spam–for example, the presence of specific keywords, a characteristic syntax, or a certain length. This information can be used to predict whether future messages are likely to be spam and should be automatically routed away from the user’s inbox.

Musical genre is an excellent subject for a classification problem because genre tends to be easy to identify superficially, but is notoriously difficult to explain. In this module, we will use a collection of about 7,000 songs as the basis for a simple genre classification program.

First, install the Orange data mining library using the command line:

pip3 install orange3

Or for earlier versions of Python:

pip install orange

Once you have installed Orange, download the song collection here. This is a simplified version of a collection originally drawn from Spotify’s database and available in full on Kaggle. Move the zip file to an easy-to-access location on your computer and unzip it. The zip file contains two text files: one labeled “training-short.tab” and one labeled “testing-short.tab.”

For this module, I have divided the song collection into two sets of data. In machine learning, we use what is known as “training” data to train our algorithm, and we use “testing” data to verify how well it performs on similar, but unfamiliar data.

Let’s import the data into Python. First, we import the Orange library:

import Orange as orange

Next, we’ll import each of the text files. Make sure to replace the file path with the appropriate path on your computer:

trainData = orange.data.Table('/Users/my_user_name/Desktop/training-short.tab')
testData = orange.data.Table('/Users/my_user_name/Desktop/testing-short.tab')

(You may get an error about an “invalid byte” at the end of one or both files. If so, don’t worry! It just has to do with the file format and doesn’t affect the data.)

Let’s take a moment to talk about what is actually in these files. Each file consists of several thousand rows representing different songs. Information about each song is stored across seven columns. The first two columns give the title and artist of the song. The next four columns give four musical qualities derived automatically from the audio files in Spotify’s databse, known as “Audio Features.” These features are what we’ll use to determine the genre of a given song. They are: danceability, energy, tempo, and time signature. We’ll talk more about audio features and how they are derived in a future module.

To simplify the classification problem further, rather than trying to classify multiple genres at once, we’ll limit ourselves to a single genre: let’s say, electronic dance music or EDM. The last column uses a “1” to indicate that a song is considered EDM and a “0” to indicate if it is not. When we’re training our algorithm, we’ll use this value to tell the system whether the qualities of a given song should be associated with the EDM genre or not. Later on, when we’re testing our algorithm, the algorithm will not take this label into account–we’ll just compare the label with the computer’s guess to see how well it did.

Once we have our data loaded in, orange makes it easy to build our machine learning algorithm. We’ll use a common algorithm known as “K-nearest neighbor” or KNN. First, we’ll build our learning algorithm:

mlLearner = orange.classification.KNNLearner()

And then we’ll train it on the training data:

mlClassifier = mlLearner(trainData)

Pretty simple, right?

Now all that remains to do is to test our algorithm. First, let’s create some variables to help us keep track of how frequently our model is correct:

mlCorrect = 0
total = len(testData)

Next, we’ll use a for loop to iterate over the test data. In this block of code, the ML algorithm makes a guess for each row (song) in the test data, then compares it with the correct answer. Remember that the correct answer as to whether a song is EDM or not is indicated by the value (0 or 1) in the seventh column. We use the .get_class() method because the genre is the “class” we are trying to classify.

for testRow in testData:
 mlGuess = mlClassifier(testRow)
 realAnswer = testRow.get_class()
  if mlGuess == realAnswer:
   mlCorrect += 1

Let’s see how we did by printing our results as a percentage:

print(f'{mlCorrect * 100 / total:2.0f}% Right')

> 88% Right

88% correct is not too bad for starters! This means that based on the qualities we looked at–danceability, energy, tempo, and time signature–we were able to determine whether an unknown track was EDM or not 88% of the time.

Of course, you probably have lots of questions about the process and results. What features were most important in determining genre? How many of our samples were actually EDM? How many genres were represented? Who decides the “correct” genre? These are all excellent questions and I look forward to exploring them in the next module.

Extensions

  1. Explore the Orange library and see if you get different results using a different algorithm.
  2. The algorithms for Spotify’s audio features are proprietary. However, if you were asked to develop an algorithm to capture the “danceability” or “energy” of a song, what factors would you consider?

Further Reading

Andrew Ng offers a fantastic video introduction to fundamental concepts and methods in machine learning. Parts of this module were adapted from Michael Cuthbert’s video lesson here.