Predicting Song Skipping on Spotify

Predicting Song Skipping on Spotify

Using LightGMB to predict my song skipping habits based solely on audio features

Introduction

In early 2019, Spotify shared interesting statistics about their platform. Out of 35+ million songs on the service, Spotify users created over 2+ billion playlists (Oskar Stål, 2019). I thought of the analogy that our music taste is like our DNA, very diverse across 7 billion people, yet the building blocks (nucleotides/songs) are the same. As a result, inferring a user’s music taste is challenging, mostly since Spotify’s business model relies on its ability to recommend new songs.

Problem Statement

Spotify doesn’t have a dislike button, so skipping songs are the subtle cues we need to learn from to infer music taste. In this project, I use my Spotify streaming history in 2019 to build a predictive model that anticipates whether I would skip a song or not based solely on their audio features.

You can request your own Spotify streaming history following these steps

Data Descriptions

After requesting my Spotify data, I received an email with a ZIP file containing every song I listened to in 2019, the artist’s name, and the streaming duration. The data processing is as follow:

  1. I filter out podcasts and only analyze songs.
  2. I used the Spotify API to extract the unique IDs of songs and their audio features.
  3. I compute the gap between the duration I streamed the track for and the song’s length. If the gap exceeds 60 seconds, then I would induce that the song has been skipped.

Below is a detailed python implementation of the steps

Since the claim is to seek if only audio characteristics can inform us on song skipping, I dropped the columns that contain the song’s title and artist.

The final dataset has the following columns:

Image for post

Assumptions

A crucial step in modeling is to lay out all the assumptions and limitations in order to properly interpret the result. Some assumptions are due to the data collection process and others are part of the modeling process:

  • The user’s music taste is homogenous, i.e., the mechanism that leads a user to skip a song is static across time.
  • Songs are broken down into audio features hence the lyrics are not interpreted as natural language text. This limitation is important to consider since lyrical meaning can be a strong predictor of song skipping.

Modeling

I use LightGBM binary classification to infer my song-skipping habits based solely on audio features.

Image for post

Bayesian Optimization

LightGBM contains many parameters, thus, instead of running through all of their possible values, I used Bayesian Optimization for hyperparameter tuning

Image for post

Results & Discussion

The model performs better with personalized data with an accuracy of 74.17% (28th iteration of Bayesian Optimization). The assumption that Spotify users are homogeneous is a strong one, and the performance can be improved if we gather more user level details.

Overall, recommendation engines require both personalized learning about the user and general learning about the songs. In this project, I experimented with machine learning classification using only audio features, audio and user features, and my personal listening history. A further investigation might include the causal relationships between the covariates because perhaps understanding the mechanism by which the data is generated is more informative than curve-fitting.

References

  • Oskar Stål (2019). Music Recommendations at Spotify. Nordic Data Science and Machine Learning Summit. Retrieved from: https://youtu.be/2VvM98flwq0
  • Brian Brost, Rishabh Mehrotra, and Tristan Jehan. 2019. The Music Streaming Sessions Dataset. In Proceedings of the 2019 World Wide Web Conference (WWW ’19), May 13–17, 2019, San Francisco, CA, USA. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3308558.3313641

Leave a Comment