Using Natural Language Processing to Analyze Spotify 2019 Top Artists
Introduction:
One of Spotify’s greatest features is the personalized yearly recap created for each user. It goes through a user’s most listened to artists, albums, playlists, and overall activity. This recap includes interactive visual effects that are easy to share on social media pages.
As part of the release of end of the year recap, Spotify published an article displaying top user activity this past year and decade. The article even includes a “Playlist of the Decade” with the most streamed songs according to fans. You can read the article here.
According to the article, the most-streamed artists of 2019 was Post Malone, Billie Eilish, and Ariana Grande. In addition, the most-streamed albums of 2019 were from each of these artists with Billie Eilish’s debut album “WHEN WE FALL ASLEEP, WHERE DO WE GO?” taking the number one position. I decided to do an analysis of each of these artists with an addition to Drake as he is the most streamed artist globally of the recent decade.
This article will analyze the following albums:
- WHEN WE FALL ASLEEP, WHERE DO WE GO? — Billie Eilish (2019)
- Hollywood’s Bleeding — Post Malone (2019)
- Thank u, next — Ariana Grande (2019)
- Care Package — Drake (2019)
I wanted to use this opportunity to practice working with another site for web scraping, EDA, and practicing NLP technique.
Methods:
Part 1: Web scraping & Data Cleaning
Genius is a website that contains song lyrics for most public albums and also allows users to annotate lyrics. They have a great API that is easy to use (https://docs.genius.com/). Before web scraping anything from the Genius website, you must create an account as an API client which will provide you with a client_id and client_secret which will later get you started with the site’s web scraping process.
First, I scraped each album’s main page to get the song titles and each individual song page link using the python library BeautifulSoup. Then, I went into each song page link to scrape the lyrics for each song.
After each album was scraped, there were 5 specific steps I took to clean the lyrics data. (1) I removed all punctuation from the column. I didn’t want any type of punctuation to affect the investigation of looking at any n-gram analysis. (2) I removed a custom list of words. When scraping lyrics from Genius, certain label words were included such as “Chorus” and “Bridge”. I didn’t want this to inflate the data (3) Lowercase all words in the column. (4) I created a word count for each song. This was done by counting each string at each space mark. This would be important to my analysis later. (5) I created a column for unique word count to show how many creative words used in a song rather than the overall word count.
In Tableau, I graphed each of these two charts together for comparison. Post Malone had the widest range of word count in his songs. Die For Me (Ft. Future & Halsey) has an outstanding amount of 955 words. The following 3 songs with the highest word count is all from Drake with Club Paradise’s word count at 781. Ariana Grande’s word count is consistent throughout her album with a range between 280 and 545. One of Billie Eilish’s songs has less than 20 words but that’s because the length of that song is about 15 seconds.
Jodeci Freestyle by Drake (Ft. DennisGraham & J.Cole) had the highest amount of unique words. This makes sense as Drake is a rapper and has two other artists featured on the song. To my surprise, most of Post Malone’s album was above or within the same range of unique words compared to Drake’s album. All of the song’s on Ariana’s thank u, next album contains between 23 to 26 words. Just because Ariana has more overall word count compared to Billie Eilish, she didn’t meet the mark when it came to unique words in a song.
Part 2: Sentiment Analysis
NLTK’s vader SentimentIntensityAnalyzer was created mainly to analyze the sentiment of expressions on social media. Song lyrics are obviously a lot shorter than word count on social media posts but I decided to try out this library on the song lyrics anyway.
Using python libraries Matplotlib and Seaborn, I graphed each of the sentiment score analysis available in the NLTK library. All scores are on a range from 0.0 to 1.0 except for the compound sentiment score which is on a scale from -1.0 to 1.0. The sentiment compound score was very volatile across all artists. Post Malone and Drake experienced the most volatility on their albums. It is almost a 50/50 split of a 1.0 and -1.0 compound sentiment score. Ariana’s only -1.0 compound sentiment score was from her song “bad idea.” Considering the higher amount of words she uses in a song and the little amount of unique words she has in a song, we can assume that the single repeated the term “bad idea” which led to the low compound score. The compound sentiment score made it hard to read the rest of the available sentiments. Below is a Tableau graph with only the positive and negative scores.
The Positive and Negative sentiment analysis is rated on a scale from 0 to 1. Based on these two charts we can see that the thank u, next songs are rated most positively compared to the rest of the albums. Post Malone’s Hollywood’s Bleeding has a higher sentiment score compared to the rest of the albums.
Part 3: Most Common Terms:
One of the most common steps in NLP analysis is looking at the most common terms used in the dataset. I graphed out the most common single, bi-, tri-, and quad-grams used throughout each album. This was done with sklearn.feature_extraction’s CountVectorizer and TfidfVectorizer. CountVectorizer counts the terms used in all of the lyrics. TfidfVectorizer looks at the most common terms used in each document.
Post Malone had the most common use of terms in his album. This is expected considering his high range of unique words counted. Ariana Grande and Billie Eilish — who had the least amount of range in with unique words — had common terms that were the titles of their songs. For example, Ariana’s most common term was the title of her album and single, “thank u, next.” Billie’s most common terms included the terms “Invisalign” and “my strange addiction” which are songs from her album. Drake’s most common terms across the board included the F-bomb. Another common term was “say my name” which most Drake fans can confirm is one of his signature lyrics in his famous song “Girls Love Beyonce”.
Part 4: LDA — Topic modeling
I decided to practice with sklean.decomposition’s library LatentDirichletAllocation which uses algorithms to discover topics available in a corpus of text data. After practicing with this library, I realized it isn’t as relevant considering the amount of data I was working with. There was only 12–17 subjects (songs) available for each artist I was working with. The LDA model resulted in many topics that were similar or exact to the most common terms discovered for each artist. LDA works best on a corpus with a large number of subjects.
Conclusion:
It’s no surprise that each of these artists were charted on Spotify’s top lists of the year and decade. Each of them is consistently played on radio stations and is topping the charts for popular music.
Ariana Grande and Billie Eilish fall under the pop genre of music. Their word count may have been high but their unique word count wasn’t. This can make their songs easy to memorize by fans to quickly sing along with. Drake, a rapper, and Post Malone, who falls under the hip-hop category had a significantly larger amount of total words and unique words. Hip-hop and rap are known to be a genre full of creative storytelling and that’s exactly of both of them accomplished in their 2019 album.
I consider this project a success because I was able to challenge myself by web scraping another website. I practiced working with Python libraries I haven’t touched before and I challenged myself with Tableau. As I work to improve my skills in 2020, I hope to complete this project again with Spotify’s 2020 user recap data.
Thank you for reading!
Project Repo: https://github.com/briannalytle/spotify2019_lyrics