In this project, I built a text classification model on song lyrics. The task is to predict the artist from a piece of text. To train such a model, I first needed to collect a lyrics dataset from Spotify. The following tasks in the project are:

- Download a HTML page with links to songs
- Extract hyperlinks of song pages
- Download and extract the song lyrics
- Vectorize the text using the Bag Of Words method
- Train a classification model that predicts the artist from a piece of text
- Refactor the code into functions

Download the URLs of all songs of your favourite artist.Go to the page listing all songs of your favourite artist on lyrics.com.

- Copy the URL
- Use the requests module to download that page
- Save the page to a text file
- Open the page in a text editor
- Examine the HTML code and look for links to songs
- Extract all links using Regular Expressions

A prior is the assumed probability of an event before taking any data into account. For instance, if we look at the word “yeah” in documents, we would expect it to occur with the average frequency over all documents, before looking at an individual document. The probability associated with the average frequency is the prior of “yeah”.

$$P(class)=\frac{{n}_{class}}{{n}_{total}}$$

posterior probability is the probability after looking at the data. In Naive Bayes, we can describe it using the Bayes Theorem:

$$P(class|data)=\frac{P(data|class)\ast P(class)}{P(data)}$$Here, P(class) is the prior. P(data) is called the marginal probability. In a classifier, we can usually ignore the latter, because we only need to know the ratio between the classes. The Bayes Theorem is most useful if P(data|class) is easier to calculate than P(class|data).

Lets get in touch and talk about your next project.

>