LogoAlan Craig Data Science Portfolio Architecture


Text Classification: Naive Bayes Classifier

python logo

In this project, I built a text classification model on song lyrics. The task is to predict the artist from a piece of text. To train such a model, I first needed to collect a lyrics dataset from Spotify. The following tasks in the project are:


Download the URLs of all songs of your favourite artist.Go to the page listing all songs of your favourite artist on lyrics.com.


What is a prior?

A prior is the assumed probability of an event before taking any data into account. For instance, if we look at the word “yeah” in documents, we would expect it to occur with the average frequency over all documents, before looking at an individual document. The probability associated with the average frequency is the prior of “yeah”.

P ( c l a s s ) = n c l a s s n t o t a l

What is posterior probability?

posterior probability is the probability after looking at the data. In Naive Bayes, we can describe it using the Bayes Theorem:

P ( c l a s s | d a t a ) = P ( d a t a | c l a s s ) P ( c l a s s ) P ( d a t a )

Here, P(class) is the prior. P(data) is called the marginal probability. In a classifier, we can usually ignore the latter, because we only need to know the ratio between the classes. The Bayes Theorem is most useful if P(data|class) is easier to calculate than P(class|data).


Lets get in touch and talk about your next project.