Predicting Tweet Author with Naive Bayes
Dylan | Jul 15, 2019
After the quick overview of Bayes' Theorem, it's time to dive in and apply the theorem to a real-world classification project. In this project, we'll train a text classifier on a handful of Donald Trump and Barack Obama Tweets before asking our model to predict the author of newly presented Tweets.
By setting up a Twitter developer account and utilizing the Python library, Tweepy, collecting Tweets through Twitter's API is easy. In total, I collected 654 Trump and Obama Tweets for our classifier. If you're interested in following along or exploring the dataset further, the .csv file containing the Tweets is available here!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
Our project begins like any other by importing the necessary libraries for our model. First, we must import the standard libraries NumPy, Pandas, and Matplotlib followed by the specific model and preprocessing imports from scikit-learn. We'll dive into these imports further when we call them in our project.
Loading the Dataset
df = pd.read_csv("trumbama_twitter_data.csv", encoding="ISO-8859-1")
df.columns = ['User', 'Tweet']
Pandas .read_csv converts the Tweet data stored in our .csv file into the Pandas Dataframe object named df. The dataset only contains two columns. The first contains the last name of the author of the Tweet (Trump or Obama) and the second contains the actual text in the Tweet. To apply useful labels to these two columns in our dataframe, we'll label them User and Tweet.
Filter Retweets and Split Data
df = df[~df['Tweet'].str.startswith("['RT")]
X_train, X_test, y_train, y_test = train_test_split(df['Tweet'], df['User'], random_state=1)
When each of the user's Tweets were scraped from Twitter, anything they retweeted was included as well. Since we want our model to predict the author of a Tweet, we shouldn't mix in other authors by including retweets, so we'll have to filter them out of our dataset. As it turns out, each retweet begins with the string "['RT', so filtering anything that begins with this string will eliminate all of the retweets.
Then as usual, use train_test_split from sklearn.model_selection that we imported at the start can be used to split the data into inputs (X) and outputs (y). Since we want our classifier to predict the author based on the Tweet, we'll use the Tweet text as our input and its author (User) as our output. This function also separates the data into two groups: a group for training our Naive Bayes Classifier and a group that we'll hide from the classifier until after training to test the accuracy of the model.
Apply a Count Vectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
The CountVectorize function from sklearn.feature_extraction.text tokenizes the text in our Tweets and counts the occurrence of each token. In essence, it learns the vocabulary of each Tweet and transforms the vocabulary into a matrix of counts that's ready to be passed on to the TfidfTransformer.
Apply a tf-idf Transformer
tfidf_transformer = TfidTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
Tf refers to "term-frequency" and tf-idf stands for "term-frequency inverse document-frequency". Instead of calculating the raw frequencies of tokens in our Tweets, tf-idf decreases the impact of tokens that occur very frequently across all of our Tweets. Tokens commonly used by both Trump and Obama are not as useful for predicting the author as tokens that occur less often but only in association with one of the authors.
We'll be using the frequencies stored in the Xtrain_tfidf variable to train our classification model.
Define and Train our Model
clf = MultinomialNB().fit(X_train_tfidf, y_train)
We will be using the Multinomial Naive Bayes Classifier imported from the sklearn.naive_bayes library. The MultinomialNB classifier is well-suited to handle text classification tasks and after the model has been fit to our training data we're ready to make predictions!
Prepare the Xtest Data and Make Predictions
X_test_counts = count_vect.transform(X_test)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
Just like we applied the count vectorizer and tf-idf transformer to our training data inputs, we'll have to do the same to our test data inputs to preserve the new inputs in the same format that our model learned from.
y_pred = clf.predict(X_test_tfidf)
The predictions will be stored as a np.array consisting of the elements "Trump" or "Obama". The index of each predicted User corresponds with the indexes of their associated Tweets in the Xtest array.
Visualizing Our Results
cm = confusion_matrix(y_pred, y_test)
ax = plt.subplot()
sns.heatmap(cm, annot=True, ax=ax, cmap='Blues')
We can quickly set up a confusion matrix to visualize the accuracy of our classifier by importing the confusion_matrix function from the sklearn.metric library. By passing the array of predicted (predictions) authors alongside the array of true authors (ytest) of the Tweets, the confusion_matrix function can help us visualize what our model predicts well and what tends to confuse it.
Seaborn and Matplotlib help us convert the text-based confusion matrix into an appealing visualization. By labeling the axes and ticks, our confusion matrix becomes much easier to interpret.
Out of 57 newly presented Obama Tweets, our classifier correctly attributed 44 of them to Obama; however, it also misattributed 13 of them to Trump. On the other hand, among the 25 newly presented Trump Tweets, our classifier correctly attributed all 25 of them to Trump.
With this particular split of the training and test data, whenever our model was presented a Trump tweet, it never misattributed it to Obama although when presented with Tweets from Obama, it misattributed them to Trump 22.8% (13/57) of the time.
from sklearn.metrics import classification_report
By using the sklearn.metric.classification_report function, we're able to quickly evaluate our model's precision, recall, and f1-score. Overall, this particular split of training and testing data yielded an accuracy of 84%. Not too bad for a lightweight Naive Bayes Classifier!
ConclusionCongratulations on making it to the end of this tutorial!
In this tutorial, we explored how to preprocess text data for use in a multinomial Naive Bayes' text classifier. Despite its general simplicity, the Naive Bayes' classifier remains a powerful and robust classification algorithm. Our classifier managed to predict the author of a Tweet considering only word-frequency with no understanding of meaning, placement, or surrounding words.
Let me know if I left anything unclear and I'll try my best to explain it further! Until next week, keep coding!