<-- home

Twitter Hashtag Recommendation

We developed an offline Twitter Hashtag Recommendation system with my friends Ahmet Kucuk and Can Firtina this semester for Bilkent’s CS464 Introduction to Machine Learning course. I decided to write a short blog post and provide some insights about the project and the results. Most of the ideas and methods we applied in this project were taken from [1] and [2].

Our motivation for the project was to recommend a relevant and common hashtag to a user for a new tweet. We assumed in the beginning (and verified during the project) that most of the hashtags are unique, but about similar topics. Two user may be tweeting about facebook but they use different hashtags like #face, #fb, #facebook. When I want to search for their tweets and type #fb to the twitter’s search box, I will miss tweets that contain other facebook related hashtags such as #face or #facebook (btw, one may develop a system to recommend similar & related hashtags during the search process as an extend to this project). So we want to help users to find common and relevant hashtags when they write a new tweet.

Here is a brief summary of the things we did step-by-step during the project.


Data collection

We have collected 20 M tweets (with hashtags) with a api caller and an illegal crawler script :) We wrote two code pieces, one for calling official twitter api and the other one for crawling twitter pages for collecting tweets. We run 3 api consumers and 3 crawlers during one week in a parallel fashion and collected 20 M tweets at the end of the week.

20 M tweets were collected from 250 K users that talk about politics, books, news, sports and technology. We started to crawl twitter with an initial user list that post tweets about the given topics and continued by moving to their friends and subset of followers. Yep, as you just figured out, our data set was biased and actually quite dirty.

Data preparation

Before building our recommendation model, we had to clean and process our dataset to make it usable. We started with removing links and usernames from the dataset and pruned non-English tweets. We got 12 M tweets from 150 K users after the prune operation. As a subsequent step which is quite necessary for any kind of internet-related data, we converted slang words into pure english text using an internet slang library that contains about ~5400 slang words.

Steps that came after the pruning and slang-removing steps are not so different than the steps that are applied to any machine learning project. We converted words (except hashtags) to their stems, removed stop words and punctuation. Stemming and stop-word removal steps pruned 8 M tweets and left us 3 M tweets to work with.

After all the steps described above, data was still not ready for training a model. We partitioned the remaining tweets by date and saw that almost half of the 3 M tweets were posted in October 2013 and rest of them were distributed to a time range between 2011 and September 2013. Of course, there were only few tweets posted on 2011 and 2012 and they were almost useless. For January 2013 to September 2013, there were ~100K tweets for each month but again, they became irrelevant because of the nature of the Twitter. Ultimately, we finished our data preparation process by removing the tweets that were posted before October 2013 and we remained 1.5 M tweets that were posted by 120 K users at the end. Each user had ~ 17 tweets and each tweet had ~ 1.6 hashtags.

Insights about the prepared data

Although we have gained many insights, the most important thing to talk about the prepared dataset is given above. This plot also shows our motivation. There are only a few hashtags that are used in many tweets (upper-side) and there are thousands of hashtags that are used only a few times or once (lower-side). We aim to make this plot look better my moving the lower dots to upper parts of the plot.

Building different models


We built a Naive-Bayes model (with smoothing) for recommendation. Our features were words in a tweet and our labels were hashtags. For a given tweet, we calculated probabilities of hashtags by looking at occurence of words of the tweet and hashtags in the dataset and sorted them by probability. Initially, we compared only the most probable hashtag with the actual hashtag and the accuracy was quite low. Borrowing an idea from a paper I cited in the beginning, we decided to look at top-20 most probable hashtags instead of comparing only the most probable hashtag with the actual hashtag. The motivation behind this step was hashtags in the top-20 are related and similar to each other and the actual hashtag is still likely to be in the top-20 while it is not the number one. With this reasonable evaluation improvement, our accuracy became 4 times bigger than the initial accuracy.

We had ~300 K different hashtags (class labels) and ~150 K different words (instance features) in the prepared dataset, so for training a Naive-Bayes model, billions of probabilities were needed to be calculated. If we would follow a dynamic-programming approach and calculate each probability only once and keep it in the memory, a single machine’s memory (8 gigabytes as max of our laptops) would not be sufficient and it didn’t, so we calculated each probability again and again without memorization. Of course, there would be improvements like caching probabilities of popular hashtags etc.

Modifications to Naive-Bayes

We modified our Naive-Bayes model by considering TF - IDF values of words in a tweet while calculating the probability of a hashtag given words of a tweet. Since tweets are only a few words long, TF was useless and ignored. We only added IDF values of words into the probability calculation with the motivation of if a rare word occurs with a hashtag, it must be more relevant.


We also wanted to try different another machine learning algorithm to build a recommendation model. Just like any machine learning beginner would think, SVM seemed interesting (maybe because coolness of its name, or maybe because everybody talks about it) and worth trying. We found the libshorttext library for short text classification which is a customized version of liblinear which is also a customized version of libsvm. Our requirements (~300K instances and ~150 K features) seemed to match libshortext’s capabilities. So we decided to give it a shot.

We divided October 2013 tweets into days, and used each day’s tweets to recommend hashtags for the tweets in the same day and new tweets in the following day.

Evaluating the results


We divided 1.5 M tweets into 5-folds to train the Naive-Bayes model. We had 1.2 M tweets for training and 300 K for testing. Each fold run 3 to 4 hours because of the large number of instances and features. Initial accuracy of compare-only-the-top-hashtag evaluation method was 10%. When we improved our evaluation by looking at top-20 most probably hashtags, accuracy moved up to 40%. It may seem that 10% and 40% accuracy values are not valuable. But when we thought about the sparsity of our dataset, the accuracies seemed sufficient to us. 

When we look at the mean, it is ~13 that means we find the hashtag in the top 14. But when we consider only the tweets that we found the actual hashtag in the top-20, mean value is ~3. So if we succeed to the actual hashtag, we recommend it in the top-4!

After the TF-IDF modification, accuracy of the model decreased 2-3% percent as opposite to our expectation. We thought that there are thousands of meaningless / fuzzy words in the twitter dataset and rareness a word does not imply much about its importance.


We divided October 2013’s tweets into days and trained SVM models for some random days of the month (we couldn’t train models for each day because of the time requirements of the project). We divided each randomly selected day’s tweets into 3-folds and we tested each trained model with 1-fold of the that day and whole tweets of the following day. We barely succeded to build a model using tweets of a given day (~100 K) since libshorttext’s memory and CPU requirements (it uses 1-against-1 instead of 1-against-rest for multi-class classification) are very high.

Accuracy of the libshorttext was ~10% for the same day and ~4% for the following day. As you can notice, 10% is very close to our initial implementation of Naive-Bayes. Since we could not modify the libshortext implementation like we did in our Naive-Bayes implementation, we tried different parameters for libsvm training, such as unigram / bigram features, binary presence / term frequency / TF-IDF for feature representation but still we could not improve the accuracy significantly.

While they are quite low accuracy values for a classification task, 10% - 4% difference still show a very important detail to us that is: things people talk about in Twitter are changing substantially every day.


When we compare effectiveness of the two methods, they do not differ noticeably, but when we consider efficiency, Naive-Bayes beats SVM and seems much more applicable to hashtag recommendation task. This system may be turned into a real-time system with building models from the tweets of last X hours / days and recommending hashtags to new-coming tweets.

This project was baby-steps for both me and my friends since we had almost-zero experience on machine learning topics before the course. Methods we applied could be improved from many aspects and much better results could be obtained. But still, it was quite challenging and we had great fun!

If you managed to read until here, I sincerely thank you for your patience and curiosity :) I tried to make a brief summary of a 2-month project and some important parts may be skipped. Please fell free to ask anything (except the dataset because Twitter forbids to distribute collected datasets).