Twitter Data Analysis with R – Text Mining the Kenya Elections
Hello people! I’m back again. This time we will be looking at twitter analytics. There’s so much that can be done with social media interactions. The other day, Facebook announced that they had surpassed the 1 billion monthly active users mark! To be exact, as at 2nd quarter 2017, Facebook had about 1.94billion MAU! That’s just incredible. What this means is that at a given time, we have millions of people worldwide interacting on Facebook. Add this to other social media platforms like Twitter, Instagram and so on.
So what does this means for organisations? Well, for one, because people are now more likely to express their opinions on social media, a company can able to view in real time, how (un)interested consumers are in their newly launched product. They can also do this to assess how happy or otherwise people are with their competitors. So if company A was planning on introducing a product to the market and company B and the latter goes to the market before them on a similar product, company A can use social media feedback from company B’s launch to know what people are actually saying about the product. This is very common with communications companies who fight to gain/retain clients with tarriff bundles.This is just a simple example.
Asides monitoring sentiments on social media, companies can also direct social media users (say Twitter) to direct users to their website and can use twitter analytics to know what percentage of users actually visited their website due to a tweet. With additional tools like Google Analytics, you can then track if these visitors made a purchase whilst on the website.
The above are just two examples of what can be done with social media analytics. Today, we will be looking at how to extract data from twitter and perform some simple analytics on the data.
Kenya had their elections on 10th of August 2017 and we will tract how one of the hash tags for the elections performed on twitter over a short period of time.
Load relevant packages
library(ROAuth)
library(twitteR)
library(tm)
library(igraph)
library(topicmodels)
library(devtools)
library(sentiment)
library(RTextTools)
library(e1071)
library(data.table)
library(ggplot2)
library(readr)
library(microbenchmark) # not required for Twitter anaytics
The key part of this work is extracting data from twitter. To do this, you will need 4 key parameters from Twitter – “consumer key”, “consumer secret”, “access token” and “access token secret”. To get these,
- Create a twitter account if you don’t have one already
- Go to https://apps.twitter.com/ and log in
- Click on “create New App”
- Fill out the form, (read and) accept the Developer Agreement and click “Create your Twitter Application”
- On your new app page, you should see the name you gave to your app at the top left of the screen. Click on “Keys and Access Token” tab and copy your “Consumer key” and “Consumer secret”.
Scroll down and click “Create my access token”, and copy your “Access token” and “Access token secret”.
That’s it!
You can now use this in your application. To keep it safe, open an R script, assign your keys and secrets to four different variables. You can name them whatever you want but I have used consumer_key, consumer_secret, access_token, access_secret. Save this file and then access it using the “source” function. This ensures that anyone looking at your code cannot see your keys and secrets 😉
source("..\\twitterdata.R")
Retrieve Tweets
To retrieve tweet, from keywords (in our case, we want to retrieve 300 tweets from the #KenyaDescides hashtag). Note that this is just a simple example to demonstrate how this works. In real life, you may want to make your search more robust by including other possible hashtags that users may use.
Connect with Twitter using your consumer_key, consumer_secret, access_token, access_secret and search for 3000 tweets with this hashtag.Note that the Twitter Search API searches against a sampling of recent Tweets published in the past 7 days:
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
## [1] "Using direct authentication"
tweets <- searchTwitter('#KenyaDecides', n=3000, lang='en')
Store the data
It is usually good to store your data at this point. Then convert to a dataframe for easy analysis.
#saveRDS(tweets, 'kenyatweets.RDS')
readTweets <- readRDS('kenyatweets.RDS')
tweetDF <- twListToDF(readTweets) # convert to dataframe
That’s it! From this point, you can start analysing your data. This article is meant to show us how to connect to Twitter and extract tweets for analysis. In another post, I will show how to analyse this tweets. However, for those who can’t wait, I will show a bit of what can be done below:
Analyse the Data
First we investigate the dataframe by checking the names of the various variables
names(tweetDF) # list the variable names
## [1] "text" "favorited" "favoriteCount" "replyToSN" ## [5] "created" "truncated" "replyToSID" "id" ## [9] "replyToUID" "statusSource" "screenName" "retweetCount" ## [13] "isRetweet" "retweeted" "longitude" "latitude"
As you can see, we’ve got 16 parameters pertaining to each tweet. The “text” is the content of the tweet so your (re)tweet itself. “favorited” and “favoriteCount” tells us if the tweet was favorited and how many it received. “created” gives us the day and time the tweet was sent. other parameters can be deciphered from their names.
From the summary fo the data, we can get a quick glance at some statistics.
summary(tweetDF) # summary analysis. We can see longitude and latitude
## text favorited favoriteCount replyToSN ## Length:3000 Mode :logical Min. : 0.0000 Length:3000 ## Class :character FALSE:3000 1st Qu.: 0.0000 Class :character ## Mode :character Median : 0.0000 Mode :character ## Mean : 0.6137 ## 3rd Qu.: 0.0000 ## Max. :59.0000 ## created truncated replyToSID ## Min. :2017-08-10 08:22:56 Mode :logical Length:3000 ## 1st Qu.:2017-08-10 08:48:07 FALSE:2857 Class :character ## Median :2017-08-10 09:10:05 TRUE :143 Mode :character ## Mean :2017-08-10 09:10:56 ## 3rd Qu.:2017-08-10 09:34:25 ## Max. :2017-08-10 09:59:39 ## id replyToUID statusSource ## Length:3000 Length:3000 Length:3000 ## Class :character Class :character Class :character ## Mode :character Mode :character Mode :character ## ## ## ## screenName retweetCount isRetweet retweeted ## Length:3000 Min. : 0.0 Mode :logical Mode :logical ## Class :character 1st Qu.: 1.0 FALSE:836 FALSE:3000 ## Mode :character Median : 10.0 TRUE :2164 ## Mean : 380.5 ## 3rd Qu.: 98.0 ## Max. :2410.0 ## longitude latitude ## Length:3000 Length:3000 ## Class :character Class :character ## Mode :character Mode :character ## ## ##
# have 3000 NA's implying 0 data. Some tweets received 1263 retweets (max)
# and 96 favourites. 3000 Tweets collected between 09:09:55 and 11:08:48
# on 09/08/2017
As we can see, we have 3000 tweets (as requested) over a 2 hour period (9.09.55 to 11.08.48 on 09-08-2017)
We can also view the top and bottom 2 data
head(tweetDF, 2)
## text ## 1 A goal without a plan is just a wish.\nMeet our TEAM. #ElectionKE2017 \n#KenyaDecides #kibandabae \n#Githeriman<U+0085> https://t.co/IBupRJ1FD9 ## 2 RT @Asamoh_: Dear NASA supporters, don't lose hope. Stand strong. We can't accept and move on. No more fraud. Tuko gangari #KenyaDecides ## favorited favoriteCount replyToSN created truncated ## 1 FALSE 0 <NA> 2017-08-10 09:59:39 TRUE ## 2 FALSE 0 <NA> 2017-08-10 09:59:37 FALSE ## replyToSID id replyToUID ## 1 <NA> 895585427385483264 <NA> ## 2 <NA> 895585419160543238 <NA> ## statusSource ## 1 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a> ## 2 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> ## screenName retweetCount isRetweet retweeted longitude latitude ## 1 pitchblank_ent 0 FALSE FALSE <NA> <NA> ## 2 EndesiaC 27 TRUE FALSE <NA> <NA>
tail(tweetDF, 2)
## text ## 2999 RT @chriskirwa: RETWEET IN SUPPORT - He has made us laugh during the Tense #KenyaDecides - Let's locate & Celebrate him & his Family #Githe<U+0085> ## 3000 RT @chriskirwa: RETWEET IN SUPPORT - He has made us laugh during the Tense #KenyaDecides - Let's locate & Celebrate him & his Family #Githe<U+0085> ## favorited favoriteCount replyToSN created truncated ## 2999 FALSE 0 <NA> 2017-08-10 08:22:56 FALSE ## 3000 FALSE 0 <NA> 2017-08-10 08:22:56 FALSE ## replyToSID id replyToUID ## 2999 <NA> 895561088242262016 <NA> ## 3000 <NA> 895561085805367300 <NA> ## statusSource ## 2999 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> ## 3000 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> ## screenName retweetCount isRetweet retweeted longitude latitude ## 2999 SavednSent 2410 TRUE FALSE <NA> <NA> ## 3000 DennisShonko 2410 TRUE FALSE <NA> <NA>
uniqueN(tweetDF$screenName)
## [1] 1960
This gives us the number of user who have interacted with the hashtag over the time period.
Now let’s use data.table package for analysis. Convert the dataframe to a data table and order the number of retweets.
tweetDT <- data.table(tweetDF, key="retweetCount")
tweetDT <- tweetDT[order(retweetCount),]
topRetweets <- tail(tweetDT, 10)
View(topRetweets)
In this case, user @RT @Ahmedkadar1 received the highest number of retweets for his post: “A 100 year old Mzee being led to…” This is the post with the pic showing a 100 year old man being led to the polling booth by his 70 year old son. awwww…
You can also use this to order tweets by the number of times they were favorited.
We can then do a smooth curve showing the number of retweets over the time period
ggplot(tweetDF, aes(x=created, y=retweetCount)) + geom_smooth()
## `geom_smooth()` using method = 'gam'
Who has spent time tweeting on this hashtag?
counts <- table(tweetDF$screenName)
counts <- subset(counts, counts > 5)
barplot(counts, las=2, cex.names = 0.7)
Well done @KBCChannel! Top user of this hashtag within the time period with over 50 tweets or about 1.7% f all tweets.
There’s so much that can be done with this dataset. I will try and do another post showing more analysis on the dataset. If you have any analysis you want me to do on this data set (or another), comment of send me a mail. See you soon!
0 Comments