Twitter Data Analysis with R – Text Mining the Kenya Elections

Table of Contents Twitter Data Analysis with R â<U+0080><U+0093> Text Mining the Kenya Elections Load relevant packages Retrieve Tweets Store the data Analyse the Data Twitter Data Analysis with R – Text Mining the Kenya Elections Hello people! I’m back again. This time we will be looking at twitter analytics. There’s so much that can […]

Aug 10, 2017

Twitter Data Analysis with R – Text Mining the Kenya Elections

Hello people! I’m back again. This time we will be looking at twitter analytics. There’s so much that can be done with social media interactions. The other day, Facebook announced that they had surpassed the 1 billion monthly active users mark! To be exact, as at 2nd quarter 2017, Facebook had about 1.94billion MAU! That’s just incredible. What this means is that at a given time, we have millions of people worldwide interacting on Facebook. Add this to other social media platforms like Twitter, Instagram and so on.

So what does this means for organisations? Well, for one, because people are now more likely to express their opinions on social media, a company can able to view in real time, how (un)interested consumers are in their newly launched product. They can also do this to assess how happy or otherwise people are with their competitors. So if company A was planning on introducing a product to the market and company B and the latter goes to the market before them on a similar product, company A can use social media feedback from company B’s launch to know what people are actually saying about the product. This is very common with communications companies who fight to gain/retain clients with tarriff bundles.This is just a simple example.
Asides monitoring sentiments on social media, companies can also direct social media users (say Twitter) to direct users to their website and can use twitter analytics to know what percentage of users actually visited their website due to a tweet. With additional tools like Google Analytics, you can then track if these visitors made a purchase whilst on the website.

The above are just two examples of what can be done with social media analytics. Today, we will be looking at how to extract data from twitter and perform some simple analytics on the data.
Kenya had their elections on 10th of August 2017 and we will tract how one of the hash tags for the elections performed on twitter over a short period of time.

Load relevant packages

library(ROAuth)
library(twitteR)
library(tm)
library(igraph)
library(topicmodels)
library(devtools)
library(sentiment)
library(RTextTools)
library(e1071)
library(data.table)
library(ggplot2)
library(readr)
library(microbenchmark) # not required for Twitter anaytics

The key part of this work is extracting data from twitter. To do this, you will need 4 key parameters from Twitter – “consumer key”, “consumer secret”, “access token” and “access token secret”. To get these,

  1. Create a twitter account if you don’t have one already
  2. Go to https://apps.twitter.com/ and log in
  3. Click on “create New App”

  1. Fill out the form, (read and) accept the Developer Agreement and click “Create your Twitter Application”

  1. On your new app page, you should see the name you gave to your app at the top left of the screen. Click on “Keys and Access Token” tab and copy your “Consumer key” and “Consumer secret”.
    Scroll down and click “Create my access token”, and copy your “Access token” and “Access token secret”.

That’s it!

You can now use this in your application. To keep it safe, open an R script, assign your keys and secrets to four different variables. You can name them whatever you want but I have used consumer_key, consumer_secret, access_token, access_secret. Save this file and then access it using the “source” function. This ensures that anyone looking at your code cannot see your keys and secrets 😉

source("..\\twitterdata.R")

Retrieve Tweets

To retrieve tweet, from keywords (in our case, we want to retrieve 300 tweets from the #KenyaDescides hashtag). Note that this is just a simple example to demonstrate how this works. In real life, you may want to make your search more robust by including other possible hashtags that users may use.

Connect with Twitter using your consumer_key, consumer_secret, access_token, access_secret and search for 3000 tweets with this hashtag.Note that the Twitter Search API searches against a sampling of recent Tweets published in the past 7 days:

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
## [1] "Using direct authentication"
tweets <- searchTwitter('#KenyaDecides', n=3000, lang='en')

Store the data

It is usually good to store your data at this point. Then convert to a dataframe for easy analysis.

#saveRDS(tweets, 'kenyatweets.RDS')
readTweets <- readRDS('kenyatweets.RDS')
tweetDF <- twListToDF(readTweets) # convert to dataframe

That’s it! From this point, you can start analysing your data. This article is meant to show us how to connect to Twitter and extract tweets for analysis. In another post, I will show how to analyse this tweets. However, for those who can’t wait, I will show a bit of what can be done below:

Analyse the Data

First we investigate the dataframe by checking the names of the various variables

names(tweetDF) # list the variable names
##  [1] "text"          "favorited"     "favoriteCount" "replyToSN"    
##  [5] "created"       "truncated"     "replyToSID"    "id"           
##  [9] "replyToUID"    "statusSource"  "screenName"    "retweetCount" 
## [13] "isRetweet"     "retweeted"     "longitude"     "latitude"

As you can see, we’ve got 16 parameters pertaining to each tweet. The “text” is the content of the tweet so your (re)tweet itself. “favorited” and “favoriteCount” tells us if the tweet was favorited and how many it received. “created” gives us the day and time the tweet was sent. other parameters can be deciphered from their names.

From the summary fo the data, we can get a quick glance at some statistics.

summary(tweetDF) # summary analysis. We can see longitude and latitude 
##      text           favorited       favoriteCount      replyToSN        
##  Length:3000        Mode :logical   Min.   : 0.0000   Length:3000       
##  Class :character   FALSE:3000      1st Qu.: 0.0000   Class :character  
##  Mode  :character                   Median : 0.0000   Mode  :character  
##                                     Mean   : 0.6137                     
##                                     3rd Qu.: 0.0000                     
##                                     Max.   :59.0000                     
##     created                    truncated        replyToSID       
##  Min.   :2017-08-10 08:22:56   Mode :logical   Length:3000       
##  1st Qu.:2017-08-10 08:48:07   FALSE:2857      Class :character  
##  Median :2017-08-10 09:10:05   TRUE :143       Mode  :character  
##  Mean   :2017-08-10 09:10:56                                     
##  3rd Qu.:2017-08-10 09:34:25                                     
##  Max.   :2017-08-10 09:59:39                                     
##       id             replyToUID        statusSource      
##  Length:3000        Length:3000        Length:3000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##   screenName         retweetCount    isRetweet       retweeted      
##  Length:3000        Min.   :   0.0   Mode :logical   Mode :logical  
##  Class :character   1st Qu.:   1.0   FALSE:836       FALSE:3000     
##  Mode  :character   Median :  10.0   TRUE :2164                     
##                     Mean   : 380.5                                  
##                     3rd Qu.:  98.0                                  
##                     Max.   :2410.0                                  
##   longitude           latitude        
##  Length:3000        Length:3000       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
# have 3000 NA's implying 0 data. Some tweets received 1263 retweets (max) 
# and 96 favourites. 3000 Tweets collected between 09:09:55 and 11:08:48 
# on 09/08/2017

As we can see, we have 3000 tweets (as requested) over a 2 hour period (9.09.55 to 11.08.48 on 09-08-2017)

We can also view the top and bottom 2 data

head(tweetDF, 2)
##                                                                                                                                       text
## 1 A goal without a plan is just a wish.\nMeet our TEAM. #ElectionKE2017 \n#KenyaDecides #kibandabae \n#Githeriman<U+0085> https://t.co/IBupRJ1FD9
## 2 RT @Asamoh_: Dear NASA supporters, don't lose hope. Stand strong. We can't accept and move on. No more fraud. Tuko gangari #KenyaDecides
##   favorited favoriteCount replyToSN             created truncated
## 1     FALSE             0      <NA> 2017-08-10 09:59:39      TRUE
## 2     FALSE             0      <NA> 2017-08-10 09:59:37     FALSE
##   replyToSID                 id replyToUID
## 1       <NA> 895585427385483264       <NA>
## 2       <NA> 895585419160543238       <NA>
##                                                                           statusSource
## 1                   <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## 2 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
##       screenName retweetCount isRetweet retweeted longitude latitude
## 1 pitchblank_ent            0     FALSE     FALSE      <NA>     <NA>
## 2       EndesiaC           27      TRUE     FALSE      <NA>     <NA>
tail(tweetDF, 2)
##                                                                                                                                                      text
## 2999 RT @chriskirwa: RETWEET IN SUPPORT - He has made us laugh during the Tense #KenyaDecides - Let's locate &amp; Celebrate him &amp; his Family #Githe<U+0085>
## 3000 RT @chriskirwa: RETWEET IN SUPPORT - He has made us laugh during the Tense #KenyaDecides - Let's locate &amp; Celebrate him &amp; his Family #Githe<U+0085>
##      favorited favoriteCount replyToSN             created truncated
## 2999     FALSE             0      <NA> 2017-08-10 08:22:56     FALSE
## 3000     FALSE             0      <NA> 2017-08-10 08:22:56     FALSE
##      replyToSID                 id replyToUID
## 2999       <NA> 895561088242262016       <NA>
## 3000       <NA> 895561085805367300       <NA>
##                                                                              statusSource
## 2999 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
## 3000 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
##        screenName retweetCount isRetweet retweeted longitude latitude
## 2999   SavednSent         2410      TRUE     FALSE      <NA>     <NA>
## 3000 DennisShonko         2410      TRUE     FALSE      <NA>     <NA>
uniqueN(tweetDF$screenName)
## [1] 1960

This gives us the number of user who have interacted with the hashtag over the time period.

Now let’s use data.table package for analysis. Convert the dataframe to a data table and order the number of retweets.

tweetDT <- data.table(tweetDF, key="retweetCount")
tweetDT <- tweetDT[order(retweetCount),] 
topRetweets <- tail(tweetDT, 10)
View(topRetweets)

In this case, user @RT @Ahmedkadar1 received the highest number of retweets for his post: “A 100 year old Mzee being led to…” This is the post with the pic showing a 100 year old man being led to the polling booth by his 70 year old son. awwww…

You can also use this to order tweets by the number of times they were favorited.

We can then do a smooth curve showing the number of retweets over the time period

ggplot(tweetDF, aes(x=created, y=retweetCount)) + geom_smooth()
## `geom_smooth()` using method = 'gam'

plot of chunk unnamed-chunk-10

Who has spent time tweeting on this hashtag?

counts <- table(tweetDF$screenName)
counts <- subset(counts, counts > 5)
barplot(counts, las=2, cex.names = 0.7)

plot of chunk unnamed-chunk-11

Well done @KBCChannel! Top user of this hashtag within the time period with over 50 tweets or about 1.7% f all tweets.

There’s so much that can be done with this dataset. I will try and do another post showing more analysis on the dataset. If you have any analysis you want me to do on this data set (or another), comment of send me a mail. See you soon!

About the Author

Ayodeji Akiwowo is a seasoned Data Science Consultant with over a decade experience in both industry and academia, specializing in AI and data analytics. With a deep passion for both technology and faith, Ayodeji bridges the gap between modern AI innovations and Christian values, helping churches and faith-based organizations harness the power of AI responsibly. He is dedicated to ensuring that AI enhances church operations while upholding ethical standards and fostering community. Ayodeji is a trusted advisor and speaker, committed to guiding community organisations through the complexities of technology in a way that honours their mission and traditions.

Related Posts

The EU AI Act is here and Why Faith Institutions should care
The EU AI Act is here and Why Faith Institutions should care

How the EU AI Act Can Help Churches Regulate the Use of AI The European Union's AI Act, which officially came into force in August 2024, is more than just a regulatory framework for technology companies—it's a set of guidelines that can also serve as a critical tool...

The Ethical Dilemma: Navigating AI in the Church
The Ethical Dilemma: Navigating AI in the Church

This article could easily apply to many faith institutions, but I’m focusing on the church because that’s where my deepest insights lie as a Christian. Artificial intelligence (AI) is making its way into almost every corner of our lives, and the church is no...

Subscribe

Comments

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

This website uses cookies. By continuing to use this site, you accept our use of cookies.  Learn more