Dataset from National Bureau of Statistics
Number of Rows: 37
Number of Columns: 5
Hello friends! I had a bit of time this week and thought I should spend some time to practice my RMarkdown skills. Instead of simply playing around with some not too useful data, I decided to work with the data from the National Bureau of statistics website.
The data is stored in excel format which made it easy to download and import into python or R for analysis. It contains numerical information on the crime rate in all states of Nigeria including the Federal Capital Territory. It is a simple dataset with information on Offences against Persons (oap), Offences against Property (oapr), Offences against Lawful Authority (oala) and Offences against Local Acts (oalo). I have some data visualizations and applied unsupervised machine learning.
I used R for analysis in this case just because (no special reason). I could have used Python which I am also fluent in. My aim for doing this is just to show how clustering works and show some of the questions we could ask when we have sufficient data.
There are various Machine Learning algorithms in existence today. In general, we have two major approaches to learning – Supervised and Unsupervised learning (some will include semi-supervised but I like to keep it simple). Supervised learning extracts a function from labelled training data. The training data consist of a set of training examples. Once the model has been trained, you then test the model using test data. Unsupervised Learning aims to draw inferences from unlabelled data. That is, without having any information on the data, we try to group the data into similar groups by some metric.
In this article, using the NBS data for unsupervised learning. I implemented clustering algorithm which happens to be one of the popular unsupervised machine learning techniques. Clustering basically finds structure by grouping similar objects together into clusters. In general, clustering algorithms work with the idea that clusters are formed such that the objects in the same cluster should be similar to each other than to objects in other clusters. I implemented both hierarchical and k-means clustering on the dataset.
The dataset (from the NBS website)
Crime Statistics on reported offences reflected that a total of 125,790 cases were reported in 2016. Offence against property has the highest number of cases reported with 65,397 of such cases reported. Offence against persons recorded 45,554 cases reported while offence against lawful authority and local acts recorded the least with 12,144 and 2,695 cases recorded respectively. Lagos State has the highest percentage share of total cases reported with 36.08% and 45,385 cases recorded. FCT Abuja and Delta State followed closely with 10.48% and 13,181 and 6.25% and 7,867 cases recorded respectively. Katsina State has the lowest percentage share of total cases reported with 0.10% and 120 cases recorded. Abia and Zamfara States followed closely with 0.29% and 364 and 0.38% and 483 cases recorded respectively.
crime <- read.csv("crime.csv") # read the data into R
Data Exploration
Some exploration before learning
Key part of the Data scientist’s work is data exploration. Data exploration basically involves “looking at” the data to see what sort of information one can glean from it. This is usually in form of plots and statistics. Once loaded in R, I used the head() summary() and str() functions to highlight some details of the dataset.
head()
head(crime)
## states oap oapr oala oalo sumall ## 1 FCT 2984 9350 843 4 13181 ## 2 ABIA 230 113 21 0 364 ## 3 ADAMAWA 779 1417 56 7 2259 ## 4 AKWA-IBOM 840 333 232 6 1411 ## 5 ANAMBRA 898 1413 142 81 2534 ## 6 BAUCHI 812 1713 118 14 2657
summary
summary(crime)
## states oap oapr oala ## ABIA : 1 Min. : 51 Min. : 65 Min. : 0.0 ## ADAMAWA : 1 1st Qu.: 423 1st Qu.: 497 1st Qu.: 21.0 ## AKWA-IBOM: 1 Median : 656 Median : 897 Median : 57.0 ## ANAMBRA : 1 Mean : 1231 Mean : 1767 Mean : 328.2 ## BAUCHI : 1 3rd Qu.: 954 3rd Qu.: 1413 3rd Qu.: 145.0 ## BAYELSA : 1 Max. :15426 Max. :22885 Max. :6768.0 ## (Other) :31 ## oalo sumall ## Min. : 0.00 Min. : 120 ## 1st Qu.: 0.00 1st Qu.: 1089 ## Median : 14.00 Median : 1769 ## Mean : 72.84 Mean : 3400 ## 3rd Qu.:105.00 3rd Qu.: 2534 ## Max. :356.00 Max. :45385 ##
str()
str(crime)
## 'data.frame': 37 obs. of 6 variables: ## $ states: Factor w/ 38 levels "ABIA","ADAMAWA",..: 15 1 2 3 4 5 6 7 8 9 ... ## $ oap : num 2984 230 779 840 898 ... ## $ oapr : num 9350 113 1417 333 1413 ... ## $ oala : num 843 21 56 232 142 118 91 0 3 100 ... ## $ oalo : num 4 0 7 6 81 14 1 129 269 35 ... ## $ sumall: num 13181 364 2259 1411 2534 ...
The head function prints out the first six rows of the dataset. You may have noticed that I renamed some of the columns as the names from the raw data were too long for me. The new names are kind of intuitive and follow the order from the raw dataset. I have also added two new columns (called features). The sumtotal column was added to confirm the values given in the “Total Number of Cases 2016” column and the percentage column is simply the percentage of the sumtotal value to the overall total. The first and last three columns were not used in the final clustering analysis.
The summary() output highlights the summary statistics of the data. Obviously, data with non-numerical values will give meaningless statistics. However, we can easily see that the mean and median oap (offence against persons) for example are 1231 and 656 respectively.
The str() function gives information important to every data scientist. Immediately we can see that the data contains 37 observations and 8 variables. The variables are the columns and can also be known as features.
For each variable, we also have information on types. For example, the states variable is a Factor variable with 38 levels (technically not really a factor variable but R reads it as factor variable). The other variables are numerical (num) variables.
The above functions are very useful for a data scientist. They are typically some of the first set of tool he/she will use after loading the data into R.
Luckily for us, the data is clean and does not need much pre-processing. Any data scientist will tell you that this isn’t always the case. We can now do some visualizations!
0 Comments