Implementing Unsupervised Machine Learning on Crime Data

Table of Contents The dataset (from the NBS website) Data Exploration Some exploration before learning Data Visualization Clustering Hierarchical Clustering Conclusion Dataset from National Bureau of Statistics Number of Rows: 37 Number of Columns: 5 Hello friends! I had a bit of time this week and thought I should spend some time to practice my […]

Jun 22, 2017

Dataset from National Bureau of Statistics
Number of Rows: 37
Number of Columns: 5

Hello friends! I had a bit of time this week and thought I should spend some time to practice my RMarkdown skills. Instead of simply playing around with some not too useful data, I decided to work with the data from the National Bureau of statistics website.

The data is stored in excel format which made it easy to download and import into python or R for analysis. It contains numerical information on the crime rate in all states of Nigeria including the Federal Capital Territory. It is a simple dataset with information on Offences against Persons (oap), Offences against Property (oapr), Offences against Lawful Authority (oala) and Offences against Local Acts (oalo). I have some data visualizations and applied unsupervised machine learning.

I used R for analysis in this case just because (no special reason). I could have used Python which I am also fluent in. My aim for doing this is just to show how clustering works and show some of the questions we could ask when we have sufficient data.

There are various Machine Learning algorithms in existence today. In general, we have two major approaches to learning – Supervised and Unsupervised learning (some will include semi-supervised but I like to keep it simple). Supervised learning extracts a function from labelled training data. The training data consist of a set of training examples. Once the model has been trained, you then test the model using test data. Unsupervised Learning aims to draw inferences from unlabelled data. That is, without having any information on the data, we try to group the data into similar groups by some metric.

In this article, using the NBS data for unsupervised learning. I implemented clustering algorithm which happens to be one of the popular unsupervised machine learning techniques. Clustering basically finds structure by grouping similar objects together into clusters. In general, clustering algorithms work with the idea that clusters are formed such that the objects in the same cluster should be similar to each other than to objects in other clusters. I implemented both hierarchical and k-means clustering on the dataset.

The dataset (from the NBS website)

Crime Statistics on reported offences reflected that a total of 125,790 cases were reported in 2016. Offence against property has the highest number of cases reported with 65,397 of such cases reported. Offence against persons recorded 45,554 cases reported while offence against lawful authority and local acts recorded the least with 12,144 and 2,695 cases recorded respectively. Lagos State has the highest percentage share of total cases reported with 36.08% and 45,385 cases recorded. FCT Abuja and Delta State followed closely with 10.48% and 13,181 and 6.25% and 7,867 cases recorded respectively. Katsina State has the lowest percentage share of total cases reported with 0.10% and 120 cases recorded. Abia and Zamfara States followed closely with 0.29% and 364 and 0.38% and 483 cases recorded respectively.

crime <- read.csv("crime.csv") # read the data into R

Data Exploration

Some exploration before learning

Key part of the Data scientist’s work is data exploration. Data exploration basically involves “looking at” the data to see what sort of information one can glean from it. This is usually in form of plots and statistics. Once loaded in R, I used the head() summary() and str() functions to highlight some details of the dataset.

head()

head(crime)
##      states  oap oapr oala oalo sumall
## 1       FCT 2984 9350  843    4  13181
## 2      ABIA  230  113   21    0    364
## 3   ADAMAWA  779 1417   56    7   2259
## 4 AKWA-IBOM  840  333  232    6   1411
## 5   ANAMBRA  898 1413  142   81   2534
## 6    BAUCHI  812 1713  118   14   2657

summary

summary(crime)
##        states        oap             oapr            oala       
##  ABIA     : 1   Min.   :   51   Min.   :   65   Min.   :   0.0  
##  ADAMAWA  : 1   1st Qu.:  423   1st Qu.:  497   1st Qu.:  21.0  
##  AKWA-IBOM: 1   Median :  656   Median :  897   Median :  57.0  
##  ANAMBRA  : 1   Mean   : 1231   Mean   : 1767   Mean   : 328.2  
##  BAUCHI   : 1   3rd Qu.:  954   3rd Qu.: 1413   3rd Qu.: 145.0  
##  BAYELSA  : 1   Max.   :15426   Max.   :22885   Max.   :6768.0  
##  (Other)  :31                                                   
##       oalo            sumall     
##  Min.   :  0.00   Min.   :  120  
##  1st Qu.:  0.00   1st Qu.: 1089  
##  Median : 14.00   Median : 1769  
##  Mean   : 72.84   Mean   : 3400  
##  3rd Qu.:105.00   3rd Qu.: 2534  
##  Max.   :356.00   Max.   :45385  
## 

str()

str(crime)
## 'data.frame':    37 obs. of  6 variables:
##  $ states: Factor w/ 38 levels "ABIA","ADAMAWA",..: 15 1 2 3 4 5 6 7 8 9 ...
##  $ oap   : num  2984 230 779 840 898 ...
##  $ oapr  : num  9350 113 1417 333 1413 ...
##  $ oala  : num  843 21 56 232 142 118 91 0 3 100 ...
##  $ oalo  : num  4 0 7 6 81 14 1 129 269 35 ...
##  $ sumall: num  13181 364 2259 1411 2534 ...

The head function prints out the first six rows of the dataset. You may have noticed that I renamed some of the columns as the names from the raw data were too long for me. The new names are kind of intuitive and follow the order from the raw dataset. I have also added two new columns (called features). The sumtotal column was added to confirm the values given in the “Total Number of Cases 2016” column and the percentage column is simply the percentage of the sumtotal value to the overall total. The first and last three columns were not used in the final clustering analysis.

The summary() output highlights the summary statistics of the data. Obviously, data with non-numerical values will give meaningless statistics. However, we can easily see that the mean and median oap (offence against persons) for example are 1231 and 656 respectively.

The str() function gives information important to every data scientist. Immediately we can see that the data contains 37 observations and 8 variables. The variables are the columns and can also be known as features.

For each variable, we also have information on types. For example, the states variable is a Factor variable with 38 levels (technically not really a factor variable but R reads it as factor variable). The other variables are numerical (num) variables.
The above functions are very useful for a data scientist. They are typically some of the first set of tool he/she will use after loading the data into R.

Luckily for us, the data is clean and does not need much pre-processing. Any data scientist will tell you that this isn’t always the case. We can now do some visualizations!

About the Author

Ayodeji Akiwowo is a seasoned Data Science Consultant with over a decade experience in both industry and academia, specializing in AI and data analytics. With a deep passion for both technology and faith, Ayodeji bridges the gap between modern AI innovations and Christian values, helping churches and faith-based organizations harness the power of AI responsibly. He is dedicated to ensuring that AI enhances church operations while upholding ethical standards and fostering community. Ayodeji is a trusted advisor and speaker, committed to guiding community organisations through the complexities of technology in a way that honours their mission and traditions.

Related Posts

The EU AI Act is here and Why Faith Institutions should care
The EU AI Act is here and Why Faith Institutions should care

How the EU AI Act Can Help Churches Regulate the Use of AI The European Union's AI Act, which officially came into force in August 2024, is more than just a regulatory framework for technology companies—it's a set of guidelines that can also serve as a critical tool...

The Ethical Dilemma: Navigating AI in the Church
The Ethical Dilemma: Navigating AI in the Church

This article could easily apply to many faith institutions, but I’m focusing on the church because that’s where my deepest insights lie as a Christian. Artificial intelligence (AI) is making its way into almost every corner of our lives, and the church is no...

Subscribe

Comments

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

This website uses cookies. By continuing to use this site, you accept our use of cookies.  Learn more