Text Mining the KJV Bible

Table of Contents Loading the document and required packages Inspecting the loaded document and removing irrelevant sections Pre-processing Eploration Creating Plots Word Clouds Hello friends! Today’s post is a blast from the past! I worked on this during my early days doing text mining using the tm package in R along with some other packages. […]

Sep 8, 2016

Hello friends! Today’s post is a blast from the past! I worked on this during my early days doing text mining using the tm package in R along with some other packages. I will show frequently used words via a frequency table and a Wordcloud. For those new to R, I have elected to release all my codes used in this post so feel free to re-use as you sharpen your R skills. If you have any question, please note them down in the comment box below and I will definitely get back to you!

Out of curiosity, I decided to do some analytics on the bible. Luckily, there are quite a number of electronic versions including this one here at the Guttenberg Project . Also I must say this work was inspired by this other project on Rstats which did some analytics on the entire work of Williams Shakespeare.

The Bible is perhaps the most popular book in the world. It was written over a 1600 year period between 1500BC to 100AD. It contains 66 smaller books written by various authors. I used R as my language of choice in this case but I intend to do same thing in another post using Python. So let’s dive in!

Quick info: There are approximately 788,258 words in the King James Bible (excluding the Hebrew Alphabet in Psalm 119 or the superscriptions listed in some of the Psalms). Of these, 14,565 are unique.

Let’s Get Started!

Some of the stuff you’ll learn:

  1. How to read from text file into R
  2. getting and setting working directory in R
  3. Installing and loading packages in R
  4. Inspecting data
  5. using dataframes
  6. Preprocessing of data – stopwords, stemming, lowercase etc
  7. Plotting frequency graphs
  8. Drawing WordClouds

Loading the document and required packages

The first thing we need to do is to load the document into R. Now there are several ways of scraping the net for data using R but in this case, I have taken the easy way out. I copied and pasted the text into a text document and then loaded same in R,.

Load the text and store in a container called bible.

text <- "bible.txt"
bible <- readLines(text)
## Warning in readLines(text): incomplete final line found on 'bible.txt'

readLines() is one of the ways of reading files off your computer into R. Other ways include read.csv and read.tsv. If you get an error running line 2 above, check to see if you are in the correct working directory by typing getwd(). If not, you can set your working directory using setwd().

Once the file is loaded, you can install and load the required packages.

#install.packages(c("wordcloud", "tm", "SnowballC", "ggplot2","wordcloud"))

library(tm)

library(SnowballC)

library(ggplot2)

library(wordcloud)

Inspecting the loaded document and removing irrelevant sections

You can inspect the text by looking at the first five lines and the last five lines.

head(bible)
## [1] "Project Gutenberg's The Bible, King James Version, Complete, by Various"
## [2] ""                                                                       
## [3] "This eBook is for the use of anyone anywhere at no cost and with"       
## [4] "almost no restrictions whatsoever.  You may copy it, give it away or"   
## [5] "re-use it under the terms of the Project Gutenberg License included"    
## [6] "with this eBook or online at www.gutenberg.net"
tail(bible)
## [1] "     http://www.gutenberg.net"                                     
## [2] ""                                                                  
## [3] "This Web site includes information about Project Gutenberg-tm,"    
## [4] "including how to make donations to the Project Gutenberg Literary" 
## [5] "Archive Foundation, how to help produce our new eBooks, and how to"
## [6] "subscribe to our email newsletter to hear about new eBooks."

As you can see from the output, seems we have some extra text which are not part of the Bible included in our text. Surely, we need to remove these additional texts. To remove these additional texts, we need to find out where these texts occur in the document. By scanning through the text, I was able to find out the texts at the start of the document start on line 1 and ends at line 94. Therefore, to remove this, we just need to

bible <- bible[-(1:94)]

With more scanning, I found that the external text at the end runs from line 114103 to the end. Hence we remove this using the line of code:

bible <- bible[-(114103:length(bible))]
bible <- gsub("and","",bible)

Let’s see if we have the contents of the Bible

head(bible)
## [1] "Book 01        Genesis"                                                
## [2] ""                                                                      
## [3] "01:001:001 In the beginning God created the heaven  the earth."        
## [4] ""                                                                      
## [5] "01:001:002 And the earth was without form,  void;  darkness was"       
## [6] "           upon the face of the deep. And the Spirit of God moved upon"
tail(bible)
## [1] "           written in this book."                                         
## [2] ""                                                                         
## [3] "66:022:020 He which testifieth these things saith, Surely I come quickly."
## [4] "           Amen. Even so, come, Lord Jesus."                              
## [5] ""                                                                         
## [6] "66:022:021 The grace of our Lord Jesus Christ be with you all. Amen."

Pre-processing

Now we know that all we have in this document is simply the text in the bible. That’s not totally correct. If you visually inspect the text on your text file, you will notice that at the start of every chapter, there is a chapter title (we have removed this from the first book (Genesis) after running the text above) but you will notice it from the Book of Exodus. However, these titles will not affect our analytics that much so we can overlook them. However, in subsequent updates of this post, I will show you two different methods that can be used to remove these title (apart from trying to randomly find out their respective line numbers).

Next we concatenate all of the lines into a single string.

bible <- paste(bible, collapse = " ")

We now have one big document containing the entire bible. Now let’s convert this into a corpus. A corpus is simply a collection of text documents. It is the main structure for managing documents in the text mining package ™ in R.

docs <- Corpus(VectorSource(bible))

Next step will be to pre-process the data. For me, the very first step is to convert all the text to lower case. This ensures that words appear exactly the same every time they occur. Pre-processing also helps to remove words/characters that will not be useful to our analytics. For example, characters such as the ‘@’ symbol (not sure there’s any in the bible though), and punctuations will not be useful. So it is important that we remove these. We can also remove numbers (in digits) as I am not really interested in this. Words such as ‘and’, ‘at’, ‘the’ are also very common in English language but not useful for our analytics. We call these words, stopwords and we need to get them off too. The R package which helps in dealing with these words is the ‘tm’ package so make sure you remember to install and/or load it before next steps.

docs <- tm_map(docs, content_transformer(tolower))

docs <- tm_map(docs, removePunctuation)

docs <- tm_map(docs, removeNumbers)

docs <- tm_map(docs, removeWords, stopwords("english"))

Next we remove common word endings like ‘ing’, ‘es’, and so on. This ensures that words like living and live are treated the same as one is just an extension of the other. We call this process, stemming. Again, R makes this very easy for us via the snowball package. Once this is done, we need to get rid of all the unnecessary whitespace created by the above pre-processing.

docs <- tm_map(docs, stemDocument)

docs <- tm_map(docs, stripWhitespace)

To complete the pre-processing stage, we need to tell R to treat the preprocessed document as text document.

docs <- tm_map(docs, PlainTextDocument)

The next stage in this analysis is to create a document term matrix (DTM). According to Wikipedia, a document term matrix (dtm) is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In simple terms, a matrix is a collection of objects in rows and columns. Our DTM extracts terms and their frequency of occurrence and outputs it as a matrix as shown below. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

dtm <- DocumentTermMatrix(docs)
inspect(dtm)
## <<DocumentTermMatrix (documents: 2, terms: 9134)>>
## Non-/sparse entries: 9134/9134
## Sparsity           : 50%
## Maximal term length: 18
## Weighting          : term frequency (tf)
## Sample             :
##          Terms
## Docs       god lord said shall  son thee  thi thou unto will
##   content 4728 8007 3999  9838 3486 3826 4600 5474 8997 3893
##   meta       0    0    0     0    0    0    0    0    0    0

Eploration

Let’s order the terms by their frequency.

freq <- colSums(as.matrix(dtm))
ord <- order(freq)

Let’s export the document to excel

my_file <- as.matrix(dtm)   
dim(my_file)
## [1]    2 9134
write.csv(my_file, file="biblewords.csv")

Then find the most frequent terms and the least frequent words.

freq[head(ord)]
##  abaddon  abagtha    abana   abdeel   abdiel abelmaim 
##        1        1        1        1        1        1
freq[tail(ord)]
##   thi   god  thou  lord  unto shall 
##  4600  4728  5474  8007  8997  9838

As you can see, the least frequent words include adaddon, abagtha, abana, etc whilst the most frequent word in the bible is shall! Let’s find words in the bible that occur a thousand or more times

findFreqTerms(dtm, lowfreq=1000)
##  [1] "also"     "behold"   "came"     "children" "citi"     "come"    
##  [7] "david"    "day"      "even"     "everi"    "father"   "god"     
## [13] "great"    "hast"     "hath"     "hous"     "israel"   "king"    
## [19] "let"      "lord"     "made"     "make"     "man"      "may"     
## [25] "men"      "name"     "now"      "offer"    "one"      "pass"    
## [31] "peopl"    "said"     "saith"    "say"      "shall"    "shalt"   
## [37] "son"      "thee"     "therefor" "thi"      "thing"    "thou"    
## [43] "unto"     "upon"     "went"     "will"     "word"

Creating Plots

Perhaps you are a visual person and prefer to see the words as text instead? We’ll draw a histogram! First we need to convert to a dataframe.

bibleText <- data.frame(word=names(freq), freq=freq) # Convert to dataframe
subBible <- ggplot(subset(bibleText, freq>2000), aes(word, freq)) # plot subset text with frequency > 2000
subBible <- subBible + geom_bar(stat="identity")

subBible <- subBible + theme(axis.text.x=element_text(angle=45, hjust=1))

subBible

Frequency plot of most common words in the Bible

ggplot(subset(bibleText, freq>2000), aes(x = reorder(word, -freq), y = freq)) +
          geom_bar(stat = "identity") + 
          theme(axis.text.x=element_text(angle=45, hjust=1))

Same as fig1 but arranged in descending order

We get a nice, clean graph shown above. Now it’s clear to see the most frequent terms. Note that we have only shown for the texts which occur more than 2000 times in the Bible. What interesting things do you note? Please share in the comments below. Is it surprising that the word ‘unto’ is the second highest occurring word given the version of the Bible we extracted from?

Word Clouds

We can also build a word cloud of the text. To make it more interesting and get more words, I will increase this to words that appear over 500 times.

wordcloud(names(freq), freq, min.freq=500)

plot of chunk unnamed-chunk-24

I’m sure our word cloud can be more interesting with colours and all. Let’s do this with the code below. Checkout whose name made it into the words that appear over 1000 times in the Bible Hall of Fame. Can you find it? Is it better in color?

wordcloud(names(freq), freq, min.freq=1000, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))

plot of chunk unnamed-chunk-25

Can you form any meaningful sentence from the wordcloud? I’ll give it a try: “Man shall therefore behold thy saying” Don’t know if that makes any sense 😀 I’ll love to see your examples in the comment box.

Finally, the top 10 most frequent words in the Bible are:

wordcloud(names(freq), freq, min.freq=3000, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))

plot of chunk unnamed-chunk-26

Again, let’s see if you can form a meaningful sentence from these words. That’s it for now. If you have any suggestions for me or questions, please add them to the comment box. In subsequent posts, I will replicate the above in python for all you python lovers out there. My next post will deal with sentiment analysis using twitter data. Click here to go straight to the post. Don’t forget to add you comments below!

About the Author

Ayodeji Akiwowo is a seasoned Data Science Consultant with over a decade experience in both industry and academia, specializing in AI and data analytics. With a deep passion for both technology and faith, Ayodeji bridges the gap between modern AI innovations and Christian values, helping churches and faith-based organizations harness the power of AI responsibly. He is dedicated to ensuring that AI enhances church operations while upholding ethical standards and fostering community. Ayodeji is a trusted advisor and speaker, committed to guiding community organisations through the complexities of technology in a way that honours their mission and traditions.

Related Posts

The EU AI Act is here and Why Faith Institutions should care
The EU AI Act is here and Why Faith Institutions should care

How the EU AI Act Can Help Churches Regulate the Use of AI The European Union's AI Act, which officially came into force in August 2024, is more than just a regulatory framework for technology companies—it's a set of guidelines that can also serve as a critical tool...

The Ethical Dilemma: Navigating AI in the Church
The Ethical Dilemma: Navigating AI in the Church

This article could easily apply to many faith institutions, but I’m focusing on the church because that’s where my deepest insights lie as a Christian. Artificial intelligence (AI) is making its way into almost every corner of our lives, and the church is no...

Subscribe

Comments

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

This website uses cookies. By continuing to use this site, you accept our use of cookies.  Learn more