Hello friends! Today’s post is a blast from the past! I worked on this during my early days doing text mining using the tm package in R along with some other packages. I will show frequently used words via a frequency table and a Wordcloud. For those new to R, I have elected to release all my codes used in this post so feel free to re-use as you sharpen your R skills. If you have any question, please note them down in the comment box below and I will definitely get back to you!
Out of curiosity, I decided to do some analytics on the bible. Luckily, there are quite a number of electronic versions including this one here at the Guttenberg Project . Also I must say this work was inspired by this other project on Rstats which did some analytics on the entire work of Williams Shakespeare.
The Bible is perhaps the most popular book in the world. It was written over a 1600 year period between 1500BC to 100AD. It contains 66 smaller books written by various authors. I used R as my language of choice in this case but I intend to do same thing in another post using Python. So let’s dive in!
Quick info: There are approximately 788,258 words in the King James Bible (excluding the Hebrew Alphabet in Psalm 119 or the superscriptions listed in some of the Psalms). Of these, 14,565 are unique.
Let’s Get Started!
Some of the stuff you’ll learn:
- How to read from text file into R
- getting and setting working directory in R
- Installing and loading packages in R
- Inspecting data
- using dataframes
- Preprocessing of data – stopwords, stemming, lowercase etc
- Plotting frequency graphs
- Drawing WordClouds
Loading the document and required packages
The first thing we need to do is to load the document into R. Now there are several ways of scraping the net for data using R but in this case, I have taken the easy way out. I copied and pasted the text into a text document and then loaded same in R,.
Load the text and store in a container called bible.
text <- "bible.txt"
bible <- readLines(text)
## Warning in readLines(text): incomplete final line found on 'bible.txt'
readLines() is one of the ways of reading files off your computer into R. Other ways include read.csv and read.tsv. If you get an error running line 2 above, check to see if you are in the correct working directory by typing getwd(). If not, you can set your working directory using setwd().
Once the file is loaded, you can install and load the required packages.
#install.packages(c("wordcloud", "tm", "SnowballC", "ggplot2","wordcloud"))
library(tm)
library(SnowballC)
library(ggplot2)
library(wordcloud)
Inspecting the loaded document and removing irrelevant sections
You can inspect the text by looking at the first five lines and the last five lines.
head(bible)
## [1] "Project Gutenberg's The Bible, King James Version, Complete, by Various" ## [2] "" ## [3] "This eBook is for the use of anyone anywhere at no cost and with" ## [4] "almost no restrictions whatsoever. You may copy it, give it away or" ## [5] "re-use it under the terms of the Project Gutenberg License included" ## [6] "with this eBook or online at www.gutenberg.net"
tail(bible)
## [1] " http://www.gutenberg.net" ## [2] "" ## [3] "This Web site includes information about Project Gutenberg-tm," ## [4] "including how to make donations to the Project Gutenberg Literary" ## [5] "Archive Foundation, how to help produce our new eBooks, and how to" ## [6] "subscribe to our email newsletter to hear about new eBooks."
As you can see from the output, seems we have some extra text which are not part of the Bible included in our text. Surely, we need to remove these additional texts. To remove these additional texts, we need to find out where these texts occur in the document. By scanning through the text, I was able to find out the texts at the start of the document start on line 1 and ends at line 94. Therefore, to remove this, we just need to
bible <- bible[-(1:94)]
With more scanning, I found that the external text at the end runs from line 114103 to the end. Hence we remove this using the line of code:
bible <- bible[-(114103:length(bible))]
bible <- gsub("and","",bible)
Let’s see if we have the contents of the Bible
head(bible)
## [1] "Book 01 Genesis" ## [2] "" ## [3] "01:001:001 In the beginning God created the heaven the earth." ## [4] "" ## [5] "01:001:002 And the earth was without form, void; darkness was" ## [6] " upon the face of the deep. And the Spirit of God moved upon"
tail(bible)
## [1] " written in this book." ## [2] "" ## [3] "66:022:020 He which testifieth these things saith, Surely I come quickly." ## [4] " Amen. Even so, come, Lord Jesus." ## [5] "" ## [6] "66:022:021 The grace of our Lord Jesus Christ be with you all. Amen."
Pre-processing
Now we know that all we have in this document is simply the text in the bible. That’s not totally correct. If you visually inspect the text on your text file, you will notice that at the start of every chapter, there is a chapter title (we have removed this from the first book (Genesis) after running the text above) but you will notice it from the Book of Exodus. However, these titles will not affect our analytics that much so we can overlook them. However, in subsequent updates of this post, I will show you two different methods that can be used to remove these title (apart from trying to randomly find out their respective line numbers).
Next we concatenate all of the lines into a single string.
bible <- paste(bible, collapse = " ")
We now have one big document containing the entire bible. Now let’s convert this into a corpus. A corpus is simply a collection of text documents. It is the main structure for managing documents in the text mining package ™ in R.
docs <- Corpus(VectorSource(bible))
Next step will be to pre-process the data. For me, the very first step is to convert all the text to lower case. This ensures that words appear exactly the same every time they occur. Pre-processing also helps to remove words/characters that will not be useful to our analytics. For example, characters such as the ‘@’ symbol (not sure there’s any in the bible though), and punctuations will not be useful. So it is important that we remove these. We can also remove numbers (in digits) as I am not really interested in this. Words such as ‘and’, ‘at’, ‘the’ are also very common in English language but not useful for our analytics. We call these words, stopwords and we need to get them off too. The R package which helps in dealing with these words is the ‘tm’ package so make sure you remember to install and/or load it before next steps.
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
Next we remove common word endings like ‘ing’, ‘es’, and so on. This ensures that words like living and live are treated the same as one is just an extension of the other. We call this process, stemming. Again, R makes this very easy for us via the snowball package. Once this is done, we need to get rid of all the unnecessary whitespace created by the above pre-processing.
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
To complete the pre-processing stage, we need to tell R to treat the preprocessed document as text document.
docs <- tm_map(docs, PlainTextDocument)
The next stage in this analysis is to create a document term matrix (DTM). According to Wikipedia, a document term matrix (dtm) is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In simple terms, a matrix is a collection of objects in rows and columns. Our DTM extracts terms and their frequency of occurrence and outputs it as a matrix as shown below. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
dtm <- DocumentTermMatrix(docs)
inspect(dtm)
## <<DocumentTermMatrix (documents: 2, terms: 9134)>> ## Non-/sparse entries: 9134/9134 ## Sparsity : 50% ## Maximal term length: 18 ## Weighting : term frequency (tf) ## Sample : ## Terms ## Docs god lord said shall son thee thi thou unto will ## content 4728 8007 3999 9838 3486 3826 4600 5474 8997 3893 ## meta 0 0 0 0 0 0 0 0 0 0
Eploration
Let’s order the terms by their frequency.
freq <- colSums(as.matrix(dtm))
ord <- order(freq)
Let’s export the document to excel
my_file <- as.matrix(dtm)
dim(my_file)
## [1] 2 9134
write.csv(my_file, file="biblewords.csv")
Then find the most frequent terms and the least frequent words.
freq[head(ord)]
## abaddon abagtha abana abdeel abdiel abelmaim ## 1 1 1 1 1 1
freq[tail(ord)]
## thi god thou lord unto shall ## 4600 4728 5474 8007 8997 9838
As you can see, the least frequent words include adaddon, abagtha, abana, etc whilst the most frequent word in the bible is shall! Let’s find words in the bible that occur a thousand or more times
findFreqTerms(dtm, lowfreq=1000)
## [1] "also" "behold" "came" "children" "citi" "come" ## [7] "david" "day" "even" "everi" "father" "god" ## [13] "great" "hast" "hath" "hous" "israel" "king" ## [19] "let" "lord" "made" "make" "man" "may" ## [25] "men" "name" "now" "offer" "one" "pass" ## [31] "peopl" "said" "saith" "say" "shall" "shalt" ## [37] "son" "thee" "therefor" "thi" "thing" "thou" ## [43] "unto" "upon" "went" "will" "word"
Creating Plots
Perhaps you are a visual person and prefer to see the words as text instead? We’ll draw a histogram! First we need to convert to a dataframe.
bibleText <- data.frame(word=names(freq), freq=freq) # Convert to dataframe
subBible <- ggplot(subset(bibleText, freq>2000), aes(word, freq)) # plot subset text with frequency > 2000
subBible <- subBible + geom_bar(stat="identity")
subBible <- subBible + theme(axis.text.x=element_text(angle=45, hjust=1))
subBible
ggplot(subset(bibleText, freq>2000), aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity") +
theme(axis.text.x=element_text(angle=45, hjust=1))
We get a nice, clean graph shown above. Now it’s clear to see the most frequent terms. Note that we have only shown for the texts which occur more than 2000 times in the Bible. What interesting things do you note? Please share in the comments below. Is it surprising that the word ‘unto’ is the second highest occurring word given the version of the Bible we extracted from?
Word Clouds
We can also build a word cloud of the text. To make it more interesting and get more words, I will increase this to words that appear over 500 times.
wordcloud(names(freq), freq, min.freq=500)
I’m sure our word cloud can be more interesting with colours and all. Let’s do this with the code below. Checkout whose name made it into the words that appear over 1000 times in the Bible Hall of Fame. Can you find it? Is it better in color?
wordcloud(names(freq), freq, min.freq=1000, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))
Can you form any meaningful sentence from the wordcloud? I’ll give it a try: “Man shall therefore behold thy saying” Don’t know if that makes any sense 😀 I’ll love to see your examples in the comment box.
Finally, the top 10 most frequent words in the Bible are:
wordcloud(names(freq), freq, min.freq=3000, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))
Again, let’s see if you can form a meaningful sentence from these words. That’s it for now. If you have any suggestions for me or questions, please add them to the comment box. In subsequent posts, I will replicate the above in python for all you python lovers out there. My next post will deal with sentiment analysis using twitter data. Click here to go straight to the post. Don’t forget to add you comments below!
0 Comments