library(tidyverse) # for data manipulation
library(tidytext) # for tokenization
library(hunspell) # for stemming
library(rvest) # for web scraping
<- read_file("data/witcher.txt") witcher
As a big Witcher fan, I have a habit of rereading all the books during the summer. Recently, I have also become interested in text mining techniques, such as topic modeling or sentiment analysis. The opportunity presented itself and I decided to mix the two of my favorite things. The results is a short series about my journey into the world of text mining.
In this post, we do most of the data preparation needed for forthcoming analyses, with the help of tidytext, rvest and hunspell packages for R, as well as the document converter Pandoc. The steps are as follows:
Convert the books into computer readable format.
Get the data into R and clean it.
Tokenize the data with
tidytext
andhunspell
.Scrap the names of characters appearing the books from the Witcher Wiki using
rvest
.
Converting into plain text
Let’s get the text into a format that would be easy for a computer to read. I got my hands on the Complete Witcher, which includes all five parts of the main saga, as well as the two collections of short stories and the single (semi-)standalone novel. The text itself is in the EPUB format, designed for electronic readers, and if we want to analyze it, we first need to convert it into plain text.
Enter Pandoc, the almost-magical-software for converting documents across formats. At first, I was worried how well will Pandoc handle such a fancy format like EPUB, but it turned out much better than expected. All that was necessary was running a basic Pandoc command in the terminal:
pandoc -f epub -t plain -o witcher.txt witcher.epub
The result was almost perfect, I opted for just a few manual edits to make my job easier in the future. First, I deleted all the publishing information from the very beginning and the very end of the document. Second, I manually add ---
before the start of each book and &&&
before the start of each chapter, so that I can easily tell them apart in the future. The first few lines of the resulting document looked like this:
---The Last Wish
&&&THE VOICE OF REASON 1
She came to him towards morning.
She entered very carefully, moving silently, floating through the
chamber like a phantom; the only sound was that of her mantle brushing
her naked skin. Yet this faint sound was enough to wake the witcher
Importing and cleaning the data
Now that the document is in the correct format, we can import it into R. This is a pretty task, especially with the use of Tidyverse packages:
The entire Witcher saga is now contained in a single character called witcher
. Next, we split it into individual books and then we split the books into chapters. This is where the manual preprocessing we have done in the previous step pays off, since we know that every book starts with the characters ---
and every chapter with &&&
:
# Splitting text into books -----------------------------------------------
<- as.list(str_split(witcher, pattern = "---", simplify = TRUE))
books <- books[-1] # The first element is empty, so we can drop it
books
# Splitting books into chapters -------------------------------------------
<- map(books, ~as.list(str_split(.x, pattern = "&&&", simplify = TRUE))) books
We are left with a nested list. The list has eight elements, one for each book, and each element is in itself a list of individual chapters. However, the elements are not named, so it’s hard to keep track which element represents each book or chapter. The next step is therefore to name everything:
# Naming books ------------------------------------------------------------
<- map_chr(books, ~.x[[1]])
book_names <- str_replace_all(book_names, pattern = "\\n", "")
book_names <- str_squish(book_names)
book_names names(books) <- book_names
# Naming chapters ---------------------------------------------------------
<- map(books, ~.x[-1]) # This is just name of the book, so we drop it
books
<- map(books, ~map(.x, ~str_extract(.x, pattern = ".+\\n\\n")))
chapter_names <- map(chapter_names, ~map(.x, ~str_replace_all(.x, pattern = "\\n", "")))
chapter_names
<- map2(.x = books, .y = chapter_names, ~setNames(.x, .y)) books
For each book, we have extracted all the characters on the first line, which are actually the book names, and saved them into a vector book_names
. The same was done for each chapter. While we were at it, we also removed all new line characters (\n
), so that they won’t get in the way later. Now all elements of the nested list are named, so we can be sure what data we are working with.
The last two things to do in this step is to remove chapter names from the texts themselves and to get rid of all new space characters, especially the double new space characters, left in the texts:
# The chapter names are still in the text, let's drop them
<- map(books, ~map(.x, ~str_replace(.x, pattern = ".+\\n\\n", replacement = "")))
books
# sometimes two "new line" symbols appear in the text in the text, we delete them
<- map(books, ~map(.x, ~str_replace_all(.x, pattern = "\\n\\n", replacement = " "))) books
Tokenization and stemming
Now that the text is cleaned and has a reasonable structure, we can move to tokenization. Tokeniyzation is a process of splitting text into small chunks of information, which can be digested by computer algorithms. The most common way of tokenizing text is splitting it into words, but sometimes it can be beneficial to split to text into sentences or words pairs. We will stick with the classic approach and split the text into individual words, using the unnest_tokens
function from tidytext
package. Because the function expects data in a form of a data frame (or the Tidyverse’s flavor called tibble) instead of a simple vector, we first convert them using using as_tibble()
. After the text is tokenized, we can also finally merge the data into a single data frame, instead of a nested list, and get a good look of it:
<- books %>%
books map(~map(., as_tibble)) %>%
map(~map(., unnest_tokens, input = "value", output = "word")) %>%
map(bind_rows, .id = "chapters") %>%
bind_rows(.id = "book")
head(books, 5)
# A tibble: 5 × 3
book chapters word
<chr> <chr> <chr>
1 The Last Wish THE VOICE OF REASON 1 she
2 The Last Wish THE VOICE OF REASON 1 came
3 The Last Wish THE VOICE OF REASON 1 to
4 The Last Wish THE VOICE OF REASON 1 him
5 The Last Wish THE VOICE OF REASON 1 towards
The data are now in a pretty decent shape, but we still have a few things to do. First, we would like to remove all stop words present. Stop words are commonly sued words, which by themselves tend to not be useful for analysis. Examples of stop words are “is”, ”the” or “I’m”. The tidytext
package contains a list of stop words in a data frame called stop_words
, making it pretty easy to get rid of them. While we are at it, we also transform chapter names from all capital letters into a more natural format:
<- books %>%
books mutate(word = str_replace_all(word, "’", "'"),
chapters = str_to_title(chapters)) %>%
anti_join(stop_words, "word")
head(books, 5)
# A tibble: 5 × 3
book chapters word
<chr> <chr> <chr>
1 The Last Wish The Voice Of Reason 1 morning
2 The Last Wish The Voice Of Reason 1 entered
3 The Last Wish The Voice Of Reason 1 carefully
4 The Last Wish The Voice Of Reason 1 moving
5 The Last Wish The Voice Of Reason 1 silently
The very last thing to do is stemming. Currently, our data contains multiple versions of the same words, such as “come”, “comes” and “came” . This is generally not useful and it makes sense to transform all versions of the same word into a single generic form. This generic form is called a stem. The hunspell
package provides us with a simple way to do so, we just need to be careful about selecting the correct dictionary, as the versions of Witcher books we are working with is written in British english. Another complication is that some words may be a version of two different stems, such as the word “morning”, which may come both from the stem “morning” or “morn”. To deal with this, I will (arbitrary) only pick the first of the possible stems, just to make my life a bit easier. Some words are also not present in the dictionary used for stemming, usually character or geographical names, and in that case, we just assign them a missing value:
<- books %>%
books mutate(stem = hunspell_stem(word, dict = dictionary("en_GB")),
stem = map_chr(stem, ~if_else(length(.) > 0,
1],
.x[NA_character_)))
head(books, 5)
# A tibble: 5 × 4
book chapters word stem
<chr> <chr> <chr> <chr>
1 The Last Wish The Voice Of Reason 1 morning morning
2 The Last Wish The Voice Of Reason 1 entered enter
3 The Last Wish The Voice Of Reason 1 carefully care
4 The Last Wish The Voice Of Reason 1 moving moving
5 The Last Wish The Voice Of Reason 1 silently silent
And there it is! We have a data frame with a text tokenized into words. For each word, we also have its stem, as well as information on which book and chapter it belongs to.
Scraping character names
The text itself is ready for analysis, but there is still more data for us to get. For my future plans, I figured it would be useful to have a list of all character names that appear in the saga, so that we can easily identify if a word refers to a person (and which person it is). Fortunately, all character names are available on the Witcher wiki, so it’s a simple task of scrapping them all through the rvest
package. The link of website holding character names for the first book, called The Last wish, looks like this
https://witcher.fandom.com/wiki/Category:The_Last_Wish_characters
The links for the rest of the books have the format, with just the book name being different. We can therefore easily construct all the relevant links by gluing together the general parts with the book names prepared earlier:
<- paste0("https://witcher.fandom.com/wiki/Category:",
characters_pages str_replace_all(book_names, pattern = " ", replacement = "_"),
"_characters")
Now that we have the links, we can import the webpages straight into R. We also name all the imported webpages so that know which which of them relate to each book:
<- map(characters_pages, read_html)
character_pages names(character_pages) <- book_names
With the websites imported into R, we just need to extract all the relevant data, that is the list of character names. To do this, we need to find out the name of the element used to store character names. A simple option is to use Selector gadget, which allows us to learn name of an element by simply clicking on in the browser. Alternatively, some browsers allow you to inspect website elements to determine their names manually. Anyway, the element we are interested in is .category-page__member-link
. With this information, extracting the data is easy:
<- character_pages %>%
character_names map(html_elements, ".category-page__member-link") %>%
map(html_text2) %>%
map(as_tibble) %>%
bind_rows(.id = "book")
The last thing we do is extract the first names of characters, since that is how they are addressed most frequently, and also drop names that appear more than once, which happens because several characters appear across multiple books. We also make a small “manual” correction to the names list and make sure the King’s of the Wild Hunt first name is not just “King”, as that would be impolite towards to royalty. Lastly, we drop all strings containing the word “characters”, as these refer to the name of the character list and were extracted by accident:
<- character_names %>%
character_names mutate(first_name = str_extract(value, "^[:alpha:]+"),
first_name = if_else(value == "King of the Wild Hunt",
true = "King of the Wild Hunt",
false = first_name)) %>%
rename(full_name = value) %>%
distinct(full_name, .keep_all = TRUE) %>%
filter(!str_detect(full_name, "characters"))
head(character_names, 5)
# A tibble: 5 × 3
book full_name first_name
<chr> <chr> <chr>
1 The Last Wish "Abergard" Abergard
2 The Last Wish "Abrad \"Jack-up-the-Skirt\"" Abrad
3 The Last Wish "Adda of Temeria" Adda
4 The Last Wish "Adda the White" Adda
5 The Last Wish "Aen Saevherne" Aen
And that is finally the end. With the text in tokenized form and a list of character names, we are ready for a deeper dive into text mining. But that’s a story for the next time.
Bonus: Creating a word cloud
But there is a very very last thing we could do (this time for real, I swear). Wouldn’t it be cool to have word cloud of the most common words as a thumbnail for this blog post? Fortunately creating word clouds is straightforward if we use the ggwordcloud
package.
First we need to count the number of times each word appeared in the text. I will use stems instead of tokens if possible, so that all words are counted properly. If a word doesn’t have a stem, we will use the token instead. After some fiddling, I have decided that the first 100 most frequent words will give us a nicely size cloud. I also though it may be nice to have name of persons in a different color. Thankfully, we can easily identify characters thanks to the data we scraped from the Wiki:
library(ggwordcloud)
<- books %>%
wordcloud_data mutate(stem = if_else(is.na(stem),
true = word,
false = stem)) %>%
count(stem) %>%
slice_max(n, n = 100) %>%
mutate(person = if_else(stem %in% str_to_lower(character_names$first_name),
true = TRUE,
false = FALSE))
Now that the data are ready for plotting, we can create the word cloud itself. We also make sure to use the color theme of this blog and highlight the names:
%>%
wordcloud_data mutate(stem = if_else(person,
true = str_to_title(stem),
false = stem)) %>%
ggplot(aes(label = stem, size = n, color = person)) +
geom_text_wordcloud(seed = 1262) +
scale_color_manual(values = c("#fbf1c7", "#fe8019")) +
scale_size_area(max_size = 8) +
theme_void() +
theme(panel.background = element_rect(fill = "#282828", color = "#282828"))