The Witcher in Data II: Sentiment Analysis

The second part of the Witcher text analysis series, where we look into a simple sentiment analysis using the tidytext package.

text-analysis
R
Published

August 9, 2022

This is a follow-up to my first post on the Witcher saga. Previously, we have turned the books into a tokenized dataset ready for analysis. Today, we looked into sentiment analysis, using the tidytext package, as well as the trusty tidyverse.

What is sentiment analysis?

Sentiment analysis, as the name suggests, is a family of techniques used to analyze sentiment of documents. In other words, these techniques allow to identify whether given words and sentences are connected with positive or negative emotions. For example, the sentence “Geralt smiled midly.” has positive sentiment, as the word “smile” is generally connected with positive emotions. On the other hand, “Ciri cursed her pursuers.” is associated with negative sentiment, as the word “curse” is generally seen as negative.

There are many ways sentiment analysis can be carried out. The most complex can potentionaly handle even complex sentence structures like double negatives or sarcasm. Simpler “bag of words” approaches assign sentiment values to each word separately. For this post, we will content ourselves with this simple analysis.

You may be asking, how does the computer knows whether a word is associated with positive or negative sentiment? Well, that’s because we tell it. To perform sentiment analysis, we need a lexicon; a big list of words and their sentiment. Lexicons are created by a huge number of people reading through texts and meticulously rating every word in terms of emotion they associate with it. The ratings of every word is then averaged to get the final sentiment. Of course, not all words can’t be associated with a clear cut emotions, and using different lexicons can lead to different results.

Tidytext package

The tidytext package contains three different lexicons:

  • AFINN - categorizes words on a scale from -5 (negative) to 5 (positive).

  • nrc - categorizes words into five different categories: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

  • bing - categorizes words into positive and negative categories.

We will use the bing lexicon for simplicity. The lexicon is available as a dataframe, which can be easily joined with our original tokenized text. After that, it’s a simple matter of counting positive and negative words per book and chapter. We will measure the overall chapter sentiment by the proportion of positive words it contains. The final dataset looks like this, with the sentiment (i.e. the proportion of positive words) being stored in the value variable:

witcher_bing <- witcher %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(book, chapters, sentiment) %>%
  group_by(book, chapters) %>% 
  mutate(value = n / sum(n)) %>% 
  filter(sentiment == "positive") %>% 
  group_by(book) %>% 
  mutate(chapter_index = 1:n(),
         book_chapter  = paste(book, chapters, sep = "-"),
         book_chapter  = as_factor(book_chapter)) %>% 
  ungroup() %>% 
  select(-c(sentiment, n))

head(witcher_bing, 5)
# A tibble: 5 × 5
  book          chapters              value chapter_index book_chapter          
  <fct>         <fct>                 <dbl>         <int> <fct>                 
1 The Last Wish The Voice Of Reason 1 0.304             1 The Last Wish-The Voi…
2 The Last Wish The Voice Of Reason 2 0.275             2 The Last Wish-The Voi…
3 The Last Wish The Voice Of Reason 3 0.285             3 The Last Wish-The Voi…
4 The Last Wish The Voice Of Reason 4 0.390             4 The Last Wish-The Voi…
5 The Last Wish The Voice Of Reason 5 0.293             5 The Last Wish-The Voi…

And that’s really all to it. For a simple case like this, “sentiment analysis” really amounts to merging the text with lexicon and plotting the results.

Witcher’s sentiment

The plot below shows the sentiment of chapters per book, with higher values on the y axis indicate more positive words. As we can see, words with positive sentiment tend to be in minority, fitting for a dark fantasy series. Overall, the story of Witcher is pretty bleak.

There are few spikes that may be of interest. The short story A Voice of Reason has notably more positive sentiment compared to the rest of chapters in the first book. This a bit surprising, considering the chapter is basically a short dispute between Geralt and a pair of knights at the temple of Melitele. The chapter is dripping with (false) pleasantries and faux politeness, so in the end up mostly positive positive, at least at the face value. Second big spike in positive sentiment is towards the middle of the Time of Contempt book, where Geralt and Yennefer visit a banquet on the wizard island of Thanned. This is mostly a funny chapter, with jokes and Geralt making fun of most of the other mages. However, the fun is short lived, as a fight brokes out immediately in the next chapter. The situation culminates in Ciri teleporting to the middle of desert to avoid her pursuers. The chapter when Ciri walks the desert is the low point of the book, with a lot of grueling challenges she will need to overcome. The ups and downs contiunue, until we reach the most positive part of the saga (at least according to the sentiment coding), when Geralts meet Coral in the Season of Storms and starts a short lived romance with her.

And that’s that. Overall, a simple sentiment analysis won’t replace a real human reading through the text. However, when the amount of text is big or the time is short, sentiment analysis can be an useful shortcut for text analysis.