Aleš Vomáčka - Gender Differences in Course Evaluations in Humanities

Gender Differences in Lecturer Ratings

In academia, there has been a long standing discussion on gender biases in lecturer ratings done by students. Namely, female lecturers tend to be rated more harshly than their male counterparts, although the strength of the bias and its practical significance has been somewhat debated (Özgümüs et al. (2020); Centra and Gaubatz (2000); Boring (2017)). However, to my knowledge, there has been little to no research on this topic in the Czech republic, possible due to the lack of available data.

One possible source of at least some data are student ratings at the Faculty of Arts, Charles University. The faculty offers study programs in social sciences, like sociology or psychology, as well as humanities like philosophy or English studies. The faculty also publishes aggregated lecturer ratings at the end of each semester. This provide us with a unique opportunity to gain some insight into the possible gender bias in lecturer ratings among humanities students.

Getting the Data

While the lecturer evaluation results are freely available, the data are not easily download in a computer readable format. Instead, they are hidden inside a web app, which means we’ll have to do some scraping. Probably the best package for scraping interactive websites is RSelenium. I won’t be going into the gritty detailed, but if you are interested, relevant scripts can be found in Github repository.

I have scraped data for the last three semesters - from summer 2021/2022 to winter 2022/2023. In theory, there are data from two other years to be scraped, but the questionnaire format is very different (and I have frankly didn’t want to waste time on that). The scraped data contains name of lecturer and course, as well as aggregated answers to both close-ended and open-ended questions. What the data don’t contain is the gender of lecturer. Instead, I have estimated the gender based on the lecturer’s first name, with a help from genderize.io.

Exploring the Data

With the data ready to go, we can start exploring it. To help with this, we’ll use several packages for visualization and (Bayesian) modelling. To save space, all code in this article will be folded, so if you want to check it, just click on the small “code” button.

Code

library(tidyverse)
library(brms)
library(marginaleffects)
library(tidybayes)
library(avom)
library(patchwork)
library(scales)

courses <- read_rds("data/course_evals.rds")

theme_set(theme_avom(text = element_text(family = "Fira Sans"),
                     axis.title = element_blank()))

primary_color <- "#83a598"

update_geom_defaults("point", list(color = primary_color))
update_geom_defaults("col", list(fill = primary_color))
update_geom_defaults("line", list(color = primary_color))
update_geom_defaults("smooth", list(color = primary_color))
update_geom_defaults("text", list(family = "Fira Sans"))

Unfortunately, not all courses have data available. If a course had too low response rate, its results have not been published. This means that out of the 9 559 course records, we have data for only 5 934 of them. Since courses with missing data are not useful to us, we’ll drop them.

Code

courses <- filter(courses, !is.na(lecturer_rating))

The gender composition at the faculty is fairly balanced, with men being slightly more common. Most lecturers hold PhD, with circa 22% having only Masters (Mgr.) or lower title - presumably these are mostly PhD students. There are also 25% of lectures holding the title of “docent” (doc.) or “profesor” (prof.) - titles unique to continental Europe, more prestigious than “mere” doctor.

Code

courses |> 
  select(lecturer, title, gender, semester) |> 
  distinct() |> 
  select(title, gender, semester) |> 
  pivot_longer(cols = everything()) |> 
  count(name, value) |> 
  mutate(value = if_else(is.na(value),
                         true = "Unknown",
                         false = value),
         value = as_factor(value),
         value = fct_relevel(value, "Unknown", after = Inf),
         value = fct_relabel(value, str_to_title),
         name = fct_relabel(name, str_to_title)) |> 
  mutate(freq = n / sum(n),
         freq_label = percent(freq, accuracy = 1),
         .by = name) |> 
  ggplot(aes(x = value,
             y = freq,
             label = freq_label)) +
  facet_wrap(~name, ncol = 2,
             scales = "free_x") +
  geom_col() +
  geom_text(vjust = -1) +
  scale_x_discrete(labels = ~str_wrap(.,10)) +
  scale_y_continuous(labels = percent_format(accuracy = 1),
                     limits = c(NA, 0.62)) +
  theme(panel.grid.major.x = element_blank())

The primary variable of interest is lecturer rating, on the scale from 0 (worst) to 100 (best). The score is actually an rescaled average of responses to a 5-response likert item with the following wording: “Please rate the teacher’s pedagogical performance”. For modeling purposes, we’ll transform the variable into 0 (worst) to 1 (best) range.

Code

courses$lecturer_rating <- courses$lecturer_rating / 100

The scores are highly skewed and it’s very rare for a lecturer to get score below 0.5 (talk about grade inflation, am I right?). At the first glance, there also doesn’t seem to be that much of a difference between men and women or between people with different academic titles. The average rating for women is 0.913 points, while men on average score 0.906 points. Similarly small differences can be found across titles. While lecturers with PhD enjoy average of 0.912, lecturers with “docent” and “profesor” titles score 0.896 on average. There a very small amount of lecturer who earned a score of zero (16 to be precise). After some thinking, I have decided to simply drop them from further analysis, as I didn’t want to spend model parameters for such a small number of observations.

Code

courses <- courses |> 
  mutate(response_rate = total_resp / enrolled) |> 
  filter(response_rate <= 1)

ratings_gender <- courses |> 
  rename(Gender = gender) |> 
  filter(!is.na(Gender)) |> 
  ggplot(aes(x = lecturer_rating,
             fill = Gender)) +
  stat_density(alpha = 0.5,
               color = NA,
               bounds = c(0, 1)) +
  scale_fill_avom(labels = str_to_title) +
  labs(title = "Teacher Rating (Higher is Better)") +
  theme(legend.position = "bottom")

ratings_title <- courses |> 
  rename(Title = title) |> 
  filter(!is.na(Title)) |> 
  ggplot(aes(x = lecturer_rating,
             fill = Title)) +
  stat_density(alpha = 0.5,
               color = NA,
               bounds = c(0, 1)) +
  scale_fill_avom(labels = str_to_title) +
  theme(legend.position = "bottom")

responses_gender <- courses |> 
  rename(Gender = gender) |> 
  filter(!is.na(Gender) & response_rate <= 1) |> 
  ggplot(aes(x = response_rate,
             fill = Gender)) +
  stat_density(alpha = 0.5,
               color = NA,
               bounds = c(0, 1)) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_fill_avom(labels = str_to_title) +
  labs(title = "Response Rate") +
  theme(legend.position = "bottom")

responses_title <- courses |> 
  rename(Title = title) |> 
  filter(!is.na(Title) & response_rate <= 1) |> 
  ggplot(aes(x = response_rate,
             fill = Title)) +
  stat_density(alpha = 0.5,
               color = NA,
               bounds = c(0, 1)) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_fill_avom(labels = str_to_title) +
  theme(legend.position = "bottom")


(ratings_gender + ratings_title) / (responses_gender + responses_title)

There is also the question of response rate. Not all students fill out the questionnaire, in fact the average response rate is only 29% for both men and women. Also, there are nine courses with response rate over 100%. I do happen to know that the university app had problem with computing responses before, so this is most likely a bug. We’ll just going to drop the problematic cases.

Global Difference

We’ll start with a model comparing lecturer ratings between genders across all departments. To do this, we are going to use one-inflated beta regression, which is usually used to model proportions (this is why we transformed lecturer rating into 0-1 scale). brms package doesn’t provide one-inflated beta distribution specifically, but the model can still be fitted with a bit of a hack. Apart from main effect of gender, our model will also control for department (using random effects with random slopes for gender) and the observation will be weighted by response rate, to make sure courses with very small response rate don’t have undue influence on the results.

Code

m1 <- courses |> 
  filter(lecturer_rating > 0 & response_rate <= 1) |> 
  brm(bf(lecturer_rating | weights(response_rate) ~ gender + (1 + gender|department),
         coi = 1),
      data = _,
      family = zero_one_inflated_beta(),
      cores = 4,
      threads = threading(2),
      backend = "cmdstanr",
      seed = 1234,
      file = "model/eval-model-department",
      file_refit = "on_change",
      refresh = 0,
      silent  = 2)

After weighting for response rate and controlling for department, the estimated mean difference between male and female lecturers is extremely small - based on the model it’s most likely something between -0.006 and 0.009 points. Remember that we are working on the scale from 0 to 1, so there is really no evidence that one of the genders would be more popular among the students than the other. That’s nice to hear!

The plot below shows a so-called “posterior distribution” of the estimated rating between male and female lecturers. It is the distribution of plausible results, based on the combination of our model and our data. We can see that the peak of the distribution is close to zero, indicating that as the most plausible value.

Code

comparisons(m1, 
  variables = "gender", 
  re_formula = NULL) |> 
  posterior_draws() |> 
  ggplot(aes(x = draw)) +
  stat_halfeye(fill = primary_color) +
  geom_vline(xintercept = 0,
             linetype = "dashed",
             color = "black") +
  labs(x = "Expected Mean Male - Female Difference") +
  annotate(geom = "text",
           x = -0.05,
           y = 0.85,
           label = str_wrap("There is 95% probability that the expected difference between male and female lecturers is between -0.006 and 0.009 points (On scale from 0 to 1).",
                            width = 40),
           hjust = 0) +
  scale_x_continuous(limits = c(-0.05, 0.05)) +
  theme(axis.title.x = element_text())

Department Level Differences

We have seen that there is no difference between men and women on the global level, but what about individual departments? We can use the same model to estimate that. Turns out, the differences across departments are not much bigger than the global difference. Few departments are leaning slightly more in favor of men, but the difference are extremely small.

Code

comparisons(m1,
            newdata = datagrid(gender = c("male", "female"),
                               department = unique(courses$department))) |> 
  mutate(department = fct_reorder(department, estimate)) |> 
  ggplot(aes(x = estimate,
             xmin = conf.low,
             xmax = conf.high,
             y = department)) +
  geom_pointrange() +
  geom_vline(xintercept = 0, linetype = "dashed")

So not only there are no differences on the global level, but there are also no departments with suspicious gender imbalances.

Title Level Differences

Last, we are going to check if there gender difference across lecturer with varying academic titles. For this, we’ll create a new model, similar to the previous one - one-inflated beta regression, random effect for department - but with title thrown into the mix.

Code

m2 <- courses |> 
  mutate(title = fct_collapse(title,
                              `Mgr./Ing./Bc.` = c("Mgr.", "Ing.", "Bc.")),
         title = droplevels(title)) |> 
  filter(lecturer_rating > 0 & response_rate <= 1) |> 
  brm(bf(lecturer_rating | weights(response_rate) ~ title * gender + (1|department),
         coi = 1),
      data = _,
      family = zero_one_inflated_beta(),
      cores = 4,
      threads = threading(2),
      backend = "cmdstanr",
      seed = 1234,
      file = "model/eval-model-title",
      file_refit = "on_change",
      refresh = 0,
      silent  = 2)

Results are shown in the plot below. Again, there is virtually no difference between average rating of male and female lecturers. The estimates for professors and docents are somewhat less precise than for PhD holders (notice that the distribution of plausible differences are more stretched out), but in the end it hardly matters. The estimated differences are negligible.

Code

avg_comparisons(m2,
                newdata = datagrid(gender = c("male", "female"),
                                   title = unique,
                                   department = unique), 
                variables = "gender",
                by = "title",
                re_formula = NULL) |> 
  posterior_draws() |> 
  mutate(title_label = paste0('For group "',
                              title ,
                              '", there is a 95% probability that the gender difference is between ',
                              round(conf.low, 3),
                              " and ",
                              round(conf.high, 3),
                              "."),
         title_label = str_wrap(title_label, 55)) |> 
  ggplot(aes(x = draw,
             y = title)) +
  stat_halfeye(fill = primary_color) +
  geom_vline(xintercept = 0,
             linetype = "dashed") +
  geom_text(aes(x = 0.05,
                label = title_label),
            hjust = 0,
            nudge_y = 0.2) +
  scale_x_continuous(limits = c(NA, 0.2)) +
  labs(x = "Expected Mean Male - Female Difference") +
  theme(axis.title.x = element_text())

Wrapping Up

No matter how we slice it, there is virtually no difference in course ratings based on lecturer gender. This was a bit surprising to me, considering all the papers showing the opposite. On the other hand, with humanities students being more socially liberal than students in technical or natural sciences, this is perhaps to be expected. Still, points to the Faculty of Arts for gender equality.

Of course, there are some limitations. The data we have used are from a single faculty from a single university. It’s quite possible that with a richer sample, the results would be quite different. Unfortunately, to my knowledge there are no other faculties that would make lecturer ratings available publicly.

Second problem is that for large number of courses, no data were available due to low response rate. This could be a problem if the gender differences in these courses were more prominent, but I have hard time thinking about a specific mechanism that would lead to this.

Lastly, only aggregated data are available. This makes sense, as to protect students’ privacy, but it does also present complication during the analysis. Hopefully, someone at the faculty have done similar study on the original dataset.

References

Boring, Anne. 2017. “Gender Biases in Student Evaluations of Teaching.” Journal of Public Economics 145 (January): 27–41. https://doi.org/10.1016/j.jpubeco.2016.11.006.

Centra, John A., and Noreen B. Gaubatz. 2000. “Is There Gender Bias in Student Evaluations of Teaching?” The Journal of Higher Education 71 (1): 17–33. https://doi.org/10.1080/00221546.2000.11780814.

Özgümüs, Asri, Holger A. Rau, Stefan T. Trautmann, and Christian König-Kersting. 2020. “Gender Bias in the Evaluation of Teaching Materials.” Frontiers in Psychology 11 (May). https://doi.org/10.3389/fpsyg.2020.01074.