Yogurt study data analysis

This document presents an example of analysis of the Group 1 dataset.

Study description

This is a randomized controlled trial to study whether yogurt consumption has an effect on the microbiome post antibiotic treatment. Absolute abundance of bacteria was measured by 3 qPCR assays (for total, L. crispatus, L. iners). Cytokine levels (in copies/ml of vaginal fluid) were also measured by Luminex. Data was collected at two timepoints, pre and post antibiotic treatment.

Checking if arms are balanced in terms of demographic variables

We first check if the randomization of participant didn’t lead to any unfortunate differences in terms of demographic/clinical variables between the two groups.

To do so, we load the demographic data table (`01_participant_metadata_yogurt.csv`), display the distribution of these variables per arm and create “Table 1”.

Demographic data table

demographic_data <- 
  read_csv("data/01_participant_metadata_yogurt.csv")

Rows: 51 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): pid, arm, birth_control, education, sex
dbl (2): days_since_last_sex, age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

First 6 rows of the table:

demographic_data |> head() |> pander()

Table continues below
pid	arm	days_since_last_sex	birth_control
pid_01	unchanged_diet	2	no hormonal birth control
pid_02	unchanged_diet	4	Depoprovera
pid_03	unchanged_diet	3	no hormonal birth control
pid_04	yogurt	1	Depoprovera
pid_05	yogurt	5	Depoprovera
pid_06	unchanged_diet	9	no hormonal birth control

age	education	sex
28	grade 10-12, not matriculated	female
34	less than grade 9	female
32	grade 10-12, matriculated	female
30	grade 10-12, not matriculated	female
34	less than grade 9	female
27	grade 10-12, matriculated	female

Number of participants per arm:

demographic_data |> count(arm) |> pander()

arm	n
unchanged_diet	25
yogurt	26

Visualization of the demographic variables of interest by study arm

There are 4 variables that we are interested in comparing: age, days_since_last_sex, birth_control, and education.

arm_colors <-
  c("unchanged_diet" = "darkseagreen3", yogurt = "slateblue")

g_age <- 
ggplot(demographic_data,
       aes(x = age, fill = arm)) +
  geom_histogram(binwidth = 1) + 
  facet_grid(arm ~ .) + 
  guides(fill = "none") +
  scale_fill_manual(values = arm_colors)

g_birth_control <- 
  ggplot(demographic_data,
       aes(x = birth_control, fill = arm)) +
  geom_bar()  + 
  facet_grid(arm ~ .) + 
  guides(fill = "none") +
  scale_fill_manual(values = arm_colors) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

g_days_since_last_sex <- 
ggplot(demographic_data,
       aes(x = days_since_last_sex, fill = arm)) +
  geom_histogram(binwidth = 1)  + 
  facet_grid(arm ~ .) + 
  guides(fill = "none") +
  scale_fill_manual(values = arm_colors)

g_education <- 
  ggplot(demographic_data,
       aes(x = education, fill = arm)) +
  geom_bar() +
  facet_grid(arm ~ .) + 
  guides(fill = "none") +
  scale_fill_manual(values = arm_colors) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

g_age + g_days_since_last_sex + g_birth_control + g_education

Table 1

We can now build “Table 1” using the tableone package and its function CreateTableOne.

table_one <- 
  CreateTableOne(
    data = demographic_data,
    strata = "arm",
    vars = c("age", "days_since_last_sex","birth_control", "education")
  ) 

table_one |> print(printToggle = FALSE) |> as.data.frame() |> select(-test) |> pander()

	unchanged_diet	yogurt	p
n	25	26
age (mean (SD))	30.44 (3.33)	31.00 (3.29)	0.548
days_since_last_sex (mean (SD))	7.44 (6.70)	9.50 (11.45)	0.439
birth_control = no hormonal birth control (%)	17 (68.0)	9 (34.6)	0.035
education (%)			0.799
grade 10-12, matriculated	6 (24.0)	4 (15.4)
grade 10-12, not matriculated	9 (36.0)	12 (46.2)
less than grade 9	4 (16.0)	5 (19.2)
post-secondary	6 (24.0)	5 (19.2)

Note: an alternative way, is to use the gtsummary package

library(gtsummary) # type install.packages("gtsummary") in the console if you get an error

#BlackLivesMatter

demographic_data |> 
  select(-pid, -sex) |> 
  tbl_summary(by = arm) |> 
  add_p()

Warning for variable 'days_since_last_sex':
simpleWarning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot compute exact p-value with ties

Warning for variable 'age':
simpleWarning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot compute exact p-value with ties

Characteristic	unchanged_diet, N = 25¹	yogurt, N = 26¹	p-value²
days_since_last_sex	6 (3, 10)	5 (3, 11)	0.8
birth_control			0.017
Depoprovera	8 (32%)	17 (65%)
no hormonal birth control	17 (68%)	9 (35%)
age	30.0 (28.0, 33.0)	31.0 (29.0, 32.8)	0.6
education			0.8
grade 10-12, matriculated	6 (24%)	4 (15%)
grade 10-12, not matriculated	9 (36%)	12 (46%)
less than grade 9	4 (16%)	5 (19%)
post-secondary	6 (24%)	5 (19%)
¹ Median (IQR); n (%)
² Wilcoxon rank sum test; Pearson’s Chi-squared test; Fisher’s exact test

The p-values are slightly different because, by default, the two packages use different tests. We note that in both cases, birth_control is flagged as unbalanced between the two groups. We may thus evaluate if this unbalance might explain any of the differences in the outcomes.

Research aim: Evaluating the influence of eating yogurt on the vaginal microbiota composition

Data exploration

The microbiota composition data is in 02_qpcr_results_yogurt.csv file.

Let’s open it to see how the microbiota composition has been characterized:

qpcr <- 
  read_csv(
    "data/02_qpcr_results_yogurt.csv"
  )

Rows: 102 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): sample_id
dbl (3): qpcr_bacteria, qpcr_crispatus, qpcr_iners

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

qpcr |> head() |> pander()

sample_id	qpcr_bacteria	qpcr_crispatus	qpcr_iners
UNCH092	3093064	0	26.84
UNCH042	8275	0	76.04
UNCH015	7532202	0	555
UNCH035	4794320	12.99	204309
UNCH019	5328344	0	107806
UNCH040	5804	0	121045

We see that we have qPCR data for total bacterial load, Lactobacillus crispatus, and Lactobacillus iners.

Let’s display the general distribution of these three quantities. To be able to display the data on the same graph, we can pivot_longer:

qpcr_long <- 
  qpcr |> 
  pivot_longer(
    cols = -sample_id,
    names_to = "variable",
    values_to = "qpcr_value" ,
    names_pattern = "qpcr_(.*)"
  )

ggplot(qpcr_long, aes(x = qpcr_value)) +
  geom_histogram(bins = 30) +
  facet_grid(variable ~ .)

We see that we may have to transform our data. We can try to see what a log transformation would do:

ggplot(qpcr_long, aes(x = qpcr_value)) +
  geom_histogram(bins = 30) +
  facet_grid(variable ~ .) +
  scale_x_log10()

Warning in scale_x_log10(): log-10 transformation introduced infinite values.

Warning: Removed 74 rows containing non-finite outside the scale range
(`stat_bin()`).

Notice that we got a warning that the log transformation introduced infinite values. That’s because some values in L. crispatus and L. iners are 0 and log(0) = -\(\infty\).

These 0s might be a problem for the downstream analyses. There are different ways to deal with these 0s. Here, we’ll use a simple little trick: we replace the 0s by a value that is half of the smaller value otherwise observed.

That value is:

smallest_non_zero <- qpcr_long$qpcr_value |> unique() |> sort() |> magrittr::extract(2)
smallest_non_zero

[1] 0.994

qpcr_long <- 
  qpcr_long |> 
  mutate(
    qpcr_value_b = 
      case_when(
        qpcr_value == 0 ~ smallest_non_zero/2,
        TRUE ~ qpcr_value
      ),
    was_zero = (qpcr_value == 0)
  )

We can now redo the histogram with this modified variable:

ggplot(qpcr_long, aes(x = qpcr_value_b, fill = was_zero)) +
  geom_histogram(bins = 30) +
  facet_grid(variable ~ ., scales = "free_y") +
  scale_x_log10() +
  scale_fill_manual("was zero", values = c("gray40", "gray60"), labels = c("no","yes"))

This gives us a better idea of the overall distributions of the variables describing the microbiota.

Research question

We can now define research questions associated with our research aim.

For example, we can ask the three following questions:

Does eating yogurt affect the total bacterial load?
Does eating yogurt affect the L. crispatus amounts?
Does eating yogurt affect the L. iners amounts?

For each of them, our Null hypothesis is that there are no differences between the arms and the Alternative is that there is a difference.

Outcome variables at baseline

Before answering these questions, we may want to verify that the distributions of these three variables are similar at baseline.

To be able to check that we need to join by the sample information table (00_sample_ids_yogurt.csv) so we have the visit info

sample_info <- 
  read_csv("data/00_sample_ids_yogurt.csv")

Rows: 102 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): pid, time_point, arm, sample_id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sample_info |> head() |> pander()

pid	time_point	arm	sample_id
pid_25	baseline	unchanged_diet	UNCH001
pid_51	baseline	yogurt	YOG002
pid_33	baseline	yogurt	YOG003
pid_13	baseline	yogurt	YOG004
pid_37	after_antibiotic	unchanged_diet	UNCH005
pid_09	after_antibiotic	unchanged_diet	UNCH006

We transform the categorical variables into factors, and, when relevant, we define the order of the categories (it is relevant for time_point for example).

sample_info <- 
  sample_info |> 
  mutate(time_point = time_point |> factor(levels = c("baseline", "after_antibiotic")))

qpcr_long_si <-
  qpcr_long |> 
  left_join(sample_info)

Joining with `by = join_by(sample_id)`

qpcr_long_si |> 
  filter(time_point == "baseline") |> 
  ggplot(aes(x = arm, y = qpcr_value_b |> log10(), fill = arm)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.75) + 
  # setting "outlier.shape" to NA does not show the outliers. 
  # we don't need to show them because with the geom_jitter below, we'll still see them.
  geom_jitter(width = 0.2, height = 0, size = 0.75) +
  facet_grid(. ~ variable) +
  scale_fill_manual(values = arm_colors) +
  ggtitle("Distribution of qPCR values at BASELINE by arm") +
  guides(fill = "none")

It looks like almost no one had L. crispatus before intervention but that there might be a small difference in the L. iners amounts.

Let’s see if that difference is statistically significant. We use a non-parametric test to do that.

qpcr_long_si |> 
  filter(time_point == "baseline") |> 
  group_by(variable) |> 
  summarize(
    wtest = wilcox.test(log10(qpcr_value_b) ~ arm)$p.value
  )

Warning: There were 3 warnings in `summarize()`.
The first warning was:
ℹ In argument: `wtest = wilcox.test(log10(qpcr_value_b) ~ arm)$p.value`.
ℹ In group 1: `variable = "bacteria"`.
Caused by warning in `wilcox.test.default()`:
! cannot compute exact p-value with ties
ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.

# A tibble: 3 × 2
  variable  wtest
  <chr>     <dbl>
1 bacteria  0.486
2 crispatus 0.978
3 iners     0.119

None of these differences are statistically significant: we do not reject the Null hypothesis that the microbiota composition (i.e., our outcomes of interest) at baseline is similar between the two groups.

(This is an equivalent of Table 1 but for the outcome of interest at baseline).

Results

We can now answer our questions.

Let’s first display our data to highlight the answers to our results.

We are interested in these changes at the "after_antibiotic" visit since that is after participants had both took the antibiotic and eaten the yogurt.

qpcr_long_si |> 
  filter(time_point == "after_antibiotic") |> 
  ggplot(aes(x = arm, y = qpcr_value_b |> log10(), fill = arm)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.75) + 
  geom_jitter(width = 0.2, height = 0, size = 0.75) +
  facet_grid(. ~ variable) +
  scale_fill_manual(values = arm_colors) +
  guides(fill = "none") +
  ggtitle("Distribution of qPCR values AFTER INTERVENTION by arm")

Visually, it looks like the yogurt might have had an effect on L. crispatus.

!!! NOTE !!! If we had ignored the zeros and done the same visualization using the original data, we would have seen something quite different:

qpcr_long_si |> 
  filter(time_point == "after_antibiotic") |> 
  ggplot(aes(x = arm, y = qpcr_value |> log10(), fill = arm)) +
  geom_boxplot(outlier.shape = NA, alpha = 0.75) +
  # geom_jitter(width = 0.2, height = 0, size = 0.75) + 
  # (you'll see that geom_jitter will still display the -infinite values at the very bottom of the plot) 
  # but geom_boxplot is not able to show include them to the distribution.
  facet_grid(. ~ variable) +
  scale_fill_manual(values = arm_colors) +
  guides(fill = "none") +
  ggtitle("Distribution of qPCR values AFTER INTERVENTION by arm\nzeros EXCLUDED from the plot")

Warning: Removed 17 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Let’s now compute the median values in both arms and test our hypotheses:

medians <- 
qpcr_long_si |> 
  filter(time_point == "after_antibiotic") |> 
  group_by(arm, variable) |> 
  summarize(median_log = median(log10(qpcr_value_b)), .groups = "drop") |> 
  pivot_wider(
    id_cols = variable, names_from = arm, names_prefix = "median_log_",
    values_from = median_log) 

test_res <- 
qpcr_long_si |> 
  filter(time_point == "after_antibiotic") |> 
  group_by(variable) |> 
  summarize(
    wtest_p = wilcox.test(log10(qpcr_value) ~ arm)$p.value
  ) |> 
  mutate(qval = p.adjust(wtest_p, method = "BH"))

Warning: There were 3 warnings in `summarize()`.
The first warning was:
ℹ In argument: `wtest_p = wilcox.test(log10(qpcr_value) ~ arm)$p.value`.
ℹ In group 1: `variable = "bacteria"`.
Caused by warning in `wilcox.test.default()`:
! cannot compute exact p-value with ties
ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.

medians |> left_join(test_res) |> pander()

Joining with `by = join_by(variable)`

variable	median_log_unchanged_diet	median_log_yogurt	wtest_p	qval
bacteria	5.654	5.163	0.07192	0.1136
crispatus	-0.3036	2.328	0.07571	0.1136
iners	5.472	5.502	0.8876	0.8876

We see that, although there is a large difference between the two medians in the two arms for L. crispatus, none of these p-values are smaller than 1/20. So we do not reject any of the null hypotheses.

But because of this large difference in medians for L. crispatus, it might be interesting to examine the before/after L. crispatus values in a high resolution (i.e., visualize them for each participants).

qpcr_long_si |> 
  filter(variable == "crispatus") |> 
  arrange(time_point |> fct_rev(), -qpcr_value) |> 
  # the fct_rev() function allows to sort factor variables in reverse order than the order of the levels
  # this is because I want to order participant IDs by the qpcr values at the post-intervention visit
  mutate(pid = pid |> factor(levels = unique(pid))) |> 
  ggplot(
    aes(x = time_point, y = pid, fill = log10(qpcr_value_b))
  ) +
  geom_tile() + 
  geom_text(aes(label = qpcr_value |> round()), size = 3) + 
  # geom text allows to print text in the viz
  facet_wrap(arm ~ ., scales = "free") +
  scale_fill_gradient(low = "white", high = "steelblue2") +
  guides(fill = "none") +
  labs(title = "L. crispatus qPCR values at both visits in both arms")

So, we see, comparing the yogurt and the control group, that, in the yogurt arm, there are more people who ended up having at least some L. crispatus. Whether that’s enough to restore a healthy vaginal microbiota is something to discuss with microbiota experts so one could decide whether it would be worth designing a new, better powered, study to refine our understanding of the effects of eating yogurt.

Longitudinal visualization

So far, this analysis ignored the longitudinal nature of the data.

Let’s fix that and display the baseline - post-antibiotic trajectories for these three variables in both arms.

qpcr_long_si <- 
  qpcr_long_si |> 
  mutate(
    time_point = time_point |> fct_relevel("baseline")
  )

qpcr_long_si  |> 
  ggplot(aes(x = time_point, y = qpcr_value_b |> log10(), col = arm)) +
  geom_line(aes(group = pid), alpha = 0.5) +
  geom_point() +
  scale_color_manual(values = arm_colors) +
  facet_grid(variable ~ ., scales = "free")

We see huge difference between the visits, but not so much between the arms. We might conclude that the antibiotic drove most of these effects.

What about birth control?

Remember that the two arms were not balanced with respect to birth control? (Results from Table 1)

Let’s do a few visualization to check if the birth control of participant may be related to the outcomes.

qpcr_long_si  |> 
  left_join(demographic_data) |> 
  ggplot(aes(x = time_point, y = qpcr_value_b |> log10(), col = birth_control)) +
  geom_line(aes(group = pid), alpha = 0.5) +
  geom_point(alpha = 0.5) +
  scale_color_manual(values = c("deeppink2","green4")) +
  facet_grid(variable ~ arm, scales = "free")

Joining with `by = join_by(pid, arm)`

qpcr_long_si  |> 
  left_join(demographic_data) |> 
  ggplot(aes(x = arm, y = qpcr_value_b |> log10(), fill = birth_control)) +
  geom_boxplot(varwidth = TRUE) +
  scale_fill_manual(values = c("deeppink2","green4")) +
  facet_grid(variable ~ time_point, scales = "free")

Joining with `by = join_by(pid, arm)`

Visually, we don’t see consistent effects of the birth control on the outcomes.

Additional analyses

There are many additional analyses that could be performed on this data. For example, one could continue this analysis by looking at the impact of the intervention on the cytokines (in the fourth table).