Scientists often evaluate theories or draw conclusions by informally combining results from several experiments. The experiments and the measured outcomes are usually diverse – making a meta-analyses inappropriate – and scientists therefore typically use the number of significant p-values to support their conclusion. P-values, however, are a poor way of integrating results, and since statistical power is often low, “conflicting results” are common. Informal methods of evaluating a series of experiments makes inefficient use of the data and can lead to incorrect conclusions and poor decisions. Here we show how to combine diverse evidence across experiments using Bayes factors [1,2], based on a method developed by Kuiper et al. [3].
The procedure is outlined in Figure 1 and consists of the following five steps:
Before seeing the data, specify the prior probability of three hypotheses: that the effect is less than zero, greater than zero, and exactly zero. An equal probability of 1/3 is usually appropriate for these exhaustive and mutually exclusive hypotheses. These probabilities will be updated after observing the data. (It is also possible to have a range of values around zero instead of a point value of exactly zero, and the null hypothesis could be a value other than zero; we will ignore these details for now.)
Specify a prior distribution for the effect size (ES). The ES
depends on the research question and could be a difference between
means, the slope of a regression line, or an odds ratio. The
BayesCombo
package calculates a sensible default prior if
none is specified.
Calculate the effect size and standard error (SE) from an experiment, which can be obtained from output of a standard statistical analysis.
Calculate the Bayes factor (BF) for each hypothesis, which
represents the evidence for a hypothesis after seeing the data, relative
to the probability of a hypothesis before seeing the data. The BFs are
calculated by the BayesCombo
package as a ratio of
posterior to prior distributions over a defined range of parameter
values.
Update the prior probability for each hypothesis with the Bayes factors to give the posterior probability for each hypothesis.
For the next experiment, use these updated probabilities to replace those defined in Step 1 and go through steps 2–5 again. Repeat for all experiments that you want to include. Let’s work through an example.
We have an experiment where 20 rats were randomised to one of four doses of the antidepressant fluoxetine, given in the drinking water. The time that the rats spent immobile in the Forced Swim Test (FST) was recorded. The FST is a standard behavioural test of “depression” in rodents.
First, we specify the prior probability of three hypotheses: that
fluoxetine increases (H>
), decreases
(H<
), or has no effect (H0
) on immobility
time. These three hypotheses are exhaustive (include all possible
outcomes) and mutually exclusive (only one can be true). Although
fluoxetine is known to decrease immobility time, we will specify an
equality probability of 1/3 for each hypothesis to illustrate the
approach.
Next, we need to specify a prior for the effect size (we define the effect size in the Step 3). For now we will use the default prior, which is calculated from the data. It is a normal prior, centred at zero, with the width calculated such that the 99% confidence interval (CI) of the prior matches the 99% CI of the data distribution. The results can be sensitive to the choice of prior, especially if the width is large relative to the data. The default prior is suitable for most situations where you have no prior information to include, and the results are insensitive to small changes near the default value (see below).
The data for this example are in the labstats
package
(available on CRAN) and plotted in Figure 2. To calculate the effect
size (and the prior in the previous step) we need to define the
analysis. Here, dose is treated as a continuous variable (see reference
[5]) and so a linear regression quantifies the relationship between
fluoxetine and immobility time. No effect corresponds to a flat line
(slope = 0) in Figure 2.
library(labstats)
par(las=1)
plot(time.immob ~ dose, data=fluoxetine, col="royalblue",
ylab="Time immobile (s)", ylim=c(0, 250), xlim=c(0, 250),
xlab="Dose of fluoxetine (mg/L in drinking water)")
abline(lm(time.immob ~ dose, data=fluoxetine), col="royalblue") # add reg. line
The code below calculates the effect size and standard error. These
are the required inputs to calculate the posterior probabilities and are
returned from lm()
, glm()
, and related
functions (e.g. from ANOVAs, t-tests, or regressions with Gaussian,
Poisson, or binomial outcomes).
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 170.440 14.83681307 11.487642 1.015410e-09
## dose -0.252 0.09913263 -2.542049 2.043654e-02
From the above output we see that the estimated slope is -0.252, with a standard error of 0.099, and a p-value of 0.020. We now have all the information to calculate the probability that fluoxetine increases, decreases, or has no effect on immobility time.1
Bayes factors are used as an intermediate step in calculating the
posterior probabilities of each hypothesis. The functions in the
BayesCombo
package calculate these automatically, but we
take a brief digression from the fluoxetine example to illustrate how
they are calculated to provide some insight into the method. Figure 3
shows a standard Bayesian updating of a prior distribution to a
posterior distribution based on the data (likelihood). The prior in the
top panel is normal with a mean of zero and standard deviation of 1.29.
The middle panel shows the data (likelihood) distribution, which has a
mean of 0.75 and standard deviation of 1. The bottom panel shows the
posterior, which is shifted to positive values relative to the prior and
reflects the influence of the data. The prior and likelihood correspond
to steps 2 and 3 in Figure 1. The BFs for the three hypotheses are then
calculated as the ratio of posterior to prior areas or heights of the
distributions:
$$ BF_{H<0} = \dfrac{\mathrm{area~of~d}}{\mathrm{area~of~a}} $$
$$ BF_{H=0} = \dfrac{\mathrm{height~of~point~e}}{\mathrm{height~of~point~b}} $$
$$ BF_{H>0} = \dfrac{\mathrm{area~of~f}}{\mathrm{area~of~c}} $$
For example, 50% of the prior distribution is above 0 (region c), as is 72% of the posterior (region f). The interpretation is that the data have increased the plausibility of hypothesis H> from 50% to 72%. The ratio of these values is the Bayes factor and is equal to 0.72/0.5 = 1.4. Thus, we can say that H> is 1.4 times more likely. More generally, a BF > 1 means that the data support a hypothesis, whereas a BF < 1 means that data do not support a hypothesis.
The final step is to update the prior probability for each hypothesis with the BFs to get the posterior probability for each hypothesis. The equation below shows the calculation for the hypothesis that the effect is greater than zero (H>), and other probabilities are calculated in the same way, just substituting other BFs in the numerator. Pr() are the prior probabilities for each hypothesis, and they cancel out from the equation when they are all equal (e.g. if they are all 1/3).
$$ P(H>) = \dfrac{Pr(H>)BF_{H>}}{Pr(H<)BF_{H<} + Pr(H0)BF_{H0} + Pr(H>)BF_{H>}}$$
All of the above steps can be conveniently calculated using the
pph()
function (Posterior Probability of a Hypothesis).
Returning to the fluoxetine example, we can calculate the probability
that the slope is negative, positive, or zero. Below, we specify the
slope (beta = -0.252
) and its standard error
(se.beta = 0.099
) that we obtained previously from the
output of the lm()
function. The default settings are used
for all other options. The output below shows that the probability of a
negative slope (H<
) is 0.9120, a positive slope
(H>
) is 0.0106, and a zero slope (H0
,
corresponding to no effect) is 0.0774. Unlike a p-value, which is the
probability of the data given a hypothesis, these probabilities have a
direct and intuitive interpretation as the probability of a hypothesis
given the data.
## Length Class Mode
## beta 1 -none- numeric
## se.beta 1 -none- numeric
## beta0 1 -none- numeric
## ci 1 -none- numeric
## se0 1 -none- numeric
## post.b 1 -none- numeric
## post.se 1 -none- numeric
## H.priors 3 -none- numeric
## BFs 3 -none- numeric
## pphs 3 -none- numeric
Plotting the output of the pph()
function returns the
likelihood, prior, and posterior distributions (Fig. 4). The mean of the
likelihood (data distribution; dotted line) is centred on -0.252 and the
standard error (0.099) determines the width of this distribution. The
prior (thick grey line) was automatically calculated and we can see how
it encompasses the likelihood distribution and is centred at zero. The
posterior distribution (thin black line) represents the combination of
prior and likelihood. The BFs and posterior probabilities of the three
hypotheses are then calculated from these prior and posterior
distributions (analogous to the distributions in Fig. 3).
The above default prior may be more informative than desired (note
how the posterior is pulled towards the prior in Fig. 4) and the easiest
way to decrease the influence of the prior is to make it wider by
specifying a multiplier. The code below doubles the previous standard
error (se.mult = 2
), and the posterior is now much closer
to the data distribution (Fig. 5). However, the posterior hypothesis
probabilities are similar to the previous analysis; rounded to two
decimal places, the probability that the slope is negative is still
0.91.
## Length Class Mode
## beta 1 -none- numeric
## se.beta 1 -none- numeric
## beta0 1 -none- numeric
## ci 1 -none- numeric
## se0 1 -none- numeric
## post.b 1 -none- numeric
## post.se 1 -none- numeric
## H.priors 3 -none- numeric
## BFs 3 -none- numeric
## pphs 3 -none- numeric
A final example illustrates other options. The prior for the slope is
directly specified as having a mean of 0 (beta0 = 0
) and
standard error of 1.2 (se0 = 1.2
). In the previous analyses
H0 was defined as exactly equal to 0, but here we define H0 as a range
of values close to zero using H0 = c(-0.05, 0.05)
. Finally,
the priors on the hypotheses are also given as an argument to
H.priors
. The values indicate that the prior probability of
the slope being negative, zero, and positive are 0.495, 0.495, 0.01,
respectively. The interpretation is that we expect that fluoxetine
either decreases immobility time or has no effect, but it is unlikely to
increase immobility time.
x3 <- pph(beta = -0.252, se.beta = 0.099, beta0=0, se0=1.2,
H0 = c(-0.05, 0.05), H.priors=c(0.495, 0.495, 0.01))
summary(x3)
## Length Class Mode
## beta 1 -none- numeric
## se.beta 1 -none- numeric
## beta0 1 -none- numeric
## ci 1 -none- numeric
## se0 1 -none- numeric
## post.b 1 -none- numeric
## post.se 1 -none- numeric
## H.priors 3 -none- numeric
## BFs 3 -none- numeric
## pphs 3 -none- numeric
With these options, the probability that fluoxetine decreases immobility time is now 0.98, and the distributions are shown in Figure 6.
We can use the above procedure to sequentially combine results from multiple experiments, where the posterior probabilities for hypotheses from one experiment are used as the prior probabilities for the next experiment. It is the analysts responsibility to ensure that the experiments are testing the same overall hypothesis or theory. In addition, the direction of the effects should be aligned; for example, if a positive effect size in one experiment is interpreted as supporting a theory, but a negative effect size in another experiment also supports the theory, then the negative effect should be multiplied by -1 to change its sign.
The ev.combo()
function combines results and only
requires effect sizes (beta
) and standard errors
(se.beta
) from two or more experiments as input. The
default prior mean (beta0 = 0
) is suitable for most
analyses, as is the equal prior hypothesis probabilities
(H.priors = c(1/3, 1/3, 1/3)
) for each hypothesis. In the
example below, assume we have four clinical trials where positive effect
sizes indicate a beneficial effect of a treatment.
The forestplot()
function makes a graph resembling a
traditional forest plot (Fig. 7), with the observed effect sizes and
their 99% CI (black lines). The automatically calculated priors for the
effect sizes are also plotted (grey lines). All four experiments have
positive effect sizes but only the first experiment is significant at
the usual 0.05 level, and a simple “vote counting” suggests that the
treatment is ineffective, despite the three non-significant studies
being in the predicted direction.
The results summary below shows how the support for the three
hypotheses changes as each experiment is added. The first line in the
output contains the prior hypothesis probability of 1/3 or 33%. When the
first experiment is included (second row) the probability that the
effect size is greater than zero (H>
) increases to 85%.
As more experiments are included, the probability increases further to
98%, and the null has only 1.7% support.
## Length Class Mode
## N 1 -none- numeric
## beta 4 -none- numeric
## se.beta 4 -none- numeric
## beta0 1 -none- numeric
## ci 1 -none- numeric
## se0 4 -none- numeric
## post.b 4 -none- numeric
## post.se 4 -none- numeric
## H.priors 3 -none- numeric
## pph.uniform 12 -none- numeric
## pphs 15 -none- numeric
It is easier to see how these probabilities change as experiments are added with a graph (Fig. 8).
In addition to seeing how the probabilities of hypotheses change as
experiments are added, we can also plot the evidence for each
experiment. This is equivalent to analysing each experiment separately
with the pph()
function, and the results are shown in Fig.
9. Both graphs plot the same data; the left graph group the studies by
hypothesis, while the right graph groups the hypotheses by study. It is
easy to see in the left graph that the H>
hypothesis has
the most support (above 0.5) for all four experiments.
par(mfrow=c(1,2))
dotchart(x4$pph.uniform, xlim=c(0,1), xlab="PPH", pch=21, bg="grey")
dotchart(t(x4$pph.uniform), xlim=c(0,1), xlab="PPH", pch=21, bg="grey")
Wagenmakers E-J, Lodewyckx T, Kuriyal H, Grasman R (2010). Bayesian hypothesis testing for psychologists: A tutorial on the Savage-Dickey method. Cognitive Psychology 60: 158-189.
Wetzels R, Grasman RP, Wagenmakers E-J (2010). An encompassing prior generalization of the Savage-Dickey density ratio. Computational Statistics and Data Analysis 54: 2094-2102.
Kuiper RM, Buskens V, Raub W, Hoijtink H (2012). Combining statistical evidence from several studies: A method using Bayesian updating and an example from research on trust problems in social and economic exchange. Sociological Methods and Research 42(1): 60-81.
Lazic SE (2008). Why we should use simpler models if the data allow this: relevance for ANOVA designs in experimental biology. BMC Physiology 8:16.
Lazic SE (2016). Experimental Design for Laboratory Biologists: Maximising Information and Improving Reproducibility. Cambridge University Press: Cambridge, UK
A potential source of confusion: We use normal or
Gaussian distributions to represent likelihoods and priors, and normal
distributions are defined by a mean and standard deviation. A standard
error reflects the uncertainty (or precision) of an estimate, which we
get from the output of a standard statistical analysis. Thus, we can say
that “the estimated slope is -0.252 with a standard
error of 0.099”. We can also say that we will represent our
uncertainty in the slope as a “normal distribution with a mean of -0.252
a standard deviation of 0.099”. Here we are saying that
a standard error and standard deviation are the same value, which is not
usually true. We can do this because we are using the output of one
analysis as the input into another one. Later in the code you will see
the likelihood defined as beta = -0.252
and
se.beta = 0.099
. Similarly, the prior for the effect size
is defined as beta0
and se0
. But both
se.beta
and se0
are the standard deviations of
normal distributions (e.g. as in Fig. 4). This confusion in terminology
arises because we are taking the output from a frequentist analysis and
using it as input into a Bayesian analysis. To keep things simple, we
have used se
in the code when referring to this quantity.↩︎