\(\newcommand{\rv}[1]{\mathbf{#1}}\) \(\newcommand{\follows}{\thicksim}\) \(\newcommand{\normal}[1]{\mathcal{N}\!\left(#1\right)}\) \(\newcommand{\ga}[1]{\mathrm{Gamma}\!\left(#1\right)}\) \(\newcommand{\diff}[1]{\mathop{}\!\mathrm{d}#1}\) \(\newcommand{\fun}[2]{#1\left(#2\right)}\) \(\newcommand{\prob}[1]{\fun{p}{#1}}\)

Introduction

Suppose we have three continuous random variables \(\rv{X}\), \(\rv{Y}\), and \(\rv{Z}\), and we wish to sample from their joint probability distribution \(\prob{\rv{X}, \rv{Y}, \rv{Z}}\), which can often be difficult. Gibbs sampling can be used to sample from this joint probability distribution provided that we can directly sample from the full conditional probability distributions of every individual random variables, namely \(\prob{\rv{X} \mid \rv{Y}, \rv{Z}}\), \(\prob{\rv{Y} \mid \rv{X}, \rv{Z}}\), and \(\prob{\rv{Z} \mid \rv{X}, \rv{Y}}\).

The first step in Gibbs sampling is therefore to derive the full conditional distributions for each of the random variables in the joint distribution. This topic had been addressed in a number of places (e.g. #1 and #2). The common strategy for finding the analytical solution to the full conditional \(\prob{\rv{X} \mid \rv{Y}, \rv{Z}}\) seems to be:

Write down the expression for the full conditional distribution of your model (i.e. \(\prob{\rv{X}, \rv{Y}, \rv{Z}}\)).
Drop all the factors and constants that do not involve the random variable \(\rv{X}\).
See if the resultant expression is proportional to the probability desnity/mass function of any well-known distributions; if so, we have then found a full conditional distribution from which we can sample easily, otherwise we will have to resort to other sampling methods.

Why does it work?

The rationale behind the strategy outlined above may not be immediately obvious to those who are new to Bayesian methods. For \(\prob{\rv{X} \mid \rv{Y}, \rv{Z}}\), note that:

\[ \begin{align} \prob{\rv{X} \mid \rv{Y}, \rv{Z}} &= \frac{ \prob{\rv{X}, \rv{Y}, \rv{Z}} }{ \prob{\rv{Y}, \rv{Z}} } \\ &= \frac{ \prob{\rv{X}, \rv{Y}, \rv{Z}} }{ \int \prob{\rv{X}, \rv{Y}, \rv{Z}} \diff{\rv{X}} } \label{eq:full-conditional-X} \end{align} \]

Now suppose we can factorise the probability density function (p.d.f.) of the joint distribution into a product of two functions:

\[ \begin{align} \prob{\rv{X}, \rv{Y}, \rv{Z}} &= g\!\left( \rv{Y}, \rv{Z} \right) f\!\left( \rv{X}, \rv{Y}, \rv{Z} \right) \label{eq:full-conditional-factorised} \end{align} \]

where the function \(g\!\left( \rv{Y}, \rv{Z} \right)\) does not involve \(\rv{X}\). Substitute \(\eqref{eq:full-conditional-factorised}\) back into \(\eqref{eq:full-conditional-X}\):

\[ \begin{align} \prob{\rv{X} \mid \rv{Y}, \rv{Z}} &= \frac{ g\!\left( \rv{Y}, \rv{Z} \right) f\!\left( \rv{X}, \rv{Y}, \rv{Z} \right) }{ \int g\!\left( \rv{Y}, \rv{Z} \right) f\!\left( \rv{X}, \rv{Y}, \rv{Z} \right) \diff{\rv{X}} } \\ &= \frac{f\!\left( \rv{X}, \rv{Y}, \rv{Z} \right)}{ \int f\!\left( \rv{X}, \rv{Y}, \rv{Z} \right) \diff{\rv{X}} } \label{eq:full-conditional-reduced} \end{align} \]

Since we are working with full conditionals for \(\rv{X}\), \(\rv{Y}\) and \(\rv{Z}\) in \(\eqref{eq:full-conditional-reduced}\) can be considered observed and thus constant, so the factor \(\fun{f}{ \rv{X}, \rv{Y}, \rv{Z} }\) is in fact a function of \(\rv{X}\) only; note that when \(\rv{Y}\) and \(\rv{Z}\) are treated as constants, expression \(\eqref{eq:full-conditional-reduced}\) actually represents a valid p.d.f. for \(\rv{X}\). So if the p.d.f. of the full joint distribution can be factored in a way such that \(f\!\left( \rv{X}, \rv{Y}, \rv{Z} \right)\) is proportional to the p.d.f. (p.m.f. for discrete r.v.) of a well-known distribution from which we can directly sample from, the we are done; if the full conditional distributions cannot be obtained this way, we will have to resort to sampling methods other than Gibbs sampling.

An Example

The following example is taken from Gilks et al. 1996¹, pg. 75. Consider a simple two-parameter Bayesian model:

\[ \begin{align} y_i &\follows \normal{\mu, \tau^{-1}}, & i = 1, \dots, n; \\ \mu &\follows \normal{0, 1}; \\ \tau &\follows \ga{2, 1}; \end{align} \]

\(\normal{\mu, \tau^{-1}}\) denotes a normal distribution with mean \(\mu\) and variance \(\tau^{-1}\); The parameter \(\tau\) is called the precision of the normal distribution and is defined as \(\tau = \frac{1}{\sigma^2}\) where \(\sigma\) is the standard deviation of the normal distribution. \(\ga{\alpha, \beta}\) denotes a gamma distribution whose mean and variance are \(\frac{\alpha}{\beta}\) and \(\frac{\alpha}{\beta^2}\) respectively; the p.d.f. of a gamma distribution \(\ga{x, \alpha, \beta}\) is given by \[ \frac{\beta^\alpha}{\Gamma \left(\alpha \right)} x^{\alpha - 1} e^{- \beta x} \]. We will assume independence between the \(\left\lbrace y_i \right\rbrace\) given \(\mu\) and \(\tau\) and also between \(\mu\) and \(\tau\) themselves. In other words, we have independently drawn \(n\) samples (i.e. the \(\left\lbrace y_i \right\rbrace\)s ) from a normal distribution \(\) whose mean \(\mu\) and precision \(\tau\) follow a normal prior and a gamma prior respectively.

To derive the full conditional distributions for \(\mu\) and \(\tau\), we first write down the expression for the full joint distribution for our model:

\[ \begin{align} \prob{y, \mu, \tau} &= \prob{ \mu } \prob{ \tau } \prod_{i = 1}^{n} \prob{ y_i \mid \mu, \tau } \\ &= { \frac{1}{\sqrt{2\pi}} e^{- \frac{\mu^2}{2}} } {\tau e^{- \tau}} \prod_{i = 1}^{n} \sqrt{ \frac{\tau}{2\pi} } e^{\frac{-\tau {\left( y_i - \mu \right)}^2 }{2}} \label{eq:full-joint} \\ \end{align} \]

Full Conditional for \(\mu\)

Let us start with \(\prob{ \mu \mid y, \tau }\). Factorise \(\eqref{eq:full-joint}\) into two factors \(\fun{g}{ y, \tau }\) and \(\fun{f}{\mu, y, \tau}\) as discussed previously:

\[ \begin{align*} \prob{ \mu, y, \tau } &= { \left( \frac{1}{2\pi} \right) }^{\frac{1}{2}} \tau e^{-\tau} { \left( \frac{\tau}{2\pi} \right) }^{\frac{n}{2}} e^{-\frac{\mu^2}{2}} e^{\sum_{i = 1}^{n} \frac{-\tau {\left( y_i - \mu \right)}^2 }{2}} \\ &= { \left( \frac{1}{2\pi} \right) }^{\frac{1}{2}} \tau e^{-\tau} { \left( \frac{\tau}{2\pi} \right) }^{\frac{n}{2}} \exp{ \left\lbrace -\frac{\mu^2}{2} + \sum_{i = 1}^{n} \frac{-\tau {\left( y_i - \mu \right)}^2 }{2} \right\rbrace } \\ &= { \left( \frac{1}{2\pi} \right) }^{\frac{1}{2}} \tau e^{-\tau} { \left( \frac{\tau}{2\pi} \right) }^{\frac{n}{2}} \\ & \qquad \exp{\left\lbrace -\frac{\left( n \tau + 1 \right)}{2} \left( \mu^2 - 2 \mu \frac{\tau \sum_{i = 1}^{n} y_i}{ n \tau + 1 } + \frac{\tau \sum_{i = 1}^{n} {y_i}^2}{ n \tau + 1 } \right) \right\rbrace} \\ &= { \left( \frac{1}{2\pi} \right) }^{\frac{1}{2}} \tau e^{-\tau} { \left( \frac{\tau}{2\pi} \right) }^{\frac{n}{2}} \\ & \qquad \exp{\left\lbrace -\frac{\left( n \tau + 1 \right)}{2} \left( \mu^2 - 2 \mu \frac{\tau \sum_{i = 1}^{n} y_i}{ n \tau + 1 } + { \left( \frac{\tau \sum_{i = 1}^{n} y_i}{ n \tau + 1 } \right) }^2 - { \left( \frac{\tau \sum_{i = 1}^{n} y_i}{ n \tau + 1 } \right) }^2 + \frac{\tau \sum_{i = 1}^{n} {y_i}^2}{ n \tau + 1 } \right) \right\rbrace} \\ &= \underbrace{ { \left( \frac{1}{2\pi} \right) }^{\frac{1}{2}} \tau { \left( \frac{\tau}{2\pi} \right) }^{\frac{n}{2}} \exp{ \left\lbrace -\tau - { \left( \frac{\tau \sum_{i = 1}^{n} y_i}{ n \tau + 1 } \right) }^2 + \frac{\tau \sum_{i = 1}^{n} {y_i}^2}{ n \tau + 1 } \right\rbrace } }_{\fun{g}{ y, \tau } } \\ & \qquad \underbrace{ \exp{ \left\lbrace -\frac{1}{2 {\left( n \tau + 1 \right)}^{-1} } { \left( \mu - \frac{\tau \sum_{i = 1}^{n} y_i}{ n \tau + 1 } \right) }^2 \right\rbrace } }_{ \fun{f}{ \mu, y, \tau } } \end{align*} \]

As discussed before, \(\fun{g}{y, \tau}\) will be cancelled out in \(\prob{ \mu \mid y, \tau }\) and so we are left with:

\[ \begin{align} \prob{ \mu \mid y, \tau } &= \frac{ \exp{ \left\lbrace -\frac{1}{2 {\left( n \tau + 1 \right)}^{-1} } { \left( \mu - \frac{\tau \sum_{i = 1}^{n} y_i}{ n \tau + 1 } \right) }^2 \right\rbrace } }{ \displaystyle \int \exp{ \left\lbrace -\frac{1}{2 {\left( n \tau + 1 \right)}^{-1} } { \left( \mu - \frac{\tau \sum_{i = 1}^{n} y_i}{ n \tau + 1 } \right) }^2 \right\rbrace } \diff{\rv{\mu}} } \\ \end{align} \]

which corresponds to the p.d.f. of a normal distribution whose mean and variance are \(\frac{\tau \sum_{i = 1}^{n} y_i}{ n \tau + 1 }\) and \({ \left( n \tau + 1 \right) }^{-1}\) respectively. Thus we have:

\[ \begin{align} \mu \mid y, \tau &\follows \normal{\frac{\tau \sum_{i = 1}^{n} y_i}{ n \tau + 1 }, { \left( n \tau + 1 \right) }^{-1}} \end{align} \]

Full Conditional for \(\tau\)

For \(\prob{ \tau \mid y, \mu }\), we follow a similar procedure but without writing down the factors that are not related to \(\tau\):

We were able to use the “proportional to” (\(\propto\)) symbol because the factors dropped are considered constants for our purpose. The final expression is proportional to the p.d.f. of a gamma distribution with the \(\alpha\) being \(\frac{n}{2} + 2\) and the \(\beta\) being \(1 + \frac{1}{2} \sum_{i = 1}^{n} {\left( y_i - \mu \right)}^2\). In other words:

\[ \begin{align} \tau \mid y, \mu \follows \ga{\frac{n}{2} + 2, 1 + \frac{1}{2} \sum_{i = 1}^{n} {\left( y_i - \mu \right)}^2} \end{align} \]

Full Conditional for \(y\)

The full conditional distributions for the \(\left\lbrace y_{i} \right\rbrace\)s are simply the normal distribution \(\normal{\mu, \tau^{-1}}\). Of course, most of the time we are not interested in sampling from \(\left\lbrace y_{i} \right\rbrace\) because these are observed variables; however, in cases where some of the \(y\) are missing, we can draw samples for \(y\) to use as imputed values.

A Simple Implementation

The following R code implements a simple Gibbs sampler based on the full conditionals just derived:

set.seed(42)

# generate some y:
y <- rnorm(1000, mean = 3, sd = 4) # equivalent to tau = 1 / 16 = 0.0625
n <- length(y)

# randomly initialise the parameters
mu  <- rnorm(1, 0, 1)
tau <- rgamma(1, shape = 2, rate = 1)


# burn-in phase: give the MC some time to reach convergence
for (iter in 1:5000) {
  mu <- rnorm(1, mean = (tau * sum(y)) / (n * tau + 1), sd = (n * tau + 1)^(-1/2) )
  tau <- rgamma(1, shape = (n / 2) + 2, rate = 1 + 0.5 * sum( (y - mu)^2 ))
}

# do the actual sampling and collect the results
mus <- c()
taus <- c()

for (iter in 1:5000) {
  mu <- rnorm(1, mean = (tau * sum(y)) / (n * tau + 1), sd = (n * tau + 1)^(-1/2) )
  tau <- rgamma(1, shape = (n / 2) + 2, rate = 1 + 0.5 * sum( (y - mu)^2 ))
  mus <- append(mus, mu)
  taus <- append(taus, tau)
}

samples <- data.frame(mu = mus, tau = taus)
summary(samples)

##        mu             tau         
##  Min.   :2.432   Min.   :0.05227  
##  1st Qu.:2.768   1st Qu.:0.06053  
##  Median :2.852   Median :0.06237  
##  Mean   :2.853   Mean   :0.06243  
##  3rd Qu.:2.937   3rd Qu.:0.06437  
##  Max.   :3.318   Max.   :0.07350

hist(samples$mu, xlab = "mu", main = 'Histogram for samples of mu')

Figure 1: Histogram for samples of mu

hist(samples$tau, xlab = "tau", main = 'Histogram for samples of tau')

Figure 2: Histogram for samples of tau

References

Gilks, W. R., S Richardson, and D. J. Spiegelhalter. Markov chain Monte Carlo in practice. London: Chapman & Hall, 1996. Print.↩︎

Deriving Full Conditional Distribution for Gibbs Sampling