4 min read

Reservoir Sampling

\(\newcommand{\t}[1]{ {#1}^T }\) \(\newcommand{\inv}[1]{ {#1}^{-1} }\) \(\newcommand{\vec}[1]{\mathbf{#1}}\) \(\newcommand{\rv}[1]{\mathbf{#1}}\) \(\newcommand{\follows}{\thicksim}\) \(\newcommand{\normal}[1]{\mathcal{N}\!\left(#1\right)}\) \(\newcommand{\ga}[1]{\mathrm{Gamma}\!\left(#1\right)}\) \(\newcommand{\diff}[1]{\mathop{}\!\mathrm{d}#1}\) \(\newcommand{\fun}[2]{#1\left(#2\right)}\) \(\newcommand{\expect}[1]{\vec{E}\left[#1\right]}\) \(\newcommand{\prob}[1]{\fun{p}{#1}}\) \(\newcommand{\Prob}[1]{\fun{P}{#1}}\) \(\DeclareMathOperator*{\argmax}{arg\,max}\)

Reservoir sampling is a very useful technique that is capable of sampling \(k\) items from a stream of items with an indefinite (but limited) length in just one pass. The algorithm is particularly relevant in the scenarios where the populations to be sampled from are too large to fit into the memory available.

An example implementation of the algorithm in its simplest form in R:

reservoir_sampling <- function(k, N) {
	population <- 1:N
	reservoir  <- population[1:k]
	for (i in (k+1):N) {
		if (runif(1) < (k/i)) {
			j <- sample(1:k, 1)
			reservoir[j] <- population[i]
		}
	}
	return(reservoir)
}

set.seed(42)
reservoir_sampling(2, 10)  # sampling 2 elements from a population of 10 items
## [1] 9 2

We can also check that the probability with which each item is sampled is indeed 20% using Monte Carlo simulation:

N <- 100000
table(sapply(1:N, FUN=function(x) {reservoir_sampling(2, 10)})) / N
## 
##       1       2       3       4       5       6       7       8       9      10 
## 0.20052 0.19662 0.20140 0.20015 0.20216 0.19873 0.19965 0.20018 0.19999 0.20060

We can see that the probability of each item to have been sampled is roughly 0.20.

The basic idea of the algorithm is to maintain a “reservoir” of candidate items. The reservoir is initially filled with the first \(k\) items from the population. As we iterate the \(j\)-th item of the population, we test whether to replace one of the existing candidates in the reservoir with the current item with a pass rate of \(\frac{k}{j}\) (note that \(j \gt k\)).

To see how the math adds up, we note that:

\[ \begin{split} & \Prob{\text{the j-th item ends up staying in the reservoir}} = \\ & \qquad \Prob{\text{the j-th item ends up staying in the reservoir} \mid \text{the j-th item makes it into the reservoir} } \\ & \qquad \Prob{ \text{the j-th item makes it into the reservoir} } \end{split} \]

\(\Prob{ \text{the j-th item makes it into the reservoir}}\) can be further broken down into two cases, \(j <= k\) and \(j > k\), so we have:

\[ \begin{split} & \Prob{\text{the j-th item ends up staying in the reservoir}} = \\ & \qquad \Prob{\text{the j-th item ends up staying in the reservoir} \mid \text{the j-th item makes it into the reservoir}, j \le k} \\ & \qquad \Prob{ \text{the j-th item makes it into the reservoir}, j \le k} + \\ & \qquad \Prob{\text{the j-th item ends up staying in the reservoir} \mid \text{the j-th item makes it into the reservoir}, j > k} \\ & \qquad \Prob{ \text{the j-th item makes it into the reservoir}, j > k} \end{split} \]

Observe that:

  • \(\Prob{ \text{the j-th item makes it into the reservoir}, j \le k}\) is simply \(\frac{k}{N} \cdot 1\).
  • \(\Prob{ \text{the j-th item makes it into the reservoir}, j > k}\) is \(\frac{N - k}{N} \frac{k}{j}\).
  • \(\Prob{\text{the j-th item ends up staying in the reservoir} \mid \text{the j-th item makes it into the reservoir}, j > k}\) is \(\prod_{i=1}^{N-j} \left\{ \left(1 - \frac{k}{j + i}\right) + \frac{k}{j + i} \left(1 - \frac{1}{k}\right) \right\} = \prod_{i=1}^{N-j} \frac{i + j - 1}{i + j} = \frac{j}{j+1} \frac{j+1}{j+2} \frac{j+2}{j+3} \cdots \frac{N-1}{N} = \frac{j}{N}\); this is just the probability that none of the subsequent items after the \(j\)-th item replaced the \(j\)-th item.
  • \(\Prob{\text{the j-th item ends up staying in the reservoir} \mid \text{the j-th item makes it into the reservoir}, j \le k} = \frac{k}{N}\); if an item is initially placed in the reservoir, then this probability is the same as \(\Prob{\text{the j-th item ends up staying in the reservoir} \mid \text{the j-th item makes it into the reservoir}, j = k}\).

So combining the above, we can show that:

\[ \begin{split} & \Prob{\text{the j-th item ends up staying in the reservoir}} \\ & \qquad = \frac{k}{N} \frac{k}{N} + \frac{N - k}{N} \frac{k}{j} \frac{j}{N} \\ & \qquad = \frac{k}{N} \end{split} \]