wesselb.github.io

A Short Note on Uniform Integrability

2021-08-05T00:00:00+00:00

Introduction

A sequence of random variables $(X_n)_{n \ge 1} \sub L^1$ is called $L^1$-convergent if there exists some limit $X \in L^1$ such that $\E|X_n - X| \to 0$ as $n \to \infty$. In this post, we briefly discuss a necessary and sufficient condition for $L^1$-convergence called uniform integrability.

Uniform Integrability

Definition. A collection of random variables $\mathcal{F}$ is called uniformly integrable if

\begin{equation} \lim_{K \to \infty} \sup\,\{ \E[\ind_{|X| \ge K} |X|] : X \in \mathcal{F} \} = 0. \end{equation}

Noting that $\E[\ind_{\abs{X} \ge K} \abs{X}] = \E\abs{X - \ind_{\abs{X} < K} X}$, this condition can also be written as

\begin{equation} \lim_{K \to \infty} \sup\,\{ \E\abs{X - \ind_{\abs{X} < K} X} : X \in \mathcal{F} \} = 0. \end{equation}

In other words, if $\mathcal{F}$ is uniformly integrable, then you can choose a single value of $K > 0$ such that, uniformly over $X \in \mathcal{F}$, the random variable $\ind_{\abs{X} < K} X$ is a good approximation of $X$ in terms of the $L^1$-norm. Crucially, every $\ind_{\abs{X} < K} X$ is a bounded random variable, which is often a desirable property. Therefore, you could aptly call a family which is uniform integrable a family which allows a uniform bounded approximation.

But what about the name uniform integrability? For a single variable $X$, it is true that \begin{equation} \E\abs{X} < \infty \iff \lim_{K \to \infty} \E[\ind_{|X| \ge K} |X|] = 0. \end{equation} Hence, you could call a family of random variables uniformly integrable if the limit on the RHS, which is equivalent to integrability, converges uniformly over the family.

The bounded approximation given by uniform integrability can be made a bit nicer. Instead of bounding $X$ by applying the function $f_K(x) = \ind_{\abs{x} < K} x$, which exhibits a discontinuity at $\abs{x} = K$, uniform integrability allows us to bound $X$ by applying the nicer function $g_K(x) = \max(\min(x, K), -K)$, which is a fully continuous function: \begin{equation} \E\abs{g_K(X) - X} = \E[\ind_{\abs{X} \ge K}\abs{\abs{X} - K}] \le \E[\ind_{\abs{X} \ge K}\abs{X}] + \E[\ind_{\abs{X} \ge K} K] \le 2 \E[\ind_{\abs{X} \ge K} \abs{X}], \end{equation} which uniformly converges to zero as $K \to \infty$. Henceforth, for any random variable $X$, denote by $X^K =\max(\min(X, K), -K)$ the trunction of $X$ at level $K$. Since $g_K$ is continuous, trunctions in this way preserves limits.

Finally, to check that a family of random variables is uniformly integrable, the following two facts are very useful:

If $\sup\,\{ \E[\abs{X}^{p}] : X \in \mathcal{F}\} < \infty$ for some $p > 1$, then $\mathcal{F}$ is uniformly integrable.
Every family $\{ \E[X \cond \mathcal{G}] : \mathcal{G} \text{ is a sub-}\sigma\text{-algebra}\}$ is uniformly integrable.

A Necessary and Sufficient Condition for $L^1$-Convergence

A standard way to prove that a sequence of random variables $(X_n)_{n \ge 1}$ is $L^1$-convergent to some limit is to use bounded convergence, an instance of the dominated convergence theorem. Recall that a sequence of random variables $(X_n)_{n \ge 1}$ is called convergent in probability if there exists a limit $X$ such that $\P(\abs{X - X_n} \ge \e) \to 0$ for every $\e > 0$.

Theorem (bounded convergence). If $(X_n)_{n \ge 1}$ and $X$ are bounded by some $K > 0$ and $X_n \to X$ in probability, then $X_n \to X$ in $L^1$.

Proof. Without loss of generality, assume that $X = 0$, so it remains to demonstrate that $\E\abs{X_n} \to 0$. Let $\e > 0$. Using the assumption that $\abs{X_n} \le K$, the idea is to consider the cases $\abs{X_n} \in [0, \e]$ and $\abs{X_n} \in (\e, K]$:

\begin{equation} \E\abs{X_n} = \E[\abs{X_n} \ind_{\abs{X_n} \in [0, \e]}] + \E[\abs{X_n} \ind_{\abs{X_n} \in (\e, K]}] \le \e + K\, \E[\ind_{\abs{X_n} \in (\e, K]}] \le \e + K\, \P(\abs{X_n} \ge \e). \end{equation}

Using that assumpion that $\P(\abs{X_n} \ge \e) \to 0$ as $n \to \infty$, hence $\limsup_{n \to \infty} \E\abs{X_n} \le \e$. Since $\e > 0$ was arbitrary, this proves that $\lim_{n \to \infty} \E\abs{X_n}=0$. $\blacksquare$

Bounded convergence is an incredibly useful tool, but the assumption that $(X_n)_{n \ge 1}$ and $X$ are bounded can be too strong. A looser assumption is that $(X_n)_{n \ge 1}$ and $X$ uniformly allow a bounded approximation, i.e. that $(X_n)_{n \ge 1}$ (and therefore the union of $(X_n)_{n \ge 1}$ and $X$) are uniformly integrable. This looser condition turns out to not just be sufficient but also necessary.

Theorem (Vitali’s). Let $(X_n)_{n \ge 1}$ be a sequence of random variables and let $X$ be a random variable. Then (a) $(X_n)_{n \ge 1} \sub L^1$, $X \in L^1$, and $X_n \to X$ in $L^1$ if and only if (b) $(X_n)_{n \ge 1} \sub L^1$ is uniformly integrable and $X_n \to X$ in probability.

Proof. We only show the hard direction, which is that (b) implies (a). Assume that $(X_n)_{n \ge 1} \sub L^1$ is uniformly integrable and $X_n \to X$ in probability. To begin with, it is true¹ that $X \in L^1$. Since $X \in L^1$, $(X_n - X)_{n \ge 1}$ is uniformly integrable and $X_n - X \to 0$ in probability in any case, so without loss of generality assume that $X = 0$.

Uniform integrability gives a uniform bounded approximation of the sequence:

\begin{equation} \label{eq:uniform-approx} \lim_{K \to \infty} \sup_{n \ge 1}\, \E\abs{X_n - \ind_{\abs{X_n} < K} X_n} = 0. \end{equation}

For every $K>0$, the sequence $(\ind_{\abs{X_n} < K} X_n)_{n \ge 1}$ is bounded and $\ind_{\abs{X_n} < K} X_n \to 0$ in probability, so $\ind_{\abs{X_n} < K} X_n \to 0$ in $L^1$ by bounded convergence. The idea is to then take $K \to \infty$ to show that also $X_n \to 0$ in $L^1$. To wit, by the triangle inequality,

\begin{equation} \limsup_{n \to \infty} \E\abs{X_n} \le \sup_{n \ge 1}\, \E\abs{X_n - \ind_{\abs{X_n} < K} X_n} + \limsup_{n \to \infty} \E\abs{\ind_{\abs{X_n} < K} X_n} \overset{\text{(i)}}{=} \sup_{n \ge 1}\, \E\abs{X_n - \ind_{\abs{X_n} < K} X_n} \end{equation}

where (i) follows from that $\ind_{\abs{X_n} < K} X_n \to 0$ in $L^1$ by bounded convergence. Taking $K \to \infty$ and using \eqref{eq:uniform-approx} then shows the result. $\blacksquare$

Application: Strengthening of Convergence in Distribution

A sequence of random variables $(X_n)_{n \ge 1}$ is called weakly convergent if there exists a limit $X$ such that, for every $f \colon \R \to \R$ continuous and bounded, it is true that $\E[f(X_n)] \to \E[f(X)]$. A limitation of weak convergence is that it only handles bounded $f$; for example, weak convergence does not imply that $\E[X_n] \to \E[X]$. As we illustrate now, the assumption of uniform integrability can be used to strengthen the conclusion of weak convergence to include $\E[X_n] \to \E[X]$.

The key observation is as follows: if $(X_n)_{n \ge 1}$ and $X$ were bounded by some $K > 0$, then we can apply the truncation function $g_K$, which is a continuous and bounded function, to conclude that \begin{equation} \E[X_n] = \E[g_K(X_n)] \to \E[g_K(X)] = \E[X]. \end{equation} Instead of assuming boundedness, now assume that $(X_n)_{n \ge 1}$ is only uniformly integrable. For all $K > 0$, consider the uniform bounded approximations $(X^K_n)_{n \ge 1}$ and $X^K$. Because the trunction operation is continuous, every $(X_n^K)_{n \ge 1}$ is still weakly convergent to $X^K$. Morever, $(X_n^K)_{n \ge 1}$ and $X^K$ are bounded by $K > 0$. The foregoing argument then shows that $ \lim_{n \to \infty} \E[X_n^K] = \E[X^K]. $ Therefore, \begin{equation} \lim_{n \to \infty} \E[X_n] = \lim_{n \to \infty} \lim_{K \to \infty} \E[X_n^K] = \lim_{K \to \infty} \lim_{n \to \infty} \E[X_n^K] = \lim_{K \to \infty} \E[X^K] = \E[X], \end{equation} where the interchange of limits is allowed by uniformity of the bounded approximation.

Summary

A family of random variables is called uniformly integrable if it allows a uniform bounded approximation. Allowing a uniform bounded approximation turns out to be the right characterisation of $L^1$-convergence: a sequence is $L^1$-convergent if and only if it is uniformly integrable. Uniformly integrability is generally useful tool: if you can prove a result for bounded random variables, then you might be able to prove the result for the greater class of uniformly integrable random variables by considering a uniform bounded approximation.

Thanks to Jiri Hron for helpful comments on a draft of this post.

Since $X_n \to X$ in probability, $X_{n_k} \to X$ almost surely along some subsequence $(X_{n_k})_{k \ge 0}$. Therefore, using Fatou’s lemma,

\begin{equation} \E\abs{X} = \E[\lim_{k \to \infty} \abs{X_{n_k}}] \le \liminf_{k \to \infty} \E[\abs{X_{n_k}}] < \infty, \end{equation}

where the right hand side is bounded because any uniformly integrable family is uniformly bounded in $L^1$. ↩

What Keeps a Bayesian Awake at Night

2021-04-07T00:00:00+00:00

The Cambridge Machine Learning Group is launching a blog, featuring a first two-part post about what keeps a Bayesian awake at night. In the first part, during day time, we lay out the standard arguments that many use to support Bayesian inference, ranging from more fundamental theorems, like Cox’s theorem, to unit tests, like Wald’s theorem. In the second part, at night time, we take a closer look at these standard arguments and identify the weaknesses which cause Bayesians to lose sleep at night: the standard justifications have problems, modelling is hard and sensitive to innocolous details, and—worst of all—one typically must resort to approximate inference. Check it out!

Linear Models from a Gaussian Process Point of View with Stheno and JAX

2021-01-19T00:00:00+00:00

By Wessel Bruinsma, James Requeima, and Eric Perim Martins

Cross-posted on the Invenia blog.

Introduction

A linear model prescribes a linear relationship between inputs and outputs. Linear models are amongst the simplest of models, but they are ubiquitous across science. A linear model with Gaussian distributions on the coefficients forms one of the simplest instances of a Gaussian process. In this post, we will give a brief introduction to linear models from a Gaussian process point of view. We will see how a linear model can be implemented with Gaussian process probabilistic programming using Stheno, and how this model can be used to denoise noisy observations. (Disclosure: Will Tebbutt and Wessel are the authors of Stheno; Will maintains a Julia version.) In short, probabilistic programming is a programming paradigm that brings powerful probabilistic models to the comfort of your programming language, which often comes with tools to automatically perform inference (make predictions). We will also use JAX’s just-in-time compiler to make our implementation extremely efficient.

Linear Models from a Gaussian Process Point of View

Consider a data set $(x_i, y_i)_{i=1}^n \subseteq \R \times \R$ consisting of $n$ real-valued input–output pairs. Suppose that we wish to estimate a linear relationship between the inputs and outputs:

\[\label{eq:ax_b} y_i = a \cdot x_i + b + \e_i,\]

where $a$ is an unknown slope, $b$ is an unknown offset, and $\e_i$ is some error/noise associated with the observation $y_i$. To implement this model with Gaussian process probabilistic programming, we need to cast the problem into a functional form. This means that we will assume that there is some underlying, random function $y \colon \R \to \R$ such that the observations are evaluations of this function: $y_i = y(x_i)$. The model for the random function $y$ will embody the structure of the linear model \eqref{eq:ax_b}. This may sound hard, but it is not difficult at all. We let the random function $y$ be of the following form:

\[\label{eq:ax_b_functional} y(x) = a(x) \cdot x + b(x) + \e(x)\]

where $a\colon \R \to \R$ is a random constant function. An example of a constant function $f$ is $f(x) = 5$. Random means that the value $5$ is not fixed, but modelled with a random value drawn from some probability distribution, because we don’t know the true value. We let $b\colon \R \to \R$ also be a random constant function, and $\e\colon \R \to \R$ a random noise function. Do you see the similarities between \eqref{eq:ax_b} and \eqref{eq:ax_b_functional}? If all that doesn’t fully make sense, don’t worry; things should become more clear as we implement the model.

To model random constant functions and random noise functions, we will use Stheno, which is a Python library for Gaussian process modelling. We also have a Julia version, but in this post we’ll use the Python version. To install Stheno, run the command

pip install --upgrade --upgrade-strategy eager stheno

In Stheno, a Gaussian process can be created with GP(kernel), where kernel is the so-called kernel or covariance function of the Gaussian process. The kernel determines the properties of the function that the Gaussian process models. For example, the kernel EQ() models smooth functions, and the kernel Matern12() models functions that look jagged. See the kernel cookbook for an overview of commonly used kernels and the documentation of Stheno for the corresponding classes. For constant functions, you can set the kernel to simply a constant, for example 1, which then models the constant function with a value drawn from $\Normal(0, 1)$. (By default, in Stheno, all means are zero; but, if you like, you can also set a mean.)

Let’s start out by creating a Gaussian process for the random constant function $a(x)$ that models the slope.

>>> from stheno import GP

>>> a = GP(1)

>>> a
GP(0, 1)

You can see how the Gaussian process looks by simply sampling from it. To sample from the Gaussian process a at some inputs x, evaluate it at those inputs, a(x), and call the method sample: a(x).sample(). This shows that you can really think of a Gaussian process just like you think of a function: pass it some inputs to get (the model for) the corresponding outputs.

>>> x = np.linspace(0, 10, 100)

>>> plt.plot(x, a(x).sample(20)); plt.show()

Figure 1: Samples of a Gaussian process that models a constant function

We’ve sampled a bunch of constant functions. Sweet! The next step in the model \eqref{eq:ax_b_functional} is to multiply the slope function $a(x)$ by $x$. To multiply a by $x$, we multiply a by the function lambda x: x, which casts also $x$ as a function:

>>> f = a * (lambda x: x)

>>> f
GP(0, <lambda>)

This will give rise to functions like $x \mapsto 0.1x$ and $x \mapsto -0.4x$, depending on the value that $a(x)$ takes.

>>> plt.plot(x, f(x).sample(20)); plt.show()

Figure 2: Samples of a Gaussian process that models functions with a random slope

This is starting to look good! The only ingredient that is missing is an offset. We model the offset just like the slope, but here we set the kernel to 10 instead of 1, which models the offset with a value drawn from $\Normal(0, 10)$.

>>> b = GP(10)

>>> f = a * (lambda x: x) + b
AssertionError: Processes GP(0, <lambda>) and GP(0, 10 * 1) are associated to different measures.

Something went wrong. Stheno has an abstraction called measures, where only GPs that are part of the same measure can be combined into new GPs; the abstraction of measures is there to keep things safe and tidy. What goes wrong here is that a and b are not part of the same measure. Let’s explicitly create a new measure and attach a and b to it.

>>> from stheno import Measure

>>> prior = Measure()

>>> a = GP(1, measure=prior)

>>> b = GP(10, measure=prior)

>>> f = a * (lambda x: x) + b

>>> f
GP(0, <lambda> + 10 * 1)

Let’s see how samples from f look like.

>>> plt.plot(x, f(x).sample(20)); plt.show()

Figure 3: Samples of a Gaussian process that models linear functions

Perfect! We will use f as our linear model.

In practice, observations are corrupted with noise. We can add some noise to the lines in Figure 3 by adding a Gaussian process that models noise. You can construct such a Gaussian process by using the kernel Delta(), which models the noise with independent $\Normal(0, 1)$ variables.

>>> from stheno import Delta

>>> noise = GP(Delta(), measure=prior)

>>> y = f + noise

>>> y
GP(0, <lambda> + 10 * 1 + Delta())

>>> plt.plot(x, y(x).sample(20)); plt.show()

Figure 4: Samples of a Gaussian process that models noisy linear functions

That looks more realistic, but perhaps that’s a bit too much noise. We can tune down the amount of noise, for example, by scaling noise by 0.5.

>>> y = f + 0.5 * noise

>>> y
GP(0, <lambda> + 10 * 1 + 0.25 * Delta())

>>> plt.plot(x, y(x).sample(20)); plt.show()

Figure 5: Samples of a Gaussian process that models noisy linear functions

Much better.

To summarise, our linear model is given by

prior = Measure()

a = GP(1, measure=prior)            # Model for slope
b = GP(10, measure=prior)           # Model for offset
f = a * (lambda x: x) + b           # Noiseless linear model

noise = GP(Delta(), measure=prior)  # Model for noise
y = f + 0.5 * noise                 # Noisy linear model

We call a program like this a Gaussian process probabilistic program (GPPP). Let’s generate some noisy synthetic data, (x_obs, y_obs), that will make up an example data set $(x_i, y_i)_{i=1}^n$. We also save the observations without noise added — f_obs — so we can later check how good our predictions really are.

>>> x_obs = np.linspace(0, 10, 50_000)

>>> f_obs = 0.8 * x_obs - 2.5

>>> y_obs = f_obs + 0.5 * np.random.randn(50_000)

>>> plt.scatter(x_obs, y_obs); plt.show()

Figure 6: Some observations

We will see next how we can fit our model to this data.

Inference in Linear Models

Suppose that we wish to remove the noise from the observations in Figure 6. We carefully phrase this problem in terms of our GPPP: the observations y_obs are realisations of the noisy linear model y at x_obs — realisations of y(x_obs) — and we wish to make predictions for the noiseless linear model f at x_obs — predictions for f(x_obs).

In Stheno, we can make predictions based on observations by conditioning the measure of the model on the observations. In our GPPP, the measure is given by prior, so we aim to condition prior on the observations y_obs for y(x_obs). Mathematically, this process of incorporating information by conditioning happens through Bayes’ rule. Programmatically, we first make an Observations object, which represents the information — the observations — that we want to incorporate, and then condition prior on this object:

>>> from stheno import Observations

>>> obs = Observations(y(x_obs), y_obs)

>>> post = prior.condition(obs)

You can also more concisely perform these two steps at once, as follows:

>>> post = prior | (y(x_obs), y_obs)

This mimics the mathematical notation used for conditioning.

With our updated measure post, which is often called the posterior measure, we can make a prediction for f(x_obs) by passing f(x_obs) to post:

>>> pred = post(f(x_obs))

>>> pred.mean
<dense matrix: shape=50000x1, dtype=float64
 mat=[[-2.498]
      [-2.498]
      [-2.498]
      ...
      [ 5.501]
      [ 5.502]
      [ 5.502]]>

>>> pred.var
<low-rank matrix: shape=50000x50000, dtype=float64, rank=2
 left=[[1.e+00 0.e+00]
       [1.e+00 2.e-04]
       [1.e+00 4.e-04]
       ...
       [1.e+00 1.e+01]
       [1.e+00 1.e+01]
       [1.e+00 1.e+01]]
 middle=[[ 2.001e-05 -2.995e-06]
         [-2.997e-06  6.011e-07]]
 right=[[1.e+00 0.e+00]
        [1.e+00 2.e-04]
        [1.e+00 4.e-04]
        ...
        [1.e+00 1.e+01]
        [1.e+00 1.e+01]
        [1.e+00 1.e+01]]>

The prediction pred is a multivariate Gaussian distribution with a particular mean and variance, which are displayed above. You should view post as a function that assigns a probability distribution — the prediction — to every part of our GPPP, like f(x_obs). Note that the variance of the prediction is a massive matrix of size 50k $\times$ 50k. Under the hood, Stheno uses structured representations for matrices to compute and store matrices in an efficient way.

Let’s see how the prediction pred for f(x_obs) looks like. The prediction pred exposes the method marginal_credible_bounds() that conveniently computes the mean and associated lower and upper error bounds for you.

>>> mean, error_lower, error_upper  = pred.marginal_credible_bounds()

>>> mean
array([-2.49818708, -2.49802708, -2.49786708, ...,  5.50148996,
        5.50164996,  5.50180997])

>>> error_upper - error_lower
array([0.01753381, 0.01753329, 0.01753276, ..., 0.01761883, 0.01761935,
       0.01761988])

The error is very small — on the order of $10^{-2}$ — which means that Stheno predicted f(x_obs) with high confidence.

>>> plt.scatter(x_obs, y_obs); plt.plot(x_obs, mean); plt.show()

Figure 7: Mean of the prediction (blue line) for the denoised observations

The blue line in Figure 7 shows the mean of the predictions. This line appears to nicely pass through the observations with the noise removed. But let’s see how good the predictions really are by comparing to f_obs, which we previously saved.

>>> f_obs - mean
array([-0.00181292, -0.00181292, -0.00181292, ..., -0.00180997,
       -0.00180997, -0.00180997])

>>> np.mean((f_obs - mean) ** 2)  # Compute the mean square error.
3.281323087544209e-06

That’s pretty close! Not bad at all.

We wrap up this section by encapsulating everything that we’ve done so far in a function linear_model_denoise, which denoises noisy observations from a linear model:

def linear_model_denoise(x_obs, y_obs):
    prior = Measure()
    a = GP(1, measure=prior)                # Model for slope
    b = GP(10, measure=prior)               # Model for offset
    f = a * (lambda x: x) + b               # Noiseless linear model
    noise = GP(Delta(), measure=prior)      # Model for noise
    y = f + 0.5 * noise                     # Noisy linear model

    post = prior | (y(x_obs), y_obs)        # Condition on observations.
    pred = post(f(x_obs))                   # Make predictions.
    return pred.marginal_credible_bounds()  # Return the mean and associated error bounds.

>>> linear_model_denoise(x_obs, y_obs)
(array([-2.49818708, -2.49802708, -2.49786708, ...,  5.50148996,
        5.50164996,  5.50180997]), array([-2.50695399, -2.50679372, -2.50663346, ...,  5.49268055,
        5.49284029,  5.49300003]), array([-2.48942018, -2.48926044, -2.4891007 , ...,  5.51029937,
        5.51045964,  5.51061991]))

>>> %timeit linear_model_denoise(x_obs, y_obs)
233 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

To denoise 50k observations, linear_model_denoise takes about 250 ms. Not terrible, but we can do much better, which is important if we want to scale to larger numbers of observations. In the next section, we will make this function really fast.

Making Inference Fast

To make linear_model_denoise fast, firstly, the linear algebra that happens under the hood when linear_model_denoise is called should be simplified as much as possible. Fortunately, this happens automatically, due to the structured representation of matrices that Stheno uses. For example, when making predictions with Gaussian processes, the main computational bottleneck is usually the construction and inversion of y(x_obs).var, the variance associated with the observations:

>>> y(x_obs).var
<Woodbury matrix: shape=50000x50000, dtype=float64
 diag=<diagonal matrix: shape=50000x50000, dtype=float64
       diag=[0.25 0.25 0.25 ... 0.25 0.25 0.25]>
 lr=<low-rank matrix: shape=50000x50000, dtype=float64, rank=2
     left=[[1.e+00 0.e+00]
           [1.e+00 2.e-04]
           [1.e+00 4.e-04]
           ...
           [1.e+00 1.e+01]
           [1.e+00 1.e+01]
           [1.e+00 1.e+01]]
     middle=[[10.  0.]
             [ 0.  1.]]
     right=[[1.e+00 0.e+00]
            [1.e+00 2.e-04]
            [1.e+00 4.e-04]
            ...
            [1.e+00 1.e+01]
            [1.e+00 1.e+01]
            [1.e+00 1.e+01]]>>

Indeed observe that this matrix has particular structure: it is a sum of a diagonal and a low-rank matrix. In Stheno, the sum of a diagonal and a low-rank matrix is called a Woodbury matrix, because the Sherman–Morrison–Woodbury formula can be used to efficiently invert it. Let’s see how long it takes to construct y(x_obs).var and then invert it. We invert y(x_obs).var using LAB, which is automatically installed alongside Stheno and exposes the API to efficiently work with structured matrices.

>>> import lab as B

>>> %timeit B.inv(y(x_obs).var)
28.5 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

That’s only 30 ms! Not bad, for such a big matrix. Without exploiting structure, a 50k $\times$ 50k matrix takes 20 GB of memory to store and about an hour to invert.

Secondly, we would like the code implemented by linear_model_denoise to be as efficient as possible. To achieve this, we will use JAX to compile linear_model_denoise with XLA, which generates blazingly fast code. We start out by importing JAX and loading the JAX extension of Stheno.

>>> import jax.numpy as jnp

>>> import stheno.jax  # JAX extension for Stheno

We use JAX’s just-in-time (JIT) compiler to compile linear_model_denoise.

>>> import lab as B

>>> linear_model_denoise_jitted = B.jit(linear_model_denoise)

Let’s see what happens when we run linear_model_denoise_jitted. We must pass x_obs and y_obs as JAX arrays to use the compiled version.

>>> linear_model_denoise_jitted(jnp.array(x_obs), jnp.array(y_obs))
(DeviceArray([-2.4981871 , -2.4980271 , -2.49786709, ...,  5.50149004,
              5.50165005,  5.50181005], dtype=float64), DeviceArray([-2.5069514 , -2.50679114, -2.50663087, ...,  5.4927699 ,
              5.49292964,  5.49308938], dtype=float64), DeviceArray([-2.4894228 , -2.48926306, -2.48910332, ...,  5.51021019,
              5.51037046,  5.51053072], dtype=float64))

Nice! Let’s see how much faster linear_model_denoise_jitted is:

>>> %timeit linear_model_denoise(x_obs, y_obs)
233 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit linear_model_denoise_jitted(jnp.array(x_obs), jnp.array(y_obs))
1.63 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The compiled function linear_model_denoise_jitted only takes 2 ms to denoise 50k observations! Compared to linear_model_denoise, that’s a speed-up of two orders of magnitude.

Conclusion

We’ve seen how a linear model can be implemented with a Gaussian process probabilistic program (GPPP) using Stheno. Stheno allows us to focus on model construction, and takes away the distraction of the technicalities that come with making predictions. This flexibility, however, comes at the cost of some complicated machinery that happens in the background, such as structured representations of matrices. Fortunately, we’ve seen that this overhead can be completely avoided by compiling your program using JAX, which can result in extremely efficient implementations. To close this post and to warm you up for what’s further possible with Gaussian process probabilistic programming using Stheno, the linear model that we’ve built can easily be extended to, for example, include a quadratic term:

def quadratic_model_denoise(x_obs, y_obs):
    prior = Measure()
    a = GP(1, measure=prior)                # Model for slope
    b = GP(1, measure=prior)                # Model for coefficient of quadratic term
    c = GP(10, measure=prior)               # Model for offset
    # Noiseless quadratic model
    f = a * (lambda x: x) + b * (lambda x: x ** 2) + c
    noise = GP(Delta(), measure=prior)      # Model for noise
    y = f + 0.5 * noise                     # Noisy quadratic model

    post = prior | (y(x_obs), y_obs)        # Condition on observations.
    pred = post(f(x_obs))                   # Make predictions.
    return pred.marginal_credible_bounds()  # Return the mean and associated error bounds.

To use Gaussian process probabilistic programming for your specific problem, the main challenge is to figure out which model you need to use. Do you need a quadratic term? Maybe you need an exponential term! But, using Stheno, implementing the model and making predictions should then be simple.

Julia Learning Circle: Generated Functions

2020-12-13T00:00:00+00:00

A normal function outputs the result of the computation by the function. In contrast, a generated function outputs the code that implements the function. While generating this code, the generated function can only make use of the types of the arguments, not their values. In a sense, generated functions offer “on-demand code generation”. This mechanism is quite powerful and can be used when normal functions in combination with multiple dispatch cannot give you what you need.

To illustrate generated functions, we will build on the example of stack-allocated vectors from the previous post. We will extend our stack-allocated vector to a stack-allocated matrix, and we will use a generated function to implement matrix multiplication. Let’s start out by defining a stack-allocated vector and matrix.

struct StackMatrix{T, M, N, L}
    data::NTuple{L, T}
end

function StackVector(data::Vector{T}) where T
    return StackMatrix{T, length(data), 1, length(data)}(Tuple(data))
end
function StackMatrix(data::Matrix{T}) where T
    M, N = size(data)
    return StackMatrix{T, M, N, length(data)}(Tuple(data[:]))
end

The type signature is StackMatrix{T, M, N, L} where T is the type of the elements of the matrix, M is the number of rows of the matrix, N is the number of columns of the matrix, and L = M * N is the total number of elements in the matrix; even though L can always be computed from M and N, we need L in the type signature, because it specifies the length of the NTuple.

Before we implement multiplication of general StackMatrix{T, M, N, L}s, we first consider the case of StackMatrix{T, 2, 2, 4}s.

import Base: *

function *(x::StackMatrix{T, 2, 2, 4}, y::StackMatrix{T, 2, 2, 4}) where T
    x11, x21, x12, x22 = x.data
    y11, y21, y12, y22 = y.data
    z11 = x11 * y11 + x12 * y21
    z21 = x21 * y11 + x22 * y21
    z12 = x11 * y12 + x12 * y22
    z22 = x21 * y12 + x22 * y22
    return StackMatrix{T, 2, 2, 4}((z11, z21, z12, z22))
end

Let’s check that the implementation is correct.

julia> x = randn(2, 2);

julia> y = randn(2, 2);

julia> x_stack = StackMatrix(x);

julia> y_stack = StackMatrix(y);

julia> x * y
2×2 Matrix{Float64}:
 -1.16361    0.848159
  0.355827  -0.441428

julia> reshape(collect((x_stack * y_stack).data), 2, 2)
2×2 Matrix{Float64}:
 -1.16361    0.848159
  0.355827  -0.441428

Nice! And it is quite a bit faster, too.

julia> @benchmark $x * $y
BenchmarkTools.Trial:
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     54.112 ns (0.00% GC)
  median time:      59.993 ns (0.00% GC)
  mean time:        62.529 ns (1.27% GC)
  maximum time:     486.884 ns (83.35% GC)
  --------------
  samples:          10000
  evals/sample:     973

julia> @benchmark $(Ref(x_stack))[] * $(Ref(y_stack))[]
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.015 ns (0.00% GC)
  median time:      3.033 ns (0.00% GC)
  mean time:        3.077 ns (0.00% GC)
  maximum time:     16.869 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

The problem with multiplication of general StackMatrix{T, M, N, L}s is that the implementation depends on the particular values of M and N — for example, the variables z11, z21, et cetera. We will use a generated function to automatically generate the implementation of the corresponding matrix multiplication. This code-generation procedure depends on the values of M and N and will adapt the implementation accordingly. Generated functions are defined with the macro @generated. The implementation of multiplication of general StackMatrix{T, M, N, L}s as follows:

import Base: *

@generated function *(
    x::StackMatrix{T, K, M, L₁},
    y::StackMatrix{T, M, N, L₂}
) where {T, K, M, N, L₁, L₂}
    # Unpack `x`.
    tuple_x = Expr(:tuple,  [Symbol("x_$(k)_$(m)") for m = 1:M for k = 1:K]...)
    unpack_x = :($tuple_x = x.data)

    # Unpack `y`.
    tuple_y = Expr(:tuple,  [Symbol("y_$(m)_$(n)") for n = 1:N for m = 1:M]...)
    unpack_y = :($tuple_y = y.data)

    # Perform multiplication.
    mults = Vector{Expr}()
    for k = 1:K, n = 1:N
        expr = Expr(
            :call,
            :+,
            [:($(Symbol("x_$(k)_$(m)")) * $(Symbol("y_$(m)_$(n)"))) for m = 1:M]...
        )
        push!(mults, :($(Symbol("z_$(k)_$(n)")) = $expr))
    end

    # Pack `z`.
    tuple_z = Expr(:tuple,  [Symbol("z_$(k)_$(n)") for n = 1:N for k = 1:K]...)
    pack_z = :(StackMatrix{T, K, N, L₃}($tuple_z))

    return Expr(
        :block,
        unpack_x,
        unpack_y,
        mults...,
        :(L₃ = K * N),
        :(return $pack_z)
    )
end

If we omit the macro @generated, we can call the implementation to inspect the generated code:

julia> x_stack * y_stack
quote
    (x_1_1, x_2_1, x_1_2, x_2_2) = x.data
    (y_1_1, y_2_1, y_1_2, y_2_2) = y.data
    z_1_1 = x_1_1 * y_1_1 + x_1_2 * y_2_1
    z_1_2 = x_1_1 * y_1_2 + x_1_2 * y_2_2
    z_2_1 = x_2_1 * y_1_1 + x_2_2 * y_2_1
    z_2_2 = x_2_1 * y_1_2 + x_2_2 * y_2_2
    L₃ = K * N
    return StackMatrix{T, K, N, L₃}((z_1_1, z_2_1, z_1_2, z_2_2))
end

Sweet! This looks very much like our earlier implementation of the two-by-two case. Let’s again check that the implementation is correct.

julia> x = randn(4, 2);

julia> y = randn(2, 3);

julia> x_stack = StackMatrix(x);

julia> y_stack = StackMatrix(y);

julia> x * y
4×3 Matrix{Float64}:
  0.125514  -0.0135978  -0.0283178
 -1.93756    0.450559    1.17303
  2.56769   -0.365378   -0.845966
  3.22549   -0.602203   -1.50065

julia> reshape(collect((x_stack * y_stack).data), 4, 3)
4×3 Matrix{Float64}:
  0.125514  -0.0135978  -0.0283178
 -1.93756    0.450559    1.17303
  2.56769   -0.365378   -0.845966
  3.22549   -0.602203   -1.50065

Like the two-by-two case, this implementation is quite a bit faster, too.

julia> @benchmark $x * $y
BenchmarkTools.Trial:
  memory estimate:  176 bytes
  allocs estimate:  1
  --------------
  minimum time:     205.100 ns (0.00% GC)
  median time:      219.162 ns (0.00% GC)
  mean time:        229.711 ns (0.74% GC)
  maximum time:     1.679 μs (75.82% GC)
  --------------
  samples:          10000
  evals/sample:     530

julia> @benchmark $(Ref(x_stack))[] * $(Ref(y_stack))[]
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.097 ns (0.00% GC)
  median time:      15.665 ns (0.00% GC)
  mean time:        16.987 ns (0.00% GC)
  maximum time:     103.605 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

Julia Learning Circle: Memory Allocations and Garbage Collection

2020-11-23T00:00:00+00:00

Immutable and Mutable Types

Concrete types in Julia are either immutable or mutable. Immutable types are created with struct ImmutableType and mutable types are created with mutable struct MutableType. The advantage of immutable types is that they can be allocated on the stack as opposed to on the heap. Allocating objects on the stack is typically more performant due to cache locality and the stack’s simple, but more rigid memory structure.

An interesting situation occurs when an immutable type references a mutable type. Since Julia 1.5, such immutable types can be allocated on the stack.

julia> struct A
           data::Vector{Float64}
       end

julia> a = A(randn(3))
A([0.9462871255469765, 1.1995018446247545, 0.7153882414691778])

Here A is immutable, but references a Vector{Float64}, which is mutable. This means that a.data cannot be changed, but, since a.data is mutable, e.g. a.data[1] can be changed.

julia> a.data = randn(3)
ERROR: setfield! immutable struct of type A cannot be changed
Stacktrace:
 [1] setproperty!(x::A, f::Symbol, v::Vector{Float64})
   @ Base ./Base.jl:34
 [2] top-level scope
   @ REPL[5]:1

julia> a.data[1] = 1.0
1.0

Types T that satisfy isbitstype(T) == true are a subset of immutable types. They are immutable types that reference only other isbitstype types or primitive types. Primitive types are types whose data are a simple collection of bits. A collection of primitive types is defined by base. The purpose of primitive types is to facilitate interoperability with LLVM.

Case Study: Stack-Allocated Vectors (A.K.A. a Very Brief Introduction to StaticArrays.jl)

The usual Vector{Float64} is mutable, which means that it is heap allocated. Let’s see if we can create a more performant vector by creating a vector type that is allocated on the stack.

struct StackVector{N}
    data::NTuple{N, Float64}
end

StackVector(data::Vector{Float64}) = StackVector(Tuple(data))

Define + for our newly defined StackVector.

import Base: +

+(x::StackVector{N}, y::StackVector{N}) where N = StackVector{N}(x.data .+ y.data)

Let’s check that this works as intended.

julia> x = randn(10);

julia> y = randn(10);

julia> stack_x = StackVector(x);

julia> stack_y = StackVector(y);

julia> x + y
10-element Vector{Float64}:
 -0.5453143850886275
  2.120385168072067
  1.1278328263047377
  1.6358682579762607
 -0.22486252827622277
 -2.1333012655133836
  2.6754332229859767
 -0.7701873679976846
  0.26775849165909
 -2.7389288669831786

julia> collect((stack_x + stack_y).data)
10-element Vector{Float64}:
 -0.5453143850886275
  2.120385168072067
  1.1278328263047377
  1.6358682579762607
 -0.22486252827622277
 -2.1333012655133836
  2.6754332229859767
 -0.7701873679976846
  0.26775849165909
 -2.7389288669831786

That looks good. Now let’s see what avoiding allocations on the heap gets us.

julia> using BenchmarkTools

julia> @benchmark $x + $y
BenchmarkTools.Trial:
  memory estimate:  160 bytes
  allocs estimate:  1
  --------------
  minimum time:     53.664 ns (0.00% GC)
  median time:      56.126 ns (0.00% GC)
  mean time:        59.544 ns (1.83% GC)
  maximum time:     572.958 ns (87.42% GC)
  --------------
  samples:          10000
  evals/sample:     987

julia> @benchmark $stack_x + $stack_y
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     0.052 ns (0.00% GC)
  median time:      0.055 ns (0.00% GC)
  mean time:        0.055 ns (0.00% GC)
  maximum time:     0.099 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Whoa! What happened here is that the compiler is a little too clever: it managed to figure out the answer at compile time and essentially hardcoded the answer. Compare this with

julia> @benchmark $(stack_x + stack_y)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     0.052 ns (0.00% GC)
  median time:      0.055 ns (0.00% GC)
  mean time:        0.056 ns (0.00% GC)
  maximum time:     8.968 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

To stop the compiler from being too clever, BenchmarkTools.jl advises the following trick:

julia> @benchmark $(Ref(stack_x))[] + $(Ref(stack_y))[]
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.276 ns (0.00% GC)
  median time:      2.293 ns (0.00% GC)
  mean time:        2.401 ns (0.00% GC)
  maximum time:     30.049 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

That looks more reasonable. For this small array, compared to the allocating on the heap, that’s an 25x improvement in runtime! This example demonstrates that allocations on the heap can substantially contribute to the total runtime of a program.

The idea of allocating vectors on the stack is certainly not mine. Check out the fantastic StaticArrays.jl, which provides a generic implementation of stack-allocated arrays. If the size of the array is small, these stack-allocated arrays can be significantly more performant than their heap-allocated counterparts. StaticArrays.jl works by automagically generating implementations of linear algebra operations that are optimised for specific sizes of vectors or matrices by using generated functions.

Garbage Collection

As more and more objects are allocated on the heap, eventually the heap fills up. The purpose of the garbage collector is to clean up the heap every once in a while. The underlying principle of garbage collection is that objects are considered garbage, hence can be cleaned, if it can be proven that they cannot be reached (used) anymore in future code.

Julia’s garbage collector algorithm is called mark and sweep. This algorithm consists of two phases: the mark phase, where all objects that are not garbage are found and marked so; and the sweep phase, where all unmarked objects are cleaned. The mark phase first establishes a set of objects that are definitely not garbage. This set is called the root set, and essentially consists of all global variables and everything on the stack. The garbage collector then follows everything that the root set references, and everything that those references reference, and marks those objects along the way.

During the sweep phase, the unmarked objects are freed, which simply means that it is internally recorded that their memory can be freely overwritten and used for something else. These unmarked objects are found by walking through the whole heap. Marked objects, on the other hand, remain untouched. They are also not moved around: you can imagine that the memory used by marked objects can sometimes be rearranged into a more compact arrangement. This, however, takes time. That Julia’s garbage collector does not move marked objects around is referred to by saying that Julia’s mark-and-sweep algorithm is non-moving or non-compacting.

There is more fancy stuff going on. For example, Julia’s garbage collector is generational. You can check out the docstrings of gc.c for more details.

Julia Learning Circle: JIT and Method Invalidations

2020-11-07T00:00:00+00:00

I am participating in a learning circle with the goal of gaining a better understanding of the Julia language. To better retain what we learn, I will be turning my notes into small blog posts. The posts should be simple, quick, but hopefully enjoyable reads.

The code snippets in this post are run on Julia 1.6.0-DEV.1440.

Just-in-Time Compilation

The first time a method is run, it will just-in-time (JIT) be compiled. The compilation time can be measured with @time.

julia> A = randn(Float64, 3, 3);

julia> @time inv(A);
  0.244590 seconds (559.50 k allocations: 31.983 MiB, 2.82% gc time, 99.94% compilation time)

julia> @time inv(A);
  0.000015 seconds (4 allocations: 1.953 KiB)

The method inv(::Vector{Float64}) is now compiled and fast to call. However, for example inv(::Vector{Float32}) is not yet compiled, and will consequently incur compilation time.

julia> A = randn(Float32, 3, 3);

julia> @time inv(A);
  0.188690 seconds (449.85 k allocations: 25.852 MiB, 96.79% compilation time)

julia> @time inv(A);
  0.000017 seconds (4 allocations: 1.125 KiB)

The Julia JIT is simple: it compiles a method once the method is required. This, however, comes at the cost of start-up time and delays during runtime. Other approaches, like PyPy, first run the code on an interpreter, profile the code, and then compile bits of the code based on the profiling results; this is called profile-guided optimisation (POGO).

Method Invalidation

Once a method is compiled, it can happen that it needs to be recompiled. Namely, a method is compiled under certain assumptions, and these assumptions may not hold anymore as more code is loaded.

For example, suppose that a compiled method m uses the instance my_add(x::Float64, y::Float64) obtained from the implementation for my_add(x::Real, y::Real). If a direct implementation of my_add(x::Float64, y::Float64) is then added, the compiled method m needs to be recompiled to make use of this direct implementation: m gets invalidated.

Here’s that example:

julia> my_add(x::Real, y::Real) = x + y
my_add (generic function with 1 method)

julia> my_sum(x::Vector{T}) where T<:Real = reduce(my_add, x; init=one(T))
my_sum (generic function with 1 method)

julia> my_sum(randn(10))
0.65443378603631

We then add a direct implementation for my_add(x::Float64, y::Float64). To detect the method invalidation, we use SnoopCompile.jl.

julia> trees = invalidation_trees(@snoopr begin
           my_add(x::Float64, y::Float64) = x + y
       end)
1-element Vector{SnoopCompile.MethodInvalidations}:
 inserting my_add(x::Float64, y::Float64) in Main at REPL[12]:2 invalidated:
   backedges: 1: superseding my_add(x::Real, y::Real) in Main at REPL[8]:1 with MethodInstance for my_add(::Float64, ::Float64) (10 children)
   1 mt_cache

julia> trees[1].backedges[end]
MethodInstance for my_add(::Float64, ::Float64) at depth 0 with 10 children

julia> show(trees[1].backedges[end]; minchildren=0, maxdepth=100)
MethodInstance for my_add(::Float64, ::Float64) (10 children)
 MethodInstance for (::Base.BottomRF{typeof(my_add)})(::Float64, ::Float64) (9 children)
  MethodInstance for _foldl_impl(::Base.BottomRF{typeof(my_add)}, ::Float64, ::Vector{Float64}) (8 children)
   MethodInstance for foldl_impl(::Base.BottomRF{typeof(my_add)}, ::Float64, ::Vector{Float64}) (7 children)
    MethodInstance for mapfoldl_impl(::typeof(identity), ::typeof(my_add), ::Float64, ::Vector{Float64}) (6 children)
     MethodInstance for _mapreduce_dim(::typeof(identity), ::typeof(my_add), ::Float64, ::Vector{Float64}, ::Colon) (5 children)
      MethodInstance for var"#mapreduce#665"(::Colon, ::Float64, ::typeof(mapreduce), ::typeof(identity), ::typeof(my_add), ::Vector{Float64}) (4 children)
       MethodInstance for (::Base.var"#mapreduce##kw")(::NamedTuple{(:init,), Tuple{Float64}}, ::typeof(mapreduce), ::typeof(identity), ::typeof(my_add), ::Vector{Float64}) (3 children)
        MethodInstance for var"#reduce#667"(::Base.Iterators.Pairs{Symbol, Float64, Tuple{Symbol}, NamedTuple{(:init,), Tuple{Float64}}}, ::typeof(reduce), ::typeof(my_add), ::Vector{Float64}) (2 children)
         MethodInstance for (::Base.var"#reduce##kw")(::NamedTuple{(:init,), Tuple{Float64}}, ::typeof(reduce), ::typeof(my_add), ::Vector{Float64}) (1 children)
          MethodInstance for my_sum(::Vector{Float64}) (0 children)

This shows the whole call stack. You can interactively navigate the stack with ascend(trees[1].backedges[end]), which uses Cthulhu.jl.

Let’s perform some timings to see whether we can detect delays due to method invalidations. Start up a fresh Julia REPL.

julia> using SnoopCompile

julia> x = randn(10);

julia> my_add(x::Real, y::Real) = x + y;

julia> my_sum(x::Vector{T}) where T<:Real = reduce(my_add, x; init=one(T));

julia> @time my_sum(x);
  0.023856 seconds (79.31 k allocations: 4.761 MiB, 99.88% compilation time)

julia> my_add(x::Float64, y::Float64) = x + y;

julia> @time my_sum(x);
  0.016896 seconds (53.17 k allocations: 2.952 MiB, 99.94% compilation time)

julia> using SnoopCompile

julia> x = randn(10);

julia> my_add(x::Real, y::Real) = x + y;

julia> my_sum(x::Vector{T}) where T<:Real = reduce(my_add, x; init=one(T));

julia> @time my_sum(x);
  0.023979 seconds (79.31 k allocations: 4.761 MiB, 99.89% compilation time)

julia> my_add(x::Float32, y::Float32) = x + y;

julia> @time my_sum(x);
  0.000004 seconds (1 allocation: 16 bytes)

In the first case, where my_add(::Float64, ::Float64) gets invalidated, the second call of my_sum(x) again incurs compilation time. This does not happen in the second case.

Lastly, we discuss one more common scenario in which method invalidations happen. Consider

julia> f(x::Int) = 1;

julia> g(x) = f(x);

julia> g("1")
ERROR: MethodError: no method matching f(::String)
Closest candidates are:
  f(::Int64) at REPL[8]:1
Stacktrace:
 [1] g(x::String)
   @ Main ./REPL[9]:1
 [2] top-level scope
   @ REPL[10]:1

The compiled method instance g(::String) gives back a MethodError. In particular, it assumes that there is no implementation for f(::String). If we add that implementation, then g(::String) needs to be recompiled to make use of the then-available f(::String). Invalidations of this kind link back to the method table. They show up in the property mt_backedges of MethodInvalidations:

julia> invalidation_trees(@snoopr begin f(x::String) = 1 end)
1-element Vector{SnoopCompile.MethodInvalidations}:
 inserting f(x::String) in Main at REPL[11]:1 invalidated:
   mt_backedges: 1: signature Tuple{typeof(f), String} triggered MethodInstance for g(::String) (0 children)

Solutions for High-Dimensional Statistics

2020-08-21T00:00:00+00:00

A brief update: Jiri and I have been working through the new book High-Dimensional Statistics: A Non-Asymptotic Viewpoint by Martin E. Wainwright, which has been really good so far. In the process, we have produced solutions for a subset of the exercises. Since some of the exercises are considerably challenging, we have decided to publicly post our worked solutions. Check it out!

A Short Note on The Y Combinator

2018-08-16T00:00:00+00:00

Cross-posted at the Invenia blog.

Introduction

This post is a short note on the notorious Y combinator. No, not that company, but the computer sciency objects that looks like this:

\[\label{eq:Y-combinator} Y = \lambda\, f : (\lambda\, x : f\,(x\, x))\, (\lambda\, x : f\,(x\, x)).\]

Don’t worry if that looks complicated; we’ll get down to some examples and the nitty gritty details in just a second. But first, what even is this Y combinator thing? Simply put, the Y combinator is a higher-order function $Y$ that can be used to define recursive functions in languages that don’t support recursion. Cool!

For readers unfamiliar with the above notation, the right-hand side of Equation \eqref{eq:Y-combinator} is a lambda term, which is a valid expression in lambda calculus:

$x$, a variable, is a lambda term;
if $t$ is a lambda term, then the anonymous function $\lambda\, x : t$ is a lambda term;
if $s$ and $t$ are lambda terms, then $s\, t$ is a lambda term, which should be interpreted as $s$ applied with argument $t$; and
nothing else is a lambda term.

For example, if we apply $\lambda\, x : y\,x$ to $z$, we find

\[\label{eq:example} (\lambda\, x : y\,x)\, z = y\,z.\]

Although the notation in Equation \eqref{eq:example} suggests multiplication, note that everything is function application, because really that’s all there is in lambda calculus.

Consider the factorial function $\code{fact}$:

\[\label{eq:fact-recursive} \code{fact} = \lambda\, n : (\code{if}\, (\code{iszero}\, n) \, 1 \, (\code{multiply}\, n\, (\code{fact}\, (\code{subtract}\, n\, 1)))).\]

In words, if $n$ is zero, return $1$; otherwise, multiply $n$ with $\code{fact}(n-1)$. Equation \eqref{eq:fact-recursive} would be a valid expression if lambda calculus would allow us to use $\code{fact}$ in the definition of $\code{fact}$. Unfortunately, it doesn’t. Tricky. Let’s replace the inner $\code{fact}$ by a variable $f$:

\[\code{fact}' = \lambda\, f: \lambda\, n : (\code{if}\, (\code{iszero}\, n) \, 1 \, (\code{multiply}\, n\, (f\, (\code{subtract}\, n\, 1)))).\]

Now, crucially, the Y combinator $Y$ is precisely designed to construct $\code{fact}$ from $\code{fact}'$:

\[Y\, \code{fact}' = \code{fact}.\]

To see this, let’s denote $\code{fact2}=Y\,\code{fact}'$ and verify that $\code{fact2}$ indeed equals $\code{fact}$:

\begin{align} \code{fact2} &= Y\, \code{fact}’ \newline &= (\lambda\, f : (\lambda\, x : f\,(x\, x))\, (\lambda\, x : f\,(x\, x)))\, \code{fact}’ \newline &= (\lambda\, x : \code{fact}’\,(x\, x) )\, (\lambda\, x : \code{fact}’\,(x\, x)) \label{eq:step-1} \newline &= \code{fact}’\, ((\lambda\, x : \code{fact}’\, (x\, x))\,(\lambda\, x : \code{fact}’\, (x\, x))) \label{eq:step-2} \newline &= \code{fact}’\, (Y\, \code{fact}’) \newline &= \code{fact}’\, \code{fact2}, \end{align}

which is exactly what we’re looking for, because the first argument to $\code{fact}'$ should be the actual factorial function, $\code{fact2}$ in this case. Neat!

We hence see that $Y$ can indeed be used to define recursive functions in languages that don’t support recursion. Where does this magic come from, you say? Sit tight, because that’s up next!

Deriving the Y Combinator

This section introduces a simple trick that can be used to derive Equation \eqref{eq:Y-combinator}. We also show how this trick can be used to derive analogues of the Y combinator that implement mutual recursion in languages that don’t even support simple recursion.

Again, let’s start out by considering a recursive function:

\[f = \lambda\, x:g[f, x]\]

where $g$ is some lambda term that depends on $f$ and $x$. As we discussed before, such a definition is not allowed. However, pulling out $f$,

\[\label{eq:fixed-point} f = \underbrace{(\lambda \, f' :\lambda\, x:g[f', x])}_{h}\,\, f = h\, f.\]

we do find that $f$ is a fixed point of $h$: $f$ is invariant under applications of $h$. Now—and this is the trick—suppose that $f$ is the result of a function $\hat{f}$ applied to itself: $f=\hat{f}\,\hat{f}$. Then Equation \eqref{eq:fixed-point} becomes

\[{\color{red}\hat{f}} \,\hat{f} = h\,(\hat{f}\, \hat{f}) = ({\color{red}\lambda\,x:h(x\,x)})\,\,\hat{f},\]

from which we, by inspection, infer that

\[\hat{f} = \lambda\,x:h(x\,x).\]

Therefore,

\[f = \hat{f}\hat{f} = (\lambda\,x:h(x\,x))\,(\lambda\,x:h(x\,x)).\]

Pulling out $h$,

\[f = (\lambda\, h': (\lambda\,x:h'\,(x\,x))\,(\lambda\,x:h'\,(x\,x)))\, h = Y\, h,\]

where suddenly a wild Y combinator has appeared.

The above derivation shows that $Y$ is a fixed-point combinator. Passed some function $h$, $Y\,h$ gives a fixed point of $h$: $f = Y\,h$ satisfies $f = h\,f$.

Pushing it even further, consider two functions that depend on each other:

\begin{align} f &= \lambda\,x:k_f[x, f, g], & g &= \lambda\,x:k_g[x, f, g] \end{align}

where $k_f$ and $k_g$ are lambda terms that depend on $x$, $f$, and $g$. This is foul play, as we know. We proceed as before and pull out $f$ and $g$:

\begin{align} f = \underbrace{ (\lambda\,f’:\lambda\,g’:\lambda\,x:k_f[x, f’, g’]) }_{h_f} \,\, f\, g = h_f\, f\, g \end{align}

\begin{align}
g = \underbrace{ (\lambda\,f’:\lambda\,g’:\lambda\,x:k_g[x, f’, g’]) }_{h_g} \,\, f\, g = h_g\, f\, g. \end{align}

Now—here’s that trick again—let $f = \hat{f}\,\hat{f}\,\hat{g}$ and $g = \hat{g}\,\hat{f}\,\hat{g}$.¹ Then

\begin{align} \hat{f}\,\hat{f}\,\hat{g} &= h_f\,(\hat{f}\,\hat{f}\,\hat{g})\,(\hat{g}\,\hat{f}\,\hat{g}) = (\lambda\,x:\lambda\,y:h_f\,(x\,x\,y)\,(y\,x\,y))\,\,\hat{f}\,\hat{g},\newline \hat{g}\,\hat{f}\,\hat{g} &= h_g\,(\hat{f}\,\hat{f}\,\hat{g})\,(\hat{g}\,\hat{f}\,\hat{g}) = (\lambda\,x:\lambda\,y:h_g\,(x\,x\,y)\,(y\,x\,y))\,\,\hat{f}\,\hat{g}, \end{align}

which suggests that

\begin{align} \hat{f} &= \lambda\,x:\lambda\,y:h_f\,(x\,x\,y)\,(y\,x\,y), \newline \hat{g} &= \lambda\,x:\lambda\,y:h_g\,(x\,x\,y)\,(y\,x\,y). \end{align}

Therefore

\begin{align} f &= \hat{f}\,\hat{f}\,\hat{g} \newline &= (\lambda\,x:\lambda\,y:h_f\,(x\,x\,y)\,(y\,x\,y))\, (\lambda\,x:\lambda\,y:h_f\,(x\,x\,y)\,(y\,x\,y))\, (\lambda\,x:\lambda\,y:h_g\,(x\,x\,y)\,(y\,x\,y)) \newline &= Y_f\, h_f\, h_g \end{align}

where

\[Y_f = (\lambda\, h_f': \lambda\, h_g': (\lambda\,x:\lambda\,y:h_f'\,(x\,x\,y)\,(y\,x\,y))\, (\lambda\,x:\lambda\,y:h_f'\,(x\,x\,y)\,(y\,x\,y))\, (\lambda\,x:\lambda\,y:h_g'\,(x\,x\,y)\,(y\,x\,y))).\]

Similarly,

\[g = Y_g\, h_f\, h_g.\]

Dang, laborious, but that worked. And thus we have derived two analogues $Y_f$ and $Y_g$ of the Y combinator that implement mutual recursion in languages that don’t even support simple recursion.

Implementing the Y Combinator in Python

Well, that’s cool and all, but let’s see whether this Y combinator thing actually works. Consider the following nearly 1-to-1 translation of $Y$ and $\code{fact}'$ to Python:

Y = lambda f: (lambda x: f(x(x)))(lambda x: f(x(x)))
fact = lambda f: lambda n: 1 if n == 0 else n * f(n - 1)

If we try to run this, we run into some weird recursion:

>>> Y(fact)(4)
RecursionError: maximum recursion depth exceeded

Eh? What’s going? Let’s, for closer inspection, once more write down $Y$:

\[Y = \lambda\, f: (\lambda\, x : f\,(x\, x))\, (\lambda\, x : f\,(x\, x)).\]

After $f$ is passed to $Y$, $(\lambda\, x : f\,(x\, x))$ is passed to $(\lambda\, x : f\,(x\, x))$; which then evaluates $x\, x$, which passes $(\lambda\, x : f\,(x\, x))$ to $(\lambda\, x : f\,(x\, x))$; which then again evaluates $x\, x$, which again passes $(\lambda\, x : f\,(x\, x))$ to $(\lambda\, x : f\,(x\, x))$; ad infinitum. Written down differently, evaluation of $Y\, f\, x$ yields

\[Y\, f\, x = (Y\, f)\, x = (Y\, (Y\, f))\, x = (Y\, (Y\, (Y\, f)))\, x = (Y\, (Y\, (Y\, (Y\, f))))\, x = \ldots,\]

which goes on indefinitely. Consequently, $Y\, f$ will not evaluate in finite time, and this is the cause of the RecursionError. But we can fix this, and quite simply so: only allow the recursion—the $x\,x$ bit—to happen when it’s passed an argument; in other words, replace

\[\label{eq:strict-evaluation} x\,x \to \lambda\,y:x\,x\,y.\]

Subsituting Equation \eqref{eq:strict-evaluation} in Equation \eqref{eq:Y-combinator}, we find

\[\label{eq:strict-Y-combinator} Y = \lambda\, f : (\lambda\, x : f(\lambda\, y: x\, x\,y))\, (\lambda\, x : f(\lambda\, y:x\, x\, y)).\]

Translating to Python,

Y = lambda f: (lambda x: f(lambda y: x(x)(y)))(lambda x: f(lambda y: x(x)(y)))

And then we try again:

>>> Y(fact)(4)
24

>>> Y(fact)(3)
6

>>> Y(fact)(2)
2

>>> Y(fact)(1)
1

Sweet success!

Summary

To recapitulate, the Y combinator is a higher-order function that can be used to define recursion—and even mutual recursion—in languages that don’t support recursion. One way of deriving $Y$ is to assume that the recursive function under consideration $f$ is the result of some other function $\hat{f}$ applied to itself: $f = \hat{f}\,\hat{f}$; after some simple manipulation, the result can then be determined by inspection. Although $Y$ can indeed be used to define recursive functions, it cannot be applied literally in a contemporary programming language; recursion errors might then occur. Fortunately, this can be fixed simply by letting the recursion in $Y$ happen when needed—that is, lazily.

Do you see why this is the appropriate generalisation of letting $f=\hat{f}\,\hat{f}$? ↩

Hello, World

2018-06-19T00:00:00+00:00

Hello, world! Another blog has come into existence. Woo! Find out more about me here and here.

Posts to follow soon. I promise.