Linear Regression: A Kernel-Method Perspective

\(\newcommand{\t}[1]{ {#1}^T }\) \(\newcommand{\inv}[1]{ {#1}^{-1} }\) \(\newcommand{\vec}[1]{\boldsymbol{#1}}\) \(\newcommand{\diff}[1]{\mathop{}\!\mathrm{\partial}#1}\) \(\newcommand{\fun}[2]{#1\left(#2\right)}\) \(\newcommand{\norm}[1]{\left\lVert #1 \right\rVert}\) \(\newcommand{\abs}[1]{\left| #1 \right|}\) \(\newcommand{\name}[1]{\mathrm{#1}}\) \(\newcommand{\new}{\name{new}}\) \(\newcommand{\dist}[1]{\name{#1}}\) \(\DeclareMathOperator*{\argmax}{arg\,max}\) \(\DeclareMathOperator*{\argmin}{arg\,min}\) \(\DeclareMathOperator*{\diag}{diag}\) \(\newcommand{\rv}[1]{\mathbf{#1}}\) \(\newcommand{\follows}{\thicksim}\) \(\newcommand{\normal}[1]{\mathcal{N}\!\left(#1\right)}\) \(\newcommand{\ga}[1]{\mathrm{Gamma}\!\left(#1\right)}\) \(\newcommand{\diff}[1]{\mathop{}\!\mathrm{d}#1}\) \(\newcommand{\expect}[1]{\mathbb{E}\left[#1\right]}\) \(\newcommand{\vari}[1]{\mathbb{V}\left[#1\right]}\) \(\newcommand{\covari}[1]{\mathrm{Cov}\left(#1\right)}\) \(\newcommand{\prob}[1]{\fun{p}{#1}}\)

We all know that with linear regressions, the OLS estimate of the coefficient \(\vec{\beta}\) takes the following form:

\[ \hat{\vec{\beta}} = \inv{ \left( \t{ \vec{X} } \vec{X} \right) } \t{\vec{X}} \vec{y} \]

\(\vec{X}\) is the design matrix of shape \(N \times K\) where \(N\) is the number of observations and \(K\) is the number of variables; \(\vec{y}\) is the response vector of shape \(N \times 1\); \(\hat{\vec{\beta}}\) is of shape \(K \times 1\).

But what is the expression really saying?

One approach to answering this question is to consider how one is able to compute \(\hat{\vec{y}}\) from the training data \(\vec{X}\) using the coefficient estimate \(\hat{\vec{\beta}}\):

\[ \begin{align} \hat{\vec{y}} &= \vec{X} \hat{\vec{\beta}} \\ &= \vec{X} \inv{ \left( \t{ \vec{X} } \vec{X} \right) } \t{\vec{X}} \vec{y} \label{eq:yprediction} \end{align} \]

Suppose the input data matrix \(\vec{X}\) is already centred, \({\t{\vec{X}} \vec{X}}\) (of shape \(K \times K\)) is proportional to the covariance matrix for \(\vec{X}\), i.e. \(\vec{\Sigma}\). Since \({\t{\vec{X}} \vec{X}}\) is symmetric and real-valued, it is possible to eigen-decompose its inverse into \({\vec{Q}} {\vec{\Lambda}}^{-1} \inv{\vec{Q}}\) where \(\vec{Q}\) is an orthonormal matrix whose columns are comprised of unit eigenvectors of \({\t{\vec{X}} \vec{X}}\), and \(\vec{\Lambda}\) is a diagonal matrix whose elements consist of the corresponding eigenvalues.

We can decompose \({\vec{\Lambda}}^{-1}\) into \({\vec{\Lambda}}^{-\frac{1}{2}} \t{{\vec{\Lambda}}^{-\frac{1}{2}}}\) and re-arrange the decomposition:

\[ \begin{align} \inv{\left( \t{\vec{X}} \vec{X} \right)} &= {\vec{Q}} {\vec{\Lambda}}^{-\frac{1}{2}} \t{{\vec{\Lambda}}^{-\frac{1}{2}}} \inv{\vec{Q}} \\ &= \t{ \left( \t{\vec{\Lambda}^{-\frac{1}{2}}} \t{\vec{Q}} \right) } { \left( \t{\vec{\Lambda}^{-\frac{1}{2}}} \inv{\vec{Q}} \right) } \\ &= \t{ \left( {\vec{\Lambda}^{-\frac{1}{2}}} \inv{\vec{Q}} \right) } { \left( {\vec{\Lambda}^{-\frac{1}{2}}} \inv{\vec{Q}} \right) } \quad \left( \text{noting that } \t{ \vec{Q}} = \inv{\vec{Q}} \text{ for orthonormal matrices} \right) \end{align} \]

Plugging this back into \(\eqref{eq:yprediction}\):

\[ \begin{align} \hat{\vec{y}} &= \vec{X} \t{ \left( {\vec{\Lambda}^{-\frac{1}{2}}} \inv{\vec{Q}} \right) } { \left( {\vec{\Lambda}^{-\frac{1}{2}}} \inv{\vec{Q}} \right) } \t{\vec{X}} \vec{y} \\ &= \t{ \left( {\vec{\Lambda}^{-\frac{1}{2}}} \inv{\vec{Q}} \t{\vec{X}} \right) } { \left( {\vec{\Lambda}^{-\frac{1}{2}}} \inv{\vec{Q}} \t{\vec{X}} \right) } \vec{y} \\ &= \underbrace{ \t{ \left( \inv{ \left( \vec{Q} \vec{\Lambda}^{\frac{1}{2}} \right) } \t{\vec{X}} \right) } }_{N \times K} \underbrace{ \left( \inv{ \left( \vec{Q} \vec{\Lambda}^{\frac{1}{2}} \right) } \t{\vec{X}} \right) }_{K \times N} \underbrace{\vec{y}}_{N \times 1} \end{align} \]

\(\inv{ \left( \vec{Q} \vec{\Lambda}^{\frac{1}{2}} \right)} \t{\vec{X}}\) is a particularly interesting quantity. From one of my previous posts on PCA, we know that \(\inv{ \left( \vec{Q} \vec{\Lambda}^{\frac{1}{2}} \right)} \t{\vec{X}}\) can be interpreted as an representation of \(\t{\vec{X}}\) in an alternative basis specified by the columns of \({{\vec{Q} \vec{\Lambda}}^{\frac{1}{2}} }\).

What this means is that \(\t{\left(\inv{ \left( \vec{Q} \vec{\Lambda}^{\frac{1}{2}} \right)} \t{\vec{X}} \right)} {\left( \inv{ \left( \vec{Q} \vec{\Lambda}^{\frac{1}{2}} \right)} \t{\vec{X}} \right)}\) is a Gram matrix whose \(i,j\)-th element is the dot product between the \(i\)-th and the \(j\)-th elements of \(\vec{X}\) in the alternative basis. We will discuss why we would want to compute the dot products in the alternative basis later on.

Let \({\left[\vec{X}_{i}\right]}_{\vec{Q}}\) denote the \(i\)-th sample in \(\vec{X}\) as represented in the alternative basis specified by the columns of the orthonormal matrix \(\vec{Q}\). One can derive the \(i\)-th element of \(\hat{\vec{y}}\) as follows:

\[ \begin{align} \hat{\vec{y}}_{i} &= \sum_{j = 1}^{N} \t{{\left[\vec{X}_{i}\right]}_{\vec{Q}}} \cdot {\left[\vec{X}_{j}\right]}_{\vec{Q}} \vec{y}_{j} \end{align} \]

Intuitively, what the above says is that the predicted \(\hat{\vec{y}}_{i}\) can be computed as a weighted sum of the elements in \(\vec{y}\), where the weights are the inner-products between the \(i\)-th element in \(\vec{X}\) and each of the samples in \(\vec{X}\). The “closer” another sample \(\vec{X}_{j}\) is to \(\vec{X}_{i}\) in terms of cosine similarity, the greater the influence of \(\vec{y}_{j}\) to the outcome of \(\hat{\vec{y}}_{i}\) will be.

But why bother computing the inner products in the alternative basis \(\vec{Q}\)? The intuition here is that we would like to normalise the similarities based on the density of the input data’s dot products so that the scale of the weights are invariant of the scales and correlations exhibited in the data.

Let’s illustrate the ideas above with some synthetic data:

pacman::p_load("mvtnorm")

N <- 1000
S <- matrix(c(0.2, 0.2, 0.2, 0.4), nrow=2)
X <- mvtnorm::rmvnorm(N, mean = c(0, 0), sigma = S) %>% `colnames<-`(c("v1", "v2"))  # X is a data matrix of shape (N, 2)
y <- X %*% c(-0.5, 2)

# visualise the input data
ggplot(X %>% as_tibble()) + geom_point(aes(x = `v1`, y = `v2`)) + coord_fixed(ratio = 1)

Now compute some of the quantities we discussed:

Σ <- t(X) %*% X
e <- eigen(Σ)
Λ <- diag(e$values)
Q <- e$vectors

Note that \(\t{\left(\inv{\vec{Q}}\right)} = \t{\left(\t{\vec{Q}}\right)} = \vec{Q}\), so we have \(\t{\left(\inv{ \left( \vec{Q} \vec{\Lambda}^{\frac{1}{2}} \right)} \t{\vec{X}}\right)} = \vec{X} \t{ \left( \inv{ \left( \vec{Q} \vec{\Lambda}^{\frac{1}{2}} \right) } \right) } = \vec{X} \t{ \left( \vec{\Lambda}^{-\frac{1}{2}} \inv{\vec{Q}} \right) } = \vec{X} \vec{Q} \vec{\Lambda}^{-\frac{1}{2}}\).

Let us transform X with Q %*% solve(sqrt(Λ)) and visualise the results:

rbind(
	(X) %>% as_tibble() %>% mutate(seq = seq_len(N), state = "observed"), 
	(X %*% Q %*% solve(sqrt(Λ))) %>% `colnames<-`(c("v1", "v2")) %>% as_tibble() %>%  mutate(seq = seq_len(N), state = "normalised")
) %>% { 
	ggplot(.) + 
		geom_point(aes(x = `v1`, y = `v2`, group = `seq`, colour = state)) + 
		coord_fixed(ratio = 1) + 
		theme(axis.title.y = element_text(vjust = +3), axis.title.x = element_text(vjust = -0.2)) +
		transition_states(fct_relevel(state, c("observed", "normalised")), transition_length = 2, state_length = 1, wrap = FALSE ) + 
		ease_aes("linear") + 
		view_zoom_manual(pause_length=3, aspect_ratio=1, delay=1, xmin=c(-2.0, -0.1), xmax=c(2.0,  0.1), ymin=c(-2.0,  -0.1), ymax=c(2.0, 0.1))
}

We can see that transforming X into this new basis defined by Q %*% sqrt(Λ) has the effect of “de-correlating” X. How is this related to “normalisation” though? To see this, note how the (relatively short) distances between points along the direction that hosted more points became relatively greater after the transformation, whereas the distances between points along the other direction shrank after the transformation in relative terms. In other words, the distance between two points along the dense directions is worth more than the same distance between two points along the sparse directions, because the average distance between points situated along the former direction is lower — the former directions are more “competitive” with being “similar”, so to speak. A concept that is closely related to this is the Mahalanobis distance.

Another interesting observation coming from our findings above is how the transformation is only able to identify at most \(K\) correlating directions — this can be seen as a limitation of the linear aspect of the transformation. This insight also reveals a different way to interpreting linear regressions than the OLS story we are all familiar with:

A linear regression can be seen as a kernel machine with a linear/inner-product kernel that operates in a “reference” vector space.
- It is a kernel machine because the predictions can be formulated such that they dependent only on the inner products between the observations.
It is our assumption that the observed data can be represented in this reference vector space such that they follow a standard (potentially multivariate) normal distribution.
However, we only get to obtain a “distorted” view to the dataset — the observed data are represented in an alternative vector space which is morphed from the reference space via a linear transformation.
Fortunately, it is possible to identify the linear transformation the vector space underwent by analysing the covariances the observed dataset exhibited (i.e. to answer the question: “what linear transformation could have caused a normally distributed dataset to be become the correlated the way we observed?”).