Change of Basis
Suppose that the columns of the matrix \(B = \left[ \begin{array}{rr} 3 & -2 \\ 1 & 4 \end{array} \right]\) represent a set of bases depicted in the standard or natural basis1 \(B_0 = \left[ \begin{array}{rr} 1 & 0 \\ 0 & 1 \end{array} \right]\). The vector \(\vec{v}_{\left[B\right]} = \left[ \begin{array}{r} 1 \\ 1 \end{array} \right]\) represented under the basis \(B\) can be translated into its counterpart \(\vec{v}_{\left[B_0\right]}\) which is the same vector but represented in basis \(B_0\) by left-multiplying it by the matrix \(B\):
\[ B \vec{v}_{\left[B\right]} = \left[ \begin{array}{rr} 3 & -2 \\ 1 & 4 \end{array} \right] \left[ \begin{array}{r} 1 \\ 1 \end{array} \right] = \left[ \begin{array}{r} 1 \\ 5 \end{array} \right] = \vec{v}_{\left[B_0\right]}\text{.}\]
This equation also tells us how to translate a vector written in the standard basis \(B_0\) back to an alternative basis \(B\):
\[\vec{v}_{\left[B\right]} = B^{-1} \vec{v}_{\left[B_0\right]}\]
The maximum variance formulation of PCA can now be stated as follows: find an orthonormal basis \(B\) such that the variance of the data, when translated into this new basis, is maximised.
Matrix Similarity
Consider an invertible matrix \(P\) and also two other matrices \(A\) and \(B\) each representing a linear transformation. The matrices \(A\) is said to be similar to \(B\) (and vice versa) if \(B = PA\inv{P}\).
The intuition behind the concept of matrix similarity is that similar matrices represent the same linear transformation under different bases. To see this, consider a non-zero vector \(\vec{v}\) in the standard basis; transforming it with \(B\) amounts to:
\[ B \vec{v} = \underbrace{ P \underbrace{A \overbrace{ \inv{P} \vec{v}}^{\vec{v}_{\left[P\right]}}}_{\vec{v}_{\left[P\right]} \text{ transformed by} A}}_{A \vec{v}_{\left[P\right]} \text{ translated back to the original basis}} \]
In other words, transforming the vector \(\vec{v}\) using \(A\) inside an alternative basis defined by \(P\) and then translating the transformed vector back to the standard basis yield the same result as transforming \(\vec{v}\) by \(B\) directly in the standard (i.e. current) basis.
Diagonalizable Matrix
A matrix \(A\) is said to be “diagonalizable” if it is similar to a diagonal matrix, i.e. \(\inv{P} A P = D\) where \(D\) is a diagonal matrix. An intuitive geometric interpretation for a diagonalizable matrix \(A\) is that the linear transformation \(A\) represents under the standard basis is equivalent to a stretch-only transformation under the alternative basis \(P\).
Why are diagonalizable matrices interesting? It turns out that if a matrix \(A\) can be diagonalized, then we have:
\[ \begin{align} \inv{P} A P &= D \\ A P &= P D \end{align} \]
We can write \(D\) as \(D = \begin{bmatrix} \lambda_1 & & & \\ & \lambda_2 & & \\ & & \ddots & \\ & & & \lambda_n \end{bmatrix}\), and also write \(P\) as column vectors \(P = \begin{bmatrix} \vec{\alpha}_1 & \vec{\alpha}_2 & \dots & \vec{\alpha}_n \end{bmatrix}\). This leads to:
\[ \begin{align} A \begin{bmatrix} \vec{\alpha}_1 & \vec{\alpha}_2 & \dots & \vec{\alpha}_n \end{bmatrix} &= \begin{bmatrix} \vec{\alpha}_1 & \vec{\alpha}_2 & \dots & \vec{\alpha}_n \end{bmatrix} \begin{bmatrix} \lambda_1 & & & \\ & \lambda_2 & & \\ & & \ddots & \\ & & & \lambda_n \end{bmatrix} \\ \begin{bmatrix} A \vec{\alpha}_1 & A \vec{\alpha}_2 & \dots & A \vec{\alpha}_n \end{bmatrix} &= \begin{bmatrix} \lambda_1 \vec{\alpha}_1 & \lambda_2 \vec{\alpha}_2 & \dots & \lambda_n \vec{\alpha}_n \end{bmatrix} \end{align} \]
which is equivalent to:
\[A \vec{\alpha}_i = \lambda_i \vec{\alpha}_i \]
In other words, if a matrix \(A\) is similar to a diagonal matrix \(D\) (i.e. diagonalizable) under a change of basis matrix \(P\), then the column vectors in \(P\) are the eigenvectors of \(A\), and the values in the diagonal matrix \(D\) are the eigenvalues of \(P\), each corresponding to one of the eigenvectors in \(P\).
Orthogonal Matrices
An orthogonal matrix is a square matrix whose constituent columns \(\vec{u}_i\)s (and rows) are:
- unit vectors (i.e. \({\t{\vec{u}_i}} \vec{u}_i = 1\)), and
- orthogonal to each other (i.e. \(\t{\vec{u}_i} \vec{u}_j = 0, \forall i \ne j\)).
Using the definition, we can see that an orthogonal matrix \(\vec{Q}\) has the following property:
\[ \t{\vec{Q}} \vec{Q} = \vec{Q} \t{\vec{Q}} = \vec{I} \]
where \(\vec{I}\) is the identity matrix. To see how this is the case, consider \(\vec{Q}\) in its column form:
\[ \vec{Q} = \begin{bmatrix} \vec{u}_1 & \vec{u}_2 & \dots & \vec{u}_N \end{bmatrix} \]
where \(\vec{u}_i\) is the i-th column vector of \(\vec{Q}\). The transpose of \(\t{\vec{Q}}\) then becomes:
\[ \t{\vec{Q}} = \begin{bmatrix} \t{\vec{u}_1} \\ \t{\vec{u}_2} \\ \vdots \\ \t{\vec{u}_N} \end{bmatrix} \]
where \(\t{\vec{u}_i}\) is the i-th row vector transposed from the original i-th column vector in \(\vec{Q}\). The product \(\t{\vec{Q}}\vec{Q}\) now becomes:
\[ \begin{equation} \t{\vec{Q}}\vec{Q} = \begin{bmatrix} \t{\vec{u}_1} \\ \t{\vec{u}_2} \\ \vdots \\ \t{\vec{u}_N} \end{bmatrix} \begin{bmatrix} \vec{u}_1 & \vec{u}_2 & \dots & \vec{u}_N \end{bmatrix} = \begin{bmatrix} {\vec{u}_1} \vec{u}_1 & {\vec{u}_1} \vec{u}_2 & \dots & {\vec{u}_1} \vec{u}_N \\ {\vec{u}_2} \vec{u}_1 & {\vec{u}_2} \vec{u}_2 & \dots & {\vec{u}_2} \vec{u}_N \\ \vdots & \vdots & \ddots & \vdots \\ {\vec{u}_N} \vec{u}_1 & {\vec{u}_N} \vec{u}_2 & \dots & {\vec{u}_N} \vec{u}_N \end{bmatrix} \label{eq:QTQ} \end{equation}\]
Note that \({\t{\vec{u}_i}} \vec{u}_i = 1\) and \(\t{\vec{u}_i} \vec{u}_j = 0, \forall i \ne j\) are true by definition, so the elements on the diagonal of \(\eqref{eq:QTQ}\) are going to be \(1\) while all other elements in \(\eqref{eq:QTQ}\) are going to become \(0\), thus \(\t{\vec{Q}} \vec{Q} = \vec{I}\). Similar derivation can be applied to obtain \(\vec{Q} \t{\vec{Q}} = \vec{I}\)
Combined with the definition of the inverse of a matrix, i.e. \(\inv{\vec{Q}} \vec{Q} = \vec{Q} \inv{\vec{Q}} = \vec{I}\), we can also conclude that \(\t{\vec{Q}} = \inv{\vec{Q}}\).
So, why bother with orthogonal matrices?
It turns out that every real-valued symmetric matrix \(\vec{A}\) can be decomposed (or more precisely, diagonalized) as:
\[ \vec{A} = \vec{Q} \vec{\Lambda} \t{\vec{Q}} \]
where \(\vec{Q}\) is a real orthogonal matrix whose columns are the eigenvectors of \(\vec{A}\) and \(\vec{\Lambda}\) is a diagonal matrix whose entries are the eigenvalues of \(\vec{\Lambda}\).2
It is good to know that symmetric matrices can be decomposed this way, but what then? Note that given an arbitrarily shaped matrix \(\vec{X}\), its covariance matrix \(\vec{C} = \expect{ \left(\vec{X} - \expect{\vec{X}} \right) \t{\left( \vec{X} - \expect{\vec{X}} \right)}}\) is symmetric, and thus we have:
\[ \vec{C} = \vec{Q} \vec{\Lambda} \t{\vec{Q}} \]
We shall see that this decomposition will become quite relevant later on.
Principal Component Analysis, a.k.a. PCA
There are two common ways to formulate principal component analysis (PCA) that both lead to the same algorithm.3 In this article, I am to derive PCA based on the maximum variance formulation formulation. Under this formulation, the goal of the procedure is to find a new set of orthonormal bases such that, when translated into this new set of bases, the variation in the transformed data is accounted for by as fewest dimensions as possible.
Derivation of PCA under the Maximum Variance Formulation
The discussion in this section largely follows the expositions presented in Frank Wood’s excellent slides.
First, let us introduce the settings and some notations used in the derivation:
- \(D\) is a \(m \times n\) centred data matrix containing \(m\) variables and \(n\) observations represented in the standard basis; the data matrix is centered in the sense that the mean of each variable has been subtracted from the data respectively. Note that for this set-up, the data points are represented as column vectors rather than row vectors.
- \(P\) is a \(m \times m\) change of basis matrix that transforms vectors from the target orthonormal basis (that maximises the variances of the transformed data) into the standard basis. \(\vec{\alpha}_k\) denotes the \(k\)-th column in \(P\).
Let us now transform the data from the natural basis into the target basis:
\[D_{\left[P\right]} = \inv{P} D = \t{P} D\]
Here we exploited the fact that \(P\) is an orthonormal matrix, and so we have \(\inv{P} = \t{P}\).
The covariance matrix for the transformed data is:
\[ \begin{align} \vec{C}_{\left[P\right]} &= D_{\left[P\right]} \t{D_{\left[P\right]}} \\ &= \t{P} D \t{D} P \\ &= \t{P} \vec{C} P \label{eq:transformed-covariance} \end{align} \].
The variances of the transformed data are encoded in the diagonal of \(\eqref{eq:transformed-covariance}\). In particular, the variance of the transformed data projected on the \(k\)-th dimension in the basis defined by \(P\) is \(\t{\vec{\alpha}_k} \vec{C} \vec{\alpha}_k\). Our goal is to maximise this quantity subject to the constraint that \(\vec{\alpha}_k\) must be of unit length, i.e. \(\t{\vec{\alpha}_k} \vec{\alpha}_k = 1\). This can be achieved by optimising \(\t{\vec{\alpha}_k} \vec{C} \vec{\alpha}_k\) with the Lagrange multiplier \(\lambda_k \left( \t{\vec{\alpha}_k} \vec{\alpha}_k - 1\right)\):
\[ \begin{align} \frac{\partial \t{\vec{\alpha}_k} \vec{C} \vec{\alpha}_k - \lambda_k \left( \t{\vec{\alpha}_k} \vec{\alpha}_k - 1\right) }{\partial \vec{\alpha}_k} = \vec{C} \vec{\alpha}_k - \lambda_k \vec{\alpha}_k = 0 \end{align} \]
which leads to
\[ \begin{align} \vec{C} \vec{\alpha}_k &= \lambda_k \vec{\alpha}_k \label{eq:eigen-equation} \end{align} \] which is to say \(\lambda_k\) must be one of the eigenvalues of the covariance matrix \(\vec{C}\), and \(\vec{\alpha}_k\) must be the corresponding eigenvector for the eigenvalue \(\lambda_k\). However, \(\eqref{eq:eigen-equation}\) only tells us the axes in the target basis \(P\) must be the eigenvectors of \(\vec{C}\), but it does not say which eigenvector accounts for the most variance. Observe that the target quantity for the optimisation is
\[\t{\vec{\alpha}_k} \vec{C} \vec{\alpha}_k = \t{\vec{\alpha}_k} \lambda_k \vec{\alpha}_k = \lambda_k \t{\vec{\alpha}_k} \vec{\alpha}_k = \lambda_k\]
This implies that the target quantity is optimised when \(\lambda_k\) is the largest eigenvalue among all the eigenvalues of \(\vec{C}\); the axis that accounts for the most variance in the basis \(P\) is therefore the eigenvector that corresponds to the largest eigenvalue, i.e. the first principal component (PC). Let us denote the eigenvalue and eigenvector associated with the first PC as \(\lambda_1\) and \(\vec{\alpha}_1\).
To find the eigenvalue and the eigenvector associated with the second principal component, i.e. \(\lambda_2\) and \(\vec{\alpha}_2\), we optimise the quantity \(\t{\vec{\alpha}_2} \vec{C} \vec{\alpha}_2\) once again but with the additional constraint that \(\vec{\alpha}_2\) must be orthogonal to \(\vec{\alpha}_1\), i.e. \(\t{\vec{\alpha}_1} \vec{\alpha}_2 = 0\). Again, this can be achieved by applying an additional Lagrange multiplier to the optimisation:
\[ \begin{align} \frac{\partial \t{\vec{\alpha}_2} \vec{C} \vec{\alpha}_2 - \lambda_2 \left( \t{\vec{\alpha}_2} \vec{\alpha}_2 - 1\right) - \phi \t{\vec{\alpha}_1} \vec{\alpha}_2 }{\partial \vec{\alpha}_2} = \vec{C} \vec{\alpha}_2 - \lambda_2 \vec{\alpha}_2 - \phi \vec{\alpha}_1 = 0 \end{align} \]
Left multiply \(\vec{\alpha}_1\) to the above equation:
\[ \begin{align} \t{\vec{\alpha}_1} \vec{C} \vec{\alpha}_2 - \lambda_k \t{\vec{\alpha}_1} \vec{\alpha}_2 - \phi \t{\vec{\alpha}_1} \vec{\alpha}_1 &= 0 \\ 0 - 0 - \phi 1 &= 0 \\ \phi &= 0 \end{align} \]
Therefore we see that the optimisation process leads to:
\[ \begin{align} \vec{C} \vec{\alpha}_2 &= \lambda_2 \vec{\alpha}_2 \end{align} \]
which is the same as the case for the first PC except \(\lambda_2\) is the second largest eigenvalue of \(\vec{C}\), and \(\vec{\alpha}_2\) is the corresponding eigenvector associated with \(\lambda_2\).
The procedure can be repeated further to obtain the eigenvalues and the eigenvectors for rest of the PCs. In general, the \(k\)-th principal component is represented by the eigenvector associated with the \(k\)-th largest eigenvalue of the covariance matrix \(\vec{C}\).
Relationship between PCA and SVD
A common misconception about PCA is that one has to use SVD to implement it. As has been demonstrated above, this is not true — one can directly perform an eigendecomposition over the symmetric covariance matrix \(\vec{C}\) of the centered data to obtain the eigenvectors and eigenvalues for the principal components:
\[ \begin{align} \vec{C} = \vec{P} \vec{\Lambda} \t{\vec{P}} \label{eq:svd-decomposition} \end{align} \]
where \(\vec{P}\) is the change of basis matrix whose column vectors are the eigenvectors of \(\vec{C}\), and \(\vec{\Lambda}\) is a diagonal matrix containing the eigenvalues of \(\vec{C}\).
To see how PCA and SVD is related, consider the singular value decomposition of a row-wise centered, \(m \times n\) data matrix \(\vec{X}\):
\[ \begin{align} \vec{X} = \vec{U}\vec{\Sigma}\t{\vec{V}} \end{align} \]
The covariance matrix of \(\vec{C}\) can be written as:
\[ \begin{align} \vec{X} \t{\vec{X}} = \vec{U}\vec{\Sigma}\t{\vec{V}} \vec{V} \t{\vec{\Sigma}} \t{\vec{U}} \end{align} \]
Since \(\vec{V}\) is an orthonormal matrix, we have \(\t{\vec{V}} \vec{V} = \vec{I}\):
\[ \begin{align} \vec{X} \t{\vec{X}} &= \vec{U} \vec{\Sigma}^2 \t{\vec{U}} \label{eq:covariance-decomposition} \end{align} \]
Compare \(\eqref{eq:covariance-decomposition}\) with \(\eqref{eq:svd-decomposition}\), it can be seen that we can attain the eigenvalues (diagonal of \(\vec{\Sigma}^2\)) and the eigenvectors (columns of \(\vec{U}\)) of the covariance matrix by directly applying SVD on the centered data matrix \(\vec{X}\).
Some Observations about PCA
- Applying PCA to a data set does not change the total variance of the data, but redistribute it:
pacman::p_load(tidyverse)
set.seed(42)
# Make some random matrix whose entries are distributed normally.
X <- matrix(rnorm(100, 0, 2), nrow = 20, ncol = 5)
# Examin the trace of the covariance matrix of X:
X %>% cov() %>% diag() %>% sum()
## [1] 22.02276
# Examine the total variance in PCA-transformed data.
prcomp(X, retx = TRUE) %>% `$`(x) %>% cov() %>% diag() %>% sum()
## [1] 22.02276
- The variance explained by the \(k\)-th principal component is given by the \(k\)-th largest eigenvalue of the covariance matrix:
# Examine the eigenvalues of the covariance matrix
X %>% cov() %>% eigen() %>% `$`(values)
## [1] 8.931190 4.893004 3.892862 2.801816 1.503892
# Examine the variance for each PC in PCA-transformed data.
prcomp(X, retx = TRUE) %>% `$`(x) %>% cov() %>% diag()
## PC1 PC2 PC3 PC4 PC5
## 8.931190 4.893004 3.892862 2.801816 1.503892
“A standard basis, also called a natural basis, is a special orthonormal vector basis in which each basis vector has a single non-zero entry with value \(1\)” (http://mathworld.wolfram.com/StandardBasis.html).↩︎
see http://nlp.stanford.edu/IR-book/html/htmledition/matrix-decompositions-1.html and https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix#Real_symmetric_matrices for additional details.↩︎
Bishop, Christopher M. (2007): Pattern Recognition and Machine Learning. Springer Berlin Heidelberg.↩︎