10 min read

Gradient Methods

\(\newcommand{\rv}[1]{\mathbf{#1}}\) \(\newcommand{\vec}[1]{\mathbf{#1}}\) \(\newcommand{\follows}{\thicksim}\) \(\newcommand{\normal}[1]{\mathcal{N}\!\left(#1\right)}\) \(\newcommand{\ga}[1]{\mathrm{Gamma}\!\left(#1\right)}\) \(\newcommand{\diff}[1]{\mathop{}\!\mathrm{d}#1}\) \(\newcommand{\fun}[2]{\mathit{#1}\left(#2\right)}\) \(\newcommand{\prob}[1]{\fun{p}{#1}}\) \(\newcommand{\logistic}[1]{\fun{logistic}{#1}}\)

This article is meant to serve as a personal memoirs on the principles of gradient methods; hopefully it can useful to others who wishes to get an introduction to the method.

Gradient Method

The gradient method is an optimisation algorithm for finding the maximum/minimum value(s) of a differentiable function. The algorithm is an iterative procedure that gradually updates the parameters of the function in directions along which the value of the function increases/decreases the fastest until no new updates to the parameters can be found (i.e. the procedure has converged).

Let us first make up a simple polynomial function: \(f\left(x\right) = -x ^ 4 + x ^ 3 - 2x ^ 2 + 3x + 5\) :

target <- function(x) { - (x ^ 4) + (x ^ 3) - 2 * (x ^ 2) + 3 * x + 5 }
plot(target, -1, 2)

Suppose we would like to find the maximum of this function using the gradient method. To do this, we first derive the gradient function of the target function: \[ \begin{align} \frac{\partial f\left(x\right)}{\partial x} &= -4x^3 + 3x^2 - 4x + 3 \label{eq:poly-derivative} \end{align} \] The above gradient function translates simply to the following R function:

gradient <- function(x) {-4 * x ^ 3 + 3 * x ^ 2 - 4 * x + 3}

Superficially, there are two types of gradient methods: gradient ascent and gradient descent. The former is used when finding the maximum value(s) of a function while the latter is used when finding the minimum value(s) of a function. In this case, we will be using gradient ascent since we want to find the maximum value of the function, but the rationales behind the two methods are the same. The basic idea of gradient ascent is to gradually move the parameters in the function (i.e. \(x\) in this case) towards the direction to which the gradient vector points to when it is evaluated at the current set of parameters; this is also the direction towards which the value of the function \(\fun{f}{x}\) increases the fastest.

The following R function implements the algorithm described above:

gradient.ascent <- function(gradient, target, learning.rate = 10, learning.rate.decay = 0.9999, tolerance = .00001, init) {
  # set the initial parameters
  old_x <- init
  new_x <- old_x
  
  repeat {
    
    # evaluate the gradient at old_x
    g <- gradient(old_x)
    
    # on normalised gradient:
    # http://stats.stackexchange.com/questions/22568/difference-in-using-normalized-gradient-and-gradient
    # compute the norm of the gradient vector
    len <- base::norm(g, type = "2")
    if (len != 0) {
      g <- g / len
    }
    
    # update the old_x to obtain the new_x
    new_x <- old_x + learning.rate * g
    
    # Check for convergence: check whether the difference between the value of the function evaluated with the updated parameters and that with the original parameters is samller than the tolerance specified. Break the loop if converged.
    if(if_else(abs(target(new_x) - target(old_x)) < tolerance, TRUE, FALSE, missing = FALSE)){
      break
    }
    
    old_x <- new_x
    
    # Reduce the learning rate as the iteration proceeds. This should help avoid "jumpping" around the optimal parameters.
    learning.rate <- learning.rate * learning.rate.decay
    
  }
  return(old_x)
}

The most important parameter in this process is the learning.rate. If set too large, the algorithm may keep “overshooting” the optimal value for the parameter; if set too small, the algorithm can take too long to reach convergence.

The above algorithm employs two additional tricks to improve the convergence performance of the procedure. The first one is to normalise the length of the gradient vector before using it to update the parameters; this makes it easier to choose the learning.rate as the scale of learning.rate is no longer dependent on the data. The second technique is to gradually shrink the learning rate to prevent the procedure from overshooting the maximum by the end of the optimisation.

A key property of the gradient method is that the algorithm is guaranteed to find the global maximum of the function only if the function is concave1 (convex if the target is a minimum). Otherwise, the algorithm finds local extremes. In this case, the derivative function (Eq. \(\eqref{eq:poly-derivative}\)) is not concave; however, we know analytically that the function has a global maximum at \(x = 0.75\).

We can now run the gradient ascent procedure to find the \(x\) that maximises \(f\left(x\right) = -x ^ 4 + x ^ 3 - 2x ^ 2 + 3x + 5\):

gradient.ascent(gradient, target, learning.rate = .01, init = 0)
## [1] 0.75

Multivariate Gradient ascent

The previously section used the gradient ascent algorithm to maximise univariate functions. The same method can be applied to optimise multivariate functions. As an example, we will use the gradient ascent algorithm to maximise the probability density function of a 2-dimensional multivariate normal distribution. The probability density function of a multivariate normal distribution is defined as: \[\normal{\vec{x};\vec{\mu}, \vec{\Sigma}} = \frac{1}{\sqrt{\left(2\pi\right)^k\left|\vec{\Sigma}\right|}} \exp{ \left\{ -\frac{1}{2} {\left( \vec{x} - \vec{\mu} \right) }^{\mathrm{T}} {\vec{\Sigma}}^{-1} {\left( \vec{x} - \vec{\mu} \right) } \right\} } \] where \(\vec{\mu}\) and \(\vec{\Sigma}\) is the mean vector and the covariance matrix of the normal distribution respectively; \(k\) is the number of dimensions of the multivariate distribution; \(\left|\vec{\Sigma}\right|\) denotes the determinant of \(\vec{\Sigma}\). The above function translates to the following R code:

multivariate.normal <- function(x, mu, sigma) { sqrt((2 * pi)^ncol(x) * det(sigma)) * exp(-1/2 * apply(x, MARGIN = 1, FUN = function(x) {t(x - mu) %*% solve(sigma) %*% (x - mu)})) }

In this example, we will be working with the following multivariate normal distribution:

mu <- c(1, 1)
sigma <- matrix(c(1, .75, .75, 1), ncol = 2)

x <- seq(-5, 5, length.out = 100)
y <- seq(-5, 5, length.out = 100)
z <- expand.grid(x = x, y = y) %>% mutate(z = multivariate.normal(cbind(x, y), mu, sigma)) %>% '$'(z)
persp(x, y, matrix(z, nrow = length(x)), xlab = "x", ylab = "y", zlab = "z", theta = 20, phi = 20)

The mean of the multivariate normal distribution is located at \(\left(1, 1\right)\); its covariance matrix is \[\begin{pmatrix} 1 & 0.75 \\ 0.75 & 1 \end{pmatrix}\]. As such, the maximum of the probability density function of the distribution occurs at \(\left(1, 1\right)\).

Taking the direivative of the density function with respect to \(\vec{x}\) gives: \[\frac{\partial \normal{\vec{x};\vec{\mu}, \vec{\Sigma}} }{\partial \vec{x} } = - \frac{1}{2} \vec{\Sigma}^{-1} \left(\vec{x} - \vec{\mu}\right) \normal{\vec{x};\vec{\mu}, \vec{\Sigma}} \] which translates to the following R code:

multivariate.normal.gradient <- function(x, mu, sigma) { -0.5 * t(apply(x, MARGIN = 1, FUN = function(x) {(x - mu) %*% solve(sigma)})) * multivariate.normal(x, mu, sigma)  }

Since we only want to optimise the function with respect to \(x\), we need to create an intermediate anonymous function that evaluates the gradient of the multivariate normal distribution with \(\vec{\mu}\) and \(\vec{\Sigma}\) fixed:

# note that mu and sigma have been defined previously:
gradient.ascent(function(x) { multivariate.normal.gradient(matrix(x, nrow = 1), mu, sigma) }, function(x) {multivariate.normal(matrix(x, nrow = 1), mu, sigma)}, init = c(0, 0))
##      [,1] [,2]
## [1,]    1    1

which gives the correct answer.

Logistic Regression with Gradient Method

In this section, I am going to use gradient ascent to estimate the maximum likelihood estimations of the coefficients in a logistic regression model.

Logistic Regression

Logistic regression is a widely used model for performing binary classifications. The basic idea of logistic regression is to pass the result of an ordinary linear regression through a logistic activation function.

Data

We use example data from “https://stats.idre.ucla.edu/stat/data/binary.csv” as test input:

data <- read_csv("binary.csv", show_col_types = FALSE)

As a benchmark, we first perform a logistic regression on the dataset using R’s built-in glm function:

glm(admit ~ ., data = data, family = binomial(link = "logit"))
## 
## Call:  glm(formula = admit ~ ., family = binomial(link = "logit"), data = data)
## 
## Coefficients:
## (Intercept)          gre          gpa         rank  
##   -3.449548     0.002294     0.777014    -0.560031  
## 
## Degrees of Freedom: 399 Total (i.e. Null);  396 Residual
## Null Deviance:	    500 
## Residual Deviance: 459.4 	AIC: 467.4

We will later be checking the results obtained by our gradient ascent algorithm implemented in this document with that produced by R’s built-in logistic regression procedure.

The logistic function is a function from the sigmoid family and is defined as: \[ \logistic{x} = \frac{1}{1 + e^{-x}} \] and it plots as follows:

logistic <- function(x) { 1 / (1 + exp(-x)) }
plot(logistic, from = -10, to = 10)

A Bayesian formulation of the logistic regression problem can be specified as follows: \[ \vec{y}_i \follows \fun{Bernoulli}{logistic(\vec{w}^{\mathrm{T}} \vec{x}_i)} \] which states that each \(\vec{y}_i\) is a sample from a Bernoulli distribution whose “success probability” is \(logistic(\vec{w}^{\mathrm{T}} \vec{x}_i)\); the greater the value of \(\vec{w}^{\mathrm{T}} \vec{x}_i\) is, the closer the “success probability” is to \(1\). Through this formulation, we can write down the log-likelihood function for the logistic regression as follows: \[ \begin{align} \log \mathcal{L}\left(\vec{w}, \vec{x}, \vec{y}\right) &= \log \prod_{i = 1}^{N} \fun{Bernoulli}{ \vec{y}_i; \logistic{\vec{w}^{\mathrm{T}} \vec{x}_i} } \\ &= \sum_{i = 1}^{N} { \log{\logistic{\vec{w}^{\mathrm{T}} \vec{x}_i}^{\vec{y}_i} {\left( 1 - \logistic{\vec{w}^{\mathrm{T}} \vec{x}_i} \right)}^{1 - \vec{y}_i}} } \\ &= \sum_{i = 1}^{N} \vec{y}_i \log{\logistic{\vec{w}^{\mathrm{T}} \vec{x}_i}} \\ &\qquad + \left(1 - \vec{y}_i\right) \log{\left(1 - \logistic{\vec{w}^{\mathrm{T}} \vec{x}_i}\right)} \\ &= \sum_{i = 1}^{N} \vec{y}_i \log{\logistic{\vec{w}^{\mathrm{T}} \vec{x}_i}} \\ &\qquad + \log{\left(1 - \logistic{\vec{w}^{\mathrm{T}} \vec{x}_i}\right)} - \vec{y}_i \log{\left(1 - \logistic{\vec{w}^{\mathrm{T}} \vec{x}_i}\right)} \\ &= \sum_{i = 1}^{N} { \vec{y}_i \log{ \frac{\logistic{\vec{w}^{\mathrm{T}} \vec{x}_i}}{\left(1 - \logistic{\vec{w}^{\mathrm{T}} \vec{x}_i}\right)} } + \log{\left(1 - \logistic{\vec{w}^{\mathrm{T}} \vec{x}_i}\right)} } \\ &= \sum_{i = 1}^{N} {\vec{y}_i \log{ \frac{\frac{1}{1 + e^{-\vec{w}^{\mathrm{T}} \vec{x}_i}}}{1 - \frac{1}{1 + e^{-\vec{w}^{\mathrm{T}} \vec{x}_i}}} } + \log{\left(1 - \logistic{\vec{w}^{\mathrm{T}} \vec{x}_i}\right)} } \\ &= \sum_{i = 1}^{N} {\vec{y}_i \log{\frac{1}{1 + e^{-\vec{w}^{\mathrm{T}} \vec{x}_i} - 1}} + \log{\left(1 - \logistic{\vec{w}^{\mathrm{T}} \vec{x}_i}\right)} } \\ &= \sum_{i = 1}^{N} {\vec{y}_i \log{ e^{\vec{w}^{\mathrm{T}} \vec{x}_i} } + \log{\left(1 - \logistic{\vec{w}^{\mathrm{T}} \vec{x}_i}\right)} } \\ &= \sum_{i = 1}^{N} { \vec{y}_i \vec{w}^{\mathrm{T}} \vec{x}_i + \log{\left(\logistic{-\vec{w}^{\mathrm{T}} \vec{x}_i}\right)} } \label{eq:logistic-likelihood} \end{align} \]

Equation \(\eqref{eq:logistic-likelihood}\) translates into the following R function:

log.likelihood <- function(w, x, y) { sum( - log1p( exp(x %*% w)) + y * x %*% w ) }

Taking the derivative of \(\eqref{eq:logistic-likelihood}\) and note that: \[ \begin{align} \frac{ \partial \log \logistic{ - \vec{w}^{\mathrm{T}} \vec{x}_i }}{\partial \vec{w}} &= \logistic{-\vec{w}^{\mathrm{T}} \vec{x}_i}^{-1} \frac{\partial \frac{1}{1 + e^{\vec{w}^{\mathrm{T}} \vec{x}_i}} }{\partial \vec{w}} \\ &= - \logistic{-\vec{w}^{\mathrm{T}} \vec{x}_i}^{-1} \logistic{-\vec{w}^{\mathrm{T}} \vec{x}_i}^{2} \frac{ \partial {1 + e^{\vec{w}^{\mathrm{T}} \vec{x}_i}} }{\partial \vec{w}} \\ &= - \logistic{-\vec{w}^{-\mathrm{T}} \vec{x}_i} \left(e^{\vec{w}^{\mathrm{T}} \vec{x}_i}\right) \vec{x}_i \\ &= - \frac{1}{1 + e^{\vec{w}^{\mathrm{T}} \vec{x}_i}} e^{\vec{w}^{\mathrm{T}} \vec{x}_i} \vec{x}_i\\ &= - \logistic{\vec{w}^{\mathrm{T}} \vec{x}_i} \vec{x}_i \end{align} \]

We now obtain the gradient function of the log-likelihood function for the Bayesian formulation of a logistic regression: \(\fun{gradient}{\vec{w}, \vec{x}, \vec{y}} = \sum_{i = 1}^{N} {\vec{x}_i \left( \vec{y_i} - \logistic{\vec{w}^{\mathrm{T}} \vec{x}_i} \right)}\). In R, this can be coded as the following function:

log.likelihood.gradient <- function(w, x, y) { colSums( x * c(y - logistic(x %*% w))) }

Note that the second derivative of the log-likelihood function is non-positive across the domain: \[ \frac{\partial^2 \sum_{i = 1}^{N} { \vec{y}_i \vec{w}^{\mathrm{T}} \vec{x}_i + \log{\left(\logistic{-\vec{w}^{\mathrm{T}} \vec{x}_i}\right)} } }{ \partial \vec{w}^{2} } = - \sum_{i = 1}^{N} { \frac{ e^{\vec{w}^{\mathrm{T}} \vec{x}_i } \vec{x}_i^2 }{ { \left( 1 + e^{\vec{w}^{\mathrm{T}} \vec{x}_i } \right) }^2 } } \] which implys that the algorithm is guaranteed to reach the global maximum for the log-likelihood function.

We can now apply the gradient ascent algorithm to find the coefficient vector \(\vec{w}\) that maximises the log-likelihood function:

x <- data[, 2:4] %>% mutate(bias = 1) %>% data.matrix
y <- data[, 1] %>% data.matrix
gradient.ascent(function(w) {log.likelihood.gradient(w, x, y)}, function(w) {log.likelihood(w, x, y)}, learning.rate = 100, tolerance = .0001, init = c(0, 0, 0, 0))
##         gre         gpa        rank        bias 
##  0.00229396  0.77701357 -0.56003139 -3.44954840

It can be seen that the gradient ascent algorithm has reached the same set of coefficients for the model as obtained by R’s built-in logistic regression function.


  1. A differentiable function is concave if and only if its derivative function is monotonous decreasing.↩︎