# Tutorial¶

This is a tutorial on elastic net regularized generalized linear models. We will go through the math to setup the penalized negative log-likelihood loss function and the coordinate descent algorithm for optimization.

Here are some other resources from a recent talk.

At present this tutorial does not cover Tikhonov regularization or group lasso, but we look forward to adding more material shortly.

Reference Jerome Friedman, Trevor Hastie and Rob Tibshirani. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, Vol. 33(1), 1-22 [pdf].

# Author: Pavan Ramkumar

import numpy as np
from scipy.special import expit


## GLM with elastic net penalty¶

In the elastic net regularized generalized Linear Model (GLM), we want to solve the following problem.

$\min_{\beta_0, \beta} \frac{1}{N} \sum_{i = 1}^N \ell (y_i, \beta_0 + \beta^T x_i) + \lambda [0.5(1 - \alpha)\| \beta \|_2^2 + \alpha \| \beta \|_1]$

where $$\ell (y_i, \beta_0 + \beta^T x_i)$$ is the negative log-likelihood of an observation $$i$$. Here, we will go through the softplus link function case and show how we can optimize this cost function

## Poisson-like GLM¶

The pyglmnet implementation comes with poisson, binomial, gaussian, and multinomial distributions, but for illustration, we will walk you through a particular adaptation of the canonical Poisson generalized linear model (GLM).

For the Poisson GLM, $$\lambda_i$$ is the rate parameter of an inhomogeneous linear-nonlinear Poisson (LNP) process with instantaneous mean given by:

$\lambda_i = \exp(\beta_0 + \beta^T x_i)$

where $$x_i \in \mathcal{R}^{p \times 1}, i = \{1, 2, \dots, n\}$$ are the observed independent variables (predictors), $$\beta_0 \in \mathcal{R}^{1 \times 1}$$, $$\beta \in \mathcal{R}^{p \times 1}$$ are linear coefficients. $$\lambda_i$$ is also known as the conditional intensity function, conditioned on $$(\beta_0, \beta)$$ and $$q(z) = \exp(z)$$ is the nonlinearity.

For numerical reasons, let’s adopt a stabilizing non-linearity, known as the softplus or the smooth rectifier Dugas et al., 2001, and adopted by Jonathan Pillow’s and Liam Paninski’s groups for neural data analysis. See for instance: Park et al., 2014.

$q(z) = \log(1+\exp(z))$

The softplus prevents $$\lambda$$ in the canonical inverse link function from exploding when the argument to the exponent is large. In this modification, the formulation is no longer an exact LNP, nor an exact GLM, but :math:pmmathcal{L}(beta_0, beta) is still concave (convex) and we can use gradient ascent (descent) to optimize it.

$\lambda_i = q(\beta_0 + \beta^T x_i) = \log(1 + \exp(\beta_0 + \beta^T x_i))$
def qu(z):
"""The non-linearity."""
qu = np.log1p(np.exp(z))

def lmb(self, beta0, beta, X):
"""Conditional intensity function."""
z = beta0 + np.dot(X, beta)
l = self.qu(z)
return l


## Poisson Log-likelihood¶

The likelihood of observing the spike count $$y_i$$ under the Poisson likelihood function with inhomogeneous rate $$\lambda_i$$ is given by:

$\prod_i P(y = y_i) = \prod_i \frac{e^{-\lambda_i} \lambda_i^{y_i}}{y_i!}$

The log-likelihood is given by:

$\mathcal{L} = \sum_i \bigg\{y_i \log(\lambda_i) - \lambda_i - log(y_i!)\bigg\}$

However, we are interested in maximizing the log-likelihood as a function of $$\beta_0$$ and $$\beta$$. Thus, we can drop the factorial term:

$\mathcal{L}(\beta_0, \beta) = \sum_i \bigg\{y_i \log(\lambda_i) - \lambda_i\bigg\}$

## Elastic net penalty¶

For large models we need to penalize the log likelihood term in order to prevent overfitting. The elastic net penalty is given by:

$\mathcal{P}_\alpha(\beta) = (1-\alpha)\frac{1}{2} \|\beta\|^2_{\mathcal{l}_2} + \alpha\|\beta\|_{\mathcal{l}_1}$

The elastic net interpolates between two extremes. $$\alpha = 0$$ is known as ridge regression and $$\alpha = 1$$ is known as LASSO. Note that we do not penalize the baseline term $$\beta_0$$.

def penalty(alpha, beta):
"""the penalty term"""
P = 0.5 * (1 - alpha) * np.linalg.norm(beta, 2) + \
alpha * np.linalg.norm(beta, 1)
return P


## Objective function¶

We minimize the objective function:

$J(\beta_0, \beta) = -\mathcal{L}(\beta_0, \beta) + \lambda \mathcal{P}_\alpha(\beta)$

where $$\mathcal{L}(\beta_0, \beta)$$ is the Poisson log-likelihood and $$\mathcal{P}_\alpha(\beta)$$ is the elastic net penalty term and $$\lambda$$ and $$\alpha$$ are regularization parameters.

def loss(beta0, beta, reg_lambda, X, y):
"""Define the objective function for elastic net."""
L = logL(beta0, beta, X, y)
P = penalty(beta)
J = -L + reg_lambda * P
return J


To calculate the gradients of the cost function with respect to $$\beta_0$$ and $$\beta$$, let’s plug in the definitions for the log likelihood and penalty terms from above.

$\begin{split}\begin{eqnarray} J(\beta_0, \beta) &= \sum_i \bigg\{ \log(1 + \exp(\beta_0 + \beta^T x_i))\\ & - y_i \log(\log(1 + \exp(\beta_0 + \beta^T x_i)))\bigg\}\\ & + \lambda(1 - \alpha)\frac{1}{2} \|\beta\|^2_{\mathcal{l_2}} + \lambda\alpha\|\beta\|_{\mathcal{l_1}} \end{eqnarray}\end{split}$

Since we will apply co-ordinate descent, let’s rewrite this cost in terms of each scalar parameter $$\beta_j$$

$\begin{split}\begin{eqnarray} J(\beta_0, \beta) &= \sum_i \bigg\{ \log(1 + \exp(\beta_0 + \sum_j \beta_j x_{ij})) & - y_i \log(\log(1 + \exp(\beta_0 + \sum_j \beta_j x_{ij})))\bigg\}\\ & + \lambda(1-\alpha)\frac{1}{2} \sum_j \beta_j^2 + \lambda\alpha\sum_j \mid\beta_j\mid \end{eqnarray}\end{split}$

Let’s take the derivatives of some big expressions using chain rule. Define $$z_i = \beta_0 + \sum_j \beta_j x_{ij}$$.

For the nonlinearity in the first term $$y = q(z) = \log(1+e^{z(\theta)})$$,

$\begin{split}\begin{eqnarray} \frac{\partial y}{\partial \theta} &= \frac{\partial q}{\partial z}\frac{\partial z}{\partial \theta}\\ & = \frac{e^z}{1+e^z}\frac{\partial z}{\partial \theta}\\ & = \sigma(z)\frac{\partial z}{\partial \theta} \end{eqnarray}\end{split}$

For the nonlinearity in the second term $$y = \log(q(z)) = \log(\log(1+e^{z(\theta)}))$$,

$\begin{split}\begin{eqnarray} \frac{\partial y}{\partial \theta} & = \frac{1}{q(z)}\frac{\partial q}{\partial z}\frac{\partial z}{\partial \theta}\\ & = \frac{\sigma(z)}{q(z)}\frac{\partial z}{\partial \theta} \end{eqnarray}\end{split}$

where $$\dot q(z)$$ happens to be be the sigmoid function,

$\sigma(z) = \frac{e^z}{1+e^z}$

Putting it all together, we have,

$\frac{\partial J}{\partial \beta_0} = \sum_i \sigma(z_i) - \sum_i y_i\frac{\sigma(z_i)}{q(z_i)}$
$\frac{\partial J}{\partial \beta_j} = \sum_i \sigma(z_i) x_{ij} - \sum_i y_i \frac{\sigma(z_i)}{q(z_i)}x_{ij} + \lambda(1-\alpha)\beta_j + \lambda\alpha \text{sgn}(\beta_j)$

def grad_L2loss(beta0, beta, reg_lambda, X, y):
z = beta0 + np.dot(X, beta)
s = expit(z)
q = qu(z)
grad_beta0 = np.sum(s) - np.sum(y * s / q)
np.dot(np.transpose(y * s / q), X)) + \
reg_lambda * (1 - alpha) * beta


Note that this is all we need for a classic batch gradient descent implementation. (for the 'batch-gradient' solver).

However, let’s also derive the Hessian terms that will be useful for second-order optimization methods implemented in the 'cdfast' solver.

## Hessian terms¶

Second-order derivatives can accelerate convergence to local minima by providing optimal step sizes. However, they are expensive to compute.

This is where co-ordinate descent shines. Since we update only one parameter $$\beta_j$$ per step, we can simply use the $$j^{th}$$ diagonal term in the Hessian matrix to perform an approximate Newton update as:

$\beta_j^{t+1} = \beta_j^{t} - \bigg\{\frac{\partial^2 J}{\partial \beta_j^2}\bigg\}^{-1} \frac{\partial J}{\partial \beta_j}$

Let’s use calculus again to compute these diagonal terms. Recall that:

$\begin{split}\begin{eqnarray} \dot q(z) & = \sigma(z)\\ \dot\sigma(z) = \sigma(z)(1-\sigma(z)) \end{eqnarray}\end{split}$

Using these, and applying the product rule

$\frac{\partial}{\partial z}\bigg\{ \frac{\sigma(z)}{q(z)} \bigg\} = \frac{\sigma(z)(1-\sigma(z))}{q(z)} - \frac{\sigma(z)}{q(z)^2}$

Plugging all these in, we get

$\frac{\partial^2 J}{\partial \beta_0^2} = \sum_i \sigma(z_i)(1 - \sigma(z_i)) - \sum_i y_i \bigg\{ \frac{\sigma(z_i) (1 - \sigma(z_i))}{q(z_i)} - \frac{\sigma(z_i)}{q(z_i)^2} \bigg\}$
$\begin{split}\begin{eqnarray} \frac{\partial^2 J}{\partial \beta_j^2} & = \sum_i \sigma(z_i)(1 - \sigma(z_i)) x_{ij}^2 \\ & - \sum_i y_i \bigg\{ \frac{\sigma(z_i) (1 - \sigma(z_i))}{q(z_i)} \\ & - \frac{\sigma(z_i)}{q(z_i)^2} \bigg\}x_{ij}^2 + \lambda(1-\alpha) \end{eqnarray}\end{split}$
def hessian_loss(beta0, beta, alpha, reg_lambda, X, y):
z = beta0 + np.dot(X, beta)
q = qu(z)
s = expit(z)
hess_beta = np.transpose(np.dot(np.transpose(grad_s), X * X)
- np.dot(np.transpose(y * grad_s_by_q), X * X))\
+ reg_lambda * (1-alpha)
return hess_beta0, hess_beta


Also see the cheatsheet for calculus required to derive gradients and Hessians for other distributions.

## Cyclical co-ordinate descent¶

Parameter update step

In cylical coordinate descent with elastic net, we store an active set, $$\mathcal{K}$$, of parameter indices that we update. Since the $$\mathcal{l}_1$$ terms $$|\beta_j|$$ are not differentiable at zero, we use the gradient without the $$\lambda\alpha \text{sgn}(\beta_j)$$ term to update $$\beta_j$$. Let’s call these gradient terms $$\tilde{g}_k$$.

We start by initializing $$\mathcal{K}$$ to contain all parameter indices Let’s say only the $$k^{th}$$ parameter is updated at time step $$t$$.

$\begin{split}\begin{eqnarray} \beta_k^{t} & = \beta_k^{t-1} - (h_k^{t-1})^{-1} \tilde{g}_k^{t-1} \\ \beta_j^{t} & = \beta_j^{t-1}, \forall j \neq k \end{eqnarray}\end{split}$

Next, we apply a soft thresholding step for $$k \neq 0$$ after every update iteration, as follows. $$\beta_k^{t} = \mathcal{S}_{\lambda\alpha}(\beta_k^{t})$$

where

$\begin{split}S_\lambda(x) = \begin{cases} 0 & \text{if} & |x| \leq \lambda\\ \text{sgn}(x)||x|-\lambda| & \text{if} & |x| > \lambda \end{cases}\end{split}$

If $$\beta_k^{t}$$ has been zero-ed out, we remove $$k$$ from the active set.

$\mathcal{K} = \mathcal{K} \setminus \left\{k\right\}$
def prox(X, l):
"""Proximal operator."""
return np.sign(X) * (np.abs(X) - l) * (np.abs(X) > l)


Caching the z update step

Next, we want to update $$\beta_{k+1}$$ at the next time step $$t+1$$. For this we need the gradient and Hessian terms, $$\tilde{g}_{k+1}$$ and :mathh_{k+1}. If we update them instead of recalculating them, we can save on a lot of multiplications and additions. This is possible because we only update one parameter at a time. Let’s calculate how to make these updates.

$z_i^{t} = z_i^{t-1} - \beta_k^{t-1}x_{ik} + \beta_k^{t}x_{ik}$
$z_i^{t} = z_i^{t-1} - (h_k^{t-1})^{-1} \tilde{g}_k^{t-1}x_{ik}$

If $$k = 0$$,

$\tilde{g}_{k+1}^t = \sum_i \sigma(z_i^t) - \sum_i y_i \frac{\sigma(z_i^t)}{q(z_i^t)}$

If $$k > 0$$,

$\begin{split}\begin{eqnarray} \tilde{g}_{k+1}^t & = \sum_i \sigma(z_i^t) x_{i,k+1} - \sum_i y_i \frac{\sigma(z_i^t)}{q(z_i^t)} x_{i,k+1} & + \lambda(1-\alpha)\beta_{k+1}^t \end{eqnarray}\end{split}$
def grad_loss_k(z, beta_k, alpha, rl, xk, y, k):
"""Gradient update for a single coordinate
"""
q = qu(z)
s = expit(z)
if(k == 0):
gk = np.sum(s) - np.sum(y*s/q)
else:
gk = np.sum(s*xk) - np.sum(y*s/q*xk) + rl*(1-alpha)*beta_k
return gk


Hessian update

If $$k = 0$$,

$\begin{split}h_{k+1}^t & = \sum_i \sigma(z_i^t)(1 - \sigma(z_i^t)) \\ & - \sum_i y_i \bigg\{ \frac{\sigma(z_i^t) (1 - \sigma(z_i^t))}{q(z_i^t)} - \frac{\sigma(z_i^t)}{q(z_i^t)^2} \bigg\}\end{split}$

If $$k > 0$$,

$\begin{split}\begin{eqnarray} h_{k+1}^t & = \sum_i \sigma(z_i^t)(1 - \sigma(z_i^t)) x_{i,k+1}^2 \\ & - \sum_i y_i \bigg\{ \frac{\sigma(z_i^t) (1 - \sigma(z_i^t))}{q(z_i^t)} & - \frac{\sigma(z_i^t)}{q(z_i^t)^2} \bigg\}x_{i,k+1}^2 + \lambda(1-\alpha) \end{eqnarray}\end{split}$
def hess_loss_k(z, alpha, rl, xk, y, k):
"""Hessian update for a single coordinate
"""
q = qu(z)
s = expit(z)
if(k == 0):
`
We often find the optimal regularization parameter $$\lambda$$ through cross-validation. Thus, in practice, we fit the model several times over a range of $$\lambda$$‘s $$\{ \lambda_{max} \geq \dots \geq \lambda_0\}$$.
Instead of re-fitting the model each time, we can solve the problem for the most-regularized model ($$\lambda_{max}$$) and then initialize the subsequent model with this solution. The path that each parameter takes through the range of regularization parameters is known as the regularization path, and the trick of initializing each model with the previous model’s solution is known as a warm restart.