multinomial logistic regression

Problem formulation¶

Let $\mathbf{x} \in \mathbb{R}^D$ denote an input vector (e.g. the image of a handwritten digit) and $y \in \{1,...,C\}$ the corresponding label (e.g. the digit class).

We assume that there exists a function $f(\cdot; \boldsymbol\theta): \mathbb{R}^D \mapsto [0,1]^C$ parametrized by $\boldsymbol\theta \in \mathbb{R}^P$ such that:

$$p(y=k|\mathbf{x} ; \theta) = f_k(\mathbf{x}; \boldsymbol\theta), \qquad \forall k \in \{1,...,C\},$$

where $f_k(\mathbf{x}; \boldsymbol\theta)$ is the $k$-th entry of $f(\mathbf{x}; \boldsymbol\theta) \in [0,1]^C$.

Our goal is to build a model $f(\mathbf{x}; \boldsymbol\theta)$ such that the prediction

$$ \hat{y} = \arg\max_{k \in \{1,...,C\}} p(y=k|\mathbf{x} ; \theta) = \arg\max_{k \in \{1,...,C\}} f_k(\mathbf{x}; \boldsymbol\theta)$$

is as close as possible to the true label $y$.

Model¶

Using the softmax function, our model $f_k(\mathbf{x}; \boldsymbol\theta)$ is defined as follows:

$$ f_k(\mathbf{x}; \boldsymbol\theta) = \frac{\exp(\boldsymbol\theta_k^\top \mathbf{x} + b_k)}{ \sum_{m=1}^C \exp(\boldsymbol\theta_m^\top \mathbf{x} + b_m)}, $$

where $\boldsymbol\theta = \{\boldsymbol\theta_k \in \mathbb{R}^D, b_k \in \mathbb{R}\}_{k=1}^{C}$ are the model parameters.

This is the multinomial logistic regression model. Even though the name contains "regression", it is a model used for solving classification problems.

Multinomial logistic regression is considered as a linear method, because the prediction is computed by an affine combination of the input features. The softmax function, which is indeed not linear, is only introduced to ensure that $f_k(\mathbf{x}; \boldsymbol\theta) \ge 0$ and $\sum_{k=1}^C f_k(\mathbf{x}; \boldsymbol\theta) = 1$, which are properties required to interpret the output as a probability.

Note that the bias terms $b_k$ can be integrated into the weight vector $\boldsymbol\theta_k$ if we consider that the input vector $\mathbf{x}$ has an additional dimension equal to 1. The model thus simplifies as:

$$ f_k(\mathbf{x}; \boldsymbol\theta) = \frac{\exp(\boldsymbol\theta_k^\top \mathbf{x})}{ \sum_{m=1}^C \exp(\boldsymbol\theta_m^\top \mathbf{x})}. $$

In the following, we assume that the data dimension $D$ already includes this additional input.

Loss function¶

We assume that we have access to a training dataset of $N$ i.i.d samples $ \{(\mathbf{x}_i, y_i) \in \mathbb{R}^D \times \{1,...,C\}\}_{i=1}^N$. We are in a supervised learning framework.

The empirical risk is defined by:

$$ \mathcal{L}(\boldsymbol\theta) = \frac{1}{N} \sum_{i=1}^N \ell(y_i, f(\mathbf{x}_i; \boldsymbol\theta)), $$

where $\ell$ is a loss function measuring the discrepancy between the true label $y_i$ and the prediction $f(\mathbf{x}_i; \boldsymbol\theta)$. In multinomial logistic regression, we use the negative log-likelihood (NLL) as loss function.

Let us define the one-hot vector $\mathbf{t}_i \in \{0,1\}^C$ encoding the true label $y_i$:

$$ t_{i,k} = [\mathbf{t}_i]_k = \begin{cases} 1 \hspace{.25cm} \text{if } y_i =k \\ 0 \hspace{.25cm} \text{otherwise } \end{cases}.$$

It is just another representation of the label.

The NLL loss function is defined by:

$$ \begin{aligned} \ell(y_i, f(\mathbf{x}; \boldsymbol\theta)) &= - \ln p(\mathbf{x}_i,y_i; \boldsymbol\theta) \\ &= - \ln p(y_i | \mathbf{x}_i ; \boldsymbol\theta) - \ln p(\mathbf{x}_i) \\ &= - \ln \prod_{k=1}^{C} p(y_i=k | \mathbf{x}_i ; \boldsymbol\theta)^{t_{i,k}} + cst\\ &= - \ln \prod_{k=1}^{C} f_k(\mathbf{x}_i; \boldsymbol\theta)^{t_{i,k}} + cst\\ &= - \sum_{k=1}^{C} t_{i,k} \ln f_k(\mathbf{x}_i; \boldsymbol\theta) + cst. \end{aligned} $$

Ignoring the constant, the complete loss function is therefore given by:

$$ \mathcal{L}(\boldsymbol\theta) = - \frac{1}{N} \sum_{i=1}^N \sum_{k=1}^{C} t_{i,k} \ln f_k(\mathbf{x}_i; \boldsymbol\theta). $$

This is called the cross-entropy loss function.

Optimization algorithm¶

To estimate the model parameters $\boldsymbol\theta$ we have to minimize the loss function $\mathcal{L}(\boldsymbol\theta)$. To do so, we can use the gradient descent algorithm. It is an iterative algorithm which consists in iterating:

$$ \boldsymbol\theta \leftarrow \boldsymbol\theta - \eta \nabla \mathcal{L}(\boldsymbol\theta), $$

where $\eta > 0$ is the learning rate. Both the learning rate and the initialization of the parameters have a critical influence on the behavior of the algorithm.

We will perform this update independently for all $\boldsymbol\theta_j \in \boldsymbol\theta$, $j \in \{1,...,C\}$, so we need to compute

$$ \frac{\partial \mathcal{L}}{\partial \boldsymbol\theta_j} = - \frac{1}{N} \sum_{i=1}^N \sum_{k=1}^{C} t_{i,k} \frac{\partial}{\partial \boldsymbol\theta_j} \ln f_k(\mathbf{x}_i; \boldsymbol\theta). $$

Exercise 1 - Theoretical work¶

You can provide handwritten answers, no need to write them in Latex (but if you want you can).

Question 1) Using the model's definition, give the expression of $\displaystyle \frac{\partial}{\partial \boldsymbol\theta_j} \ln f_k(\mathbf{x}; \boldsymbol\theta)$. You need to consider separately the cases where $j=k$ and $j \neq k$. Your answer will involve the indicator function $\mathbb{1}_{\{j=k\}}$ which is equal to $1$ if $j=k$, $0$ otherwise. You will use the chain rule of derivatives and the following formula:

$$ \frac{\partial}{\partial \mathbf{v}} \mathbf{v}^\top \mathbf{A} = \mathbf{A}, $$

which holds for any vector $\mathbf{v}$ and matrix $\mathbf{A}$ of appropriate dimension.

Question 2) Show that $\displaystyle \frac{\partial}{\partial \boldsymbol\theta_j} \mathcal{L}(\boldsymbol\theta) = \frac{1}{N} \sum_{i=1}^N (f_j(\mathbf{x}_i; \boldsymbol\theta) - t_{i,j}) \mathbf{x}_i$ using the result of the previous question and the definition of $t_{i,k}$.

Exercise 2 - Pratical work¶

Dataset¶

We start by loading a dataset of handwritten digits.

In [ ]:

import numpy as np 
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

def to_one_hot(y):
    y_one_hot = np.zeros((y.shape[0], y.max()+1))
    y_one_hot[np.arange(y.shape[0]),y] = 1
    return y_one_hot

def proba_to_class(y_hat):
    return np.argmax(y_hat, axis=1)   

x_digits, y_digits = datasets.load_digits(return_X_y=True)
x_digits = x_digits / x_digits.max()
x_digits = np.hstack((x_digits, np.ones((x_digits.shape[0],1))))

x_train, x_test, y_train, y_test = train_test_split(x_digits, y_digits, test_size=0.10, random_state=42)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.10, random_state=42)

y_one_hot_train = to_one_hot(y_train)
y_one_hot_val = to_one_hot(y_val)
y_one_hot_test = to_one_hot(y_test)

We can look at the first 5 training images.

Question 1) Look at the dimensions of the variables x_{train, val, test} and y_{train, val, test} and briefly describe the dataset.

Prediction¶

Question 2) In the next cell, you have to complete the function prediction to implement the prediction model $f(\mathbf{x}; \boldsymbol\theta) \in [0,1]^C$, $\mathbf{x} \in \mathbb{R}^D$.

This function takes as input

a $(N, D)$ array where $N$ is the number of examples in the dataset and $D$ the input dimension,
a $(C, D)$ array of parameters.

It should return a $(N, C)$ array of class probabilities.

To sum over one dimension of a numpy array you will use numpy.sum(a, axis=None, dtype=None, out=None, keepdims=<no value>, initial=<no value>, where=<no value>).

Let's try the prediction function on one example using random parameters. We will use the above-defined fuction proba_to_class to compute the argmax of the probabilities.

In [ ]:

D = x_train.shape[1] # dimension of the input vector
N = x_train.shape[0] # number of training examples
C = y_one_hot_train.shape[1] # number of classes

theta = np.random.randn(C, D) # model parameters

# We add 'None' such that the output array dim. after slicing is (1,D) and not (,D)
y_prob_train_hat = prediction(x_train[None,1,:], theta)

y_train_hat = proba_to_class(y_prob_train_hat)

print('Class probabilities: ', end="")
print(y_prob_train_hat)
print('\nSum of probabilities: %.1f' % np.sum(y_prob_train_hat.squeeze()))
print('\nArgmax: %d' % y_train_hat)

Cross-entropy loss and gradient¶

The cross_entropy function in the next cell computes the averaged cross entropy between targets (encoded as one-hot vectors) and predictions.

Question 3) You have to complete the cross_entropy_grad function that computes the gradient of the cross entropy loss. It takes as input:

a $(N, C)$ array corresponding to the output of the prediction function;
a $(N, C)$ array corresponding to the one-hot label vectors;
a $(N, D)$ array corresponding to the input data.

It should return a $(C, D)$ array containing the partial derivatives of the cross entropy loss w.r.t each scalar parameter.

In [ ]:

def cross_entropy(predictions, targets, epsilon=1e-12):
    """
    Computes the averaged cross entropy between targets (encoded as one-hot vectors)
    and predictions. 
    Input: predictions (N, C) array
           targets (N, C) array        
    Returns: scalar
    """
    predictions = np.clip(predictions, epsilon, 1. - epsilon)
    N = predictions.shape[0]
    ce = -np.sum(targets*np.log(predictions))/N
    return ce

def cross_entropy_grad(predictions, targets, x):
    """
    Computes the gradient of the cross entropy loss
    Input: predictions (N, C) array
           targets (N, C) array    
           x (N, D) array
    Returns: gradient (C, D) array 
    """

    return

Let's try your implementation.

Learning with gradient descent¶

Question 4) You have to complete the next cell to implement the gradient descent learning algorithm using the above-defined functions: prediction, cross_entropy_grad, cross_entropy, and proba_to_class.

After each epoch or iteration, you will compute the cross entropy on the training and validation sets and append it to the list ce_train and ce_val. You will also compute the accuracy using the accuracy_score function of scikit-learn and append it to the lists acc_train and acc_val.

You will implement early stopping to automatically stop the iterative learning algorithm when the cross-entropy loss has not decreased during a certain number of epochs specified in the patience variable.

Multinomial logistic regression¶

Problem formulation¶

Model¶

Loss function¶

Optimization algorithm¶

Exercise 1 - Theoretical work¶

Exercise 2 - Pratical work¶

Dataset¶

Prediction¶

Cross-entropy loss and gradient¶

Learning with gradient descent¶