Multinomial logistic regression

Problem formulation

Let $\mathbf{x} \in \mathbb{R}^D$ denote an input vector (e.g. the image of a handwritten digit) and $y \in \{1,...,C\}$ the corresponding label (e.g. the digit class).

We assume that there exists a function $f(\cdot; \boldsymbol\theta): \mathbb{R}^D \mapsto [0,1]^C$ parametrized by $\boldsymbol\theta \in \mathbb{R}^P$ such that:

$$p(y=k|\mathbf{x} ; \theta) = f_k(\mathbf{x}; \boldsymbol\theta), \qquad \forall k \in \{1,...,C\},$$

where $f_k(\mathbf{x}; \boldsymbol\theta)$ is the $k$-th entry of $f(\mathbf{x}; \boldsymbol\theta) \in [0,1]^C$.

Our goal is to build a model $f(\mathbf{x}; \boldsymbol\theta)$ such that the prediction

$$ \hat{y} = \arg\max_{k \in \{1,...,C\}} p(y=k|\mathbf{x} ; \theta) = \arg\max_{k \in \{1,...,C\}} f_k(\mathbf{x}; \boldsymbol\theta)$$

is as close as possible to the true label $y$.

Model

Using the softmax function, our model $f_k(\mathbf{x}; \boldsymbol\theta)$ is defined as follows:

$$ f_k(\mathbf{x}; \boldsymbol\theta) = \frac{\exp(\boldsymbol\theta_k^\top \mathbf{x} + b_k)}{ \sum_{m=1}^C \exp(\boldsymbol\theta_m^\top \mathbf{x} + b_m)}, $$

where $\boldsymbol\theta = \{\boldsymbol\theta_k \in \mathbb{R}^D, b_k \in \mathbb{R}\}_{k=1}^{C}$ are the model parameters.

This is the multinomial logistic regression model. Even though the name contains "regression", it is a model used for solving classification problems.

Multinomial logistic regression is considered as a linear method, because the prediction is computed by an affine combination of the input features. The softmax function, which is indeed not linear, is only introduced to ensure that $f_k(\mathbf{x}; \boldsymbol\theta) \ge 0$ and $\sum_{k=1}^C f_k(\mathbf{x}; \boldsymbol\theta) = 1$, which are properties required to interpret the output as a probability.

Note that the bias terms $b_k$ can be integrated into the weight vector $\boldsymbol\theta_k$ if we consider that the input vector $\mathbf{x}$ has an additional dimension equal to 1. The model thus simplifies as:

$$ f_k(\mathbf{x}; \boldsymbol\theta) = \frac{\exp(\boldsymbol\theta_k^\top \mathbf{x})}{ \sum_{m=1}^C \exp(\boldsymbol\theta_m^\top \mathbf{x})}. $$

In the following, we assume that the data dimension $D$ already includes this additional input.

Loss function

We assume that we have access to a training dataset of $N$ i.i.d samples $ \{(\mathbf{x}_i, y_i) \in \mathbb{R}^D \times \{1,...,C\}\}_{i=1}^N$. We are in a supervised learning framework.

The empirical risk is defined by:

$$ \mathcal{L}(\boldsymbol\theta) = \frac{1}{N} \sum_{i=1}^N \ell(y_i, f(\mathbf{x}_i; \boldsymbol\theta)), $$

where $\ell$ is a loss function measuring the discrepancy between the true label $y_i$ and the prediction $f(\mathbf{x}_i; \boldsymbol\theta)$. In multinomial logistic regression, we use the negative log-likelihood (NLL) as loss function.

Let us define the one-hot vector $\mathbf{t}_i \in \{0,1\}^C$ encoding the true label $y_i$:

$$ t_{i,k} = [\mathbf{t}_i]_k = \begin{cases} 1 \hspace{.25cm} \text{if } y_i =k \\ 0 \hspace{.25cm} \text{otherwise } \end{cases}.$$

It is just another representation of the label.

The NLL loss function is defined by:

$$ \begin{aligned} \ell(y_i, f(\mathbf{x}; \boldsymbol\theta)) &= - \ln p(\mathbf{x}_i,y_i; \boldsymbol\theta) \\ &= - \ln p(y_i | \mathbf{x}_i ; \boldsymbol\theta) - \ln p(\mathbf{x}_i) \\ &= - \ln \prod_{k=1}^{C} p(y_i=k | \mathbf{x}_i ; \boldsymbol\theta)^{t_{i,k}} + cst\\ &= - \ln \prod_{k=1}^{C} f_k(\mathbf{x}_i; \boldsymbol\theta)^{t_{i,k}} + cst\\ &= - \sum_{k=1}^{C} t_{i,k} \ln f_k(\mathbf{x}_i; \boldsymbol\theta) + cst. \end{aligned} $$

Ignoring the constant, the complete loss function is therefore given by:

$$ \mathcal{L}(\boldsymbol\theta) = - \frac{1}{N} \sum_{i=1}^N \sum_{k=1}^{C} t_{i,k} \ln f_k(\mathbf{x}_i; \boldsymbol\theta). $$

This is called the cross-entropy loss function.

Optimization algorithm

To estimate the model parameters $\boldsymbol\theta$ we have to minimize the loss function $\mathcal{L}(\boldsymbol\theta)$. To do so, we can use the gradient descent algorithm. It is an iterative algorithm which consists in iterating:

$$ \boldsymbol\theta \leftarrow \boldsymbol\theta - \eta \nabla \mathcal{L}(\boldsymbol\theta), $$

where $\eta > 0$ is the learning rate. Both the learning rate and the initialization of the parameters have a critical influence on the behavior of the algorithm.

We will perform this update independently for all $\boldsymbol\theta_j \in \boldsymbol\theta$, $j \in \{1,...,C\}$, so we need to compute

$$ \frac{\partial \mathcal{L}}{\partial \boldsymbol\theta_j} = - \frac{1}{N} \sum_{i=1}^N \sum_{k=1}^{C} t_{i,k} \frac{\partial}{\partial \boldsymbol\theta_j} \ln f_k(\mathbf{x}_i; \boldsymbol\theta). $$

Exercise 1 - Theoretical work

You can provide handwritten answers, no need to write them in Latex (but if you want you can).

Question 1) Using the model's definition, give the expression of $\displaystyle \frac{\partial}{\partial \boldsymbol\theta_j} \ln f_k(\mathbf{x}; \boldsymbol\theta)$. You need to consider separately the cases where $j=k$ and $j \neq k$. Your answer will involve the indicator function $\mathbb{1}_{\{j=k\}}$ which is equal to $1$ if $j=k$, $0$ otherwise. You will use the chain rule of derivatives and the following formula:

$$ \frac{\partial}{\partial \mathbf{v}} \mathbf{v}^\top \mathbf{A} = \mathbf{A}, $$

which holds for any vector $\mathbf{v}$ and matrix $\mathbf{A}$ of appropriate dimension.

Question 2) Show that $\displaystyle \frac{\partial}{\partial \boldsymbol\theta_j} \mathcal{L}(\boldsymbol\theta) = \frac{1}{N} \sum_{i=1}^N (f_j(\mathbf{x}_i; \boldsymbol\theta) - t_{i,j}) \mathbf{x}_i$ using the result of the previous question and the definition of $t_{i,k}$.

Exercise 2 - Pratical work

Dataset

We start by loading a dataset of handwritten digits.

We can look at the first 5 training images.

Question 1) Look at the dimensions of the variables x_{train, val, test} and y_{train, val, test} and briefly describe the dataset.


Answer


Prediction

Question 2) In the next cell, you have to complete the function prediction to implement the prediction model $f(\mathbf{x}; \boldsymbol\theta) \in [0,1]^C$, $\mathbf{x} \in \mathbb{R}^D$.

This function takes as input

It should return a $(N, C)$ array of class probabilities.

To sum over one dimension of a numpy array you will use numpy.sum(a, axis=None, dtype=None, out=None, keepdims=<no value>, initial=<no value>, where=<no value>).


Solution


Let's try the prediction function on one example using random parameters. We will use the above-defined fuction proba_to_class to compute the argmax of the probabilities.

Cross-entropy loss and gradient

The cross_entropy function in the next cell computes the averaged cross entropy between targets (encoded as one-hot vectors) and predictions.

Question 3) You have to complete the cross_entropy_grad function that computes the gradient of the cross entropy loss. It takes as input:

It should return a $(C, D)$ array containing the partial derivatives of the cross entropy loss w.r.t each scalar parameter.


Solution


Let's try your implementation.

Learning with gradient descent

Question 4) You have to complete the next cell to implement the gradient descent learning algorithm using the above-defined functions: prediction, cross_entropy_grad, cross_entropy, and proba_to_class.

After each epoch or iteration, you will compute the cross entropy on the training and validation sets and append it to the list ce_train and ce_val. You will also compute the accuracy using the accuracy_score function of scikit-learn and append it to the lists acc_train and acc_val.

You will implement early stopping to automatically stop the iterative learning algorithm when the cross-entropy loss has not decreased during a certain number of epochs specified in the patience variable.


Solution


In the next cell we plot the loss function and the accuracy along the epochs for the training and validation sets.

Question 5) What do you observe?


Answer


Question 6) Compare the final perfomance on the training and test sets. How does the performance vary if you vary the number of examples in the training dataset?


Answer