Gaussian mixture model estimation with the EM algorithm

Image credit: Bayesian Learning for Signal Processing, Antoine Deleforge, LVA/ICA 2015 Summer School.

On Tuesday October 7, 1949, Thomas Bayes is going to visit Oxford University. Upon arriving at the university, three prankster students throw dozens of small stones at him from the roof. Bayes wants to know which student has thrown which stone. Determined, he begins to note the 2D position of each single stone on the ground.

Theoretical work

Generative model

For his investigation, Thomas Bayes defines the generative process of the observed data as follows:

He observes a realization of a set of observed random variables denoted by $ \mathbf{x} = \{\mathbf{x}_n \in \mathbb{R}^2\}_{n=1}^N$, where $\mathbf{x}_n$ corresponds to the 2D position of the $n$-th stone.

These observations are generated from a set of latent unobserved random variables denoted by $ \mathbf{z} = \{z_n \in \{1,...,K\} \}_{n=1}^N$, where $z_n$ denotes the identity of the student (among $K=3$ students) who threw the $n$-th stone.

The relationships between the latent and observed variables are defined by their joint distribution, also called complete-data likelihood:

$$ \begin{aligned} p(\mathbf{x}, \mathbf{z}; \theta) &= \prod_{n=1}^N p(\mathbf{x}_n | {z}_n; \theta) p({z}_n; \theta) \\ &= \prod_{n=1}^N \prod_{k=1}^K \left( p(\mathbf{x}_n | {z}_n=k; \theta) p({z}_n=k; \theta) \right)^{\mathbb{1}\{z_n = k\}}, \end{aligned} $$

where $\mathbb{1}\{z_n = k\} = \begin{cases}1 & \text{if } z_n = k \\ 0 & \text{otherwise}\end{cases}$.

The prior over the latent variables follows a categorical distribution: $$ p({z}_n=k; \theta) = \pi_k, \qquad k \in \{1,...,K\}, \qquad \text{with }\, \pi_k > 0\, \text{ and }\, \sum_{k=1}^K = 1. $$

The likelihood is Gaussian:

$$ p(\mathbf{x}_n | z_n=k; \theta) = \mathcal{N}(\mathbf{x}_n; \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k),$$

with $\mathcal{N}(\mathbf{x}; \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \displaystyle \frac{1}{\sqrt{\det(2\pi \boldsymbol\Sigma)}} \exp\left(-\frac 1 2 ({\mathbf x}-{\boldsymbol\mu})^\mathrm{T}{\boldsymbol\Sigma}^{-1}({\mathbf x}-{\boldsymbol\mu})\right).$

The set of unknown deterministic model parameters is defined by:

$$ \theta = \{\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\}_{k=1}^K. $$

The complete-data log-likelihood is therefore given by: $$ \ln p(\mathbf{x}, \mathbf{z}; \theta) = \sum_{n=1}^N \sum_{k=1}^K \mathbb{1}\{z_n = k\} \left(\ln \pi_k + \ln \mathcal{N}(\mathbf{x}_n; \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right). $$

Posterior inference

Exercise 1

Question 1.1

Give the expression of the responsabilities $ r_{n,k} \triangleq p(z_n = k | \mathbf{x}_n; \theta)$.

Question 1.2

How can you interpret the responsabilities?

Question 1.3

In order to compute the responsabilities, it is necessary to estimate the unknown model parameters $\theta$. To do so, we would like to maximize the log-marginal likelihood $\ln p(\mathbf{x}; \theta) $. Give its expression and explain why it cannot be directly optimized.

Expectation-Maximization algorithm

As direct maximum log-marginal likelihood estimation is intractable, we will derive an expectation-maximization (EM) algorithm.

Exercise 2

Question 2.1

Let $\tilde{\theta}$ denote the current estimate of the model parameters. Using the above definition of the complete-data log-likelihood, solve the E-step, that is compute the so-called $Q$-function, defined by:

$$\begin{aligned} Q(\theta, \tilde{\theta}) &= \mathbb{E}_{p(\mathbf{z} | \mathbf{x}; \tilde{\theta})}[\ln p(\mathbf{x}, \mathbf{z}; \theta)] \end{aligned}$$

Make the depency on the model parameters $\theta = \{\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\}_{k=1}^K$ explicit (any constant with respect to these parameters can be omitted).

Hints:

Question 2.2

You now have to solve the M-step, that is updating the model parameters by maximizing $Q(\theta, \tilde{\theta})$ with respect to (w.r.t) $\theta$. To do so, you will simply cancel the partial derivatives of $Q(\theta, \tilde{\theta})$ w.r.t $\boldsymbol{\mu}_k$, $\boldsymbol{\Sigma}_k$ and $\pi_k$.

Useful matrix derivation formulas can be found in the appendix at the end of this notebook, or in the Matrix Cookbook.

Question 2.2a

Compute the partial derivative of $Q(\theta, \tilde{\theta})$ w.r.t $\boldsymbol{\mu}_k$ and set it to zero to get the update of $\boldsymbol{\mu}_k$.

$$\begin{aligned} \nabla_{\boldsymbol{\mu}_k} Q(\theta, \tilde{\theta}) &= \\ \end{aligned}$$

You will express the update as a function of $N_k = \sum_{n=1}^N \tilde{r}_{n,k}$. If we interpret $\tilde{r}_{n,k}$ as being equal to 1 if $\mathbf{x}_n$ belongs to component $k$ and 0 otherwise, $N_k$ corresponds to the number of points assigned to cluster $k$.

Hint: $\boldsymbol{\Sigma}_k$ is a covariance matrix so it is symetric.

Question 2.2b

Compute the partial derivative of $Q(\theta, \tilde{\theta})$ w.r.t $\boldsymbol{\Sigma}_k$ and set it to zero to get the update of $\boldsymbol{\Sigma}_k$.

$$\begin{aligned} \nabla_{\boldsymbol{\Sigma}_k} Q(\theta, \tilde{\theta}) &= \\ \end{aligned}$$

You will express the update as a function of $N_k = \sum_{n=1}^N \tilde{r}_{n,k}$.

Hint: Use the trace trick: $\mathbf{x}^\top\boldsymbol{\Sigma}^{-1}\mathbf{x} = tr(\mathbf{x}^\top\boldsymbol{\Sigma}^{-1}\mathbf{x}) = tr(\boldsymbol{\Sigma}^{-1}\mathbf{x}\mathbf{x}^\top) $ and then refer to the matrix derivation formulas in the appendix.

Question 2.2c

The update for $\pi_k$ is obtained by maximizing $Q(\theta, \tilde{\theta})$ under the constraint that $\sum_{k=1}^K \pi_k = 1$. We obtain:

$$ \pi_k = N_k / N, $$

where $N_k = \sum_{n=1}^N \tilde{r}_{n,k}$. The optimal prior probablity $p(z_n = k) = \pi_k$ is thus given by the number of points $N_k$ in cluster $k$ divided by the total number of points $N$.

To obtain this expression you have to use the method of Lagrange multipliers:

Practical work

The GMM class defined in the previous cell implements a Gaussian mixture model. It has two important methods:

In the following cell, we instantiate this class for our problem.

Exercise 3

Exercise 3.1

Complete the method that computes the log-marginal likelihood (LML) and run the following cell.

The LML is defined as a sum over the data points. You will divide this sum by the number of data points, so that the value of the objective function does not depend on the size of the dataset. In other words, compute the mean instead of the sum.

Exercise 3.2

Complete the method that computes the E-step and run the following cell.

To assign each point to each cluster, we simply look at the argmax of the reponsabilities. Run the following cell.

Can you explain what you observe?

Exercise 3.3

Complete the method that computes the M-step and run the following cell.

Hint: Updating the covariance matrix requires computing the outer product of vectors. Look at the notebook numpy new axis trick to help you.

If you got all my encouraging messages, then you are ready to fit the GMM on the data!

In the following cell, we plot the log-marginal likelihood along the iterations. It should be monotonically increasing, a nice feature of the EM algorithm which is very useful for debugging: if the log-marginal likelihood decreases, there is a bug.

Let's have a look to the results.

We used synthetic data, so we actually also know the true model parameters.

This is not perfect, but not that bad either...

Exercise 3.4

Question 3.4.1 - Re-run the complete pipeline several times after changing the random seed that is used to instantiate the GMM. Explain what you observe and propose a method to choose the best model among the ones you obtained from several runs.

Question 3.4.2 - Briefly explain the commonalities and differences between the K-means algorithm and the EM algorithm for the GMM model. You can use different sources of information, e.g.

It is your responsability to assess the reliability of the information you can find on the Internet.

Question 3.4.3 - Ask ChatGPT to complete the code below to perform image segmentation with the EM algorithm for the GMM model, using 3 clusters. You may need to install scikit-learn, if you don't want, ask ChatGPT to not use this package, but the code will be much longer and more difficult to read. Explain in one sentence how this image segmentation method works and explain what you observe in the result.

Appendix

For $f: \mathbb{R}^{I \times J} \mapsto \mathbb{R}$, the gradient is defined by $\frac{d}{d \mathbf{X}} f(\mathbf{X}) = \nabla_{\mathbf{X}}f(\mathbf{X}) = [\frac{\partial}{\partial X_{ij}} f(\mathbf{X}) ]_{ij} $.

Below are some useful derivatives:

\begin{equation} \frac{\partial \mathbf{x}^T \mathbf{a}}{\partial \mathbf{x}} = \frac{\partial \mathbf{a}^T \mathbf{x}}{\partial \mathbf{x}} = \mathbf{a} \end{equation}\begin{equation} \frac{\partial \mathbf{x}^T \mathbf{A} \mathbf{x}}{\partial \mathbf{x}} = 2 \mathbf{A} \mathbf{x}, \qquad \text{if } \mathbf{A} \text{ is symmetric} \end{equation}\begin{equation} \frac{\partial}{\partial \mathbf{X}}tr(\mathbf{A}\mathbf{X}^T) = \mathbf{A} \label{derTrace1} \end{equation}\begin{equation} \frac{\partial}{\partial \mathbf{X}}tr(\mathbf{A}\mathbf{X}) = \mathbf{A}^T \label{derTrace2} \end{equation}\begin{equation} \frac{\partial}{\partial \mathbf{X}}tr(\mathbf{X}^{-1}\mathbf{A}) = -(\mathbf{X}^{-1}\mathbf{A}\mathbf{X}^{-1})^T \label{derTraceInverse} \end{equation}\begin{equation} \frac{\partial}{\partial \mathbf{X}}\ln \det(\mathbf{X}) = \big((\mathbf{X}^T)^{-1}\big)^T \end{equation}