Non-negative Matrix Factorization

class: middle, title-slide
count: false

.center[
# Non-negative Matrix Factorization

<br/><br/>
  .bold[Simon Leglaive]
  <br/><br/>
.tiny[Artificial Intelligence and Deep Learning course 
<br/> CentraleSupélec]

]

.footnote[
  
This lecture is adapted from the following presentations:

- Cédric Févotte, [Nonnegative matrix factorisation and friends for audio signal separation](https://www.irit.fr/~Cedric.Fevotte/talks/spars2017.pdf), 
SPARS summer school, Lisbon, 2017.

- Slim Essid and Alexey Ozerov, [A tutorial on nonnegative matrix 
factorization with applications to audiovisual content analysis](https://perso.telecom-paristech.fr/essid/teach/NMF_tutorial_ICME-2014.pdf), 2014.

]

---
class: middle

# Today

- Introduction and applications of NMF
- Algorithms for NMF
- Activity 1 (Jupyter notebook)
- NMF for audio signal unmixing
- Activity 2 (Jupyter notebook)

---
count: false
class: center, middle

# Introduction, notations and applications

---
class: middle

Data often come in matrix form.

.center[<img src="./images/matrix_features_samples.png" style="width: 700px;" />]

.credit[Slide credit: Cédric Févotte, 
[Nonnegative matrix factorisation and friends for audio signal separation](https://www.irit.fr/~Cedric.Fevotte/talks/spars2017.pdf), 
SPARS summer school, Lisbon, 2017.]

---
class: middle

Data often come in matrix form.

.center[<img src="./images/matrix_movies_users.png" style="width: 700px;" />]

---
class: middle

Data often come in matrix form.

.center[<img src="./images/matrix_words_docs.png" style="width: 700px;" />]

---
class: middle

Data often come in matrix form.

.center[<img src="./images/matrix_freq_time.png" style="width: 700px;" />]

---
## Low-rank matrix factorization

.center[<img src="./images/MF.png" style="width: 700px;" />]

.credit[Image credit: Cédric Févotte, 
[Nonnegative matrix factorisation and friends for audio signal separation](https://www.irit.fr/~Cedric.Fevotte/talks/spars2017.pdf), 
SPARS summer school, Lisbon, 2017.]

- Data matrix: $\mathbf{V} = [\mathbf{v}\_1, ..., \mathbf{v}\_N] \in \mathbb{R}^{F \times N}$
- Dictionary matrix: $\mathbf{W} = [\mathbf{w}\_1, ..., \mathbf{w}\_K] \in \mathbb{R}^{F \times K}$
- Activation matrix: $\mathbf{H} = [\mathbf{h}\_1, ..., \mathbf{h}\_N] \in \mathbb{R}^{K \times N}$
- Factorization rank chosen such that $K(F + N) \ll FN$.

---
## Low-rank matrix factorization

.center[<img src="./images/MF_2.png" style="width: 700px;" />]

$$ \mathbf{v}\_n \approx  \hat{\mathbf{v}}\_n := \mathbf{W} \mathbf{h}\_n = \sum\_{k=1}^K \mathbf{w}\_{k} h\_{kn}, \qquad \mathbf{h}\_n = [h\_{1n}, ..., h\_{Kn}]^\top. $$

The column $\mathbf{v}\_n \in \mathbb{R}^F$ of the data matrix is approximated by a .bold[linear combination] of $K$ templates/atoms $\\{\mathbf{w}\_{k} \in \mathbb{R}^F\\}\_{k=1}^K$, determined by the $K$ activation coefficients $\\{ h\_{kn} \in \mathbb{R} \\}\_{k=1}^K$.

---
## Low-rank matrix factorization

.center[<img src="./images/MF_3.png" style="width: 700px;" />]

$$ \mathbf{v}\_n \approx  \hat{\mathbf{v}}\_n := \mathbf{W} \mathbf{h}\_n = \sum\_{k=1}^K \mathbf{w}\_{k} h\_{kn}, \qquad \mathbf{h}\_n = [h\_{1n}, ..., h\_{Kn}]^\top. $$

<!-- ---
## Low-rank matrix factorization
### Notations

.center[<img src="./images/NMF_notation.png" style="width: 600px;" />]

For all $(f,n) \in \\{1,...,F\\} \times \\{1,...,N\\}$:
$$ (\mathbf{V})\_{fn} \approx (\hat{\mathbf{V}})\_{fn} = (\mathbf{W} \mathbf{H})\_{fn} = \sum\_{k=1}^K w\_{fk} h\_{kn} $$

.credit[Image credit: Adapted from Cédric Févotte, 
[Nonnegative matrix factorisation and friends for audio signal separation](https://www.irit.fr/~Cedric.Fevotte/talks/spars2017.pdf), 
SPARS summer school, Lisbon, 2017.] -->

---
## Principal component analysis (PCA)

- Let's assume that the data are centered: $\mathbb{E}[\mathbf{v}\_n] = 0$. 
- PCA is equivalent to matrix factorization under the following contraints:
  - the dictionary matrix $\mathbf{W}$ is semi-orthogonal: $\mathbf{W}^\top \mathbf{W} = \mathbf{I}$.
  - the activation matrix is defined by $\mathbf{H} = \mathbf{W}^\top \mathbf{V}$.

We thus have $ \mathbf{v}\_n \approx \hat{\mathbf{v}}\_n = \mathbf{W} \mathbf{h}\_n = \mathbf{W} \mathbf{W}^\top \mathbf{v}\_n$.

- The dictionary matrix is estimated in the least-square sense:

$$ \underset{\mathbf{W} \in \mathbb{R}^{F \times K}}{\min} \frac{1}{N} \sum\_{n=1}^N \parallel \mathbf{v}\_n - \mathbf{W} \mathbf{W}^\top \mathbf{v}\_n \parallel\_2^2, \qquad s.t. \qquad \mathbf{W}^\top \mathbf{W} = \mathbf{I}.$$

- The solution is given by computing the eigenvalue decomposition of the empirical covariance matrix $\hat{\mathbf{R}}\_{\mathbf{v}\mathbf{v}} := \frac{1}{N} \sum\_{n=1}^N \mathbf{v}\_n \mathbf{v}\_n^\top$ and by taking $\mathbf{W}$ as the matrix built from the $K$ dominant eigenvectors (i.e. associated with the $K$ largest eigenvalues).

---
## Principal component analysis (PCA)
### Application on face images

A face is represented as a linear combination of basis images called .bold[eigenfaces].

.center[<img src="./images/PCA_faces.png" style="width: 600px;" />]

PCA allows for constructive and destructive combinations of basis images.

.credit[Image credit: D. D. Lee and H. S. Seung, [Learning the parts of objects by non-negative matrix factorization](https://www.nature.com/articles/44565.pdf?origin=ppub), 1999]

---
## Non-negative matrix factorization (NMF)

.center[<img src="./images/NMF.png" style="width: 500px;" />]

- Examples of non-negative data: pixel intensities, amplitude Fourier spectra, fluorescent spectra,
occurence counts, user scores.

- Non-negativity constraints: $\mathbf{V} \in \mathbb{R}\_{\color{brown}+}^{F \times N}$, $\mathbf{W} \in \mathbb{R}\_{\color{brown}+}^{F \times K}$ and $\mathbf{H} \in \mathbb{R}\_{\color{brown}+}^{K \times N}$.

- We will also use the notation $\mathbf{V} \ge 0$, $\mathbf{W} \ge 0$ and $\mathbf{H} \ge 0$, which should not be confused with positive semi-definiteness.

.citation[D. D. Lee and H. S. Seung, [Learning the parts of objects by non-negative matrix factorization](https://www.nature.com/articles/44565.pdf?origin=ppub), 1999]

---
## Non-negative matrix factorization (NMF)
### Application on face images

A face is represented as a linear combination of .bold[non-negative] basis images.

.center[<img src="./images/NMF_faces.png" style="width: 600px;" />]

NMF only allows for constructive combinations of basis images. It leads to intuitive notions of parts of faces in the dictionary matrix (mouth, nose, eyes, chin, etc.).

.credit[Image credit: D. D. Lee and H. S. Seung, [Learning the parts of objects by non-negative matrix factorization](https://www.nature.com/articles/44565.pdf?origin=ppub), 1999]

---
exclude: true
## NMF for feature learning in face recognition

.bold[Training stage]

- $\mathbf{V}\_{tr} \in \mathbb{R}\_{\color{brown}+}^{F \times N}$ is a dataset of $N$ face images composed of $F$ pixels.

- We estimate $\mathbf{W}\_{tr} \in \mathbb{R}\_{\color{brown}+}^{F \times K}$ and $\mathbf{H}\_{tr} \in \mathbb{R}\_{\color{brown}+}^{K \times N}$ such that $\mathbf{V}\_{tr} \approx \mathbf{W}\_{tr} \mathbf{H}\_{tr}$.

- We train a classifier for face recognition using $\mathbf{H}\_{tr}$ as features. Each column of the activation matrix is a $K$-dimensional feature vector characterizing the $n$-th image in the dataset.

.bold[Test stage]

- We observe a new image $\mathbf{v}\_n \in \mathbb{R}\_+^F$.

- We decompose it over the learned dictionary, i.e. we compute $\mathbf{h}\_n \in \mathbb{R}\_+^K$ such that:

$$\mathbf{v}\_n \approx \mathbf{W}\_{tr} \mathbf{h}\_n. $$

- We use $\mathbf{h}\_n$ as a feature vector for classification.

---
## NMF for topics analysis in documents

.center[<img src="./images/NMF_topics.png" style="width: 600px;" />]

- $\mathbf{V} \in \mathbb{R}\_{+}^{F \times N}$ is a .bold[document-term] matrix.

- Each entry of $\mathbf{V}$ is the occurence count of a word (among a vocabulary of $F$ words) in a document (among $N$ documents in the dataset).

- The dictionary matrix contains .bold[topics] and the activation matrix indicates the importance of each topic for each document.

---
exclude: true

## NMF for topics analysis in documents

.center[<img src="./images/NMF_topics_2.png" style="width: 550px;" />]

.tiny[
- The darkness of the text indicates the frequency of each word for each discovered topic in the dictionary matrix.

- The decomposition of the document on the right (eight most frequent words in the encyclopedia entry on the "Constitution of the United States") gives high weight to the upper two topics, and none to the lower two, as shown by the shaded squares in the activation matrix.
]

.credit[Image credit: D. D. Lee and H. S. Seung, [Learning the parts of objects by non-negative matrix factorization](https://www.nature.com/articles/44565.pdf?origin=ppub), 1999]

---
## NMF for movie recommendation

.center[<img src="./images/NMF_movies.png" style="width: 600px;" />]

- $\mathbf{V} \in \mathbb{R}\_{+}^{F \times N}$ is a .bold[movie rating] matrix.

- Each entry of $\mathbf{V}$ is the rating of a movie (among a catalogue of $F$ movies) by a user (among $N$ users in the dataset).

- The dictionary matrix contains movie .bold[genres] (action, comedy, etc.) and the activation matrix indicates how much each user likes each genre.

- Netflix Prize (2007-2009): $1,000,000 for the first movie recommendation system that would manage to decrease by 10% the error rate (on a private test dataset).

---
## NMF for unmixing

A rank-$K$ NMF can be written as the sum of $K$ rank-1 matrices:

$$ \mathbf{V} \approx \hat{\mathbf{V}} = \sum\_{k=1}^K \hat{\mathbf{V}}\_k, \qquad \hat{\mathbf{V}}\_k = ({\mathbf{W}})\_{:,k} ({\mathbf{H}})\_{k,:} $$

.center[<img src="./images/NMF_unmixing_1.png" style="width: 1000px;" />]

---
## NMF for unmixing

A rank-$K$ NMF can be written as the sum of $K$ rank-1 matrices:

$$ \mathbf{V} \approx \hat{\mathbf{V}} = \sum\_{k=1}^K \hat{\mathbf{V}}\_k, \qquad \hat{\mathbf{V}}\_k = ({\mathbf{W}})\_{:,k} ({\mathbf{H}})\_{k,:} $$

.center[<img src="./images/NMF_unmixing_2.png" style="width: 1000px;" />]

---
## Difficulty: rank of the factorization

.grid[
.kol-1-2[
.center[<img src="./images/MF.png" style="width: 475px;" />]
]
.kol-1-2[
- The rank of the factorization $K$ is the number of columns (or rows) in the dictionary (or activation) matrix.

- Chosing the rank of the factorization is a difficult problem. It depends on the data and the application. 
]
]

A suitable choice is very important. There is a trade-off between:

- .bold[Fidelity]: A greater rank leads to a better approximation of the data matrix.
- .bold[Complexity]: A lower rank leads to a less complex model, with less parameters to estimate (easier to solve).

.credit[Slide credit: Inspired from S. Essid and A. Ozerov, [A tutorial on nonnegative matrix 
factorization with applications to audiovisual content analysis](https://perso.telecom-paristech.fr/essid/teach/NMF_tutorial_ICME-2014.pdf), 2014.]

---
## Difficulty: the solution is not unique

- NMF is an .bold[ill-posed problem]: it has an infinie number of solutions.

- Let $\hat{\mathbf{V}} = \mathbf{W} \mathbf{H} $ with $\mathbf{W} \ge 0$ and $\mathbf{H} \ge 0$.

- Any invertible matrix $\mathbf{Q}$ such that $\tilde{\mathbf{W}} := \mathbf{W}\mathbf{Q} \ge 0$ and $\tilde{\mathbf{H}} := \mathbf{Q}^{-1}\mathbf{H} \ge 0$ provides an alternative factorization with the same reconstruction:

$$ \tilde{\mathbf{V}} = \tilde{\mathbf{W}} \tilde{\mathbf{H}} = \mathbf{W}\mathbf{Q} \mathbf{Q}^{-1}\mathbf{H} = \mathbf{W} \mathbf{H} = \hat{\mathbf{V}}. $$

---
exclude: true

- In particular, $\mathbf{Q}$ can be any non-negative generalized permutation matrix, e.g. in dimension 3:
$$
\mathbf{Q} = \begin{pmatrix}
0 & 0 & 2 \\\\
0 & 3 & 0 \\\\
1 & 0 & 0
\end{pmatrix}.
$$

This case simply scales and permutes the template vectors in the dictionary matrix.

We generally normalize the columns of $\mathbf{W}$ so that they have unit $L\_2$ norm for instance.

---
exclude: true
## Geometric interpretation

NMF assumes that the data are well described by a polyhedral cone generated by the columns of $\mathbf{W}$.

.left-column[
.center[<img src="./images/cone_1.png" style="width: 300px;" />]

$$ \mathcal{C}\_{\mathbf{W}} = \left\\{ \sum\_{k=1}^{K} h\_k \mathbf{w}\_k, (h\_1, ..., h\_K) \in \mathbb{R}\_+^K \right\\} $$ 
]

.right-column[
.center[<img src="./images/cone_2.png" style="width: 300px;" />]

.bold[Problem]: Multiple cones can lead to the same reconstruction of the data.
]

---
count: false
class: center, middle

# Algorithms for NMF

---
## NMF as a constrained optimization problem

- Given a data matrix $\mathbf{V} \in \mathbb{R}\_{+}^{F \times N}$, we need to estimate the model parameters $\mathbf{W} \in \mathbb{R}\_{+}^{F \times K}$ and $\mathbf{H} \in \mathbb{R}\_{+}^{K \times N}$.

- This is achieved by solving a constrained optimization problem:

$$ \underset{\mathbf{W} \in \mathbb{R}\_+^{F \times K}, \mathbf{H} \in \mathbb{R}\_{+}^{K \times N}}{\min} D(\mathbf{V} | \mathbf{W} \mathbf{H}),$$

where $D(\cdot | \cdot)$ is a separable measure of fit:

$$\begin{aligned}
  D(\mathbf{V} | \mathbf{W} \mathbf{H}) =& \sum\_{f=1}^{F} \sum\_{n=1}^{N} d\Big((\mathbf{V})\_{fn} | (\mathbf{W} \mathbf{H})\_{fn} \Big) = \sum\_{f=1}^{F} \sum\_{n=1}^{N} d\left( v\_{fn} \Big|  \sum\_{k=1}^K w\_{fk} h\_{kn} \right), 
  \end{aligned}$$
  and $d(x | y)$ is a scalar divergence defined for all $x, y \ge 0$:
  - $d(x | y) \ge 0$ for all $x, y \ge 0$;
  - $d(x | y) = 0$ if and only if $x = y$.

---
## Popular divergences

- Squared euclidean distance

$$ d\_{EUC}(x | y) = \frac{1}{2}(x-y)^2 $$

- Generalized Kullback-Leibler divergence

$$ d\_{KL}(x | y) = x \ln\left( \frac{x}{y} \right) + y -x $$

- Itakura-Saito divergence

$$ d\_{IS}(x | y) =  \frac{x}{y} - \ln\left( \frac{x}{y} \right) - 1$$

---
exclude: true

## Convexity properties

<table>
  <tr>
    <th></th>
    <th>EUC</th>
    <th>KL</th>
    <th>IS</th>
  </tr>
  <tr>
    <td>Convex on $x$</td>
    <td>yes</td>
    <td>yes</td>
    <td>yes</td>
  </tr>
  <tr>
    <td>Convex on $y$</td>
    <td>yes</td>
    <td>yes</td>
    <td> <b>no<b/></td>
  </tr>

</table>

.center[<img src="./images/EUC_KL_IS.png" style="width: 600px;" />]

---
exclude: true

## The $\beta$-divergence

Parametric family of divergences:

$$ d\_{\beta}(x,y) = \begin{cases}
\frac{1}{\beta(\beta - 1)} (x^\beta + (\beta - 1)y^\beta - \beta\, x\, y^{\beta - 1}) & \beta \in \mathbb{R}\setminus\\{0,1\\} \\\\
\frac{x}{y} - \ln\left( \frac{x}{y} \right) - 1 & \beta = 0 \\\\
x \ln\left( \frac{x}{y} \right) + y - x & \beta = 1
\end{cases}$$

Special cases:
- squared Euclidean distance for $\beta = 2$
- generalized Kullback-Leibler for $\beta = 1$
- Itakura-Saito divergence for $\beta = 0$

---
exclude: true

## The $\beta$-divergence

.center[<img src="./images/beta_div.png" style="width: 700px;" />]

<!-----
## Exercises

1. Compute the first and second derivatives of $d\_{\beta}(x,y)$ with respect to (w.r.t) $y$.
2. Show that $d\_{\beta}(x,y)$ has a single minimum in $y = x$.
3. Show that $d\_{\beta}(x,y)$ is convex w.r.t $y$ for $1 \le \beta \le 2$.
4. Show that $d\_{\beta}(\lambda x, \lambda y) = \lambda^\beta d\_{\beta}(x,y)$.

See C. Févotte and J. Idier, [Algorithms for nonnegative matrix factorization with the beta-divergence](https://hal.archives-ouvertes.fr/file/index/docid/526044/filename/beta_nmf_hal.pdf), 2010-->

---
## NMF algorithm

$$ \underset{\mathbf{W} \in \mathbb{R}\_+^{F \times K}, \mathbf{H} \in \mathbb{R}\_{+}^{K \times N}}{\min} D(\mathbf{V} | \mathbf{W} \mathbf{H})$$

- This optimization problem is in general .bold[not jointly convex] in $\mathbf{W}$ and $\mathbf{H}$.

- The cost function admits multiple local minima.

- It may be individually convex in $\mathbf{W}$ or $\mathbf{H}$ if a proper divergence is chosen.

- We resort to a .bold[block-coordinate] approach, alternating the optimization w.r.t  $\mathbf{W}$ with $\mathbf{H}$ fixed, and w.r.t  $\mathbf{H}$ with $\mathbf{W}$ fixed.

- The resulting iterative algorithm is sensitive to the initialization.

---
## NMF algorithm

- Updates of $\mathbf{W}$ and $\mathbf{H}$ are equivalent by transposition:

$$ \mathbf{V} \approx \mathbf{W} \mathbf{H} \Leftrightarrow \mathbf{V}^\top \approx \mathbf{H}^\top \mathbf{W}^\top $$

- The objective function is separable in the columns of $\mathbf{H}$ (or the rows of $\mathbf{W}$):

$$D(\mathbf{V} | \mathbf{W} \mathbf{H}) = \sum\_{n=1}^{N} D(\mathbf{v}\_{n} | \mathbf{W} \mathbf{h}\_{n} ) $$

- We end up with a .bold[non-negative linear regression] problem:

$$ \underset{\mathbf{h} \in \mathbb{R}\_+^K}{\min} \left\\{ C(\mathbf{h}) := D(\mathbf{v} | \mathbf{W} \mathbf{h}) \right\\}.$$

- We can solve this optimization problem using a majorize-minimization (MM) algorithm.

---
count: false
class: middle, center
## Majorization-minimization (MM)

---
### Auxiliary function

**.bold[Definition]** - The $\mathbb{R}\_+^K \times \mathbb{R}\_+^K \mapsto \mathbb{R}\_+$ mapping $G(\mathbf{h} | \tilde{\mathbf{h}})$ is an .bold[auxiliary function] to $C(\mathbf{h})$ if and only if:

$$ \forall \mathbf{h} \in \mathbb{R}\_+^K,\hspace{.15cm} C(\mathbf{h}) = G(\mathbf{h} | \mathbf{h}) \qquad \text{and} \qquad \forall (\mathbf{h}, \tilde{\mathbf{h}}) \in \mathbb{R}\_+^K \times \mathbb{R}\_+^K,\hspace{.15cm} C(\mathbf{h}) \le G(\mathbf{h} | \tilde{\mathbf{h}}). $$

In other words an auxiliary function $G(\mathbf{h} | \tilde{\mathbf{h}})$ is a .bold[majorizing function] or .bold[upper bound] of $C(\mathbf{h})$ which is tight for $\mathbf{h} = \tilde{\mathbf{h}}$.

**.bold[MM algorithm]** - The optimization of $C(\mathbf{h})$ can be replaced by an .bold[iterative optimization] of $G(\mathbf{h} | \tilde{\mathbf{h}})$:

1. Initialize $\mathbf{h}^{(0)}$
2. Iterate $ \mathbf{h}^{(i+1)} = \underset{\mathbf{h} \in \mathbb{R}\_+^K}{\arg\min} \hspace{.25cm} G(\mathbf{h} | \mathbf{h}^{(i)}). $

We can show that under this update the cost function $C(\mathbf{h})$ is monotonically decreasing (cf. appendix):

$$C(\mathbf{h}^{(i+1)}) \le C(\mathbf{h}^{(i)}).$$