+ - 0:00:00
Notes for current slide
Notes for next slide


Bayesian Methods for Machine Learning

Lecture 2 - Fundamentals of machine learning



Simon Leglaive



CentraleSupélec

This presentation is adapted from INFO8010 - Deep Learning course by Gilles Louppe at ULiège.

1 / 45

In the last episode...

We want to infer some latent information from observations.

  • Data: Get unlabeled data D={xii.i.dp(x)}i=1N\mathcal{D} = \left\{\mathbf{x}_i \overset{i.i.d}{\sim} p^\star(\mathbf{x})\right\}_{i=1}^N.
  • Modeling: Define a model that relates the latent variable of interest to the observations p(x,z;θ)=p(xz;θ)p(z;θ)p(\mathbf{x},z; \theta) = p(\mathbf{x} \mid z; \theta) p(z; \theta).
  • Inference: Compute the posterior distribution p(zx;θ)p(z \mid \mathbf{x} ; \theta), which can then be used in many different ways.
  • Learning: Estimate the unknown model parameters θ\theta by maximizing the log-marginal likelihood lnp(x;θ)\ln p(\mathbf{x}; \theta) averaged over the dataset.

This corresponds to a form of unsupervised learning, using a generative Bayesian modeling approach.

2 / 45

Today

We will focus on supervised learning with discriminative models.

The key concepts you should be familiar with at the end of this course are the following:

  • Supervised learning
  • Generative vs. discriminative model
  • Empirical risk minimization
  • Multinomial logistic regression (lab session)
3 / 45

Supervised learning

4 / 45

The term supervised means that, at training time, the input data samples xiX\mathbf{x}_i \in \mathcal{X} are labeled with the information we will try to infer at test time. The labels are represented by yiYy_i \in \mathcal{Y}.

Image credits: Antoine Deleforge, Inria, course given at Télécom Physique Strasbourg.

5 / 45

Training time

At training time, our observations consist of (input, label) pairs, which are assumed to be i.i.d according to some distribution p(x,y)p^\star(\mathbf{x},y):

D={(xi,yi)X×Yi.i.dp(x,y)}i=1N.\mathcal{D} = \Big\{(\mathbf{x}_i,y_i) \in \mathcal{X} \times \mathcal{Y} \overset{i.i.d}{\sim} p^\star(\mathbf{x},y) \Big\}_{i=1}^N.

For instance, xi\mathbf{x}_i is a DD-dimensional input feature vector and yiy_i is a scalar label (e.g., a category or a real value).

6 / 45

Test time

At test time, we only observe a new input data sample x\mathbf{x}, from which we want to infer yy.

This means computing the posterior distribution p(yx;θ)p(y \mid \mathbf{x} ; \theta), which depends on a set of parameters θ\theta that have been estimated during the learning stage, at training time.

In supervised learning, we are often only interested in a point estimate y^\hat{y}, for instance:

  • classification task:

    y^=argmaxk{1,...,C}p(y=kx;θ){1,...,C}; \hat{y} = \underset{k \in \{1,...,C\}}{\arg\max}\, p(y=k \mid \mathbf{x} ; \theta ) \in \{1,...,C\};

  • regression task:

    y^=Ep(yx;θ)[y]R. \hat{y} = \mathbb{E}_{p(y \mid \mathbf{x} ; \theta )}\left[ y \right] \in \mathbb{R}.

7 / 45

Supervised learning with generative modeling

We have access to a labeled dataset D={(xi,yi)i.i.dp(x,y)}i=1N\mathcal{D} = \left\{(\mathbf{x}_i,y_i) \overset{i.i.d}{\sim} p^\star(\mathbf{x},y) \right\}_{i=1}^N.

  • Generative modeling - we define the joint distribution to explain how the input data x\mathbf{x} are generated from the label yy:

    p(x,y;θ)=p(xy;θ)p(y;θ).p(\mathbf{x}, y; \theta) = p(\mathbf{x} \mid y; \theta) p(y; \theta).
  • Inference - we "invert" this generative model using Bayes theorem:

    p(yx;θ)=p(xy;θ)p(y;θ)p(x;θ).p(y \mid \mathbf{x} ; \theta) = \frac{p(\mathbf{x} \mid y; \theta) p(y; \theta)}{p(\mathbf{x}; \theta)}.

  • Learning - we estimate the unknown model parameters θ\theta by maximizing the log-likelihood lnp(x,y;θ)\ln p(\mathbf{x}, y; \theta) averaged over the dataset D\mathcal{D}:

    maxθEp(x,y)[lnp(x,y;θ)].\underset{\theta}{\max}\hspace{.1cm} \mathbb{E}_{p^\star(\mathbf{x},y)} [ \ln p(\mathbf{x}, y; \theta) ] .

    No need to marginalize over yy as it is observed in a supervised setting!

8 / 45

Supervised learning with discriminative modeling

We have access to a labeled dataset D={(xi,yi)i.i.dp(x,y)}i=1N\mathcal{D} = \left\{(\mathbf{x}_i,y_i) \overset{i.i.d}{\sim} p^\star(\mathbf{x},y) \right\}_{i=1}^N.

  • Discriminative modeling - we define the joint distribution to directly explain how the label yy is obtained from the input data x\mathbf{x}:

    p(x,y;θ)=p(yx;θ)p(x),p(\mathbf{x}, y; \theta) = p(y \mid \mathbf{x}; \theta) p(\mathbf{x}),

    where p(x)p(\mathbf{x}) is non-informative (e.g., uniform after normalization) and does not depend on θ\theta.

  • Inference - super easy, just evaluate the model!

  • Learning - we estimate the unknown model parameters θ\theta by maximizing the log-likelihood lnp(x,y;θ)=lnp(yx;θ)+cst\ln p(\mathbf{x}, y; \theta) = \ln p(y \mid \mathbf{x}; \theta) + cst averaged over the dataset D\mathcal{D}:

    maxθEp(x,y)[lnp(x,y;θ)]minθEp(x,y)[lnp(yx;θ)].\underset{\theta}{\max}\hspace{.1cm} \mathbb{E}_{p^\star(\mathbf{x},y)} \left[ \ln p(\mathbf{x}, y; \theta) \right] \hspace{.2cm}\Leftrightarrow\hspace{.2cm} \underset{\theta}{\min}\hspace{.1cm} \mathbb{E}_{p^\star(\mathbf{x},y)} \left[ - \ln p(y \mid \mathbf{x}; \theta) \right] .

    In supervised discriminative learning, lnp(yx;θ)- \ln p(y \mid \mathbf{x}; \theta) is often called the negative log-likelihood (NLL) function, which can be confusing (e.g., torch.nn.NLLLoss).

9 / 45

Wrap-up

  • Data: Get the labeled training dataset D={(xi,yi)i.i.dp(x,y)}i=1N\mathcal{D} = \left\{(\mathbf{x}_i,y_i) \overset{i.i.d}{\sim} p^\star(\mathbf{x},y) \right\}_{i=1}^N.
  • Modeling/inference: Define p(yx;θ)p(y \mid \mathbf{x}; \theta).
  • Learning: Estimate θ\theta by minimizing lnp(yx;θ)- \ln p(y \mid \mathbf{x}; \theta) averaged over D\mathcal{D}.
  • 3 fundamental examples are available in the appendix. We will study the multinomial classification example in detail during the lab session.
10 / 45

Wrap-up

  • Data: Get the labeled training dataset D={(xi,yi)i.i.dp(x,y)}i=1N\mathcal{D} = \left\{(\mathbf{x}_i,y_i) \overset{i.i.d}{\sim} p^\star(\mathbf{x},y) \right\}_{i=1}^N.
  • Modeling/inference: Define p(yx;θ)p(y \mid \mathbf{x}; \theta).
  • Learning: Estimate θ\theta by minimizing lnp(yx;θ)- \ln p(y \mid \mathbf{x}; \theta) averaged over D\mathcal{D}.
  • 3 fundamental examples are available in the appendix. We will study the multinomial classification example in detail during the lab session.

  • Having to define a model as the expression of a probability mass/density function p(yx;θ)p(y \mid \mathbf{x}; \theta) can be restrictive, and is actually not necessary.

    A more general approach to supervised discriminative learning is based on the principle of empirical risk minimization.

11 / 45

Empirical risk minimization

12 / 45
  • Consider a function f:XYf : \mathcal{X} \to \mathcal{Y} produced by some learning algorithm, which can depend on some parameters not represented here.

    y^=f(x)\hat{y} = f(\mathbf{x}) is the prediction of the ground-truth label yy associated with x\mathbf{x}.

  • The predictions of this function can be evaluated through a loss function :Y×YR,\ell : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}, such that (y,f(x))0\ell(y, f(\mathbf{x})) \geq 0 measures how close the prediction f(x)f(\mathbf{x}) is from yy.

    For example:

    • Classification: (y,f(x))=1y=f(x)\ell(y,f(\mathbf{x})) = \mathbf{1}_{y \cancel= f(\mathbf{x})}
    • Regression: (y,f(x))=(yf(x))2\ell(y,f(\mathbf{x})) = (y - f(\mathbf{x}))^2

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

13 / 45

Let F\mathcal{F} denote the hypothesis space, i.e. the set of all functions ff than can be produced by the chosen learning algorithm.

We are looking for a function fFf \in \mathcal{F} with a small expected risk (or generalization error) R(f)=Ep(x,y)[(y,f(x))]=Ep(x)[Ep(yx)[(y,f(x))]].R(f) = \mathbb{E}_{p^\star(\mathbf{x},y)}\left[ \ell(y, f(\mathbf{x})) \right] = \mathbb{E}_{p^\star(\mathbf{x})}\left[ \mathbb{E}_{p^\star(y| \mathbf{x})}\left[ \ell(y, f(\mathbf{x})) \right] \right].

This means that for a given data generating distribution p(x,y)p^\star(\mathbf{x},y) and for a given hypothesis space F\mathcal{F}, the optimal model is f=argminfFR(f).f^\star = \arg \min_{f \in \mathcal{F}} R(f).

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

14 / 45

Unfortunately, since p(x,y)p^\star(\mathbf{x},y) is unknown, the expected risk cannot be evaluated and the optimal model cannot be determined.

However, if we have i.i.d. training data D={(xi,yi)}i=1N\mathcal{D} = \{(\mathbf{x}_i, y_i) \}_{i=1}^N, we can compute an estimate, the empirical risk (or training error) R^(f,D)=1N(xi,yi)D(yi,f(xi)).\hat{R}(f, \mathcal{D}) = \frac{1}{N} \sum_{(\mathbf{x}_i, y_i) \in \mathcal{D}} \ell(y_i, f(\mathbf{x}_i)).

This estimate is unbiased and can be used for finding a good enough approximation of ff^\star. This results into the empirical risk minimization principle: fD=argminfFR^(f,D)f^\star_{\mathcal{D}} = \arg \min_{f \in \mathcal{F}} \hat{R}(f, \mathcal{D})

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

15 / 45

What does unbiased mean?

=> The expected empirical risk estimate (over d) is the expected risk.

Most supervised machine learning algorithms, including neural networks, implement empirical risk minimization.

Under some regularity assumptions, empirical risk minimizers converge:

limNfD=f\lim_{N \to \infty} f^\star_{\mathcal{D}} = f^\star

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

16 / 45

This is why tuning the parameters of the model to make it work on the training data is a reasonable thing to do.

Regression example

Consider the joint probability distribution p(x,y)p^\star(x,y) induced by the data generating process (x,y)p(x,y)xU([10;10]),ϵN(0,σ2),y=g(x)+ϵ(x,y) \sim p^\star(x,y) \Leftrightarrow x \sim \mathcal{U}([-10;10]), \epsilon \sim \mathcal{N}(0, \sigma^2), y = g(x) + \epsilon where xRx \in \mathbb{R}, yRy\in\mathbb{R} and gg is an unknown polynomial of degree 3.

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

17 / 45

Regression is used to study the relationship between two continuous variables.

Of course, it can be extended to higher dimensions.

Image credit: https://noeliagorod.com/2019/05/21/machine-learning-for-everyone-in-simple-words-with-real-world-examples-yes-again/

18 / 45

Regression examples:

  • predict the hospitalization time given your medical record
  • predict the rate for your insurance when taking a credit for buying a house given all your presonal data
  • predict your click rate in online advertising given your navigation history
  • predict the temperature given carbon emission rates
  • predict your age given a picture of you
  • predict a picture of you in 10 years given a picture of you today

Step 1: Defining the model

Our goal is to find a function ff that makes good predictions on average over p(x,y)p^\star(x,y).

Consider the hypothesis space fFf \in \mathcal{F} of polynomials of degree 3 defined through their parameters wR4\mathbf{w} \in \mathbb{R}^4 such that y^f(x;w)=d=03wdxd\hat{y} \triangleq f(x; \mathbf{w}) = \sum_{d=0}^3 w_d x^d

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

19 / 45

Step 2: Defining the loss function

For this regression problem, we use the squared error loss (y,f(x;w))=(yf(x;w))2\ell(y, f(x;\mathbf{w})) = (y - f(x;\mathbf{w}))^2 to measure how wrong the predictions are.

Therefore, our goal is to find the best value w\mathbf{w}^\star such w=argminwR(w)=argminwEp(x,y)[(yf(x;w))2]\begin{aligned} \mathbf{w}^\star &= \arg\min_\mathbf{w} R(\mathbf{w}) \\ &= \arg\min_\mathbf{w} \mathbb{E}_{p^\star(x,y)}\left[ (y-f(x;\mathbf{w}))^2 \right] \end{aligned}

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

20 / 45

Step 3: Training

Given a large enough training set D={(xi,yi)i=1,,N}\mathcal{D} = \{(x_i, y_i) | i=1,\ldots,N\}, the empirical risk minimization principle tells us that a good estimate wD\mathbf{w}^\star_{\mathcal{D}} of w\mathbf{w}^\star can be found by minimizing the empirical risk: wD=argminwR^(w,D)=argminw1N(xi,yi)D(yif(xi;w))2=argminw1N(xi,yi)D(yid=03wdxid)2=argminw1N(y1y2yN)y(x10x13x20x23xN0xN3)X(w0w1w2w3)2\begin{aligned} \mathbf{w}^\star_{\mathcal{D}} &= \arg\min_\mathbf{w} \hat{R}(\mathbf{w},\mathcal{D}) \\ &= \arg\min_\mathbf{w} \frac{1}{N} \sum_{(x_i, y_i) \in \mathcal{D}} (y_i - f(x_i;\mathbf{w}))^2 \\ &= \arg\min_\mathbf{w} \frac{1}{N} \sum_{(x_i, y_i) \in \mathcal{D}} \Big(y_i - \sum_{d=0}^3 w_d x_i^d\Big)^2 \\ &= \arg\min_\mathbf{w} \frac{1}{N} \left\lVert \underbrace{\begin{pmatrix} y_1 \\ y_2 \\ \ldots \\ y_N \end{pmatrix}}_{\mathbf{y}} - \underbrace{\begin{pmatrix} x_1^0 \ldots x_1^3 \\ x_2^0 \ldots x_2^3 \\ \ldots \\ x_N^0 \ldots x_N^3 \end{pmatrix}}_{\mathbf{X}} \begin{pmatrix} w_0 \\ w_1 \\ w_2 \\ w_3 \end{pmatrix} \right\rVert^2 \end{aligned}

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

21 / 45

This is ordinary least squares regression, for which the solution is known analytically: wD=(XTX)1XTy\mathbf{w}^\star_{\mathcal{D}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

In many situations, the problem is more difficult and we cannot find the solution analytically. We resort to iterative optimization algorithms, such as (variants of) gradient descent.

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

22 / 45

The expected risk minimizer f(x;w)f(x;\mathbf{w}^\star) within our hypothesis space F\mathcal{F} (polynomials of degree 3) is g(x)g(x) itself (i.e. the polynomial of degree 3 with the true parameters).

Therefore, on this toy problem, we can verify that f(x;wD)f(x;w)=g(x)f(x;\mathbf{w}^\star_{\mathcal{D}}) \to f(x;\mathbf{w}^\star) = g(x) as NN \to \infty.

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

23 / 45

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

24 / 45

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

24 / 45

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

24 / 45

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

24 / 45

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

24 / 45

What if we consider a hypothesis space F\mathcal{F} in which candidate functions ff are either too "simple" or too "complex" with respect to the true data generating process?

F\mathcal{F} = polynomials of degree 1

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

25 / 45

What if we consider a hypothesis space F\mathcal{F} in which candidate functions ff are either too "simple" or too "complex" with respect to the true data generating process?

F\mathcal{F} = polynomials of degree 2

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

25 / 45

What if we consider a hypothesis space F\mathcal{F} in which candidate functions ff are either too "simple" or too "complex" with respect to the true data generating process?

F\mathcal{F} = polynomials of degree 3

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

25 / 45

What if we consider a hypothesis space F\mathcal{F} in which candidate functions ff are either too "simple" or too "complex" with respect to the true data generating process?

F\mathcal{F} = polynomials of degree 4

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

25 / 45

What if we consider a hypothesis space F\mathcal{F} in which candidate functions ff are either too "simple" or too "complex" with respect to the true data generating process?

F\mathcal{F} = polynomials of degree 5

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

25 / 45

What if we consider a hypothesis space F\mathcal{F} in which candidate functions ff are either too "simple" or too "complex" with respect to the true data generating process?

F\mathcal{F} = polynomials of degree 10

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

25 / 45

Error vs. degree dd of the polynomial.

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

26 / 45

Why shouldn't we pick the largest dd?

Bayes risk and estimator

Let YX\mathcal{Y}^{\mathcal X} be the set of all functions f:XYf : \mathcal{X} \to \mathcal{Y}.

We define the Bayes risk as the minimal expected risk over all possible functions, RB=minfYXR(f),R_B = \min_{f \in \mathcal{Y}^{\mathcal X}} R(f), and call Bayes estimator the model fBf_B that achieves this minimum.

No model ff can perform better than fBf_B.

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

27 / 45

The capacity of an hypothesis space F\mathcal{F} induced by a learning algorithm intuitively represents the ability to find a good model fFf \in \mathcal{F} that can fit any function, regardless of its complexity.

If the capacity is infinite, we can fit any function, but in practice the capacity is always finite.

The capacity can be controlled through hyper-parameters of the learning algorithm. For example:

  • The degree of the family of polynomials;
  • The number of layers in a neural network;
  • The number of training iterations;
  • Regularization terms.

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

28 / 45

Underfitting and overfitting

  • If the capacity of F\mathcal{F} is too low, then fBFf_B \notin \mathcal{F} and R(f)RBR(f) - R_B is large for any fFf \in \mathcal{F}, including ff^\star and fDf^\star_{\mathcal{D}}. Such models ff are said to underfit the data.
  • If the capacity of F\mathcal{F} is too high, then fBFf_B \in \mathcal{F} or R(f)RBR(f^\star) - R_B is small.
    However, because of the high capacity of the hypothesis space, the empirical risk minimizer fDf^\star_{\mathcal{D}} could fit the training data arbitrarily well such that

    R(fD)RBR^(fD,D)0.R(f^\star_{\mathcal{D}}) \geq R_B \geq \hat{R}(f^\star_{\mathcal{D}}, \mathcal{D}) \geq 0.

    This indicates that the empirical risk R^(fD,D)\hat{R}(f^\star_{\mathcal{D}}, \mathcal{D}) is a poor estimator of the expected risk R(fD)R(f^\star_{\mathcal{D}}). In this situation, fDf^\star_{\mathcal{D}} becomes too specialized with respect to the true data generating process, fDf^\star_{\mathcal{D}} is said to overfit the data.

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

29 / 45

Therefore, our goal is to adjust the capacity of the hypothesis space such that the expected risk of the empirical risk minimizer (the generalization error) R(fD)R(f^\star_{\mathcal{D}}) gets as low as possible, and not simply the empirical risk of the empirical risk minimizer (training error) R^(fD,D)\hat{R}(f^\star_{\mathcal{D}}, \mathcal{D}).

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

30 / 45

Comment that for deep networks, training error may goes to 0 while the generalization error may not necessarily go up!

An unbiased estimate of the expected risk can be obtained by evaluating fDf^\star_{\mathcal{D}} on data Dtest\mathcal{D}_\text{test} independent from the training samples D\mathcal{D}: R^(fD,Dtest)=1Ntest(xi,yi)Dtest(yi,fD(xi))\hat{R}(f^\star_{\mathcal{D}}, \mathcal{D}_\text{test}) = \frac{1}{N_\text{test}} \sum_{(\mathbf{x}_i, y_i) \in \mathcal{D}_\text{test}} \ell(y_i, f^\star_{\mathcal{D}}(\mathbf{x}_i))

This test error estimate can be used to evaluate the actual performance of the model. However, it should not be used, at the same time, for model selection.

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

31 / 45

Error vs. degree dd of the polynomial.

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

32 / 45

What value of dd shall you select?

But then how good is this selected model?

This should be avoided at all costs!

Credits: Francois Fleuret, EE559 Deep Learning, EPFL.

33 / 45

Instead, keep a separate validation set for tuning the hyper-parameters.

Credits: Francois Fleuret, EE559 Deep Learning, EPFL.

34 / 45

Comment on the comparison of algorithms from one paper to the other.

Bias-variance decomposition

Consider a fixed point x\mathbf{x} and the prediction y^=fD(x)\hat{y}=f^\star_{\mathcal{D}}(\mathbf{x}) of the empirical risk minimizer at x\mathbf{x}.

Then the local expected risk of fDf^\star_{\mathcal{D}} is R(fDx)=Ep(yx)[(yfD(x))2]=Ep(yx)[(yfB(x)+fB(x)fD(x))2]=Ep(yx)[(yfB(x))2]+Ep(yx)[(fB(x)fD(x))2]=R(fBx)+(fB(x)fD(x))2\begin{aligned} R(f^\star_{\mathcal{D}}|\mathbf{x}) &= \mathbb{E}_{p^\star(y|\mathbf{x})} \left[ (y - f^\star_{\mathcal{D}}(\mathbf{x}))^2 \right] \\ &= \mathbb{E}_{p^\star(y|\mathbf{x})} \left[ (y - f_B(\mathbf{x}) + f_B(\mathbf{x}) - f^\star_{\mathcal{D}}(\mathbf{x}))^2 \right] \\ &= \mathbb{E}_{p^\star(y|\mathbf{x})} \left[ (y - f_B(\mathbf{x}))^2 \right] + \mathbb{E}_{ p^\star(y|\mathbf{x})} \left[ (f_B(\mathbf{x}) - f^\star_{\mathcal{D}}(\mathbf{x}))^2 \right] \\ &= R(f_B|\mathbf{x}) + (f_B(\mathbf{x}) - f^\star_{\mathcal{D}}(\mathbf{x}))^2 \end{aligned}

  • R(fBx)R(f_B|\mathbf{x}) is the local expected risk of the Bayes estimator. This term cannot be reduced.
  • (fB(x)fD(x))2(f_B(\mathbf{x}) - f^\star_{\mathcal{D}}(\mathbf{x}))^2 represents the discrepancy between fBf_B and fDf^\star_{\mathcal{D}}.

Remarks:

  • R(f)=Ep(x,y)[(y,f(x))]=Ep(x)[Ep(yx)[(y,f(x))]]=Ep(x)[R(fx)]\scriptsize R(f) = \mathbb{E}_{p^\star(\mathbf{\mathbf{x}},y)}\left[ \ell(y, f(\mathbf{\mathbf{x}})) \right] = \mathbb{E}_{p^\star(\mathbf{\mathbf{x}})}\left[ \mathbb{E}_{p^\star(y| \mathbf{\mathbf{x}})}\left[ \ell(y, f(\mathbf{\mathbf{x}})) \right] \right] = \mathbb{E}_{p^\star(\mathbf{\mathbf{x}})}\left[ R(f | \mathbf{x}) \right]
  • To go from the second to third line we used the fact that fB(x)=argminfYXR(f)=Ep(yx)[y]f_B(\mathbf{x}) = \arg\min_{f \in \mathcal{Y}^{\mathcal X}} R(f) = \mathbb{E}_{p^\star(y|\mathbf{x})}[y].

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

35 / 45

If D\mathcal{D} is itself considered as a random variable, then fDf^\star_{\mathcal{D}} is also a random variable, along with its predictions y^=fD(x)\hat{y} = f^\star_{\mathcal{D}}(\mathbf{x}).

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

36 / 45

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

36 / 45

Formally, the expected local expected risk yields to: ED[R(fDx)]=ED[R(fBx)+(fB(x)fD(x))2]=R(fBx)+ED[(fB(x)fD(x))2]=R(fBx)noise+(fB(x)ED[fD(x)])2bias2+ED[(fD(x)ED[fD(x)])2]var\begin{aligned} &\mathbb{E}_\mathcal{D} \left[ R(f^\star_{\mathcal{D}}|\mathbf{x}) \right] \\ &= \mathbb{E}_\mathcal{D} \left[ R(f_B|\mathbf{x}) + (f_B(\mathbf{x}) - f^\star_{\mathcal{D}}(\mathbf{x}))^2 \right] \\ &= R(f_B|\mathbf{x}) + \mathbb{E}_\mathcal{D} \left[ (f_B(\mathbf{x}) - f^\star_{\mathcal{D}}(\mathbf{x}))^2 \right] \\ &= \underbrace{R(f_B|\mathbf{x})}_{\text{noise}} + \underbrace{(f_B(\mathbf{x}) - \mathbb{E}_\mathcal{D}\left[ f^\star_{\mathcal{D}}(\mathbf{x}) \right] )^2}_{\text{bias}^2} + \underbrace{\mathbb{E}_\mathcal{D}\left[ ( f^\star_{\mathcal{D}}(\mathbf{x}) - \mathbb{E}_\mathcal{D}\left[ f^\star_{\mathcal{D}}(\mathbf{x}) \right])^2 \right]}_{\text{var}} \end{aligned}

This decomposition is known as the bias-variance decomposition.

  • The noise term quantity is the irreducible part of the expected risk.
  • The bias term measures the discrepancy between the average model and the Bayes estimator.
  • The variance term quantities the variability of the predictions.

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

37 / 45

Bias-variance trade-off

  • Reducing the capacity makes fDf^\star_{\mathcal{D}} fit the data less on average, which increases the bias term.
  • Increasing the capacity makes fDf^\star_{\mathcal{D}} vary a lot with the training data, which increases the variance term.

Credits: Francois Fleuret, EE559 Deep Learning, EPFL.

38 / 45

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

39 / 45

What do you observe?

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

39 / 45

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

39 / 45

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

39 / 45

Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.

39 / 45

Lab session on multinomial logistic regression

40 / 45

Appendix: 3 fundamental examples

41 / 45

Binary classification example

  • Training data: (x,y)X×Y(\mathbf{x}, y) \in \mathcal{X} \times \mathcal{Y} with X=RD\mathcal{X} = \mathbb{R}^D and Y={0,1} \mathcal{Y} = \{0, 1\}.

  • Model: We assume there exists a function f(;θ):RD[0,1]f(\cdot; \theta): \mathbb{R}^D \mapsto [0, 1] such that

    p(y=1x;θ)=f(x;θ)p(y=1|\mathbf{x} ; \theta) = f(\mathbf{x}; \theta) and p(y=0x;θ)=1f(x;θ)p(y=0|\mathbf{x} ; \theta) = 1 - f(\mathbf{x}; \theta).

    The model can be compactly rewritten using the following probability mass function (pmf) for all yYy \in \mathcal{Y}:

    p(yx;θ)=(f(x;θ))y(1f(x;θ))(1y). p(y \mid \mathbf{x} ; \theta) = \Big(f(\mathbf{x}; \theta)\Big)^y \Big(1-f(\mathbf{x}; \theta)\Big)^{(1-y)}.

  • The NLL function is called the binary cross-entropy:

    lnp(yx;θ)=yln(f(x;θ))(1y)ln(1f(x;θ)).- \ln p(y \mid \mathbf{x} ; \theta) = - y \ln\Big(f(\mathbf{x}; \theta)\Big) - (1-y) \ln\Big(1-f(\mathbf{x}; \theta)\Big).

42 / 45

CC-class classification example

  • Training data: (x,y)X×Y(\mathbf{x}, y) \in \mathcal{X} \times \mathcal{Y} with X=RD\mathcal{X} = \mathbb{R}^D and Y={1,...,C} \mathcal{Y} = \{1,...,C\}.

  • Model: We assume there exists a function f(;θ):RD[0,1]Cf(\cdot; \theta): \mathbb{R}^D \mapsto [0, 1]^C such that

    p(y=kx;θ)=fk(x;θ)p(y=k \mid \mathbf{x} ; \theta) = f_k(\mathbf{x}; \theta) for all k{1,...,C}k \in \{1,...,C\} and k=1Cfk(x;θ)=1\sum\limits_{k=1}^C f_k(\mathbf{x}; \theta) = 1.

    It can be compactly rewritten using the following pmf for all yYy \in \mathcal{Y}:

    p(yx;θ)=k=1Cp(y=kx;θ)1y=k=k=1Cfk(x;θ)1y=k. p(y \mid \mathbf{x} ; \theta) = \prod_{k=1}^{C} p(y=k \mid \mathbf{x} ; \theta)^{\mathbf{1}_{y = k}} = \prod_{k=1}^{C} f_k(\mathbf{x}; \theta)^{\mathbf{1}_{y = k}} .

  • The NLL function is called the cross-entropy:

    lnp(yx;θ)=k=1C1y=kln(fk(x;θ)).- \ln p(y \mid \mathbf{x} ; \theta) = - \sum_{k=1}^{C} \mathbf{1}_{y = k} \ln\Big( f_k(\mathbf{x}; \theta) \Big).

43 / 45

Logistic regression vs. naive Bayes classifiers

  • When f(x;θ)f(\mathbf{x}; \theta) is an affine function of x\mathbf{x}, this model is called (multinomial) logistic regression.

    It is one of the simplest discriminative models for supervised classification.

  • The most popular generative model for supervised classification is called the naive Bayes classifier.

  • For a comparison between the generative and discriminative approches to supervised classification, see (Ng and Jordan, 2002).

Ng, Andrew Y.; Jordan, Michael I. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems (NIPS).

44 / 45

Regression example

  • Training data: (x,y)X×Y(\mathbf{x}, \mathbf{y}) \in \mathcal{X} \times \mathcal{Y} with X=RP\mathcal{X} = \mathbb{R}^P and Y=RQ \mathcal{Y} = \mathbb{R}^Q.

  • Model: We assume there exists a function f(;θ):RPRQf(\cdot; \theta): \mathbb{R}^P \mapsto \mathbb{R}^Q such that

    p(yx;θ)=N(y;f(x;θ),I)=(2π)Q/2exp(12yf(x;θ)22)p(\mathbf{y} \mid \mathbf{x}; \theta) = \mathcal{N}\Big(\mathbf{y}; f(\mathbf{x}; \theta), \mathbf{I} \Big) = (2 \pi)^{-Q/2} \exp\Big( - \frac{1}{2} \parallel \mathbf{y} - f(\mathbf{x}; \theta) \parallel_2^2 \Big).

  • The NLL gives the squared error:

    lnp(yx;θ)=12yf(x;θ)22+cst.- \ln p(\mathbf{y} \mid \mathbf{x} ; \theta) = \frac{1}{2} \parallel \mathbf{y} - f(\mathbf{x}; \theta) \parallel_2^2 + \,cst.

45 / 45

In the last episode...

We want to infer some latent information from observations.

  • Data: Get unlabeled data D={xii.i.dp(x)}i=1N\mathcal{D} = \left\{\mathbf{x}_i \overset{i.i.d}{\sim} p^\star(\mathbf{x})\right\}_{i=1}^N.
  • Modeling: Define a model that relates the latent variable of interest to the observations p(x,z;θ)=p(xz;θ)p(z;θ)p(\mathbf{x},z; \theta) = p(\mathbf{x} \mid z; \theta) p(z; \theta).
  • Inference: Compute the posterior distribution p(zx;θ)p(z \mid \mathbf{x} ; \theta), which can then be used in many different ways.
  • Learning: Estimate the unknown model parameters θ\theta by maximizing the log-marginal likelihood lnp(x;θ)\ln p(\mathbf{x}; \theta) averaged over the dataset.

This corresponds to a form of unsupervised learning, using a generative Bayesian modeling approach.

2 / 45
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow