Simon Leglaive
CentraleSupélec
This presentation is adapted from INFO8010 - Deep Learning course by Gilles Louppe at ULiège.
We want to infer some latent information from observations.
This corresponds to a form of unsupervised learning, using a generative Bayesian modeling approach.
We will focus on supervised learning with discriminative models.
The key concepts you should be familiar with at the end of this course are the following:
The term supervised means that, at training time, the input data samples xi∈X are labeled with the information we will try to infer at test time. The labels are represented by yi∈Y.
Image credits: Antoine Deleforge, Inria, course given at Télécom Physique Strasbourg.
At training time, our observations consist of (input, label) pairs, which are assumed to be i.i.d according to some distribution p⋆(x,y):
D={(xi,yi)∈X×Y∼i.i.dp⋆(x,y)}i=1N.
For instance, xi is a D-dimensional input feature vector and yi is a scalar label (e.g., a category or a real value).
At test time, we only observe a new input data sample x, from which we want to infer y.
This means computing the posterior distribution p(y∣x;θ), which depends on a set of parameters θ that have been estimated during the learning stage, at training time.
In supervised learning, we are often only interested in a point estimate y^, for instance:
classification task:
y^=k∈{1,...,C}argmaxp(y=k∣x;θ)∈{1,...,C};
regression task:
y^=Ep(y∣x;θ)[y]∈R.
We have access to a labeled dataset D={(xi,yi)∼i.i.dp⋆(x,y)}i=1N.
Generative modeling - we define the joint distribution to explain how the input data x are generated from the label y:
p(x,y;θ)=p(x∣y;θ)p(y;θ).Inference - we "invert" this generative model using Bayes theorem:
p(y∣x;θ)=p(x;θ)p(x∣y;θ)p(y;θ).
Learning - we estimate the unknown model parameters θ by maximizing the log-likelihood lnp(x,y;θ) averaged over the dataset D:
θmaxEp⋆(x,y)[lnp(x,y;θ)].
No need to marginalize over y as it is observed in a supervised setting!
We have access to a labeled dataset D={(xi,yi)∼i.i.dp⋆(x,y)}i=1N.
Discriminative modeling - we define the joint distribution to directly explain how the label y is obtained from the input data x:
p(x,y;θ)=p(y∣x;θ)p(x),
where p(x) is non-informative (e.g., uniform after normalization) and does not depend on θ.
Inference - super easy, just evaluate the model!
Learning - we estimate the unknown model parameters θ by maximizing the log-likelihood lnp(x,y;θ)=lnp(y∣x;θ)+cst averaged over the dataset D:
θmaxEp⋆(x,y)[lnp(x,y;θ)]⇔θminEp⋆(x,y)[−lnp(y∣x;θ)].
In supervised discriminative learning, −lnp(y∣x;θ) is often called the negative log-likelihood (NLL) function, which can be confusing (e.g., torch.nn.NLLLoss).
3 fundamental examples are available in the appendix. We will study the multinomial classification example in detail during the lab session.
Having to define a model as the expression of a probability mass/density function p(y∣x;θ) can be restrictive, and is actually not necessary.
A more general approach to supervised discriminative learning is based on the principle of empirical risk minimization.
Consider a function f:X→Y produced by some learning algorithm, which can depend on some parameters not represented here.
y^=f(x) is the prediction of the ground-truth label y associated with x.
The predictions of this function can be evaluated through a loss function ℓ:Y×Y→R, such that ℓ(y,f(x))≥0 measures how close the prediction f(x) is from y.
For example:
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
Let F denote the hypothesis space, i.e. the set of all functions f than can be produced by the chosen learning algorithm.
We are looking for a function f∈F with a small expected risk (or generalization error) R(f)=Ep⋆(x,y)[ℓ(y,f(x))]=Ep⋆(x)[Ep⋆(y∣x)[ℓ(y,f(x))]].
This means that for a given data generating distribution p⋆(x,y) and for a given hypothesis space F, the optimal model is f⋆=argf∈FminR(f).
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
Unfortunately, since p⋆(x,y) is unknown, the expected risk cannot be evaluated and the optimal model cannot be determined.
However, if we have i.i.d. training data D={(xi,yi)}i=1N, we can compute an estimate, the empirical risk (or training error) R^(f,D)=N1(xi,yi)∈D∑ℓ(yi,f(xi)).
This estimate is unbiased and can be used for finding a good enough approximation of f⋆. This results into the empirical risk minimization principle: fD⋆=argf∈FminR^(f,D)
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
What does unbiased mean?
=> The expected empirical risk estimate (over d) is the expected risk.
Most supervised machine learning algorithms, including neural networks, implement empirical risk minimization.
Under some regularity assumptions, empirical risk minimizers converge:
N→∞limfD⋆=f⋆
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
This is why tuning the parameters of the model to make it work on the training data is a reasonable thing to do.
Consider the joint probability distribution p⋆(x,y) induced by the data generating process (x,y)∼p⋆(x,y)⇔x∼U([−10;10]),ϵ∼N(0,σ2),y=g(x)+ϵ where x∈R, y∈R and g is an unknown polynomial of degree 3.
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
Regression is used to study the relationship between two continuous variables.
Of course, it can be extended to higher dimensions.
Image credit: https://noeliagorod.com/2019/05/21/machine-learning-for-everyone-in-simple-words-with-real-world-examples-yes-again/
Regression examples:
Our goal is to find a function f that makes good predictions on average over p⋆(x,y).
Consider the hypothesis space f∈F of polynomials of degree 3 defined through their parameters w∈R4 such that y^≜f(x;w)=d=0∑3wdxd
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
For this regression problem, we use the squared error loss ℓ(y,f(x;w))=(y−f(x;w))2 to measure how wrong the predictions are.
Therefore, our goal is to find the best value w⋆ such w⋆=argwminR(w)=argwminEp⋆(x,y)[(y−f(x;w))2]
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
Given a large enough training set D={(xi,yi)∣i=1,…,N}, the empirical risk minimization principle tells us that a good estimate wD⋆ of w⋆ can be found by minimizing the empirical risk: wD⋆=argwminR^(w,D)=argwminN1(xi,yi)∈D∑(yi−f(xi;w))2=argwminN1(xi,yi)∈D∑(yi−d=0∑3wdxid)2=argwminN1yy1y2…yN−Xx10…x13x20…x23…xN0…xN3w0w1w2w32
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
This is ordinary least squares regression, for which the solution is known analytically: wD⋆=(XTX)−1XTy
In many situations, the problem is more difficult and we cannot find the solution analytically. We resort to iterative optimization algorithms, such as (variants of) gradient descent.
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
The expected risk minimizer f(x;w⋆) within our hypothesis space F (polynomials of degree 3) is g(x) itself (i.e. the polynomial of degree 3 with the true parameters).
Therefore, on this toy problem, we can verify that f(x;wD⋆)→f(x;w⋆)=g(x) as N→∞.
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
What if we consider a hypothesis space F in which candidate functions f are either too "simple" or too "complex" with respect to the true data generating process?
F = polynomials of degree 1
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
What if we consider a hypothesis space F in which candidate functions f are either too "simple" or too "complex" with respect to the true data generating process?
F = polynomials of degree 2
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
What if we consider a hypothesis space F in which candidate functions f are either too "simple" or too "complex" with respect to the true data generating process?
F = polynomials of degree 3
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
What if we consider a hypothesis space F in which candidate functions f are either too "simple" or too "complex" with respect to the true data generating process?
F = polynomials of degree 4
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
What if we consider a hypothesis space F in which candidate functions f are either too "simple" or too "complex" with respect to the true data generating process?
F = polynomials of degree 5
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
What if we consider a hypothesis space F in which candidate functions f are either too "simple" or too "complex" with respect to the true data generating process?
F = polynomials of degree 10
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
Error vs. degree d of the polynomial.
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
Why shouldn't we pick the largest d?
Let YX be the set of all functions f:X→Y.
We define the Bayes risk as the minimal expected risk over all possible functions, RB=f∈YXminR(f), and call Bayes estimator the model fB that achieves this minimum.
No model f can perform better than fB.
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
The capacity of an hypothesis space F induced by a learning algorithm intuitively represents the ability to find a good model f∈F that can fit any function, regardless of its complexity.
If the capacity is infinite, we can fit any function, but in practice the capacity is always finite.
The capacity can be controlled through hyper-parameters of the learning algorithm. For example:
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
If the capacity of F is too high, then fB∈F or R(f⋆)−RB is small.
However, because of the high capacity of the hypothesis space, the empirical risk minimizer fD⋆ could fit the training data arbitrarily well such that
R(fD⋆)≥RB≥R^(fD⋆,D)≥0.
This indicates that the empirical risk R^(fD⋆,D) is a poor estimator of the expected risk R(fD⋆). In this situation, fD⋆ becomes too specialized with respect to the true data generating process, fD⋆ is said to overfit the data.
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
Therefore, our goal is to adjust the capacity of the hypothesis space such that the expected risk of the empirical risk minimizer (the generalization error) R(fD⋆) gets as low as possible, and not simply the empirical risk of the empirical risk minimizer (training error) R^(fD⋆,D).
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
Comment that for deep networks, training error may goes to 0 while the generalization error may not necessarily go up!
An unbiased estimate of the expected risk can be obtained by evaluating fD⋆ on data Dtest independent from the training samples D: R^(fD⋆,Dtest)=Ntest1(xi,yi)∈Dtest∑ℓ(yi,fD⋆(xi))
This test error estimate can be used to evaluate the actual performance of the model. However, it should not be used, at the same time, for model selection.
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
Error vs. degree d of the polynomial.
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
What value of d shall you select?
But then how good is this selected model?
Instead, keep a separate validation set for tuning the hyper-parameters.
Credits: Francois Fleuret, EE559 Deep Learning, EPFL.
Comment on the comparison of algorithms from one paper to the other.
Consider a fixed point x and the prediction y^=fD⋆(x) of the empirical risk minimizer at x.
Then the local expected risk of fD⋆ is R(fD⋆∣x)=Ep⋆(y∣x)[(y−fD⋆(x))2]=Ep⋆(y∣x)[(y−fB(x)+fB(x)−fD⋆(x))2]=Ep⋆(y∣x)[(y−fB(x))2]+Ep⋆(y∣x)[(fB(x)−fD⋆(x))2]=R(fB∣x)+(fB(x)−fD⋆(x))2
Remarks:
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
If D is itself considered as a random variable, then fD⋆ is also a random variable, along with its predictions y^=fD⋆(x).
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
Formally, the expected local expected risk yields to: ED[R(fD⋆∣x)]=ED[R(fB∣x)+(fB(x)−fD⋆(x))2]=R(fB∣x)+ED[(fB(x)−fD⋆(x))2]=noiseR(fB∣x)+bias2(fB(x)−ED[fD⋆(x)])2+varED[(fD⋆(x)−ED[fD⋆(x)])2]
This decomposition is known as the bias-variance decomposition.
Credits: Gilles Louppe, INFO8010 - Deep Learning, ULiège.
Credits: Francois Fleuret, EE559 Deep Learning, EPFL.
What do you observe?
Training data: (x,y)∈X×Y with X=RD and Y={0,1}.
Model: We assume there exists a function f(⋅;θ):RD↦[0,1] such that
p(y=1∣x;θ)=f(x;θ) and p(y=0∣x;θ)=1−f(x;θ).
The model can be compactly rewritten using the following probability mass function (pmf) for all y∈Y:
p(y∣x;θ)=(f(x;θ))y(1−f(x;θ))(1−y).
The NLL function is called the binary cross-entropy:
−lnp(y∣x;θ)=−yln(f(x;θ))−(1−y)ln(1−f(x;θ)).
Training data: (x,y)∈X×Y with X=RD and Y={1,...,C}.
Model: We assume there exists a function f(⋅;θ):RD↦[0,1]C such that
p(y=k∣x;θ)=fk(x;θ) for all k∈{1,...,C} and k=1∑Cfk(x;θ)=1.
It can be compactly rewritten using the following pmf for all y∈Y:
p(y∣x;θ)=k=1∏Cp(y=k∣x;θ)1y=k=k=1∏Cfk(x;θ)1y=k.
The NLL function is called the cross-entropy:
−lnp(y∣x;θ)=−k=1∑C1y=kln(fk(x;θ)).
When f(x;θ) is an affine function of x, this model is called (multinomial) logistic regression.
It is one of the simplest discriminative models for supervised classification.
The most popular generative model for supervised classification is called the naive Bayes classifier.
For a comparison between the generative and discriminative approches to supervised classification, see (Ng and Jordan, 2002).
Ng, Andrew Y.; Jordan, Michael I. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems (NIPS).
Training data: (x,y)∈X×Y with X=RP and Y=RQ.
Model: We assume there exists a function f(⋅;θ):RP↦RQ such that
p(y∣x;θ)=N(y;f(x;θ),I)=(2π)−Q/2exp(−21∥y−f(x;θ)∥22).
The NLL gives the squared error:
−lnp(y∣x;θ)=21∥y−f(x;θ)∥22+cst.
We want to infer some latent information from observations.
This corresponds to a form of unsupervised learning, using a generative Bayesian modeling approach.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |