+ - 0:00:00
Notes for current slide
Notes for next slide


Bayesian Methods for Machine Learning

Lecture 3 (part 1) - Bayesian networks



Simon Leglaive



CentraleSupélec

1 / 33

Bayesian networks

2 / 33

A Probabilistic graphical model (PGM) comprises nodes (also called vertices) connected by links (also known as edges or arcs).

Each node represents a random variable, and the links express probabilistic relationships between these variables.

  • In Bayesian networks or directed graphical models (focus of today's lecture), the links of the graphs have a particular directionality indicated by arrows.

  • In Markov random fields, or undirected graphical models, the links do not carry arrows and have no directional significance.

3 / 33

Example in medical diagnosis

In 1998 the LDS Hospital in Salt Lake City, Utah developed a Bayesian network to distinguish patients with pneumonia from patients with other diseases with high sensitivity (0.95) and specificity (0.965). It was used for many years in the clinic.

sensitivity: proportion of positives that are correctly identified; specificity: proportion of negatives that are correctly identified.

4 / 33

From the joint distribution to the graph

The graph captures the way in which the joint distribution over all the random variables can be decomposed into a product of factors, each depending only on a subset of the variables.

Consider first an arbitrary joint distribution p(a,b,c)p(a, b, c) over three variables aa, bb, and cc.

By application of the product rule (also called chain rule) of probability, we can write the joint distribution in the following form, without making any assumption:

p(a,b,c)=p(ca,b)p(a,b)=p(ca,b)p(ba)p(a) p(a,b,c) = p(c | a, b) p(a, b) = p(c | a, b) p(b | a) p(a)

5 / 33

The factorization p(a,b,c)=p(ca,b)p(ba)p(a)p(a,b,c) = p(c | a, b) p(b | a) p(a) can be represented as a Bayesian network.

  • introduce a node for each of the random variables;

6 / 33

The factorization p(a,b,c)=p(ca,b)p(ba)p(a)p(a,b,c) = p(c | a, b) p(b | a) p(a) can be represented as a Bayesian network.

  • introduce a node for each of the random variables;
  • associate each node with the corresponding conditional distribution;

6 / 33

The factorization p(a,b,c)=p(ca,b)p(ba)p(a)p(a,b,c) = p(c | a, b) p(b | a) p(a) can be represented as a Bayesian network.

  • introduce a node for each of the random variables;
  • associate each node with the corresponding conditional distribution;
  • for each conditional distribution, add directed links (arrows) from the nodes corresponding to the variables on which the distribution is conditioned.

6 / 33

The factorization p(a,b,c)=p(ca,b)p(ba)p(a)p(a,b,c) = p(c | a, b) p(b | a) p(a) can be represented as a Bayesian network.

  • introduce a node for each of the random variables;
  • associate each node with the corresponding conditional distribution;
  • for each conditional distribution, add directed links (arrows) from the nodes corresponding to the variables on which the distribution is conditioned.

If there is a link going from a node aa to a node bb, then we say that node aa is the parent of node bb, and we say that node bb is the child of node aa.

6 / 33

This same principle applies for the joint distribution of an arbitrary number KK of variables:

p(x1,x2,...,xK)=p(xKx1,x2,...,xK1)...p(x2x1)p(x1). p(x_1, x_2, ..., x_K) = p(x_K | x_1, x_2, ..., x_{K-1} ) \,...\, p(x_2 | x_1) p(x_1).

We can again represent this as a directed graph having KK nodes, one for each conditional distribution on the right-hand side of the above equation. Each node has incoming links from all lower umbered nodes.

We say that this graph is fully connected because there is a link between every pair of nodes.

7 / 33

From the graph to the joint distribution

Consider the following Bayesian network, which is not fully connected as, for instance, there is no link from x1x_1 to x2x_2 or from x3x_3 to x7x_7.

The joint distribution of all the variables in this graph can be written as a product of conditional distributions, where each variable is conditioned only on the parents of the corresponding node.

p(x1,x2,...x7)=p(x7x1:6)p(x6x1:5)p(x5x1:4)p(x4x1:3)p(x3x1:2)p(x2x1)p(x1)= \begin{aligned} p(x_1, x_2, ... x_7) &= p(x_7 | x_{1:6}) p(x_6 | x_{1:5}) p(x_5 | x_{1:4}) p(x_4 | x_{1:3}) p(x_3 | x_{1:2}) p(x_2 | x_{1}) p(x_{1}) \\ &= \end{aligned}

8 / 33

From the graph to the joint distribution

Consider the following Bayesian network, which is not fully connected as, for instance, there is no link from x1x_1 to x2x_2 or from x3x_3 to x7x_7.

The joint distribution of all the variables in this graph can be written as a product of conditional distributions, where each variable is conditioned only on the parents of the corresponding node.

p(x1,x2,...x7)=p(x7x1:6)p(x6x1:5)p(x5x1:4)p(x4x1:3)p(x3x1:2)p(x2x1)p(x1)=p(x7x4,x5)p(x6x4)p(x5x1,x3)p(x4x1,x2,x3)p(x3)p(x2)p(x1) \begin{aligned} p(x_1, x_2, ... x_7) &= p(x_7 | x_{1:6}) p(x_6 | x_{1:5}) p(x_5 | x_{1:4}) p(x_4 | x_{1:3}) p(x_3 | x_{1:2}) p(x_2 | x_{1}) p(x_{1}) \\ &= p(x_7 | x_4, x_5) p(x_6 | x_4) p(x_5 | x_1, x_3) p(x_4 | x_1, x_2, x_3) p(x_3) p(x_2) p(x_1) \end{aligned}

8 / 33

Factorizing the joint distribution in a Bayesian network

The joint distribution defined by a Bayesian network is given by the product, over all the nodes of the graph, of a conditional distribution for each node conditioned on the variables corresponding to the parents of that node.

Thus, for a graph with KK nodes, the joint distribution is given by:

p(x)=k=1Kp(xkpak), p(\mathbf{x}) = \prod_{k=1}^K p(x_k | \text{pa}_k),

where pak\text{pa}_k denotes the set of parents of xkx_k and x={x1,x2,...,xK}\mathbf{x} = \{x_1, x_2, ..., x_K\}.

This key equation expresses the factorization properties of the joint distribution for a Bayesian network.

Although we have considered each node to correspond to a single variable, we can equally well associate sets of variables and vector-valued variables with the nodes of a graph.

9 / 33

Important restriction

This is valid as long as there are no directed cycles, i.e. there are no closed paths within the graph such that we can move from node to node along links following the direction of the arrows and end up back at the starting node.

Bayesian networks are also called directed acyclic graphs, or DAGs.

10 / 33

Formal definition of a Bayesian network

A Bayesian network is a directed graph G=(V,E)G = (V, E) together with:

  • A random variable xkx_k for each node kVk \in V
  • A conditional probability distribution p(xkpak)p(x_k | \text{pa}_k ) for each node, defining the probability distribution of xkx_k conditioned on its parents.
11 / 33

Example (from lecture 2)

p(x1,x2,...,xN,z)=p(z)i=1Np(xiz) p(\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_N, z) = p(z) \prod_{i=1}^N p(\mathbf{x}_i | z)

Latent variables are represented with empty circles, observations with filled circles.

12 / 33

Example (from lecture 2)

p(x1,x2,...,xN,z)=p(z)i=1Np(xiz) p(\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_N, z) = p(z) \prod_{i=1}^N p(\mathbf{x}_i | z)

Latent variables are represented with empty circles, observations with filled circles.

The rectangle corresponds to the plate notation: the sub-graph contained in a rectangle is repeated according to the indicated indices. Any link that crosses a plate boundary is also replicated.

13 / 33

Generative model and ancestral sampling

A Bayesian network captures the causal process by which the data were generated. For this reason, such models are often called generative models.

14 / 33

Generative model and ancestral sampling

A Bayesian network captures the causal process by which the data were generated. For this reason, such models are often called generative models.

We can generate data from the definition of the Bayesian network, by sampling successively from the individual conditional distributions.

This method is called ancestral sampling, each variable being sampled given its parents (ancestors).

14 / 33

For example, assume we want to sample from p(x1,x2,x3)=p(x1)p(x2x1)p(x3x1,x2)p(x_1, x_2, x_3) = p(x_1)p(x_2 | x_1) p(x_3 | x_1, x_2) where we know each one of the conditional distributions.

Ancestral sampling:

  • we first sample x~1\tilde{x}_1 from p(x1)p(x_1)
  • then we sample x~2\tilde{x}_2 from p(x2x~1)p(x_2 | \tilde{x}_1)
  • finally we sample x~3\tilde{x}_3 from p(x3x~1,x~2)p(x_3 | \tilde{x}_1, \tilde{x}_2)

We obtain a sample (x~1,x~2,x~3)(\tilde{x}_1, \tilde{x}_2, \tilde{x}_3) from the joint distribution p(x1,x2,x3)p(x_1, x_2, x_3).

To obtain a sample from some marginal distribution corresponding to a subset of the variables, e.g. p(x2)p(x_2), we simply take the sampled values for the required nodes and ignore the sampled values for the remaining nodes, e.g. x~2\tilde{x}_2.

15 / 33

D-Separation

16 / 33

Conditional independence

  • Consider three variables aa, bb, and cc, and suppose that the conditional distribution of aa, given bb and cc, is such that it does not depend on bb:

    p(ab,c)=p(ac). p(a | b,c) = p(a|c).

    We say that aa is conditionally independent of bb given cc.

17 / 33

Conditional independence

  • Consider three variables aa, bb, and cc, and suppose that the conditional distribution of aa, given bb and cc, is such that it does not depend on bb:

    p(ab,c)=p(ac). p(a | b,c) = p(a|c).

    We say that aa is conditionally independent of bb given cc.

  • This can be expressed in a slightly different way if we consider the joint distribution of aa and bb conditioned on cc:

    p(a,bc)=p(ab,c)p(bc)=p(ac)p(bc).p(a, b | c) = p(a | b,c) p(b | c) = p(a|c) p(b|c).

  • This says that the variables aa and bb are statistically independent, given cc.

17 / 33
  • Conditional independence properties simplify both the structure of a model and the computations needed to perform inference and learning in Bayesian networks.

  • An important and elegant feature of graphical models is that conditional independence properties of the joint distribution can be read directly from the graph without having to perform any analytical manipulations.

  • The general framework for achieving this is called D-separation, where "D" stands for "directed".

18 / 33

To motivate and illustrate the concept of D-separation, let’s start by looking at three simple Bayesian network structures with three nodes aa, bb and cc.

19 / 33

"Tail-to-tail" or "common parent" structure

  • None of the variables are observed.

  • The node cc is said to be tail-to-tail because it is connected to the tails of the two arrows.

  • The joint distribution writes:

    p(a,b,c)=p(c)p(ac)p(bc) p(a, b, c) = p(c) p(a |c) p(b | c)

  • Are aa and bb independent?

    p(a,b)=p(a,b,c)dc=p(c)p(ac)p(bc)dcp(a)p(b) p(a,b) = \int p(a, b, c) dc = \int p(c) p(a |c) p(b | c) dc \neq p(a) p(b)

    because by model definition p(ac)p(a)p(a |c) \neq p(a) and p(bc)p(b)p(b | c) \neq p(b).

    Intuitively, cc connects aa and bb, making them dependent.

We assume continuous random variables. In the discrete case, the integral is simply replaced by a sum.

20 / 33

"Tail-to-tail" or "common parent" structure

  • The variable cc is now observed.

  • The joint distribution writes:

    p(a,b,c)=p(c)p(ac)p(bc) p(a, b, c) = p(c) p(a |c) p(b | c)

  • Are aa and bb conditionally independent?

    p(a,bc)=p(a,b,c)p(c)=p(c)p(ac)p(bc)p(c)=p(ac)p(bc) p(a, b | c) = \frac{p(a, b, c)}{p(c)} = \frac{p(c) p(a |c) p(b | c)}{p(c)} = p(a |c) p(b | c)

    cc contains all the information that determines the outcomes of aa and bb. Once it is observed, there is nothing else that affects these variables’ outcomes. In other words, p(a,bc)=p(ab,c)p(bc)=p(ac)p(bc)p(a, b | c) = p(a | b, c) p(b | c) = p(a |c) p(b | c).

21 / 33

"Head-to-tail" or "cascade" structure

  • None of the variables are observed.

  • The node cc is said to be head-to-tail because it is connected to the head and tail of the left and right arrows, respectively.

  • The joint distribution writes:

    p(a,b,c)=p(a)p(ca)p(bc) p(a, b, c) = p(a) p(c | a) p(b |c)

  • Are aa and bb independent?

    p(a,b)=p(a,b,c)dc=p(a)p(ca)p(bc)dcp(a)p(b) p(a,b) = \int p(a, b, c) dc = \int p(a) p(c | a) p(b |c) dc \neq p(a) p(b)

    because by model definition p(bc)p(b)p(b | c) \neq p(b).

    cc connects aa and bb, making them dependent.

22 / 33

"Head-to-tail" or "cascade" structure

  • The variable cc is now observed.

  • The joint distribution writes:

    p(a,b,c)=p(a)p(ca)p(bc) p(a, b, c) = p(a) p(c | a) p(b |c)

  • Are aa and bb conditionally independent?

    p(a,bc)=p(a,b,c)p(c)=p(a)p(ca)p(c)p(bc)=p(ac)p(bc) p(a, b | c) = \frac{p(a, b, c)}{p(c)} = \frac{p(a) p(c | a) }{p(c)} p(b |c) = p(a |c) p(b |c)

    cc contains all the information that determines the outcomes of bb, so once it is observed aa has no influence on bb anymore. In other words, p(a,bc)=p(ba,c)p(ac)=p(bc)p(ac)p(a, b | c) = p(b |a, c) p(a | c) = p(b |c) p(a | c).

23 / 33

"Head-to-head" or "V" structure

  • None of the variables are observed.

  • The node cc is said to be head-to-head because it is connected to the heads of the two arrows.

  • The joint distribution writes:

    p(a,b,c)=p(a)p(b)p(ca,b) p(a, b, c) = p(a) p(b) p(c |a, b)

  • Are aa and bb independent?

    p(a,b)=p(a,b,c)dc=p(a)p(b)p(ca,b)dc=p(a)p(b) p(a,b) = \int p(a, b, c) dc = \int p(a) p(b) p(c |a, b) dc = p(a) p(b)

    aa and bb are two indepent factors that determine the outcome of cc.

24 / 33

"Head-to-head" or "V" structure

  • The variable cc is now observed.

  • The node cc is said to be head-to-head because it is connected to the heads of the two arrows.

  • The joint distribution writes:

    p(a,b,c)=p(a)p(b)p(ca,b) p(a, b, c) = p(a) p(b) p(c |a, b)

  • Are aa and bb conditionally independent?

    p(a,bc)=p(a,b,c)p(c)=p(ca,b)p(a)p(c)p(b)p(ac)p(bc)=p(ca)p(a)p(c)p(bc) p(a,b | c) = \frac{p(a, b, c)}{p(c)} = \frac{p(c |a, b) p(a)}{p(c)} p(b) \quad \neq \quad p(a|c) p(b|c) = \frac{p(c | a) p(a)}{p(c)} p(b|c)

    because by the model definition p(ca,b)p(ca)p(c |a, b) \neq p(c |a) and p(bc)p(b)p(b|c) \neq p(b).

    Observing cc makes aa and bb dependent as it is a common child of the two nodes. Suppose that c=a+bc = a + b, if we know the value of cc, aa and bb cannot vary independently.

    This third example has the opposite behaviour from the first two.

25 / 33

Summary for a 3-variable graph

  • A tail-to-tail (common parent) node makes the two other nodes conditionally independent when it is observed.
  • A head-to-tail (cascade) node makes the two other nodes conditionally independent when it is observed.
  • A head-to-head (V-structure) node makes the two other nodes conditionally dependent when it is observed.
26 / 33

Summary for a 3-variable graph

  • A tail-to-tail (common parent) node makes the two other nodes conditionally independent when it is observed.
  • A head-to-tail (cascade) node makes the two other nodes conditionally independent when it is observed.
  • A head-to-head (V-structure) node makes the two other nodes conditionally dependent when it is observed, and/or when at least one of its descendant is observed.

    Suppose that c=a+bc = a + b and d=c+2d = c + 2, if we know the value of cc and/or dd, aa and bb cannot vary independently.


We can apply these 3 principles recursively to analyze larger Bayesian networks with arbitrary structure.

This is the notion of D-separation.


26 / 33

Definition of D-Separation

Consider a Bayesian network in which AA, BB, and CC are arbitrary nonintersecting sets of nodes.


We say that AA and BB are D-separated given CC if all possible paths that connect any node in AA to any node in BB are blocked given CC.

Equivalently, AA and BB are D-separated given CC if they are not connected by any path that is not blocked (i.e. that is active).

If AA and BB are D-separated given CC, then p(A,BC)=p(AC)p(BC)p(A , B | C) = p(A |C) p(B | C).


A path is said to be blocked given observed variables OO if it includes a node YY such that either:

  • YY is a head-to-tail node (XYZX \rightarrow Y \rightarrow Z or XYZX \leftarrow Y \leftarrow Z) and YY is in OO (i.e. observed), or
  • YY is a tail-to-tail node (XYZX \leftarrow Y \rightarrow Z) and YY is in OO (i.e. observed), or
  • YY is head-to-head node (XYZX \rightarrow Y \leftarrow Z) and YY or any of its descendant is not in OO (i.e. not observed).
27 / 33

Recipe for D-separation:

  • List all the paths between any node in AA and any node in BB.

  • If all paths are blocked, AA and BB are D-separated given CC.

  • Equivalently, if you can find one active path (i.e. not blocked), AA and BB are not D-separated given CC.

28 / 33

An path is called active given observed variables OO if for every consecutive triplet of variables XX, YY, ZZ on the path, one of the following holds:

  • YY is a head-to-tail node (XYZX \rightarrow Y \rightarrow Z or XYZX \leftarrow Y \leftarrow Z) and YY is not in OO, i.e. not observed.
  • YY is a tail-to-tail node (XYZX \leftarrow Y \rightarrow Z) and YY is not in OO, i.e. not observed.
  • YY is head-to-head node (XYZX \rightarrow Y \leftarrow Z) and YY or any of its descendant is in OO, i.e. is observed.

Are aa and dd D-separated given ee?

Are aa and dd D-separated given bb?

29 / 33
  • In the graph on the left, the path from a to d is not blocked by node b because it is a tail-to-tail node which is not observed, nor is it blocked by node c because, it is a head-to-head node and it has a descendant e which is observed. Thus aa and dd are not D-separated given ee.

  • In the graph on the right, the path from a to d is blocked by node b because this is a tail-to-tail node that is observed. This path is also blocked by node c because e is a head-to-head node and neither it nor its descendant are in the conditioning set. So aa and dd are indeed D-separated given bb.

Are aa and ee D-separated given bb and cc?

Are bb and cc D-separated given aa and ee?

30 / 33
  • In the graph on the left, the path from a to d is blocked by node b because this is a head-to-tail node that is observed. The same applies for node c. Thus aa and ee are D-separated.

  • In the graph on the right, b and c are connected by an active path (i.e. a path that is not blocked). This path goes through node e which is a head-to-head node that is observed. Therefore bb and cc are not D-separated, even though the path which goes through node a is blocked, as node a is a tail-to-tail node that is observed.

  • You may find all this pretty abstract, but it is actually essential when you perform inference in Bayesian networks.

Homework

For each of the three Markov models shown below, are xt1x_{t-1} and yt+1y_{t+1} D-separated given

  1. xtx_t
  2. yty_t
  3. {xt,yt}\{x_t, y_t\}?


The left model is called a hidden Markov model (HMM). HMMs can be applied in many fields where the goal is to recover a data sequence that is not immediately observable. For instance, in automatic speech recognition we observe a sequence of audio features and want to infer a sequence of language tokens (e.g., phonemes).

31 / 33
  • With the left model: 1. Yes, 2. No, 3. Yes.
  • With the center model: 1. No, 2. Yes, 3. Yes.
  • With the right model: 1. No, 2. No, 3. Yes.

Markov blanket

  • Consider the joint distribution of an arbitrary number KK of variables p(x1,x2,...,xK)p(x_1, x_2, ..., x_K) represented by a Bayesian network with KK nodes.

  • The conditional distribution of an arbitrary variable xkx_k given all the remaining variables in the graph only depends on the variables in its Markov blanket, defined as the set of nodes comprising its parents, children and the co-parents:

    p(xk{xi}ik)=p(xkMB(xk)). p\left(x_k |\{x_{i}\}_{i \neq k } \right) = p\left(x_k | \text{MB}(x_k) \right).

  • Given its Markov blanket, xkx_k is conditionally independent of all the remaining variables in the graph.

  • The Markov blanket contains all the information one needs in order to infer xkx_k.

32 / 33

Consider the conditional distribution of a particular variable xkx_k given all the remaining ones {xi}ik\{x_{i}\}_{i \neq k }:

p(xk{xi}ik)=p(x1,x2,...,xK)p({xi}ik)=p(x1,x2,...,xK)p(x1,x2,...,xK)dxk=j=1Kp(xjpaj)j=1Kp(xjpaj)dxk. \begin{aligned} p\left(x_k |\{x_{i}\}_{i \neq k } \right) &= \frac{p(x_1, x_2, ..., x_K)}{p(\{x_{i}\}_{i \neq k })} &= \frac{p(x_1, x_2, ..., x_K)}{\int p(x_1, x_2, ..., x_K) d x_k} &= \frac{\prod_{j=1}^K p(x_j | \text{pa}_j)}{\int \prod_{j=1}^K p(x_j | \text{pa}_j) d x_k}. \end{aligned}

  • Any factor p(xjpaj)p(x_j | \text{pa}_j) that does not have any functional dependence on xkx_k can be taken outside the integral and will therefore cancel between numerator and denominator.
  • The only factors that remain will be the conditional distribution p(xkpak)p(x_k | \text{pa}_k) for node xkx_k itself, together with the conditional distributions p(xjpaj)p(x_j | \text{pa}_j) where xkx_k is in paj\text{pa}_j.

  • p(xkpak)p(x_k | \text{pa}_k) will depend on the parents of xkx_k, whereas the remaining conditionals p(xjpaj)p(x_j | \text{pa}_j) will depend on the children of xkx_k, as well as its co-parents (the other parents of xjx_j).

Therefore, p(xk{xi}ik)p\left(x_k |\{x_{i}\}_{i \neq k } \right) only depends on the Markov blanket of xkx_k.

33 / 33

Bayesian networks

2 / 33
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow