Simon Leglaive
CentraleSupélec
A Probabilistic graphical model (PGM) comprises nodes (also called vertices) connected by links (also known as edges or arcs).
Each node represents a random variable, and the links express probabilistic relationships between these variables.
In Bayesian networks or directed graphical models (focus of today's lecture), the links of the graphs have a particular directionality indicated by arrows.
In Markov random fields, or undirected graphical models, the links do not carry arrows and have no directional significance.
In 1998 the LDS Hospital in Salt Lake City, Utah developed a Bayesian network to distinguish patients with pneumonia from patients with other diseases with high sensitivity (0.95) and specificity (0.965). It was used for many years in the clinic.
sensitivity: proportion of positives that are correctly identified; specificity: proportion of negatives that are correctly identified.
The graph captures the way in which the joint distribution over all the random variables can be decomposed into a product of factors, each depending only on a subset of the variables.
Consider first an arbitrary joint distribution p(a,b,c) over three variables a, b, and c.
By application of the product rule (also called chain rule) of probability, we can write the joint distribution in the following form, without making any assumption:
p(a,b,c)=p(c∣a,b)p(a,b)=p(c∣a,b)p(b∣a)p(a)
The factorization p(a,b,c)=p(c∣a,b)p(b∣a)p(a) can be represented as a Bayesian network.
The factorization p(a,b,c)=p(c∣a,b)p(b∣a)p(a) can be represented as a Bayesian network.
The factorization p(a,b,c)=p(c∣a,b)p(b∣a)p(a) can be represented as a Bayesian network.
The factorization p(a,b,c)=p(c∣a,b)p(b∣a)p(a) can be represented as a Bayesian network.
If there is a link going from a node a to a node b, then we say that node a is the parent of node b, and we say that node b is the child of node a.
This same principle applies for the joint distribution of an arbitrary number K of variables:
p(x1,x2,...,xK)=p(xK∣x1,x2,...,xK−1)...p(x2∣x1)p(x1).
We can again represent this as a directed graph having K nodes, one for each conditional distribution on the right-hand side of the above equation. Each node has incoming links from all lower umbered nodes.
We say that this graph is fully connected because there is a link between every pair of nodes.
Consider the following Bayesian network, which is not fully connected as, for instance, there is no link from x1 to x2 or from x3 to x7.
The joint distribution of all the variables in this graph can be written as a product of conditional distributions, where each variable is conditioned only on the parents of the corresponding node.
p(x1,x2,...x7)=p(x7∣x1:6)p(x6∣x1:5)p(x5∣x1:4)p(x4∣x1:3)p(x3∣x1:2)p(x2∣x1)p(x1)=
Consider the following Bayesian network, which is not fully connected as, for instance, there is no link from x1 to x2 or from x3 to x7.
The joint distribution of all the variables in this graph can be written as a product of conditional distributions, where each variable is conditioned only on the parents of the corresponding node.
p(x1,x2,...x7)=p(x7∣x1:6)p(x6∣x1:5)p(x5∣x1:4)p(x4∣x1:3)p(x3∣x1:2)p(x2∣x1)p(x1)=p(x7∣x4,x5)p(x6∣x4)p(x5∣x1,x3)p(x4∣x1,x2,x3)p(x3)p(x2)p(x1)
The joint distribution defined by a Bayesian network is given by the product, over all the nodes of the graph, of a conditional distribution for each node conditioned on the variables corresponding to the parents of that node.
Thus, for a graph with K nodes, the joint distribution is given by:
p(x)=k=1∏Kp(xk∣pak),
where pak denotes the set of parents of xk and x={x1,x2,...,xK}.
This key equation expresses the factorization properties of the joint distribution for a Bayesian network.
Although we have considered each node to correspond to a single variable, we can equally well associate sets of variables and vector-valued variables with the nodes of a graph.
Important restriction
This is valid as long as there are no directed cycles, i.e. there are no closed paths within the graph such that we can move from node to node along links following the direction of the arrows and end up back at the starting node.
Bayesian networks are also called directed acyclic graphs, or DAGs.
A Bayesian network is a directed graph G=(V,E) together with:
p(x1,x2,...,xN,z)=p(z)i=1∏Np(xi∣z)
Latent variables are represented with empty circles, observations with filled circles.
p(x1,x2,...,xN,z)=p(z)i=1∏Np(xi∣z)
Latent variables are represented with empty circles, observations with filled circles.
The rectangle corresponds to the plate notation: the sub-graph contained in a rectangle is repeated according to the indicated indices. Any link that crosses a plate boundary is also replicated.
A Bayesian network captures the causal process by which the data were generated. For this reason, such models are often called generative models.
A Bayesian network captures the causal process by which the data were generated. For this reason, such models are often called generative models.
We can generate data from the definition of the Bayesian network, by sampling successively from the individual conditional distributions.
This method is called ancestral sampling, each variable being sampled given its parents (ancestors).
For example, assume we want to sample from p(x1,x2,x3)=p(x1)p(x2∣x1)p(x3∣x1,x2) where we know each one of the conditional distributions.
Ancestral sampling:
We obtain a sample (x~1,x~2,x~3) from the joint distribution p(x1,x2,x3).
To obtain a sample from some marginal distribution corresponding to a subset of the variables, e.g. p(x2), we simply take the sampled values for the required nodes and ignore the sampled values for the remaining nodes, e.g. x~2.
Consider three variables a, b, and c, and suppose that the conditional distribution of a, given b and c, is such that it does not depend on b:
p(a∣b,c)=p(a∣c).
We say that a is conditionally independent of b given c.
Consider three variables a, b, and c, and suppose that the conditional distribution of a, given b and c, is such that it does not depend on b:
p(a∣b,c)=p(a∣c).
We say that a is conditionally independent of b given c.
This can be expressed in a slightly different way if we consider the joint distribution of a and b conditioned on c:
p(a,b∣c)=p(a∣b,c)p(b∣c)=p(a∣c)p(b∣c).
This says that the variables a and b are statistically independent, given c.
Conditional independence properties simplify both the structure of a model and the computations needed to perform inference and learning in Bayesian networks.
An important and elegant feature of graphical models is that conditional independence properties of the joint distribution can be read directly from the graph without having to perform any analytical manipulations.
The general framework for achieving this is called D-separation, where "D" stands for "directed".
To motivate and illustrate the concept of D-separation, let’s start by looking at three simple Bayesian network structures with three nodes a, b and c.
None of the variables are observed.
The node c is said to be tail-to-tail because it is connected to the tails of the two arrows.
The joint distribution writes:
p(a,b,c)=p(c)p(a∣c)p(b∣c)
Are a and b independent?
p(a,b)=∫p(a,b,c)dc=∫p(c)p(a∣c)p(b∣c)dc=p(a)p(b)
because by model definition p(a∣c)=p(a) and p(b∣c)=p(b).
Intuitively, c connects a and b, making them dependent.
We assume continuous random variables. In the discrete case, the integral is simply replaced by a sum.
The variable c is now observed.
The joint distribution writes:
p(a,b,c)=p(c)p(a∣c)p(b∣c)
Are a and b conditionally independent?
p(a,b∣c)=p(c)p(a,b,c)=p(c)p(c)p(a∣c)p(b∣c)=p(a∣c)p(b∣c)
c contains all the information that determines the outcomes of a and b. Once it is observed, there is nothing else that affects these variables’ outcomes. In other words, p(a,b∣c)=p(a∣b,c)p(b∣c)=p(a∣c)p(b∣c).
None of the variables are observed.
The node c is said to be head-to-tail because it is connected to the head and tail of the left and right arrows, respectively.
The joint distribution writes:
p(a,b,c)=p(a)p(c∣a)p(b∣c)
Are a and b independent?
p(a,b)=∫p(a,b,c)dc=∫p(a)p(c∣a)p(b∣c)dc=p(a)p(b)
because by model definition p(b∣c)=p(b).
c connects a and b, making them dependent.
The variable c is now observed.
The joint distribution writes:
p(a,b,c)=p(a)p(c∣a)p(b∣c)
Are a and b conditionally independent?
p(a,b∣c)=p(c)p(a,b,c)=p(c)p(a)p(c∣a)p(b∣c)=p(a∣c)p(b∣c)
c contains all the information that determines the outcomes of b, so once it is observed a has no influence on b anymore. In other words, p(a,b∣c)=p(b∣a,c)p(a∣c)=p(b∣c)p(a∣c).
None of the variables are observed.
The node c is said to be head-to-head because it is connected to the heads of the two arrows.
The joint distribution writes:
p(a,b,c)=p(a)p(b)p(c∣a,b)
Are a and b independent?
p(a,b)=∫p(a,b,c)dc=∫p(a)p(b)p(c∣a,b)dc=p(a)p(b)
a and b are two indepent factors that determine the outcome of c.
The variable c is now observed.
The node c is said to be head-to-head because it is connected to the heads of the two arrows.
The joint distribution writes:
p(a,b,c)=p(a)p(b)p(c∣a,b)
Are a and b conditionally independent?
p(a,b∣c)=p(c)p(a,b,c)=p(c)p(c∣a,b)p(a)p(b)=p(a∣c)p(b∣c)=p(c)p(c∣a)p(a)p(b∣c)
because by the model definition p(c∣a,b)=p(c∣a) and p(b∣c)=p(b).
Observing c makes a and b dependent as it is a common child of the two nodes. Suppose that c=a+b, if we know the value of c, a and b cannot vary independently.
This third example has the opposite behaviour from the first two.
A head-to-head (V-structure) node makes the two other nodes conditionally dependent when it is observed, and/or when at least one of its descendant is observed.
Suppose that c=a+b and d=c+2, if we know the value of c and/or d, a and b cannot vary independently.
We can apply these 3 principles recursively to analyze larger Bayesian networks with arbitrary structure.
This is the notion of D-separation.
Consider a Bayesian network in which A, B, and C are arbitrary nonintersecting sets of nodes.
We say that A and B are D-separated given C if all possible paths that connect any node in A to any node in B are blocked given C.
Equivalently, A and B are D-separated given C if they are not connected by any path that is not blocked (i.e. that is active).
If A and B are D-separated given C, then p(A,B∣C)=p(A∣C)p(B∣C).
A path is said to be blocked given observed variables O if it includes a node Y such that either:
Recipe for D-separation:
List all the paths between any node in A and any node in B.
If all paths are blocked, A and B are D-separated given C.
Equivalently, if you can find one active path (i.e. not blocked), A and B are not D-separated given C.
An path is called active given observed variables O if for every consecutive triplet of variables X, Y, Z on the path, one of the following holds:
Are a and d D-separated given e?
Are a and d D-separated given b?
In the graph on the left, the path from a to d is not blocked by node b because it is a tail-to-tail node which is not observed, nor is it blocked by node c because, it is a head-to-head node and it has a descendant e which is observed. Thus a and d are not D-separated given e.
In the graph on the right, the path from a to d is blocked by node b because this is a tail-to-tail node that is observed. This path is also blocked by node c because e is a head-to-head node and neither it nor its descendant are in the conditioning set. So a and d are indeed D-separated given b.
Are a and e D-separated given b and c?
Are b and c D-separated given a and e?
In the graph on the left, the path from a to d is blocked by node b because this is a head-to-tail node that is observed. The same applies for node c. Thus a and e are D-separated.
In the graph on the right, b and c are connected by an active path (i.e. a path that is not blocked). This path goes through node e which is a head-to-head node that is observed. Therefore b and c are not D-separated, even though the path which goes through node a is blocked, as node a is a tail-to-tail node that is observed.
You may find all this pretty abstract, but it is actually essential when you perform inference in Bayesian networks.
For each of the three Markov models shown below, are xt−1 and yt+1 D-separated given
The left model is called a hidden Markov model (HMM). HMMs can be applied in many fields where the goal is to recover a data sequence that is not immediately observable. For instance, in automatic speech recognition we observe a sequence of audio features and want to infer a sequence of language tokens (e.g., phonemes).
Consider the joint distribution of an arbitrary number K of variables p(x1,x2,...,xK) represented by a Bayesian network with K nodes.
The conditional distribution of an arbitrary variable xk given all the remaining variables in the graph only depends on the variables in its Markov blanket, defined as the set of nodes comprising its parents, children and the co-parents:
p(xk∣{xi}i=k)=p(xk∣MB(xk)).
Given its Markov blanket, xk is conditionally independent of all the remaining variables in the graph.
The Markov blanket contains all the information one needs in order to infer xk.
Consider the conditional distribution of a particular variable xk given all the remaining ones {xi}i=k:
p(xk∣{xi}i=k)=p({xi}i=k)p(x1,x2,...,xK)=∫p(x1,x2,...,xK)dxkp(x1,x2,...,xK)=∫∏j=1Kp(xj∣paj)dxk∏j=1Kp(xj∣paj).
The only factors that remain will be the conditional distribution p(xk∣pak) for node xk itself, together with the conditional distributions p(xj∣paj) where xk is in paj.
p(xk∣pak) will depend on the parents of xk, whereas the remaining conditionals p(xj∣paj) will depend on the children of xk, as well as its co-parents (the other parents of xj).
Therefore, p(xk∣{xi}i=k) only depends on the Markov blanket of xk.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |