Bayesian Methods for Machine Learning

class: center, middle

<br/>
# Bayesian Methods for Machine Learning

.small-vspace[

]

### Lecture 3 (part 1) -  Bayesian networks

<br/><br/>
.bold[Simon Leglaive]
<br/>
<br/>

<br/><br/>
.tiny[CentraleSupélec]

---
class: center, middle

## Bayesian networks

---
class: middle

A Probabilistic graphical model (PGM) comprises **nodes** (also called vertices) connected by **links** (also known as edges or arcs).

Each node represents a random variable, and the links express probabilistic relationships between these variables.

- In **Bayesian networks** or **directed graphical models** (focus of today's lecture), the links of the graphs have a particular directionality indicated by arrows.

- In **Markov random fields**, or **undirected graphical models**, the links do not carry arrows and have no directional significance.

---
### Example in medical diagnosis

In 1998 the LDS Hospital in Salt Lake City, Utah developed a Bayesian network to distinguish patients with pneumonia from patients with other diseases with high sensitivity (0.95) and specificity (0.965). It was used for many years in the clinic.

.center.width-60[![](images/diagnostic_bayes_net.png)]

.footnote[sensitivity: proportion of positives that are correctly identified; specificity: proportion of negatives that are correctly identified.]

---
class: middle
### From the joint distribution to the graph

The graph captures the way in which **the joint distribution over all the random variables can be decomposed into a product of factors, each depending only on a subset of the variables**.

Consider first an arbitrary joint distribution $p(a, b, c)$ over three variables $a$, $b$, and $c$.

By application of the product rule (also called chain rule) of probability, we can write the joint distribution in the following form, without making any assumption:

$$ p(a,b,c) = p(c | a, b) p(a, b) = p(c | a, b) p(b | a) p(a) $$

---

The factorization $p(a,b,c) = p(c | a, b) p(b | a) p(a)$ can be represented as a **Bayesian network**.

.left-column[

- introduce a node for each of the random variables;

]

.right-column[

.center.width-60[![](images/abc_graph_1.png)]

]

.reset-column[

]

---
count: false

The factorization $p(a,b,c) = p(c | a, b) p(b | a) p(a)$ can be represented as a **Bayesian network**.

.left-column[

- introduce a node for each of the random variables;
- associate each node with the corresponding conditional distribution;

]

.right-column[

.center.width-60[![](images/abc_graph_2.png)]

]

.reset-column[

]

---
count: false

The factorization $p(a,b,c) = p(c | a, b) p(b | a) p(a)$ can be represented as a **Bayesian network**.

.left-column[

- introduce a node for each of the random variables;
- associate each node with the corresponding conditional distribution;
- for each conditional distribution, add directed links (arrows) from the nodes corresponding to the variables on which the distribution is conditioned.

]

.right-column[

.center.width-60[![](images/abc_graph_3.png)]

]

.reset-column[

]

--
count: false

If there is a link going from a node $a$ to a node $b$, then we say that node $a$ is the **parent** of node $b$, and we say that node $b$ is the **child** of node $a$.

---
class: middle

This same principle applies for the joint distribution of an arbitrary number $K$ of variables:

$$ p(x\_1, x\_2, ..., x\_K) = p(x\_K | x\_1, x\_2, ..., x\_{K-1} ) \,...\, p(x\_2 | x\_1) p(x\_1). $$

We can again represent this as a directed graph having $K$ nodes, one for each conditional distribution on the right-hand side of the above equation. Each node has incoming links from all lower umbered nodes.

We say that this graph is **fully connected** because there is a link between every pair of nodes.

---
### From the graph to the joint distribution

.left-column[

Consider the following Bayesian network, which is not fully connected as, for instance, there is no link from $x\_1$ to $x\_2$ or from $x\_3$ to $x\_7$.

The joint distribution of all the variables in this graph can be written as a product of conditional distributions, where each variable is **conditioned only on the parents** of the corresponding node.

]

.right-column[

.center.width-50[![](images/graph_bishop.png)]
]

.reset-column[

]

$$
\begin{aligned}
 p(x\_1, x\_2, ... x\_7) &= p(x\_7 | x\_{1:6}) p(x\_6 | x\_{1:5}) p(x\_5 | x\_{1:4}) p(x\_4 | x\_{1:3}) p(x\_3 | x\_{1:2}) p(x\_2 | x\_{1}) p(x\_{1}) \\\\
 &= 
\end{aligned}
$$

---
count: false
### From the graph to the joint distribution

.left-column[

Consider the following Bayesian network, which is not fully connected as, for instance, there is no link from $x\_1$ to $x\_2$ or from $x\_3$ to $x\_7$.

]

.right-column[

.center.width-50[![](images/graph_bishop.png)]
]

.reset-column[

]

$$
\begin{aligned}
 p(x\_1, x\_2, ... x\_7) &= p(x\_7 | x\_{1:6}) p(x\_6 | x\_{1:5}) p(x\_5 | x\_{1:4}) p(x\_4 | x\_{1:3}) p(x\_3 | x\_{1:2}) p(x\_2 | x\_{1}) p(x\_{1}) \\\\
 &= p(x\_7 | x\_4, x\_5) p(x\_6 | x\_4) p(x\_5 | x\_1, x\_3) p(x\_4 | x\_1, x\_2, x\_3) p(x\_3) p(x\_2) p(x\_1)
\end{aligned}
$$

---
### Factorizing the joint distribution in a Bayesian network

The joint distribution defined by a Bayesian network is given by the product, over all the nodes of the graph, of a conditional distribution for each node conditioned on the variables corresponding to the parents of that node.

Thus, for a graph with $K$ nodes, the joint distribution is given by:

$$ p(\mathbf{x}) = \prod\_{k=1}^K p(x\_k | \text{pa}\_k), $$

where $\text{pa}\_k$ denotes the set of **parents** of $x\_k$ and $\mathbf{x} = \\{x\_1, x\_2, ..., x\_K\\}$.

**This key equation expresses the factorization properties of the joint distribution for a Bayesian network.**

Although we have considered each node to correspond to a single variable, we can equally well associate sets of variables and vector-valued variables with the nodes of a graph.

---
class: middle

.bold[Important restriction]

This is valid as long as there are no **directed cycles**, i.e. there are no closed paths within the graph such that we can move from node to node along links following the direction of the arrows and end up back at the starting node.

Bayesian networks are also called **directed acyclic graphs**, or DAGs.

---
class: middle
### Formal definition of a Bayesian network

A Bayesian network is a directed graph $G = (V, E)$ together with:

- A random variable $x\_k$ for each node $k \in V$
- A conditional probability distribution $p(x\_k | \text{pa}\_k )$ for each node, defining the probability distribution of $x\_k$ conditioned on its parents.

---
### Example (from lecture 2)

.grid[
.kol-3-5[
  .center.width-100[![](images/bayes_8.svg)]

$$ p(\mathbf{x}\_1, \mathbf{x}\_2, ..., \mathbf{x}\_N, z) = p(z) \prod\_{i=1}^N p(\mathbf{x}\_i | z) $$
]
.kol-2-5[
  .center.width-90[![](images/BN_example.png)]
  .medium[Latent variables are represented with empty circles, observations with filled circles.]
]
]

---
### Example (from lecture 2)

.grid[
.kol-3-5[
  .center.width-100[![](images/bayes_8.svg)]

$$ p(\mathbf{x}\_1, \mathbf{x}\_2, ..., \mathbf{x}\_N, z) = p(z) \prod\_{i=1}^N p(\mathbf{x}\_i | z) $$
]
.kol-2-5[
  .center.width-40[![](images/BN_example_plate.png)]
  .medium[Latent variables are represented with empty circles, observations with filled circles.
  
  The rectangle corresponds to the plate notation: the sub-graph contained in a rectangle is repeated according to the indicated indices. Any link that crosses a plate boundary is also replicated.]
]
]

---

### Generative model and ancestral sampling

A Bayesian network captures the causal process by which the data were generated. For this reason, such models are often called **generative models**.

--
count: false

We can generate data from the definition of the Bayesian network, by sampling successively from the individual conditional distributions.

This method is called **ancestral sampling**, each variable being sampled given its parents (ancestors).

---
class: middle

For example, assume we want to sample from $p(x\_1, x\_2, x\_3) = p(x\_1)p(x\_2 | x\_1) p(x\_3 | x\_1, x\_2)$ where we know each one of the conditional distributions.

Ancestral sampling:
- we first sample $\tilde{x}\_1$ from $p(x\_1)$
- then we sample $\tilde{x}\_2$ from $p(x\_2 | \tilde{x}\_1)$
- finally we sample $\tilde{x}\_3$ from $p(x\_3 | \tilde{x}\_1, \tilde{x}\_2)$

We obtain a sample $(\tilde{x}\_1, \tilde{x}\_2, \tilde{x}\_3)$ from the joint distribution $p(x\_1, x\_2, x\_3)$.

To obtain a sample from some marginal distribution corresponding to a subset of the variables, e.g. $p(x\_2)$, we simply take the sampled values for the required nodes and ignore the sampled values for the remaining nodes, e.g. $\tilde{x}\_2$.

---
exclude: true
### Generative model with latent variables

Hidden variables may be introduced to **build a complex distribution from simpler ones**.

The Student's t example (see Lecture 2):

$$\begin{cases}
p(x | v) &= \mathcal{N} \left(x; \mu, v \right) \\\\[.25cm]
p(v) &= \mathcal{IG}\left(v; \displaystyle \frac{\alpha}{2}, \frac{\alpha}{2} \lambda^2\right)
\end{cases}  \hspace{.5cm} \Leftrightarrow \hspace{.5cm}  p(x) = \int\_{0}^{+\infty} p(x | v) p(v) dv = \mathcal{T}_{\alpha}(x; \mu, \lambda)  $$
$$

.vspace[

.center[.width-50[![](images/studentsT_pdf_logy.svg)]]

]

---
exclude: true
class: middle

Hidden variables may also have an **explicit interpretation**.

.center[<iframe src="images/TLGANdemo.mp4?autoplay=True" frameborder="0" width="570" height="360" allowfullscreen></iframe>]

In either case, the technique of ancestral sampling applied to a generative model mimics the creation of the observed data.

---
class: center, middle

## D-Separation

---
### Conditional independence

- Consider three variables $a$, $b$, and $c$, and suppose that the conditional distribution of $a$, given $b$ and $c$, is such that it does not depend on $b$:

$$ p(a | b,c) = p(a|c). $$

We say that ** $a$ is conditionally independent of $b$ given $c$**.

--
count: false

- This can be expressed in a slightly different way if we consider the joint distribution of $a$ and $b$ conditioned on $c$:

$$p(a, b | c) = p(a | b,c) p(b | c) = p(a|c) p(b|c). $$

- This says that the variables $a$ and $b$ are statistically independent, given $c$.

---
class: middle

- Conditional independence properties simplify both the structure of a model and the computations needed to perform inference and learning in Bayesian networks.

- An important and elegant feature of graphical models is that **conditional independence properties of the joint distribution can be read directly from the graph** without having to perform any analytical manipulations.

- The general framework for achieving this is called **D-separation**, where "D" stands for "directed".

---
class: middle

To motivate and illustrate the concept of D-separation, let’s start by looking at three simple Bayesian network structures with three nodes $a$, $b$ and $c$.

---
class: middle
### "Tail-to-tail" or "common parent" structure

.right.width-70[![](images/tail2tail.svg) ]

- None of the variables are observed.

- The node $c$ is said to be tail-to-tail because it is connected to the tails of the two arrows.

- The joint distribution writes:

$$ p(a, b, c) = p(c) p(a |c) p(b | c)  $$

- Are $a$ and $b$ independent?

$$ p(a,b) = \int p(a, b, c) dc = \int p(c) p(a |c) p(b | c)  dc \neq p(a) p(b)  $$

because by model definition $p(a |c) \neq p(a)$ and $p(b | c) \neq p(b)$.

Intuitively, $c$ connects $a$ and $b$, making them dependent.

.footnote[We assume continuous random variables. In the discrete case, the integral is simply replaced by a sum.]

---
class: middle
### "Tail-to-tail" or "common parent" structure

.right.width-70[![](images/tail2tail_2.svg) ]

- The variable $c$ is now **observed**.

- The joint distribution writes:

$$ p(a, b, c) = p(c) p(a |c) p(b | c)  $$

- Are $a$ and $b$ **conditionally** independent?

$$ p(a, b | c) =  \frac{p(a, b, c)}{p(c)} = \frac{p(c) p(a |c) p(b | c)}{p(c)} = p(a |c) p(b | c) $$

$c$ contains all the information that determines the outcomes of $a$ and $b$. Once it is observed, there is nothing else that affects these variables’ outcomes. In other words, $p(a, b | c) = p(a | b, c) p(b | c) = p(a |c) p(b | c)$.

---
class: middle
### "Head-to-tail" or "cascade" structure

.right.width-70[![](images/head2tail.svg) ]

- None of the variables are observed.

- The node $c$ is said to be head-to-tail because it is connected to the head and tail of the left and right arrows, respectively.

- The joint distribution writes:

$$ p(a, b, c) = p(a) p(c | a) p(b |c) $$

- Are $a$ and $b$ independent?

$$ p(a,b) = \int p(a, b, c) dc = \int p(a) p(c | a) p(b |c)  dc \neq p(a) p(b)  $$

because by model definition $p(b | c) \neq p(b)$.

$c$ connects $a$ and $b$, making them dependent.

---
class: middle
### "Head-to-tail" or "cascade" structure

.right.width-70[![](images/head2tail_2.svg) ]

- The variable $c$ is now **observed**.

- The joint distribution writes:

$$ p(a, b, c) = p(a) p(c | a) p(b |c) $$

- Are $a$ and $b$ **conditionally** independent?

$$ p(a, b | c) =  \frac{p(a, b, c)}{p(c)} = \frac{p(a) p(c | a) }{p(c)} p(b |c) = p(a |c) p(b |c) $$

$c$ contains all the information that determines the outcomes of $b$, so once it is observed $a$ has no influence on $b$ anymore. In other words, $p(a, b | c) = p(b |a, c) p(a | c) = p(b |c) p(a | c)$.

---
class: middle
### "Head-to-head" or "V" structure

.right.width-70[![](images/head2head.svg) ]

- None of the variables are observed.

- The node $c$ is said to be head-to-head because it is connected to the heads of the two arrows.

- The joint distribution writes:

$$ p(a, b, c) = p(a) p(b) p(c |a, b) $$

- Are $a$ and $b$ independent?

$$ p(a,b) = \int p(a, b, c) dc = \int p(a) p(b) p(c |a, b) dc = p(a) p(b)  $$

$a$ and $b$ are two indepent factors that determine the outcome of $c$.

---
class: middle
### "Head-to-head" or "V" structure

.right.width-70[![](images/head2head_2.svg) ]

- The variable $c$ is now **observed**.

- The node $c$ is said to be head-to-head because it is connected to the heads of the two arrows.

- The joint distribution writes:

$$ p(a, b, c) = p(a) p(b) p(c |a, b) $$

- Are $a$ and $b$ **conditionally** independent?

$$ p(a,b | c) = \frac{p(a, b, c)}{p(c)} = \frac{p(c |a, b) p(a)}{p(c)} p(b) \quad \neq \quad p(a|c) p(b|c) = \frac{p(c | a) p(a)}{p(c)} p(b|c) $$

because by the model definition $p(c |a, b) \neq p(c |a)$ and $p(b|c) \neq p(b)$.

Observing $c$ makes $a$ and $b$ dependent as it is a common child of the two nodes. Suppose that $c = a + b$, if we know the value of $c$, $a$ and $b$ cannot vary independently.

This third example has the opposite behaviour from the first two.

---
### Summary for a 3-variable graph

- A **tail-to-tail** (common parent) node makes the two other nodes conditionally **independent** when it is observed.
- A **head-to-tail** (cascade) node makes the two other nodes conditionally **independent** when it is observed.
- A **head-to-head** (V-structure) node makes the two other nodes conditionally ***dependent*** when it is observed.

---
count: false
### Summary for a 3-variable graph

.tiny.italic[Suppose that $c = a + b$ and $d = c + 2$, if we know the value of $c$ and/or $d$, $a$ and $b$ cannot vary independently.]

.vspace[

]

.center.block-center-70[

--- 
  
  We can apply these 3 principles recursively to analyze larger Bayesian networks with arbitrary structure. 
  
  **This is the notion of D-separation.**

--- 
  
]

---
exclude: true
### D-Separation

Consider a Bayesian network in which $A$, $B$, and $C$ are arbitrary nonintersecting sets of nodes. We say that **$A$ and $B$ are D-separated given $C$** if $A$ and $B$ are not connected by any **active path** given $C$.

If $A$ and $B$ are D-separated given $C$, then $p(A , B | C) = p(A |C) p(B | C)$.

An **undirected path** is called **active** given observed variables $O$ if for every consecutive triplet of variables $X$, $Y$, $Z$ on the path, one of the following holds:

- $Y$ is a head-to-tail node ($X \rightarrow Y \rightarrow Z$ or $X \leftarrow Y \leftarrow Z$) and $Y$ is not in $O$, i.e. not observed.
- $Y$ is a tail-to-tail node ($X \leftarrow Y \rightarrow Z$) and $Y$ is not in $O$, i.e. not observed.
- $Y$ is head-to-head node ($X \rightarrow Y \leftarrow Z$) and $Y$ or any of its descendant is in $O$, i.e. is observed.

---
exclude: true
class: middle

**Recipe** for D-separation:

1. list all the paths between any node in $A$ and any node in $B$
2. check if there is any active path given $C$
3. if you cannot find any active path, $A$ and $B$ are D-separated given $C$, i.e. conditionally independent

---
### Definition of D-Separation

Consider a Bayesian network in which $A$, $B$, and $C$ are arbitrary nonintersecting sets of nodes.

---

We say that **$A$ and $B$ are D-separated given $C$** if all possible paths that connect any node in $A$ to any node in $B$ are **blocked** given $C$.

Equivalently, **$A$ and $B$ are D-separated given $C$** if they are not connected by any path that is **not blocked** (i.e. that is active).

If $A$ and $B$ are D-separated given $C$, then $p(A , B | C) = p(A |C) p(B | C)$.

---

A path is said to be **blocked** given observed variables $O$ if it includes a node $Y$ such that either:

- $Y$ is a head-to-tail node ($X \rightarrow Y \rightarrow Z$ or $X \leftarrow Y \leftarrow Z$) and $Y$ is in $O$ (i.e. observed), **or**
- $Y$ is a tail-to-tail node ($X \leftarrow Y \rightarrow Z$) and $Y$ is in $O$ (i.e. observed), **or**
- $Y$ is head-to-head node ($X \rightarrow Y \leftarrow Z$) and $Y$ or any of its descendant is **not** in $O$ (i.e. not observed).

---
class: middle

**Recipe** for D-separation:

- List all the paths between any node in $A$ and any node in $B$.

- If all paths are blocked, $A$ and $B$ are D-separated given $C$.

- Equivalently, if you can find one active path (i.e. not blocked), $A$ and $B$ are not D-separated given $C$.

???

An path is called **active** given observed variables $O$ if for every consecutive triplet of variables $X$, $Y$, $Z$ on the path, one of the following holds:

---
class: middle

.left-column[
  .width-60[![](images/D-separation1.svg) ]

Are $a$ and $d$ D-separated given $e$?

]
.right.right-column[
  .width-60[![](images/D-separation2.svg) ]

Are $a$ and $d$ D-separated given $b$?
]
.reset-column[

]

???

- In the graph on the left, the path from a to d is not blocked by node b because it is a tail-to-tail node which is not observed, nor is it blocked by node c because, it is a head-to-head node and it has a descendant e which is observed. Thus $a$ and $d$ are **not** D-separated given $e$.

- In the graph on the right, the path from a to d is blocked by node b because this is a tail-to-tail node that is observed. This path is also blocked by node c because e is a head-to-head node and neither it nor its descendant are in the conditioning set. So $a$ and $d$ are indeed D-separated given $b$.

---
class: middle

.left-column[
  .width-60[![](images/D-separation3.svg) ]

Are $a$ and $e$ D-separated given $b$ and $c$?

]
.right.right-column[
  .width-60[![](images/D-separation4.svg) ]

Are $b$ and $c$ D-separated given $a$ and $e$?
]
.reset-column[

]

???

- In the graph on the left, the path from a to d is blocked by node b because this is a head-to-tail node that is observed. The same applies for node c. Thus $a$ and $e$ are D-separated.

- In the graph on the right, b and c are connected by an active path (i.e. a path that is not blocked). This path goes through node e which is a head-to-head node that is observed. Therefore $b$ and $c$ are not D-separated, even though the path which goes through node a is blocked, as node a is a tail-to-tail node that is observed.

- You may find all this pretty abstract, but it is actually essential when you perform inference in Bayesian networks.

---
### Homework

For each of the three Markov models shown below, are $x\_{t-1}$ and $y\_{t+1}$ D-separated given

1. $x\_t$
2. $y\_t$
3. $\\{x\_t, y\_t\\}$?

<br>

.center.width-100[![](images/markov_models.svg) ]

.footnote[The left model is called a [hidden Markov model](https://en.wikipedia.org/wiki/Hidden_Markov_model) (HMM). HMMs can be applied in many fields where the goal is to recover a data sequence that is not immediately observable. For instance, in automatic speech recognition we observe a sequence of audio features and want to infer a sequence of language tokens (e.g., phonemes).]

???

- With the left model: 1. Yes, 2. No, 3. Yes. 
- With the center model: 1. No, 2. Yes, 3. Yes. 
- With the right model: 1. No, 2. No, 3. Yes.

---
class: middle

## Markov blanket

- Consider the joint distribution of an arbitrary number $K$ of variables $p(x\_1, x\_2, ..., x\_K)$ represented by a Bayesian network with $K$ nodes.

- The conditional distribution of an arbitrary variable $x\_k$ given all the remaining variables in the graph only depends on the variables in its **Markov blanket**, defined as the **set of nodes comprising its parents, children and the co-parents**:

$$ p\left(x\_k |\\{x\_{i}\\}\_{i \neq k } \right) = p\left(x\_k | \text{MB}(x\_k) \right). $$

.grid[

.kol-3-4[

- **Given its Markov blanket, $x\_k$ is conditionally independent of all the remaining variables in the graph.**

- The Markov blanket contains all the information one needs in order to infer $x\_k$.

]

.kol-1-4[

.center.width-80[![](images/markov_blanket.svg) ]

]

---

Consider the conditional distribution of a particular variable $x\_k$ given all the remaining ones $\\{x\_{i}\\}\_{i \neq k }$:

$$
\begin{aligned}
p\left(x\_k |\\{x\_{i}\\}\_{i \neq k } \right) &= \frac{p(x\_1, x\_2, ..., x\_K)}{p(\\{x\_{i}\\}\_{i \neq k })} 
&= \frac{p(x\_1, x\_2, ..., x\_K)}{\int p(x\_1, x\_2, ..., x\_K) d x\_k}
&= \frac{\prod\_{j=1}^K p(x\_j | \text{pa}\_j)}{\int \prod\_{j=1}^K p(x\_j | \text{pa}\_j) d x\_k}.
\end{aligned}
$$

- Any factor $p(x\_j | \text{pa}\_j)$ that does not have any functional dependence on $x\_k$ can be taken outside the integral and will therefore cancel between numerator and denominator.
- The only factors that remain will be the conditional distribution $p(x\_k | \text{pa}\_k)$ for node $x\_k$ itself, together with the conditional distributions $p(x\_j | \text{pa}\_j)$ where $x\_k$ is in $\text{pa}\_j$.

- $p(x\_k | \text{pa}\_k)$ will depend on the **parents** of $x\_k$, whereas the remaining conditionals $p(x\_j | \text{pa}\_j)$ will depend on the **children** of $x\_k$, as well as its **co-parents** (the other parents of $x\_j$).

Therefore, $p\left(x\_k |\\{x\_{i}\\}\_{i \neq k } \right)$ only depends on the Markov blanket of $x\_k$.