Bayesian Networks: A Comprehensive Guide

Post

What is a Model?

A model is a declarative representation of our understanding of the world. It's a representation within a computer that captures our understanding of what these variables are and how they interact with each other. Declarative means that the representation stands on its own which means that we make sense of it aside from any algorithm that we might choose to apply on.

This means that the same model can be used in the context of one algorithm that answers any one kind of question or other algorithms that might answer different kinds of questions, or the same question in more efficient ways, or that make different trade-offs between accuracy and computational cost.

Also, we can separate out the construction of the model from the algorithms that are used to reason it. We can construct methodologies that elicit these models from a human expert or ones that learn it from historical data using statistical machine learning techniques or a combination of the two.

What is Probabilistic?

Uncertainty. Uncertainty comes in many forms and for many different reasons such as:

Partial knowledge of state of the world
Noisy observations
Phenomena not covered by our model
Inherently stochastic (randomly determined)

Probability theory is a framework that allows us to deal with uncertainty in ways that are principled and that bring to bear important and valuable tools:

Declarative representation with clear semantics
Powerful reasoning patterns
Established learning methods

What is Graphical?

Probabilistic graphical models are a synthesis between ideas from probability theory in statistics and ideas from computer science.

In order to capture probability distributions over spaces involving such a large number of factors, we need to have probability distributions over what are called random variables. We need to represent the world through these variables each of which captures some facet of the world.

Our goal is to capture our uncertainty about the possible states of the world in terms of their probability distribution or what's called a joint distribution over the possible assignments to the set of random variables.

Types of Graphical Models

An example of a graphical model is Bayesian networks. It uses a directed graph as the intrinsic (native) representation. In this case, the random variables are represented by nodes in the graph. The edges in the graph represent the probabilistic connections between those random variables in a way that is very formal.

Another example of a graphical model is called Markov networks. It is an undirected graph.

Graphical representation is:

Intuitive and compact data structure
Efficient reasoning using general-purpose algorithms
Sparse parameterization: feasible elicitation, learning from data – in both cases a reduction in the number of parameters is very valuable

Key Components

Representation

Directed and undirected
Temporal and plate models

Inference (Reasoning)

Exact and approximate
Decision making

Learning

Parameters and structure
With and without complete data

The Student Example

Join Distribution in the student example:

Variable 1: Intelligence(I) – it has 2 values: low, high
Variable 2: Difficulty(D) – it has 2 values: easy, hard
Variable 3: Grade(G) – it has 3 values: g1, g2, g3

P(I,D,G) = 2x2x3 = 12 probabilities

We also have independent parameters whose value is not completely determined by the value of other parameters. All the probabilities have to sum to 1 so if you tell me eleven out of the twelve, I know what the twelfth is so the number of independent parameters is eleven.

Key Operations

Conditioning: Reduction

If we know one parameter this will cause reduction of the probabilities we came in the beginning. This operation is called reduction.
Unnormalized measure which means it doesn't sum to 1. So we need to normalize this distribution – make them sum to 1.

Marginalization

Marginalize over I or D – P(I,D)

Factors

A factor is a function or a table that takes arguments (a set of variables), and just like any function it gives us a value for every assignment. The set of variables is called the scope of the factor. A joint distribution is a factor. For every value of I, D and G, a combination of values, I get a number that's why it's a factor.

Conditional Probability Distribution (CPD)

It gives us the conditional probability of the variable G given I and D – P(G | I,D). This means for every combination of values to the variables I and D, we have a probability distribution over G.

Factor Operations

Factor Product
Factor Marginalization
Factor Reduction

Why Factors?

It turns out that these are the fundamental building block for defining distributions in high-dimensional spaces. That is the way in which we're going to define an exponentially large probability putting them together by multiplying factors in order to define these high dimensional probability distributions.
Also, the same set of basic operation that we use to define the probability distributions in these high dimensional spaces are also what we use for manipulating them in order give us a set of basic inference algorithms.

The Student Example Extended

Grade – course Difficulty – student Intelligence – student Sat – reference Letter

G depends on D I depends on G and S G depends on L

The model is a representation of how we believe the world works.

Chain Rule for Bayesian Networks

P(D,I,G,S,L) so to calculate this joint probability distribution: P(d0i1g3s1l1)= 0.60.30.020.010.8

A Bayesian network is a directed acyclic (means: no cycles → you can't go back when you started) graph (DAG) G whose nodes represent the random variables X1, ..., Xn.

And for each node in the graph, we have CPD – set of variables so this would be like the probability of G given I and D.

The BN represents a joint distribution via the chain rule for Bayesian networks.

BN is a legal distribution: P >= 0 P is a product of factor (CPD) and CPD is non-negative and if you multiply a bunch of non-negative factors, you get a non-negative factor.

BN is a legal distribution: P = 1

Reasoning Patterns

Causal reasoning (top to bottom)
Evidential reasoning (bottom to top)
Intercausal reasoning (2 causes of a single effect)

Flow of Probabilistic Influence

When can X influence Y? Influence means condition on X changes beliefs about Y.

X → Y / yes – causal
X ← Y / yes – evidential
X → W → Y / yes
X ← W ← Y / yes
X ← W → Y / yes
X → W ← Y / V-structure no

To activate a v-structure, Xi or one of its descendants is observed.