EM Algorithm

Genevieve
5 min readSep 18, 2021

EM (Expectation-Maximisation) Algorithm is the go to algorithm whenever we have to do parameter estimation with hidden variables, such as in hidden Markov Chains. For some reason, it is often poorly explained and students end up confused as to what exactly are we maximising in the E-step and M-steps. Here is my attempt at a (hopefully) clear and step by step explanation on exactly how EM Algorithm works.

We begin by assuming we have some data, X

1: background, 2:model

where Z refers to the ‘model’ that generate the data. We consider the scenario of motif finding in a DNA sequence (finding special conserved patterns in the DNA, example many repeats of CGCGCG are known as CpG islands). Here we assume that the sequence is generated by 2 probability distributions, 1 is the background and the 2nd we call the ‘model’ in this example. (This is also known as a mixture model.) The probability distribution that generates the observed data (in this case the emitted DNA letter (ATCG)) is known as the hidden or latent variable, Z. In theory there can be any number of hidden states/ models that generate the data, for example let’s say you have a data set of height and weight from different animals, and you model them as a normal distribution. You would then have 1 set of parameters (mean and standard deviation) per species of animal. The species of animal would be Z.

The objective of the EM algo is to maximise the likelihood of the observed data X conditioned on the parameters, θ and λ. In the animal example, the θ is the set of all the mean and standard deviations for each Z while λ is the proportion of each species Z out of the total.

we want to maximise the log likelihood of observed data

The observed log-likelihood is usually not so easy to maximise. Instead we find the lower bound for it:

getting the lower bound for the observed log-likelihood equation

We find that the observed log-likelihood is lower bounded by the expected hidden log-likelihood and the expectation of the log of the arbitrary probability distribution q.

So, in the E-step. What we are doing is maximising the lower bound with respect to q. This happens when q=P[Z|X,θ,λ] (posterior). (More details as to how we get this is below in the section on Kullback-Leibler.) Hence, the Z that maximises q is E[q], also known as the responsibility. Thus, in the E-step, we are just setting Z to be the responsibility.

In the E-step, we set Z to be the responsibility

In the M-step, we maximise the lower bound with respect to (θ, λ). Since the second term is constant with respect to (θ, λ), we get

In the M-step we maximise w.r.t (θ, λ)

Since we can write the hidden log-likelihood as follows:

and since Z has been set to be equal to the responsibility in the E-step. We can replace Z in the above expression.

We write the expected hidden log-likelihood in terms of responsibilities

Maximising with respect to λ gives

M-step (1) update λ according to responsibilities

Maximisation with respect to θ

M-step (2) update θ according to the expected counts

We then repeat the E-step and M-step until the change in parameter or likelihood is small. Then goes the usual disclaimer about how this algorithm only finds the local maximum of the likelihood function.

Above, we presented a specific example of EM algorithm being used. We can generalise the derivations by introducing a measure known as the Kullback–Leibler divergence.

Kullback-Leibler divergence

It has 2 useful properties, that is it is always bigger or equal to 0 and it is only 0 if and only if P=Q. Thus, we can think of it as measuring the dissimilarity between P and Q.

Recall how we derived the lower bound for the observed log-likelihood above. We tweak the expansion a little:

Since the Kullback-Leibler Divergence is always positive, we again obtain another lower bound. In this version, it is easy to see that in order to maximise F (the observed likelihood in the previous section) with respect to q, we set q to P(Z|X, θ) thus making D(q||P)=0.

In the E-step, we maximise F with respect to q. Hence, we set q=P, thus after this step:

After the E-step

In the M-step, we maximise with respect to θ (as derived in the first section).

M-step (we can ignore the 2nd term)

We can ignore the 2nd term since it does not depend on θ.

So we can see that as F increases, l_obs(θ) increases.

In summary,

Graphic taken from lecture slide 2 of this course

In summary, in the EM algorithm, we just iteratively maximise with respect to q in the E-step. This turns out to be equivalent to setting the hidden variables to be equal the expectation of their posterior. In the M-step, we maximise with respect to the parameters and this is equivalent to finding the parameters that maximises the expected hidden log-likelihood.

--

--

Genevieve

I was an engineering student, a software developer at a wealth fund and now a graduate student studying computational biology.