Visualizing Covariance

Published in

The Startup

5 min readFeb 11, 2021

A common assumption when we build any model tends to be that the variables are independent, but this assumption often doesn’t hold perfectly and here we will show why covariance messes up your estimates. First we derive the likelihood distribution for some model, next we will show how the shape of this distribution and hence the confidence interval of our estimates changes with variance. Finally, we will show how we can visualise the effect of covariance graphically and how high covariance between the variables you are trying to estimate affect the confidence intervals.

Starting with Bayes’ Theorem, we can write the posterior probability of some model X given observed data D and prior information I as:

We assume that our posterior is proportional to our likelihood and prior/evidence is some constant. Hence if we maximise likelihood, we maximise the posterior.

Our model can generate ideal estimates (F=(F1,F2,…,Fn)). The difference between the ideal estimates and observed data (D=(D1,D2,…,Dn)) is the error or noise. We assume that the measurement noise follows a Gaussian distribution. Assuming that each measurement is independent, we can write that the probability of observing the real data given our model and prior belief as the product of the probability of observing each discrepancy where each discrepancy has some variance σ² and mean 0:

Taking logarithm, we have the log likelihood L:

When we maximise L we minimise the residuals

By maximising L, we can determine the X parameters that gives that minimises the discrepancy between the model and the observed data. This is the maximum likelihood parameter estimate. We shall denote this as X*.

To see how variance affects our likelihood, we remind ourselves that P(X|D,I)=exp(L). We then Taylor expand L around the maximum point L(X*).

the 2nd term is 0 because at the maximum L, the slope of L is 0

So,

We can compare this with the Gaussian distribution:

We see that we can write P(D|X, I) as a Gaussian distribution with mean X* and variance -1/(d²L/dX*²) . Naturally the larger the variance, the wider the Gaussian and the wider our confidence intervals for the same confidence level. We also see that this variance is given by the negative of the inverse of the second derivative of the log-likelihood function.

The variance determines the width of the normal distribution

Another way to examine the quality of our estimates is by looking at the deviation from the maximum likelihood. Looking at the log likelihood function again:

We see that Q defines a region around the maximum likelihood.

We now examine Q again in matrix notation,

Since the Hessian matrix is symmetric, it is also diagonalizable:

Hessian matrix is diagonalizable

We can define a mapping:

We map estimated variables X onto y

where E is a square matrix whose cols are the normalised eigenvectors of the Hessian in any order, D is a diagonal matrix whose entries are the associated eigenvalues of the eigenvectors in E.

We see that we can actually write Q as the equation for an eclipse in the space of y for the 2D case of 2 variables.

The eclipse define the region around which Q= some constant value k.

Let’s finally get around to examining covariance now.

In 2D, we can write

By considering a 2 variable model with parameters X and y, we can get P(X|D,I) if we integrate over all possible values of y:

We substitute the expression for Q then factor out the parts that don’t depend on y, complete squares. Integrals from -inf to inf of exp(-y²/(2sigma²)) have the standard solution sigma*sqrt(2*pi) which we can substitute into our expression:

We can apply the same methods to get the variance for y and for the covariance of x, y.

The covariance matrix is given by:

We can see that it is just the negative inverse of the Hessian matrix.

Moreover since the Hessian is symmetric, its determinant is given by the product of its eigenvalues. So if the covariance becomes very high, C becomes large and the determinant becomes very small. The area of the ellipse is given by k/(product of the eigenvalues)=k/determinant, hence when covariance is high, the ellipse becomes very elongated in 1 direction.

This in turns increases the marginal error bar for a given k, so for some confidence level, the error bar of our estimate would be larger and we would have less precise estimates if the covariance is high. It also leads to problems with convergence for optimisation algorithms that uses the second derivative (Hessian) such as Newton-Raphson. In the case of Newton-Raphson, the large covariance implies a large step size which can cause the algorithm to repeatedly over and undershoot the local maximum point and the algorithm to be unable to converge.

In summary, check for covariance between your variables before trying to fit any model!

Visualizing Covariance

Written by Genevieve