Confidence Intervals for prediction in GLMMs

With LM and GLM the predict function can return the standard error for the predicted values on either the observed data or on new data. This is then used to draw confidence or prediction intervals around the fitted regression lines. The confidence intervals (CI) focus on the regression lines and can be interpreted as (assuming that we draw 95% CI): “If we would repeat our sampling X times the regression line would fall between this interval 95% of the time”. On the other hand the prediction interval focus on single data point and could be interpreted as (again assuming that we draw 95% CI): “If we would sample X times at these particular value for the explanatory variables, the response value would fall between this interval 95% of the time”. The wikipedia page has some nice explanation about the meaning of confidence intervals.
For GLMM the predict function does not allow one to derive standard error, the reason being (from the help page of predict.merMod): “There is no option for computing standard errors of predictions because it is difficult to define an efficient method that incorporates uncertainty in the variance parameters”. This means there is for now no way to include in the computation of the standard error for predicted values the fact that the fitted random effect standard deviation are just estimates and may be more or less well estimated. We can however still derive confidence or prediction intervals keeping in mind that we might underestimate the uncertainty around the estimates.


library(lme4)

#first case simple lmer, simulate 100 data points from 10 groups with one continuous fixed effect variable
x<-runif(100,0,10)
f<-gl(n = 10,k = 10)
data<-data.frame(x=x,f=f)
modmat<-model.matrix(~x,data)
#the fixed effect coefficient
fixed<-c(1,0.5)
#the random effect
rnd<-rnorm(10,0,0.7)
#the simulated response values
data$y<-rnorm(100,modmat%*%fixed+rnd[f],0.3)

#model
m<-lmer(y~x+(1|f),data)

#first CI and PI using predict-like method, using code posted here: http://glmm.wikidot.com/faq
newdat<-data.frame(x=seq(0,10,length=20))
mm<-model.matrix(~x,newdat)
newdat$y<-mm%*%fixef(m) 
#predict(m,newdat,re.form=NA) would give the same results
pvar1 <- diag(mm %*% tcrossprod(vcov(m),mm))
tvar1 <- pvar1+VarCorr(m)$f[1] # must be adapted for more complex models
newdat <- data.frame(
  newdat
  , plo = newdat$y-1.96*sqrt(pvar1)
  , phi = newdat$y+1.96*sqrt(pvar1)
  , tlo = newdat$y-1.96*sqrt(tvar1)
  , thi = newdat$y+1.96*sqrt(tvar1)
)

#second version with bootMer
#we have to define a function that will be applied to the nsim simulations
#here we basically get a merMod object and return the fitted values
predFun<-function(.) mm%*%fixef(.) 
bb<-bootMer(m,FUN=predFun,nsim=200) #do this 200 times
#as we did this 200 times the 95% CI will be bordered by the 5th and 195th value
bb_se<-apply(bb$t,2,function(x) x[order(x)][c(5,195)])
newdat$blo<-bb_se[1,]
newdat$bhi<-bb_se[2,]

plot(y~x,data)
lines(newdat$x,newdat$y,col="red",lty=2,lwd=3)
lines(newdat$x,newdat$plo,col="blue",lty=2,lwd=2)
lines(newdat$x,newdat$phi,col="blue",lty=2,lwd=2)
lines(newdat$x,newdat$tlo,col="orange",lty=2,lwd=2)
lines(newdat$x,newdat$thi,col="orange",lty=2,lwd=2)
lines(newdat$x,newdat$bhi,col="darkgreen",lty=2,lwd=2)
lines(newdat$x,newdat$blo,col="darkgreen",lty=2,lwd=2)
legend("topleft",legend=c("Fitted line","Confidence interval","Prediction interval","Bootstrapped CI"),col=c("red","blue","orange","darkgreen"),lty=2,lwd=2,bty="n")

bootGLMM1

This looks pretty familiar, the prediction interval being always bigger than the confidence interval.
Now in the help page for the predict.merMod function the authors of the lme4 package wrote that bootMer should be the prefered method to derive confidence intervals from GLMM. The idea there is to simulate N times new data from the model and get some statistic of interest. In our case we are interested in deriving the bootstrapped fitted values to get confidence interval for the regression line. bb$t is a matrix with the observation in the column and the different bootstrapped samples in the rows. To get the 95% CI for the fitted line we then need to get the [0.025*N,0.975*N] values of the sorted bootstrapped values.

The bootstrapped CI falls pretty close to the “normal” CI, even if for each bootstrapped sample new random effect values were calculated (because use.u=FALSE per default in bootMer)

Now let’s turn to a more complex example, a Poisson GLMM with two crossed random effects:


#second case more complex design with two crossed RE and a poisson response
x<-runif(100,0,10)
f1<-gl(n = 10,k = 10)
f2<-as.factor(rep(1:10,10))
data<-data.frame(x=x,f1=f1,f2=f2)
modmat<-model.matrix(~x,data)

fixed<-c(-0.12,0.35)
rnd1<-rnorm(10,0,0.7)
rnd2<-rnorm(10,0,0.2)

mus<-modmat%*%fixed+rnd1[f1]+rnd2[f2]
data$y<-rpois(100,exp(mus))

m<-glmer(y~x+(1|f1)+(1|f2),data,family="poisson")

#for GLMMs we have to back-transform the prediction after adding/removing the SE
newdat<-data.frame(x=seq(0,10,length=20))
mm<-model.matrix(~x,newdat)
y<-mm%*%fixef(m)
pvar1 <- diag(mm %*% tcrossprod(vcov(m),mm))
tvar1 <- pvar1+VarCorr(m)$f1[1]+VarCorr(m)$f2[1]  ## must be adapted for more complex models
newdat <- data.frame(
  x=newdat$x,
  y=exp(y),
   plo = exp(y-1.96*sqrt(pvar1))
  , phi = exp(y+1.96*sqrt(pvar1))
  , tlo = exp(y-1.96*sqrt(tvar1))
  , thi = exp(y+1.96*sqrt(tvar1))
)

#second version with bootMer
predFun<-function(.) exp(mm%*%fixef(.)) 
bb<-bootMer(m,FUN=predFun,nsim=200)
bb_se<-apply(bb$t,2,function(x) x[order(x)][c(5,195)])
newdat$blo<-bb_se[1,]
newdat$bhi<-bb_se[2,]

#plot
plot(y~x,data)
lines(newdat$x,newdat$y,col="red",lty=2,lwd=3)
lines(newdat$x,newdat$plo,col="blue",lty=2,lwd=2)
lines(newdat$x,newdat$phi,col="blue",lty=2,lwd=2)
lines(newdat$x,newdat$tlo,col="orange",lty=2,lwd=2)
lines(newdat$x,newdat$thi,col="orange",lty=2,lwd=2)
lines(newdat$x,newdat$bhi,col="darkgreen",lty=2,lwd=2)
lines(newdat$x,newdat$blo,col="darkgreen",lty=2,lwd=2)
legend("topleft",legend=c("Fitted line","Confidence interval","Prediction interval","Bootstrapped CI"),col=c("red","blue","orange","darkgreen"),lty=2,lwd=2,bty="n")

bootGLMM2

Again in this case the bootstrapped CI falled pretty close to the “normal” CI. We have here seen three different way to derive intervals representing the uncertainty around the regression lines (CI) and the response points (PI). Choosing among them would depend on what you want to see (what is the level of uncertainty around my fitted line vs if I sample new observations which value will they take), but also for complex model on computing power, as bootMer can take some time to run for GLMM with many observations and complex model structure.

7 thoughts on “Confidence Intervals for prediction in GLMMs

  1. very well explained!

    lines 40 & 41 in the first code chunk and 36 & 37 in the second

    newdat$blo<-bb_se[1,]
    newdat$bhi<-bb_se[2,]

    should probably be

    newdat$blo<-bb_se[5,]
    newdat$bhi<-bb_se[195,]

    let me know if correct

    1. Thanks for your comment and interest!

      Actually bb$t is a matrix with B x n dimension were B is the number of bootstrapped replication (here 200) and n the number of observation (here 20). We then have the observations in the columns and the bootstrapped replication in the lines. We are here interested in the lower and upper limit of the 95% confidence interval so to get for each column the 5th and 195th value on the sorted vector of bootstrapped replication. This is what the apply call does. We then have 2 x n matrix (bb_se) with the first line being the lower limit and the second line the upper one.

      As always in R there are 1000s of ways to do one thing!

  2. Thanks for your great post. I have tried to replicate it using a model containing a spline term, but with the splines the matrix is non-conformable, so running line 22 gives an error. In contrast, at line 23 you mention that predict could substitute line 22, but predict works fine. I have reproduced an example using your data:

    > library(splines)
    > m mm newdat mm newdat$y<-mm%*%fixef(m)
    Error in mm %*% fixef(m) : non-conformable arguments

    Do you have any suggestion for how to overcome this problem?

    Also, I echo Pankil's concern – when I run those two lines I get a prediction interval that is tiny and doesn't include the mean, so I think he is correct inasmuch as at least line 41 should perhaps be something different?

    Thanks,
    Sheena

  3. I agree with the others, the apply command simply sorts the matrix rows, so the 1st and 2nd row would give the lowest bootstrapped values out of the 200 simulations. Pankil’s solution should work.

    Cheers,
    Steffen.

    1. Hi Steffen,
      Thanks for your comment, note that the command to create the bb_se matrix object is NOT:

      bb_se<-apply(bb$t,2,function(x) x[order(x)])

      Which would just sort the COLUMNS, it is rather

      bb_se<-apply(bb$t,2,function(x) x[order(x)][c(5,195)])

      Which in addition to sorting the columns take only the 5th and 195th values from each columns, representing a 95% bootstrapped confidence interval.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s