Confidence Intervals for prediction in GLMMs

With LM and GLM the predict function can return the standard error for the predicted values on either the observed data or on new data. This is then used to draw confidence or prediction intervals around the fitted regression lines. The confidence intervals (CI) focus on the regression lines and can be interpreted as (assuming that we draw 95% CI): “If we would repeat our sampling X times the regression line would fall between this interval 95% of the time”. On the other hand the prediction interval focus on single data point and could be interpreted as (again assuming that we draw 95% CI): “If we would sample X times at these particular value for the explanatory variables, the response value would fall between this interval 95% of the time”. The wikipedia page has some nice explanation about the meaning of confidence intervals.
For GLMM the predict function does not allow one to derive standard error, the reason being (from the help page of predict.merMod): “There is no option for computing standard errors of predictions because it is difficult to define an efficient method that incorporates uncertainty in the variance parameters”. This means there is for now no way to include in the computation of the standard error for predicted values the fact that the fitted random effect standard deviation are just estimates and may be more or less well estimated. We can however still derive confidence or prediction intervals keeping in mind that we might underestimate the uncertainty around the estimates.


library(lme4)

#first case simple lmer, simulate 100 data points from 10 groups with one continuous fixed effect variable
x<-runif(100,0,10)
f<-gl(n = 10,k = 10)
data<-data.frame(x=x,f=f)
modmat<-model.matrix(~x,data)
#the fixed effect coefficient
fixed<-c(1,0.5)
#the random effect
rnd<-rnorm(10,0,0.7)
#the simulated response values
data$y<-rnorm(100,modmat%*%fixed+rnd[f],0.3)

#model
m<-lmer(y~x+(1|f),data)

#first CI and PI using predict-like method, using code posted here: http://glmm.wikidot.com/faq
newdat<-data.frame(x=seq(0,10,length=20))
mm<-model.matrix(~x,newdat)
newdat$y<-mm%*%fixef(m) 
#predict(m,newdat,re.form=NA) would give the same results
pvar1 <- diag(mm %*% tcrossprod(vcov(m),mm))
tvar1 <- pvar1+VarCorr(m)$f[1] # must be adapted for more complex models
newdat <- data.frame(
  newdat
  , plo = newdat$y-1.96*sqrt(pvar1)
  , phi = newdat$y+1.96*sqrt(pvar1)
  , tlo = newdat$y-1.96*sqrt(tvar1)
  , thi = newdat$y+1.96*sqrt(tvar1)
)

#second version with bootMer
#we have to define a function that will be applied to the nsim simulations
#here we basically get a merMod object and return the fitted values
predFun<-function(.) mm%*%fixef(.) 
bb<-bootMer(m,FUN=predFun,nsim=200) #do this 200 times
#as we did this 200 times the 95% CI will be bordered by the 5th and 195th value
bb_se<-apply(bb$t,2,function(x) x[order(x)][c(5,195)])
newdat$blo<-bb_se[1,]
newdat$bhi<-bb_se[2,]

plot(y~x,data)
lines(newdat$x,newdat$y,col="red",lty=2,lwd=3)
lines(newdat$x,newdat$plo,col="blue",lty=2,lwd=2)
lines(newdat$x,newdat$phi,col="blue",lty=2,lwd=2)
lines(newdat$x,newdat$tlo,col="orange",lty=2,lwd=2)
lines(newdat$x,newdat$thi,col="orange",lty=2,lwd=2)
lines(newdat$x,newdat$bhi,col="darkgreen",lty=2,lwd=2)
lines(newdat$x,newdat$blo,col="darkgreen",lty=2,lwd=2)
legend("topleft",legend=c("Fitted line","Confidence interval","Prediction interval","Bootstrapped CI"),col=c("red","blue","orange","darkgreen"),lty=2,lwd=2,bty="n")

bootGLMM1

This looks pretty familiar, the prediction interval being always bigger than the confidence interval.
Now in the help page for the predict.merMod function the authors of the lme4 package wrote that bootMer should be the prefered method to derive confidence intervals from GLMM. The idea there is to simulate N times new data from the model and get some statistic of interest. In our case we are interested in deriving the bootstrapped fitted values to get confidence interval for the regression line. bb$t is a matrix with the observation in the column and the different bootstrapped samples in the rows. To get the 95% CI for the fitted line we then need to get the [0.025*N,0.975*N] values of the sorted bootstrapped values.

The bootstrapped CI falls pretty close to the “normal” CI, even if for each bootstrapped sample new random effect values were calculated (because use.u=FALSE per default in bootMer)

Now let’s turn to a more complex example, a Poisson GLMM with two crossed random effects:


#second case more complex design with two crossed RE and a poisson response
x<-runif(100,0,10)
f1<-gl(n = 10,k = 10)
f2<-as.factor(rep(1:10,10))
data<-data.frame(x=x,f1=f1,f2=f2)
modmat<-model.matrix(~x,data)

fixed<-c(-0.12,0.35)
rnd1<-rnorm(10,0,0.7)
rnd2<-rnorm(10,0,0.2)

mus<-modmat%*%fixed+rnd1[f1]+rnd2[f2]
data$y<-rpois(100,exp(mus))

m<-glmer(y~x+(1|f1)+(1|f2),data,family="poisson")

#for GLMMs we have to back-transform the prediction after adding/removing the SE
newdat<-data.frame(x=seq(0,10,length=20))
mm<-model.matrix(~x,newdat)
y<-mm%*%fixef(m)
pvar1 <- diag(mm %*% tcrossprod(vcov(m),mm))
tvar1 <- pvar1+VarCorr(m)$f1[1]+VarCorr(m)$f2[1]  ## must be adapted for more complex models
newdat <- data.frame(
  x=newdat$x,
  y=exp(y),
   plo = exp(y-1.96*sqrt(pvar1))
  , phi = exp(y+1.96*sqrt(pvar1))
  , tlo = exp(y-1.96*sqrt(tvar1))
  , thi = exp(y+1.96*sqrt(tvar1))
)

#second version with bootMer
predFun<-function(.) exp(mm%*%fixef(.)) 
bb<-bootMer(m,FUN=predFun,nsim=200)
bb_se<-apply(bb$t,2,function(x) x[order(x)][c(5,195)])
newdat$blo<-bb_se[1,]
newdat$bhi<-bb_se[2,]

#plot
plot(y~x,data)
lines(newdat$x,newdat$y,col="red",lty=2,lwd=3)
lines(newdat$x,newdat$plo,col="blue",lty=2,lwd=2)
lines(newdat$x,newdat$phi,col="blue",lty=2,lwd=2)
lines(newdat$x,newdat$tlo,col="orange",lty=2,lwd=2)
lines(newdat$x,newdat$thi,col="orange",lty=2,lwd=2)
lines(newdat$x,newdat$bhi,col="darkgreen",lty=2,lwd=2)
lines(newdat$x,newdat$blo,col="darkgreen",lty=2,lwd=2)
legend("topleft",legend=c("Fitted line","Confidence interval","Prediction interval","Bootstrapped CI"),col=c("red","blue","orange","darkgreen"),lty=2,lwd=2,bty="n")

bootGLMM2

Again in this case the bootstrapped CI falled pretty close to the “normal” CI. We have here seen three different way to derive intervals representing the uncertainty around the regression lines (CI) and the response points (PI). Choosing among them would depend on what you want to see (what is the level of uncertainty around my fitted line vs if I sample new observations which value will they take), but also for complex model on computing power, as bootMer can take some time to run for GLMM with many observations and complex model structure.

Generating ANOVA-like table from GLMM using parametric bootstrap

[UPDATED: I modified a bit the code of the function, now you do not need to pass as character the random effect terms]

This article may also be found on RPubs: http://rpubs.com/hughes/63269

In the list of worst to best way to test for effect in GLMM the list on http://glmm.wikidot.com/faq state that parametric bootstrapping is among the best options. PBmodcomp in the pbkrtest package implement such parametric bootstrapping by comparing a full model to a null one. The function simulate data (the response vector) from the null model then fit these data to the null and full model and derive a likelihood ratio test for each of the simulated data. Then we can compare the observed likelihood ratio test to the null distribution generated from the many simulation and derive a p-value. The advantage of using such a method over the classical p-values derived from a chi-square test on the likelihood ratio test is that in the parametric bootstrap we do not assume any null distribution (like chi-square) but instead derive our own null distribution from the model and the data at hand. We do not make the assumption then that the likelihood ratio test statistic is chi-square distributed. I have made a little function that wraps around the PBmodcomp function to compute bootstrapped p-values for each term in a model by sequentially adding them. This lead to anova-like table that are typically obtained when one use the command anova on a glm object.

#the libraries used
library(lme4)
library(arm)
library(pbkrtest)
#the function with the following arguments
#@model the merMod model fitted by lmer or glmer
#@w the weights used in a binomial fit
#@seed you can set a seed to find back the same results after bootstrapping
#@nsim the number of bootstrapped simulation, if set to 0 return the Chisq p-value from the LRT test
#@fixed should the variable be dropped based on the order given in the model (TRUE) or the the dropping goes from worst to best variable (FALSE)
anova_merMod<-function(model,w=FALSE,seed=round(runif(1,0,100),0),nsim=0,fixed=TRUE){
  require(pbkrtest)
  data<-model@frame
  if(w){
    weight<-data[,which(names(data)=="(weights)")]
    data<-data[,-which(names(data)=="(weights)")]
  }
  f<-formula(model)
  resp<-as.character(f)[2]
  rand<-lme4:::findbars(f)
  rand<-lapply(rand,function(x) as.character(x))
  rand<-lapply(rand,function(x) paste("(",x[2],x[1],x[3],")",sep=" "))
  rand<-paste(rand,collapse=" + ")
  
  #generate a list of reduced model formula
  fs<-list()
  fs[[1]]<-as.formula(paste(resp,"~ 1 +",rand))
  nb_terms<-length(attr(terms(model),"term.labels"))
  if(nb_terms>1){
    #the next two line will make that the terms in the formula will add first the most important term, and the least important one at the end 
    #going first through the interactions and then through the main effects
    mat<-data.frame(term=attr(terms(model),"term.labels"),SSQ=anova(model)[,3],stringsAsFactors = FALSE)
    mat_inter<-mat[grep(":",mat$term),]
    mat_main<-mat[!rownames(mat)%in%rownames(mat_inter),]
    if(!fixed){
      mat_main<-mat_main[do.call(order,list(-mat_main$SSQ)),]
      mat_inter<-mat_inter[do.call(order,list(-mat_inter$SSQ)),]
      mat<-rbind(mat_main,mat_inter)
    }   
    for(i in 1:nb_terms){
      tmp<-c(mat[1:i,1],rand)
      fs[[i+1]]<-reformulate(tmp,response=resp)
    }      
  }
  else{
    mat<-data.frame(term=attr(terms(model),"term.labels"),stringsAsFactors = FALSE)
  }

  #fit the reduced model to the data

  
  fam<-family(model)[1]$family
  if(fam=="gaussian"){
    m_fit<-lapply(fs,function(x) lmer(x,data,REML=FALSE))
  }
  else if(fam=="binomial"){
    m_fit<-lapply(fs,function(x) glmer(x,data,family=fam,weights=weight))
  }
  else{
    m_fit<-lapply(fs,function(x) glmer(x,data,family=fam))
  }

  if(nb_terms==1){
    if(fam=="gaussian"){
      m_fit[[2]]<-lmer(formula(model),data,REML=FALSE)
    }
    else if(fam=="binomial"){
      m_fit[[2]]<-glmer(formula(model),data,family=fam,weights=weight)
    }
    else{
      m_fit[[2]]<-glmer(formula(model),data,family=fam)
    }
  }
  #compare nested model with one another and get LRT values (ie increase in the likelihood of the models as parameters are added)
  tab_out<-NULL
  
  for(i in 1:(length(m_fit)-1)){
    if(nsim>0){
      comp<-PBmodcomp(m_fit[[i+1]],m_fit[[i]],seed=seed,nsim=nsim)    
      term_added<-mat[i,1]
      #here are reported the bootstrapped p-values, ie not assuming any parametric distribution like chi-square to the LRT values generated under the null model
      #these p-values represent the number of time the simulated LRT value (under null model) are larger than the observe one
      tmp<-data.frame(term=term_added,LRT=comp$test$stat[1],p_value=comp$test$p.value[2])
      tab_out<-rbind(tab_out,tmp)
      print(paste("Variable ",term_added," tested",sep=""))
    }
    else{
      comp<-anova(m_fit[[i+1]],m_fit[[i]])
      term_added<-mat[i,1]
      tmp<-data.frame(term=term_added,LRT=comp[2,6],p_value=comp[2,8])
      tab_out<-rbind(tab_out,tmp)
    }
    
  }  
  print(paste("Seed set to:",seed))
  return(tab_out)  
}

You pass your GLMM model to the function together with the random part as character (see example below), if you fitted a binomial GLMM you also need to provide the weights as a vector, you can then set a seed and the last argument is the number of simulation to do, it is set by default to 50 for rapid checking purpose but if you want to report these results larger values (ie 1000, 10000) should be used.

Let’s look at a simple LMM example:

data(grouseticks)
m<-lmer(TICKS~cHEIGHT+YEAR+(1|BROOD),grouseticks)
summary(m)
## Linear mixed model fit by REML ['lmerMod']
## Formula: TICKS ~ cHEIGHT + YEAR + (1 | BROOD)
##    Data: grouseticks
## 
## REML criterion at convergence: 2755
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.406 -0.246 -0.036  0.146  5.807 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  BROOD    (Intercept) 87.3     9.34    
##  Residual             28.1     5.30    
## Number of obs: 403, groups:  BROOD, 118
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   5.4947     1.6238    3.38
## cHEIGHT      -0.1045     0.0264   -3.95
## YEAR96        4.1910     2.2424    1.87
## YEAR97       -4.3304     2.2708   -1.91
## 
## Correlation of Fixed Effects:
##         (Intr) cHEIGH YEAR96
## cHEIGHT -0.091              
## YEAR96  -0.726  0.088       
## YEAR97  -0.714  0.052  0.518
anova_merMod(model=m,nsim=100)
## [1] "Variable cHEIGHT tested"
## [1] "Variable YEAR tested"
## [1] "Seed set to: 23"
##     term      LRT    p_value
## 1 cHEIGHT 14.55054 0.00990099
## 2    YEAR 14.40440 0.00990099

The resulting table show for each term in the model the likelihood ratio test, which is basically the decrease in deviance when going from the null to the full model and the p value, you may look at the PBtest line in the details of ?PBmodcomp to see how it is computed.

Now let’s see how to use the function with binomial GLMM:

#simulate some binomial data
x1<-runif(100,-2,2)
x2<-runif(100,-2,2)
group<-gl(n = 20,k = 5)
rnd.eff<-rnorm(20,mean=0,sd=1.5)
p<-1+0.5*x1-2*x2+rnd.eff[group]+rnorm(100,0,0.3)
y<-rbinom(n = 100,size = 10,prob = invlogit(p))
prop<-y/10
#fit a model
m<-glmer(prop~x1+x2+(1|group),family="binomial",weights = rep(10,100))
summary(m)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: prop ~ x1 + x2 + (1 | group)
## Weights: rep(10, 100)
## 
##      AIC      BIC   logLik deviance df.resid 
##    288.6    299.1   -140.3    280.6       96 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -2.334 -0.503  0.181  0.580  2.466 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  group  (Intercept) 1.38     1.18    
## Number of obs: 100, groups:  group, 20
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(&gt;|z|)    
## (Intercept)    0.748      0.287    2.61   0.0092 ** 
## x1             0.524      0.104    5.02  5.3e-07 ***
## x2            -2.083      0.143  -14.56  &lt; 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##    (Intr) x1    
## x1  0.090       
## x2 -0.205 -0.345</code></pre>
#takes some time
anova_merMod(m,w = rep(10,100),nsim=100)</pre>
## [1] "Variable x1 tested"
## [1] "Variable x2 tested"
## [1] "Seed set to: 98"
##   term      LRT p_value
## 1   x1   0.0429 0.80392
## 2   x2 502.0921 0.01961

For binomial model, the model must be fitted with proportion data and a vector of weights (ie the number of binomial trial) must be passed to the ‘w’ argument. Some warning message may pop up at the end of the function, this comes from convergence failure in PBmodcomp, this do not affect the results, you may read the article from the pbkrtest package: http://www.jstatsoft.org/v59/i09/ to understand better where this comes from.

Happy modeling and as Ben Bolker say: “When all else fails, don’t forget to keep p-values in perspective: http://www.phdcomics.com/comics/archive.php?comicid=905

Using bootMer to do model comparison in R

Setting the right random effect part in mixed effect models can be tricky in many applied situation. I will not talk here about choosing whether a grouping variable (sites, individuals …) should be included as a fixed term or as a random term, please see Gelman and Hill (2006) and Zuur et al (2009) for informations. Here I will present the use of the bootMer function in the package lme4 to compare two models with different random effect terms specification and decide whether one model do a (significantly) better job at fitting the data. The standard way to compare two model is to derive the likelihood ratio test (LRT) value and since these should follow a chi-square distribution derive a p-value corresponding to the probability to observe such an extreme LRT under the null hypothesis that both model perform equally well. This approach works relatively fine for GLM but for (G)LMM several problem arises due mainly to boundary effects (the null hypothesis being in this case that the variance of the random effects is 0) see Bolker et al (2009). One way to do model comparison in (G)LMM is to derive bootstrapped likelihood values from the two competing models and to draw confidence intervals around the observed values to decide if one model perform better than the other. Below are some code with simulated data (a cleaner version with more graphs can be found here: http://rpubs.com/hughes/22059):

library(lme4)
library(arm)
library(RColorBrewer)

##### work on model comparison using bootMer ##### simulate some data and fit a
##### random intercept model to them
x <- runif(100, 0, 10)
# the grouping variable
site <- gl(n = 10, k = 10)
# the random intercept effect, the simulated standard deviation around the
# intercept is 1
rnd <- rnorm(10, 0, 1)
# the simulated resposne variable, note that the fixed effect coefficient
# are 1 for the intercept and 3 for the slope. Also the simulated residuals
# will have a standard deviation of one
y <- rep(1 + rnd, each = 10) + 3 * x + rnorm(100, 0, 1)
# fit the model using Maximum Likelihood to be able to use the LRT
m1 <- lmer(y ~ x + (1 | site), REML = FALSE)

# simulate to generate credible intervals
simu <- sim(m1, n.sims = 1000)
# a new model matrix with ordered and equally spaced predictor values
new.x <- model.matrix(~x, data = data.frame(x = seq(0, 10, length.out = 100)))
new.y <- matrix(ncol = 1000, nrow = 100)
# get the predicted response values for each 1000 simulations of the fixed
# effect model parameters
new.y <- apply(simu@fixef, 1, function(x) new.x %*% x)
# compute the lower/upper quantile
lower <- apply(new.y, 1, function(x) quantile(x, prob = 0.025))
upper <- apply(new.y, 1, function(x) quantile(x, prob = 0.975))
median <- apply(new.y, 1, function(x) quantile(x, prob = 0.5))

# nice plot
pal <- brewer.pal(10, "RdYlBu")
plot(y ~ x, col = rep(pal, each = 10), pch = 16)
lines(new.x[, 2], median, col = "blue", lwd = 2)
lines(new.x[, 2], lower, col = "red", lwd = 2, lty = 2)
lines(new.x[, 2], upper, col = "red", lwd = 2, lty = 2)

fig1

# fit a second model with a random slope effect
m2 <- lmer(y ~ x + (x | site), REML = FALSE)

# using bootMer to compute 100 bootstrapped log-likelihood
b1 <- bootMer(m1, FUN = function(x) as.numeric(logLik(x)), nsim = 100)
b2 <- bootMer(m2, FUN = function(x) as.numeric(logLik(x)), nsim = 100)

# the observed LRT value
lrt <- as.numeric(-2 * logLik(m1) + 2 * logLik(m2))
# the 100 bootstrapped LRT
lrt.b <- -2 * b1$t + 2 * b2$t
# plot
quant <- quantile(lrt.b, probs = c(0.025, 0.975))
plot(1, lrt, xlab = "", ylab = "Likelihood ratio test", xaxt = "n", ylim = c(quant[1] + 
    1, quant[2] + 1))
abline(h = 0, lty = 2, lwd = 2, col = "red")
segments(1, quant[1], 1, quant[2], lend = 1)

In the above example the 95% CI of the bootstrapped LRT cross the 0 line which means that one model do not fit the data better than the other. In this case the rule use would be to use the most simple model (the one with the lower number of parameters) which is the random-intercept model.

Let's make another example:

# now simulate data from random intercept/ slope
rnd.slope <- rnorm(10, 0, 0.5)
y <- rep(1 + rnd, each = 10) + rep(3 + rnd.slope, each = 10) * x + rnorm(100, 
    0, 1)

# the new models
m3 <- lmer(y ~ x + (x | site), REML = FALSE)
m4 <- lmer(y ~ x + (1 | site), REML = FALSE)

# LRT the observed values
lrt <- -2 * logLik(m4) + 2 * logLik(m3)
# the bootstrap
b3 <- bootMer(m3, FUN = function(x) as.numeric(logLik(x)), nsim = 100)
b4 <- bootMer(m4, FUN = function(x) as.numeric(logLik(x)), nsim = 100)

# the 100 bootstrapped LRT
lrt.b <- -2 * b4$t + 2 * b3$t

# the nice plot
quant <- quantile(lrt.b, probs = c(0.025, 0.975))
plot(1, lrt, xlab = "", ylab = "Likelihood ratio test", xaxt = "n", ylim = c(0, 
    quant[2] + 1))
abline(h = 0, lty = 2, lwd = 2, col = "red")
segments(1, quant[1], 1, quant[2], lend = 1)

In this second example the random intercept/slope model fits much better to the data than the random intercept. This random effect structure should be kept. As mentioned in Bolker et al (2009) the LRT will be relevant depending on the design and the interest that is put on the random terms. In the case were random terms are due to the particular design of the study (site, blocks …) and when there are considered as a “nuisance” they may be included in the models without testing for the increase in fitness that their inclusion provide. In the case where the random term effects is of interest (individual sampling units …) then using LRT is a sensible way to detect and interpret the effect of the random terms. The function PBmodcomp in the package pbkrtest allows one to do all the preceding code in just one line with various ways to test for the significance of the likelihood ratio (thanks to Ben Bolker for his comment).

Biblio:
Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., & White, J. S. S. (2009). Generalized linear mixed models: a practical guide for ecology and evolution. Trends in ecology & evolution, 24(3), 127-135.
Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.
Zuur, A., Ieno, E. N., Walker, N., Saveliev, A. A., & Smith, G. M. (2009). Mixed effects models and extensions in ecology with R. Springer.

Generalized Linear Mixed Models in Ecology and in R

I had a nice workshop two weeks ago in Tübingen (south-germany) concerning Generalized Linear Mixed Models (GLMM) in R. The course was given by two ecologist: Dr. Pius and Fränzi Korner-Nievergelt  that spend now half of their time doing statistical consulting (http://www.oikostat.ch/navigation_engl.htm). Nice reference concerning GLMMs are: the 2009 Bolker paper (paper),  the 2007 book by Gelman (book1) and the 2009 Zuur mixed effect book (book2)

The course was very nice starting from basic linear models to more complex modelling techniques like GLMM, the teachers are also among the growing (tiny) number of ecologists that are trying out and applying bayesian data analysis to their dataset for theoretical as well as practical reasons (some complex model structure can only be fitted within a Bayesian framework).

I will most certainly need a few years to fully digest and apply what I learned and saw there, but I also had to make a small workshop for my working group to spread the knowledge gained. So this post is just to give around the R script I used to show how to fit GLMM, how to assess GLMM assumptions, when to choose between fixed and mixed effect models, how to do model selection in GLMM, and how to draw inference from GLMM.

As a teaser here are two cool graphs that you can do with this code:

GLMM_plot1GLMM_plot3

 

 

#####################################################################
# R script use for the GLMM mini-Workshop on 11th March in Freising #
#       Author: Lionel Hertzog and Franzi Korner-Nievergelt         #
#####################################################################


#load the libraries
library(lme4)
library(nlme)
library(arm)

########################
# part 0 fitting GLMM #
 #  #  #  #  #  #  # 

#the example dataset we will use
data<-read.table("/home/lionel/Documents/PhD/GLMM_WS/data/rikz.txt",sep=" ",head=TRUE)
#first a random intercept model
mod_lme1<-lme(Richness~NAP,data=data,random=~1|Beach)
mod_lmer1<-lmer(Richness~NAP+(1|Beach),data=data)
#then a random slope plus intercept model
mod_lme2<-lme(Richness~NAP,data=data,random=NAP|Beach)
mod_lmer2<-lmer(Richness~NAP+(NAP|Beach),data=data)
#Poisson model
mod_glmer1<-glmer(Richness~NAP+(1|Beach),data=data,family="poisson")
#nested and crossed random effect??

##################################
#   part 1 mixed vs fixed effect #
  #   #   #   #   #   #   #   #
#factor variable with intercept only effect
#simulate data in a fixed effect way
x<-rnorm(120,0,1)
f<-gl(n=6,k=20)
modmat<-model.matrix(~x+f,data.frame(x=x,f=f))
betas<-c(1,2,0.3,-3,0.5,1.2,0.8)
y<-modmat%*%betas+rnorm(120,0,1)

#the fixed effect model
m_lm<-lm(y~x+f)
#the mixed effect model
m_lme<-lme(y~x,random=~1|f)

#checking model assumptions in both case
par(mfrow=c(2,2))
plot(m_lm)

plot(fitted(m_lme),resid(m_lme))
qqnorm(resid(m_lme))
qqnorm(ranef(m_lme)[,1])
plot(x,resid(m_lme))

#looking at the fitted parameters
summary(m_lm)
summary(m_lme)

#plot the fit of the model
par(mfrow=c(1,1))
library(RColorBrewer)
pal<-brewer.pal(6,"Set1")
plot(y~x,col=pal[f],pch=16,main="Linear Model")
for(i in 1:length(levels(f))){
  if(i==1){
    lines(x,coef(m_lm)[1]+coef(m_lm)[2]*x,col=pal[i],lwd=1.5)
  }
  else{
    lines(x,coef(m_lm)[1]+coef(m_lm)[2]*x+coef(m_lm)[i+1],col=pal[i],lwd=1.5)
  }
}

plot(y~x,col=pal[f],pch=16,main="Linear Mixed Model")
for(i in 1:length(levels(f))){
 lines(x,fixef(m_lme)[1]+fixef(m_lme)[2]*x+ranef(m_lme)[i,1],col=pal[i],lwd=1.5) 
}
#no clear difference visible

#now generqte a random slope/intercept model through the mixed effect technique
rnd.eff<-rnorm(5,0,2)
slp.eff<-rnorm(5,0,1)
all.eff<-c(1,2,rnd.eff,slp.eff)
modmat<-model.matrix(~x*f,data.frame(x=x,f=f))
y<-modmat%*%all.eff+rnorm(120,0,1)

#build the two model
m_lm2<-lm(y~x*f)
m_lme2<-lme(y~x,random=~x|f)

#checking model assumptions
par(mfrow=c(2,2))
plot(m_lm2)
plot(fitted(m_lme2),resid(m_lme2))
abline(h=0,lty=2,col="red")
qqnorm(resid(m_lme2))
qqnorm(ranef(m_lme2)[,1])
qqnorm(ranef(m_lme2)[,2])

#summary of the models
summary(m_lm2)
summary(m_lme2)

#plot the model fitted values
par(mfrow=c(1,2))
plot(y~x,col=pal[f],pch=16,main="Linear Model")
for(i in 1:length(levels(f))){
  if(i==1){
    lines(x,coef(m_lm2)[1]+coef(m_lm2)[2]*x,col=pal[i],lwd=1.5)
  }
  else{
    lines(x,coef(m_lm2)[1]+(coef(m_lm2)[2]+coef(m_lm2)[i+6])*x+coef(m_lm2)[i+1],col=pal[i],lwd=1.5)
  }
}

plot(y~x,col=pal[f],pch=16,main="Linear Mixed Model")
for(i in 1:length(levels(f))){
  lines(x,fixef(m_lme2)[1]+(fixef(m_lme2)[2]+ranef(m_lme2)[i,2])*x+ranef(m_lme2)[i,1],col=pal[i],lwd=1.5) 
}

#again no clear difference can be seen ...

#conclusion
#end of Practical 1

#######################
#   Practical 2      #
  #   #   #   #   #  

#checking model assumptions
par(mfrow=c(2,2))
plot(fitted(m_lme2),resid(m_lme2))
abline(h=0,lty=2,col="red")
qqnorm(resid(m_lme2))
qqline(resid(m_lme2))
qqnorm(ranef(m_lme2)[,1])
qqline(ranef(m_lme2)[,1])
qqnorm(ranef(m_lme2)[,2])
qqline(ranef(m_lme2)[,2])
scatter.smooth(fitted(m_lme2),sqrt(abs(resid(m_lme2))))

#wrong data
modmat[,2]<-log(modmat[,2]+10)
y<-modmat%*%all.eff+runif(120,0,5)
m_wrg<-lme(y~x,random=~x|f)

plot(fitted(m_wrg),resid(m_wrg))
abline(h=0,lty=2,col="red")
qqnorm(resid(m_wrg))
qqline(resid(m_wrg))
qqnorm(ranef(m_wrg)[,1])
qqline(ranef(m_wrg)[,1])
qqnorm(ranef(m_wrg)[,2])
qqline(ranef(m_wrg)[,2])
scatter.smooth(fitted(m_wrg),sqrt(abs(resid(m_wrg))))

#plot fitted values vs resid, qqnorm the residuals and all random effects

#end of practical 2

###################
#  Practical 3   #
 #  #  #  #  #  #

#Model selection
#work with the RIKZ dataset from Zuur et al

data<-read.table("/home/lionel/Documents/PhD/GLMM_WS/data/rikz.txt",sep=" ",head=TRUE)

#testing the random effect
#a first model
mod1<-lme(Richness~NAP+Exposure,data=data,random=~1|Beach,method="REML")
#a second model without the random term, gls is used because it also fit the model through REML
mod2<-gls(Richness~NAP+Exposure,data=data,method="REML")
#likelihood ratio test, not very precise for low sample size
anova(mod1,mod2)

# parameteric bootstrap
lrt.obs <- anova(mod1, mod2)$L.Ratio[2] # save the observed likelihood ratio test statistic
n.sim <- 1000  # use 1000 for a real data analysis
lrt.sim <- numeric(n.sim)
dattemp <- data
for(i in 1:n.sim){
  dattemp$ysim <- simulate(lm(Richness ~ NAP+Exposure, data=dattemp))$sim_1 # simulate new observations from the null-model
  modnullsim <- gls(ysim ~ NAP+Exposure, data=dattemp)   # fit the null-model
  modaltsim <-lme(ysim ~ NAP+Exposure, random=~1|Beach, data=dattemp)  # fit the alternative model
  lrt.sim[i] <- anova(modnullsim, modaltsim)$L.Ratio[2] # save the likelihood ratio test statistic
}

(sum(lrt.sim>=lrt.obs)+1)/(n.sim+1)  # p-value

#plot
par(mfrow=c(1,1))
hist(lrt.sim, xlim=c(0, max(c(lrt.sim, lrt.obs))), col="blue", xlab="likelihood ratio test statistic", ylab="density", cex.lab=1.5, cex.axis=1.2)
abline(v=lrt.obs, col="orange", lwd=3)

#model selection for the fixed effect part, to use the simulate function we need MER object
mod1_ML<-lme(Richness~NAP+Exposure,data,random=~1|Beach,method="ML")
mod3<-lme(Richness~NAP,data,random=~1|Beach,method="ML")
mod1_lmer<-lmer(Richness~NAP+Exposure+(1|Beach),data=data,REML=FALSE)
mod3_lmer<-lmer(Richness~NAP+(1|Beach),data=data,REML=FALSE)
#compare with lme results
summary(mod1_lmer)
summary(mod1_ML)
#anova
anova(mod1_lmer,mod3_lmer)

#again parametric boostrapping of the LRT
lrt.obs<-anova(mod1_lmer, mod3_lmer)$Chisq[2]
n.sim <- 1000  # use 1000 for a real data analysis
lrt.sim <- numeric(n.sim)
dattemp <- data
for(i in 1:n.sim){
  dattemp$ysim <-  unlist(simulate(mod3_lmer)) # simulate new observations from the null-model
  modnullsim <- lmer(ysim ~ NAP+(1|Beach), data=dattemp,REML=FALSE)   # fit the null-model
  modaltsim <-lmer(ysim ~ NAP+Exposure+(1|Beach), data=dattemp,REML=FALSE)  # fit the alternative model
  lrt.sim[i] <- anova(modnullsim, modaltsim)$Chisq[2] # save the likelihood ratio test statistic
}

(sum(lrt.sim>=lrt.obs)+1)/(n.sim+1)  # p-value

#plot
hist(lrt.sim, xlim=c(0, max(c(lrt.sim, lrt.obs))), col="blue", xlab="likelihood ratio test statistic", ylab="density", cex.lab=1.5, cex.axis=1.2)
abline(v=lrt.obs, col="orange", lwd=3)

#the next step would be to drop NAP first and then see if the likelihood ratio test is significant and if dropping Exposure first always
#lead to higher LRT statistic
#other methods, AIC..
#R square computation for GLMM, see supplementary material from Nakagawa 2013 MEE
VarF <- var(as.vector(fixef(mod1_lmer) %*% t(mod1_lmer@pp$X)))
# VarCorr() extracts variance components
# attr(VarCorr(lmer.model),’sc’)^2 extracts the residual variance, VarCorr()$plot extract the variance of the plot effect
VarF/(VarF + VarCorr(mod1_lmer)$Beach[1] + attr(VarCorr(mod1_lmer), "sc")^2 )

#compute the conditionnal R-square
(VarF + VarCorr(mod1_lmer)$Beach[1])/(VarF + VarCorr(mod1_lmer)$Beach[1] + (attr(VarCorr(mod1_lmer), "sc")^2))

#end of practical 3


######################
#    Practical 4    #
 #  #  #  #  #  #  #

#drawing inference from a model
#p-values can be retrieved from lme and glmer but not from lmer call
summary(mod1)
summary(mod1_lmer)

mod1_glmer<-glmer(Richness~NAP+Exposure+(1|Beach),data=data,family="poisson")
summary(mod1_glmer)

#using sim from the arm package
n.sim<-1000
simu<-sim(mod1_glmer,n.sims=n.sim)
head(simu@fixef)
#95% credible interval
apply(simu@fixef,2,quantile,prob=c(0.025,0.5,0.975))
#plotting the effect of NAP on the richness
nsim <- 1000
bsim <- sim(mod1_glmer, n.sim=nsim)
newdat <- data.frame(NAP=seq(-1.5, 2.5, length=100),Exposure=mean(data$Exposure))
Xmat <- model.matrix(~NAP+Exposure, data=newdat)
predmat <- matrix(ncol=nsim, nrow=nrow(newdat))
predmat<-apply(bsim@fixef,1,function(x) exp(Xmat%*%x))
newdat$lower <- apply(predmat, 1, quantile, prob=0.025)
newdat$upper <- apply(predmat, 1, quantile, prob=0.975)
newdat$med<-apply(predmat, 1, quantile, prob=0.5)

plot(Richness~NAP, data=data, pch=16, las=1, cex.lab=1.4, cex.axis=1.2)
lines(newdat$NAP,newdat$med,col="blue",lty=1,lwd=1.5)
lines(newdat$NAP,newdat$upper,col="red",lty=2,lwd=1.5)
lines(newdat$NAP,newdat$lower,col="red",lty=2,lwd=1.5)

#to compare the coefficient between the different terms standardize the variable
data$stdNAP<-scale(data$NAP)
data$stdExposure<-scale(data$Exposure)
mod2_glmer<-glmer(Richness~stdNAP+stdExposure+(1|Beach),data=data,family="poisson")

#simulate to draw the posterior distribution of the coefficients
n.sim<-1000
simu<-sim(mod2_glmer,n.sims=n.sim)
head(simu@fixef)
#95% credible interval
coeff<-t(apply(simu@fixef,2,quantile,prob=c(0.025,0.5,0.975)))
#plot
plot(1:3,coeff[,2],ylim=c(-0.8,2),xaxt="n",xlab="",ylab="Estimated values")
axis(side=1,at=1:3,labels=attr(fixef(mod2_glmer),"names"))
segments(x0=1:3,y0=coeff[,1],x1=1:3,y1=coeff[,3],lend=1)
abline(h=0,lty=2,col="red")

#end of practical 4

Computing R square for Generalized Linear Mixed Models in R

R square is a widely used measure of model fitness, in General Linear Models (GLM) it can be interpreted as the percent of variance in the response variable explained by the model. This measure is unitless which makes it useful to compare model between studies in meta-analysis analysis.
Generalized Linear Mixed models (GLMM) are extending GLM by including an hierarchical structure in the model, indeed GLMs assume that every observation are independent from each other. In biological studies this assumption is often violated under certain experimental design, for example in repeated measurement scheme (several sample of a similar unit over time), or in field studies with Block design to account for some natural variation in soil gradient. Therefore GLMM are becoming a popular technique among biologist to account for the multilevel structure in their dataset, see Bolker et al (2009) Trends in Ecology and Evolution. However these models due to their various variance terms (ie variance at the block level, variance at the observation level) make the computation of R square problematic. A recent article by Nakagawa and Schielzeth (2013, Methods in Ecology and Evolution) present a new technique to derive R square for these model.

Below is some R-code to derive these values for gaussian, binomial and poisson GLMM, the R-code for the computation of the R square was taken from the appendix of the article.
To help the comprehension let us imagine that the data correspond to some field study where was recorded in 8 plots: plant biomass (gaussian), presence of Oak species (binary/binomial), number of caught coleopterans (poisson), we think that these variables are dependent to the temperature and the vegetation type in the plot. In this exemple we have one random factor: the plots, two explanatory variable: one continuous, the temperature and one factor, the vegetation type.

 
#code for computing marginal and conditional R-square for gaussian, binomial and poisson GLMM
#the code for computing the R-square is taken from Nakagawa and Schielzeth 2013 MEE
library(lme4)
library(arm)
#making some simulation for gaussian data (plant biomass) depending on one continuous variable, one factor and one random intercept effect
temp<-runif(200,0,10)
veg_type<-gl(n=4,k=50,labels=c(“A”,”B”,”C”,”D”))
#shuffle the factor
veg_type<-veg_type[sample(1:200,size=200,replace=FALSE)]
plot<-gl(n=8,k=25,labels=paste(“A”,1:8,sep=””))
modmat<-model.matrix(~i+plot+temp+veg_type-1,data.frame(i=rep(1,200),plot=plot,temp=temp,veg_type=veg_type))

#the plot effect
intercept.eff<-rnorm(8,mean=3,sd=2)
#put together all the effects
eff<-c(8,intercept.eff,3,0.3,1.5,-4)

#compute the repsonse variable, the plant biomass
mus<-modmat%*%eff
y<-rnorm(200,mean=mus,sd=1)
#put all in one data frame
data<-data.frame(y=y,temp=temp,veg_type=veg_type,plot=plot)

#the null model with only the plot term
mod0<-lmer(y~1+(1|plot),data)
#the full model
mod1<-lmer(y~temp+veg_type+(1|plot),data)
#model summary
summary(mod0)
summary(mod1)

#compute the marginal R-square
#compute the variance in the fitted values
VarF # VarCorr() extracts variance components
# attr(VarCorr(lmer.model),’sc’)^2 extracts the residual variance, VarCorr()$plot extract the variance of the plot effect
VarF/(VarF + VarCorr(mod1)$plot[1] + attr(VarCorr(mod1), “sc”)^2)

#compute the conditionnal R-square
(VarF + VarCorr(mod1)$plot[1])/(VarF + VarCorr(mod1)$plot[1] + (attr(VarCorr(mod1), “sc”)^2))

# AIC and BIC needs to be calcualted with ML not REML in body size models
mod0ML<-lmer(y~1+(1|plot),data,REML=FALSE)
mod1ML<-lmer(y~temp+veg_type+(1|plot),data,REML=FALSE)

# View model fits for both models fitted by ML
summary(mod0ML)
summary(mod1ML)

#computing the percent of explained variance
#for the plot slope
1-(VarCorr(mod1)$plot[1]^2/VarCorr(mod0)$plot[1]^2)
#for the residuals
1-(var(residuals(mod1))/var(residuals(mod0)))

There is two Rsquare computed: the marginal R square which is the variance of the fitted values divided by the total variance, and the conditional variance which is the variance of the fitted values plus the variance of the random effect divided by the total variance. In their article Nakagawa and Schielzeth adviced to report these Rsquare along with AIC and Proportion of Change in Variance (PCV) which is computing the change of variance of specific components when adding some fixed effect, ie how much better is our explained variance at the plot level by adding two fixed effects.

Below is the code for computing the same index for Binomial (the presence of Oaks) and Poisson model (the number of caught coleopterans):

#simulating binomial response data
plot.eff<-rnorm(8,0,2)
all.eff<-c(-0.3,plot.eff,0.02,0.5,-0.8,0.8)
ps<-invlogit(modmat%*%all.eff)
ys<-rbinom(200,1,ps)

data<-data.frame(y=ys,temp=temp,veg_type=veg_type,plot=plot)

mod0<-glmer(y~1+(1|plot),data,family=”binomial”)
mod1<-glmer(y~temp+veg_type+(1|plot),data,family=”binomial”)

VarF <- var(as.vector(fixef(mod1) %*% t(mod1@pp$X)))
# VarCorr() extracts variance components
# attr(VarCorr(lmer.model),’sc’)^2 extracts the residual variance, VarCorr()$plot extract the variance of the plot effect
VarF/(VarF + VarCorr(mod1)$plot[1] + attr(VarCorr(mod1), “sc”)^2 + (pi^2)/3)

#compute the conditionnal R-square
(VarF + VarCorr(mod1)$plot[1])/(VarF + VarCorr(mod1)$plot[1] + (attr(VarCorr(mod1), “sc”)^2) + (pi^2)/3)

#computing the percent of explained variance
#for the plot slope
1-(VarCorr(mod1)$plot[1]^2/VarCorr(mod0)$plot[1]^2)
#for the residuals
1-(var(residuals(mod1))/var(residuals(mod0)))

#poisson data
plot.eff<-rnorm(8,0,2)
all.eff<-c(-0.3,plot.eff,0.2,0.5,-0.8,0.8)
lambdas<-exp(modmat%*%all.eff)
ys<-rpois(200,lambdas)

data<-data.frame(y=ys,temp=temp,veg_type=veg_type,plot=plot)

mod0<-glmer(y~1+(1|plot),data,family=”poisson”)
mod1<-glmer(y~temp+veg_type+(1|plot),data,family=”poisson”)

VarF <- var(as.vector(fixef(mod1) %*% t(mod1@pp$X)))
# VarCorr() extracts variance components
# attr(VarCorr(lmer.model),’sc’)^2 extracts the residual variance, VarCorr()$plot extract the variance of the plot effect
VarF/(VarF + VarCorr(mod1)$plot[1] + attr(VarCorr(mod1), “sc”)^2 + log(1 + 1/exp(as.numeric(fixef(mod0)))))

#compute the conditionnal R-square
(VarF + VarCorr(mod1)$plot[1])/(VarF + VarCorr(mod1)$plot[1] + (attr(VarCorr(mod1), “sc”)^2) + log(1 + 1/exp(as.numeric(fixef(mod0)))))

#computing the percent of explained variance
#for the plot slope
1-(VarCorr(mod1)$plot[1]^2/VarCorr(mod0)$plot[1]^2)
#for the residuals
1-(var(residuals(mod1))/var(residuals(mod0)))

 

If you run the code you may notice that some PCVs are negative, this means that the variance at that particular level has increased after the inclusion of the fixed effects. Have a look at the Nakagawa article and compare his values with the one obtained by running the code above, the authors also provide the R code they used for their analysis, have a look. And you may also go and look at the Rpubs version of this article (much nicer to read!):  http://rpubs.com/hughes/13853