Plotting regression curves with confidence intervals for LM, GLM and GLMM in R

Once models have been fitted and checked and re-checked comes the time to interpret them. The easiest way to do so is to plot the response variable versus the explanatory variables (I call them predictors) adding to this plot the fitted regression curve together (if you are feeling fancy) with a confidence interval around it. Now this approach is preferred over the partial residual one because it allows the averaging out of any other potentially confounding predictors and so focus only on the effect of one focal predictor on the response. In my work I have been doing this hundreds of time and finally decided to put all this into a function to clean up my code a little bit. As always a cleaner version of this post is available here.
Let’s dive into some code (the function is at the end of the post just copy/paste into your R environment):

#####LM example######
#we measured plant biomass for 120 pots under 3 nutrient treatments and across a gradient of CO2
#due to limited place in our greenhouse chambers we had to use 4 of them, so we established a blocking design
data<-data.frame(C=runif(120,-2,2),N=gl(n = 3,k = 40,labels = c("Few","Medium","A_lot")),Block=rep(rep(paste0("B",1:4),each=10),times=3))
xtabs(~N+Block,data)

##         Block
## N        B1 B2 B3 B4
##   Few    10 10 10 10
##   Medium 10 10 10 10
##   A_lot  10 10 10 10

modmat<-model.matrix(~Block+C*N,data)
#the paramters of the models
params<-c(10,-0.4,2.3,-1.5,1,0.5,2.3,0.6,2.7)
#simulate a response vector
data$Biom<-rnorm(120,modmat%*%params,1)
#fit the model
m<-lm(Biom~Block+C*N,data)
summary(m)

## 
## Call:
## lm(formula = Biom ~ Block + C * N, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.11758 -0.68801 -0.01582  0.75057  2.55953 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.0768     0.2370  42.521  < 2e-16 ***
## BlockB2      -0.3011     0.2690  -1.119 0.265364    
## BlockB3       2.3322     0.2682   8.695 3.54e-14 ***
## BlockB4      -1.4505     0.2688  -5.396 3.91e-07 ***
## C             0.7585     0.1637   4.633 9.89e-06 ***
## NMedium       0.4842     0.2371   2.042 0.043489 *  
## NA_lot        2.4011     0.2335  10.285  < 2e-16 ***
## C:NMedium     0.7287     0.2123   3.432 0.000844 ***
## C:NA_lot      3.2536     0.2246  14.489  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.028 on 111 degrees of freedom
## Multiple R-squared:  0.9201, Adjusted R-squared:  0.9144 
## F-statistic: 159.8 on 8 and 111 DF,  p-value: < 2.2e-16

Here we would normally continue and make some model checks. As output from the model we would like to plot the effect of CO2 on plant biomass for each level of N addition. Of course we want to average out the Block effect (otherwise we would have to plot one separate line for each Block). This is how it works:

pred<-plot_fit(m,focal_var = "C",inter_var = "N")
head(pred)

##            C   N       LC     Pred       UC
## 1 -1.9984087 Few 8.004927 8.706142 9.407358
## 2 -1.7895749 Few 8.213104 8.864545 9.515986
## 3 -1.5807411 Few 8.417943 9.022948 9.627952
## 4 -1.3719073 Few 8.618617 9.181350 9.744084
## 5 -1.1630735 Few 8.814119 9.339753 9.865387
## 6 -0.9542397 Few 9.003286 9.498156 9.993026

#the function output a data frame with columns for the varying predictors
#a column for the predicted values (Pred), one lower bound (LC) and an upper one (UC)
#let's plot this
plot(Biom~C,data,col=c("red","green","blue")[data$N],pch=16,xlab="CO2 concentration",ylab="Plant biomass")
lines(Pred~C,pred[1:20,],col="red",lwd=3)
lines(LC~C,pred[1:20,],col="red",lwd=2,lty=2)
lines(UC~C,pred[1:20,],col="red",lwd=2,lty=2)
lines(Pred~C,pred[21:40,],col="green",lwd=3)
lines(LC~C,pred[21:40,],col="green",lwd=2,lty=2)
lines(UC~C,pred[21:40,],col="green",lwd=2,lty=2)
lines(Pred~C,pred[41:60,],col="blue",lwd=3)
lines(LC~C,pred[41:60,],col="blue",lwd=2,lty=2)
lines(UC~C,pred[41:60,],col="blue",lwd=2,lty=2)
legend("topleft",legend = c("Few","Medium","A lot"),col=c("red","green","blue"),pch=16,lwd=3,title = "N addition",bty="n")

plotFit1

The cool thing is that the function will also work for GLM, LMM and GLMM. For mixed effect models the confidence interval is computed from parametric bootstrapping:

######LMM example#######
#now let's say that we took 5 measurements per pots and we don't want to aggregate them
data<-data.frame(Pots=rep(paste0("P",1:120),each=5),Block=rep(rep(paste0("B",1:4),each=5*10),times=3),N=gl(n = 3,k = 40*5,labels=c("Few","Medium","A_lot")),C=rep(runif(120,-2,2),each=5))
#a random intercept term
rnd_int<-rnorm(120,0,0.4)
modmat<-model.matrix(~Block+C*N,data)
lin_pred<-modmat%*%params+rnd_int[factor(data$Pots)]
data$Biom<-rnorm(600,lin_pred,1)
m<-lmer(Biom~Block+C*N+(1|Pots),data)
summary(m)

## Linear mixed model fit by REML ['lmerMod']
## Formula: Biom ~ Block + C * N + (1 | Pots)
##    Data: data
## 
## REML criterion at convergence: 1765.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.6608 -0.6446 -0.0340  0.6077  3.2002 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Pots     (Intercept) 0.1486   0.3855  
##  Residual             0.9639   0.9818  
## Number of obs: 600, groups:  Pots, 120
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  9.93917    0.13225   75.15
## BlockB2     -0.42019    0.15153   -2.77
## BlockB3      2.35993    0.15364   15.36
## BlockB4     -1.36188    0.15111   -9.01
## C            0.97208    0.07099   13.69
## NMedium      0.36236    0.13272    2.73
## NA_lot       2.25624    0.13189   17.11
## C:NMedium    0.70157    0.11815    5.94
## C:NA_lot     2.78150    0.10764   25.84
## 
## Correlation of Fixed Effects:
##           (Intr) BlckB2 BlckB3 BlckB4 C      NMedim NA_lot C:NMdm
## BlockB2   -0.572                                                 
## BlockB3   -0.575  0.493                                          
## BlockB4   -0.576  0.495  0.493                                   
## C          0.140 -0.012 -0.018 -0.045                            
## NMedium   -0.511  0.003  0.027  0.007 -0.118                     
## NA_lot    -0.507  0.008  0.003  0.003 -0.119  0.502              
## C:NMedium -0.134  0.019  0.161  0.038 -0.601  0.175  0.071       
## C:NA_lot  -0.107  0.077  0.020  0.006 -0.657  0.078  0.129  0.394

#again model check should come here
#for LMM and GLMM we also need to pass as character (vector) the names of the random effect
pred<-plot_fit(m,focal_var = "C",inter_var = "N",RE = "Pots")
#let's plot this
plot(Biom~C,data,col=c("red","green","blue")[data$N],pch=16,xlab="CO2 concentration",ylab="Plant biomass")
lines(Pred~C,pred[1:20,],col="red",lwd=3)
lines(LC~C,pred[1:20,],col="red",lwd=2,lty=2)
lines(UC~C,pred[1:20,],col="red",lwd=2,lty=2)
lines(Pred~C,pred[21:40,],col="green",lwd=3)
lines(LC~C,pred[21:40,],col="green",lwd=2,lty=2)
lines(UC~C,pred[21:40,],col="green",lwd=2,lty=2)
lines(Pred~C,pred[41:60,],col="blue",lwd=3)
lines(LC~C,pred[41:60,],col="blue",lwd=2,lty=2)
lines(UC~C,pred[41:60,],col="blue",lwd=2,lty=2)
legend("topleft",legend = c("Few","Medium","A lot"),col=c("red","green","blue"),pch=16,lwd=3,title = "N addition",bty="n")

plotFit2
Please note a few elements:
– so far the function only return 95% confidence intervals
– I have tested it on various types of models that I usually build but there are most certainly still some bugs hanging around so if the function return an error please let me know of the model you fitted and the error returned
– the bootstrap computation can take some time for GLMM so be ready to wait a few minute if you have a big complex model
– the function accept a vector of variable names for the inter_var argument, it should also work for the RE argument even if I did not tried it yet

Happy plotting!

Here is the code for the function:

#for lm, glm, lmer and glmer models
#parameters:
#@m : the fitted lm, glm or merMod object (need to be provided)
#@focal_var: a character, the name of variable of interest that will be plotted on the x axis, ie the varying variable (need to be provided)
#@inter_var: a character or character vector, the name of variable interacting with the focal variable, ie categorical variables from which prediction will be drawn for each level across the focal_var gradient
#@RE: if giving a merMod object give as character or character vector of the name of the random effects variable (so far I only tried with one RE)
#@n: a numeric, the number of data point that will form the gradient of the focal variable
#@n_core: the number of core used to compute the bootstrapped CI for GLMM models

plot_fit<-function(m,focal_var,inter_var=NULL,RE=NULL,n=20,n_core=4){
  require(arm)  
  dat<-model.frame(m)
  #turn all character variable to factor
  dat<-as.data.frame(lapply(dat,function(x){
    if(is.character(x)){
      as.factor(x)
    }
    else{x}
  }))
  #make a sequence from the focal variable
  x1<-list(seq(min(dat[,focal_var]),max(dat[,focal_var]),length=n))
  #grab the names and unique values of the interacting variables
  isInter<-which(names(dat)%in%inter_var)
  if(length(isInter)==1){
    x2<-list(unique(dat[,isInter]))
    names(x2)<-inter_var
  }
  if(length(isInter)>1){
    x2<-lapply(dat[,isInter],unique)
  }
  if(length(isInter)==0){
    x2<-NULL
  }
  #all_var<-x1
  #add the focal variable to this list
  all_var<-c(x1,x2)
  #expand.grid on it
  names(all_var)[1]<-focal_var
  all_var<-expand.grid(all_var)
  
  #remove varying variables and non-predictors
  dat_red<-dat[,-c(1,which(names(dat)%in%c(focal_var,inter_var,RE,"X.weights."))),drop=FALSE]
  if(dim(dat_red)[2]==0){
    new_dat<-all_var
  }
  else{
    fixed<-lapply(dat_red,function(x) if(is.numeric(x)) mean(x) else factor(levels(x)[1],levels = levels(x)))
    #the number of rows in the new_dat frame
    fixed<-lapply(fixed,rep,dim(all_var)[1])
    #create the new_dat frame starting with the varying focal variable and potential interactions
    new_dat<-cbind(all_var,as.data.frame(fixed)) 
    #get the name of the variable to average over, debug for conditions where no variables are to be avergaed over
    name_f<-names(dat_red)[sapply(dat_red,function(x) ifelse(is.factor(x),TRUE,FALSE))]
  }  
    
  
  #get the predicted values
  cl<-class(m)[1]
  if(cl=="lm"){
    pred<-predict(m,newdata = new_dat,se.fit=TRUE)
  }
  
  if(cl=="glm" | cl=="negbin"){
    #predicted values on the link scale
    pred<-predict(m,newdata=new_dat,type="link",se.fit=TRUE)
  }
  if(cl=="glmerMod" | cl=="lmerMod"){
    pred<-list(fit=predict(m,newdata=new_dat,type="link",re.form=~0))
    #for bootstrapped CI
    new_dat<-cbind(new_dat,rep(0,dim(new_dat)[1]))
    names(new_dat)[dim(new_dat)[2]]<-as.character(formula(m)[[2]])
    mm<-model.matrix(formula(m,fixed.only=TRUE),new_dat)
  }
  #average over potential categorical variables  
  if(length(name_f)>0){
    if(cl=="glmerMod" | cl=="lmerMod"){
      coef_f<-lapply(name_f,function(x) fixef(m)[grep(paste0("^",x,"\\w+$"),names(fixef(m)))])
    }
    else{
      coef_f<-lapply(name_f,function(x) coef(m)[grep(paste0("^",x,"\\w+$"),names(coef(m)))])
    }    
    pred$fit<-pred$fit+sum(unlist(lapply(coef_f,function(x) mean(c(0,x)))))
  }
  #to get the back-transform values get the inverse link function
  linkinv<-family(m)$linkinv
  
  #get the back transformed prediction together with the 95% CI for LM and GLM
  if(cl=="glm" | cl=="lm"){
    pred$pred<-linkinv(pred$fit)
    pred$LC<-linkinv(pred$fit-1.96*pred$se.fit)
    pred$UC<-linkinv(pred$fit+1.96*pred$se.fit)
  }
  
  #for GLMM need to use bootstrapped CI, see ?predict.merMod
  if(cl=="glmerMod" | cl=="lmerMod"){
    pred$pred<-linkinv(pred$fit)
    predFun<-function(.) mm%*%fixef(.)
    bb<-bootMer(m,FUN=predFun,nsim=200,parallel="multicore",ncpus=n_core) #do this 200 times
    bb$t<-apply(bb$t,1,function(x) linkinv(x))
    #as we did this 200 times the 95% CI will be bordered by the 5th and 195th value
    bb_se<-apply(bb$t,1,function(x) x[order(x)][c(5,195)])
    pred$LC<-bb_se[1,]
    pred$UC<-bb_se[2,] 
  }
  
  #the output
  out<-as.data.frame(cbind(new_dat[,1:(length(inter_var)+1)],pred$LC,pred$pred,pred$UC))
  names(out)<-c(names(new_dat)[1:(length(inter_var)+1)],"LC","Pred","UC")
  return(out)
}

Count data: To Log or Not To Log

Count data are widely collected in ecology, for example when one count the number of birds or the number of flowers. These data follow naturally a Poisson or negative binomial distribution and are therefore sometime tricky to fit with standard LMs. A traditional approach has been to log-transform such data and then fit LMs to the transformed data. Recently a paper advocated against the use of such transformation since it led to high bias in the estimated coefficients. More recently another paper argued that log-transformation of count data followed by LM led to lower type I error rate (ie saying that an effect is significant when it is not) than GLMs. What should we do then?

Using a slightly changed version of the code published in the Ives 2015 paper let’s explore the impact of using these different modelling strategies on the rejection of the null hypothesis and the bias of the estimated coefficients.


#load the libraries
library(MASS)
library(lme4)
library(ggplot2)
library(RCurl)
library(plyr)

#download and load the functions that will be used
URL<-"https://raw.githubusercontent.com/Lionel68/Jena_Exp/master/stats/ToLog_function.R"
download.file(URL,destfile=paste0(getwd(),"/ToLog_function.R"),method="curl")
source("ToLog_function.R")

Following the paper from Ives and code therein I simulated some predictor (x) and a response (y) that follow a negative binomial distribution and is linearly related to x.
In the first case I look at the impact of varying the sample size on the rejection of the null hypothesis and the bias in the estimated coefficient between y and x.


######univariate NB case############
base_theme<-theme(title=element_text(size=20),text=element_text(size=18))
#range over n
output<-compute.stats(NRep=500,n.range = c(10,20,40,80,160,320,640,1280))
ggplot(output,aes(x=n,y=Reject,color=Model))+geom_path(size=2)+scale_x_log10()+labs(x="Sample Size (log-scale)",y="Proportion of Rejected H0")+base_theme
ggplot(output,aes(x=n,y=Bias,color=Model))+geom_path(size=2)+scale_x_log10()+labs(x="Sample Size (log-scale)",y="Bias in the slope coefficient")+base_theme

Figure 1

For this simulation round the coefficient of the slope (b1) was set to 0 (no effect of x on y), and only the sample size varied. The top panel show the average proportion of time that the p-value of the slope coefficient was lower than 0.05 (H0:b1=0 rejected). We see that for low sample size (<40) the Negative Binomial model has higher proportion of rejected H0 (type I error rate) but this difference between the model disappear as we reached sample size bigger than 100. The bottom panel show the bias (estimated value – true value) in the estimated coefficient. For very low sample size (n=10), Log001, Negative Binomial and Quasipoisson have higher bias than Log1 and LogHalf. For larger sample size the difference between the GLM team (NB and QuasiP) and the LM one (Log1 and LogHalf) gradually decrease and both teams converged to a bias around 0 for larger sample size. Only Log0001 is behaved very badly. From what we saw here it seems that Log1 and LogHalf are good choices for count data, they have low Type I error and Bias along the whole sample size gradient.

The issue is that an effect of exactly 0 never exist in real life where most of the effect are small (but non-zero) thus the Null Hypothesis will never be true. Let’s look know how the different models behaved when we vary b1 alone in a first time and crossed with sample size variation.


#range over b1
outpu<-compute.stats(NRep=500,b1.range = seq(-2,2,length=17))
ggplot(outpu,aes(x=b1,y=Reject,color=Model))+geom_path(size=2)+base_theme+labs(y="Proportion of Rejected H0")
ggplot(outpu,aes(x=b1,y=Bias,color=Model))+geom_path(size=2)+base_theme+labs(y="Bias in the slope coefficient")

Figure2

Here the sample size was set to 100, what we see in the top graph is that for a slope of exactly 0 all model have a similar average proportion of rejection of the null hypothesis. As b1 become smaller or bigger the average proportion of rejection show very similar increase for all model expect for Log0001 which has a slower increase. This curves basically represent the power of the model to detect an effect and is very similar to the Fig2 in the Ives 2015 paper. Now the bottom panel show that all the LM models have bad behaviour concerning their bias, they have only small bias for very small (close to 0) coefficient, has the coefficient gets bigger the absolute bias increase. This means that even if the LM models are able to detect an effect with similar power the estimated coefficient is wrong. This is due to the value added to the untransformed count data in order to avoid -Inf for 0s. I have no idea on how one may take into account arithmetically these added values and then remove its effects …

Next let’s cross variation in the coefficient with sample size variation:

#range over n and b1
output3<-compute.stats(NRep=500,b1.range=seq(-1.5,1.5,length=9),n.range=c(10,20,40,80,160,320,640,1280))
ggplot(output3,aes(x=n,y=Reject,color=Model))+geom_path(size=2)+scale_x_log10()+facet_wrap(~b1)+base_theme+labs(x="Sample size (log-scale)",y="Proportion of rejected H0")
ggplot(output3,aes(x=n,y=Bias,color=Model))+geom_path(size=2)+scale_x_log10()+facet_wrap(~b1)+base_theme+labs(x="Sample size (log-scale)",y="Bias in the slope coefficient")

Figure 3

The tope panel show one big issue of focussing only on the significance level: rejection of H0 depend not only on the size of the effect but also on the sample size. For example for b1=0.75 (a rather large value since we work on the exponential scale) less than 50% of all models rejected the null hypothesis for a sample size of 10. Of course as the effect sizes gets larger the impact of the sample size on the rejection of the null hypothesis is reduced. However most effect around the world are small so that we need big sample size to be able to “detect” them using null hypothesis testing. The top graph also shows that NB is slightly better than the other models and that Log0001 is again having the worst performance. The bottom graphs show something interesting, the bias is quasi-constant over the sample size gradient (maybe if we would look closer we would see some variation). Irrespective of how many data point you will collect the LMs will always have bigger bias than the GLMs (expect for the artificial case of b1=0)

To finish with in Ives 2015 the big surprise was the explosion of type I error with the increase in the variation in individual-level random error (adding a random normally distributed value to the linear predictor of each data point and varying the standard deviation of these random values) as can be seen in the Fig3 of the paper.


#range over b1 and sd.eps
output4<-compute.statsGLMM(NRep=500,b1.range=seq(-1.5,1.5,length=9),sd.eps.range=seq(0.01,2,length=10))
ggplot(output4,aes(x=sd.eps,y=Reject,color=Model))+geom_path(size=2)+facet_wrap(~b1)+base_theme+labs(x="",y="Proportion of rejected H0")
ggplot(output4,aes(x=sd.eps,y=Bias,color=Model))+geom_path(size=2)+facet_wrap(~b1)+base_theme+labs(x="Standard Deviation of the random error",y="Bias of the slope coefficient")

Figure 4

Before looking at the figure in detail please note that a standard deviation of 2 in this context is very high, remember that these values will be added to the linear predictor which will be exponentiated so that we will end up with very large deviation. In the top panel there are two surprising results, the sign of the coefficient affect the pattern of null hypothesis rejection and I do not see the explosion of rejection rates for NB or QuasiP that are presented in Ives 2015. In his paper Ives reported the LRT test for the NB models when I am reporting the p-values from the model summary directly (Wald test). If some people around have computing power it would be interesting to see if changing the seed and/or increasing the number of replication lead to different patterns … The bottom panel show again that the LMs bias are big, the NB and QuasiP models have an increase in the bias with the standard deviation but only if the coefficient is negative (I suspect some issue with the exponentiating of large random positive error), as expected the GLMM perform the best in this context.

Pre-conclusion, in real life of course we would rarely have a model with only one predictor, most of the time we will build larger models with complex interaction structure between the predictors. This will of course affect both H0 rejection and Bias, but this is material for a next post🙂

Let’s wrap it up; we’ve seen that even if LM transformation seem to be a good choice for having a lower type I error rate than GLMs this advantage will be rather minimal when using empirical data (no effect are equal to 0) and potentially dangerous (large bias). Ecologists have sometime the bad habits to turn their analysis into a star hunt (R standard model output gives stars to significant effects) and focusing only on using models that have the best behavior towards significance (but large bias) does not seem to be a good strategy to me. More and more people call for increasing the predictive power of ecological model, we need then modelling techniques that are able to precisely (low bias) estimate the effects. In this context transforming the data to make them somehow fit normal assumption is sub-optimal, it is much more natural to think about what type of processes generated the data (normal, poisson, negative binomial, with or without hierarchical structure) and then model it accordingly. There are extensive discussion nowadays about the use and abuse of p-values in science and I think that in our analysis we should slowly but surely shifts our focus from “significance/p-values<0.05/star hunting” only to a more balanced mix of effect-sizes (or standardized slopes), p-values and R-square.

A function to help graphical model checks of lm and ANOVA

As always a more colourful version of this post is available on rpubs.

Even if LM are very simple models at the basis of many more complex ones, LM still have some assumptions that if not met would render any interpretation from the models plainly wrong. In my field of research most people were taught about checking ANOVA assumptions using tests like Levene & co. This is however not the best way to check if my model meet its assumptions as p-values depend on the sample size, with small sample size we will almost never reject the null hypothesis while with big sample even small deviation will lead to significant p-values (discussion). As ANOVA and linear models are two different ways to look at the same model (explanation) we can check ANOVA assumptions using graphical check from a linear model. In R this is easily done using plot(model), but people often ask me what amount of deviation makes me reject a model. One easy way to see if the model checking graphs are off the charts is to simulate data from the model, fit the model to these newly simulated data and compare the graphical checks from the simulated data with the real data. If you cannot differentiate between the simulated and the real data then your model is fine, if you can then try again!

EDIT: You can also make the conclusion form your visual inspection more rigorous following the protocols outlined in this article. (Thanks for Tim comment)

Below is a little function that implement this idea:

 

lm.test<-function(m){
  require(plyr)
  #the model frame
  dat<-model.frame(m)
  #the model matrix
  f<-formula(m)
  modmat<-model.matrix(f,dat)
  #the standard deviation of the residuals
  sd.resid<-sd(resid(m))
  #sample size
  n<-dim(dat)[1]
  #get the right-hand side of the formula  
  #rhs<-all.vars(update(f, 0~.))
  #simulate 8 response vectors from model
  ys<-lapply(1:8,function(x) rnorm(n,modmat%*%coef(m),sd.resid))
  #refit the models
  ms<-llply(ys,function(y) lm(y~modmat[,-1]))
  #put the residuals and fitted values in a list
  df<-llply(ms,function(x) data.frame(Fitted=fitted(x),Resid=resid(x)))
  #select a random number from 2 to 8
  rnd<-sample(2:8,1)
  #put the original data into the list
  df<-c(df[1:(rnd-1)],list(data.frame(Fitted=fitted(m),Resid=resid(m))),df[rnd:8])

  #plot 
  par(mfrow=c(3,3))
  l_ply(df,function(x){
    plot(Resid~Fitted,x,xlab="Fitted",ylab="Residuals")
    abline(h=0,lwd=2,lty=2)
  })

  l_ply(df,function(x){
    qqnorm(x$Resid)
    qqline(x$Resid)
  })

  out<-list(Position=rnd)
  return(out)
}

 

This function print the two basic plots: one looking at the spread of the residuals around the fitted values, the other one look at the normality of the residuals. The function return the position of the real model in the 3×3 window, counting from left to right and from top to bottom (ie position 1 is upper left graph).

Let’s try the function:
 

#a simulated data frame of independent variables
dat<-data.frame(Temp=runif(100,0,20),Treatment=gl(n = 5,k = 20))
contrasts(dat$Treatment)<-"contr.sum"
#the model matrix
modmat<-model.matrix(~Temp*Treatment,data=dat)
#the coefficient
coeff<-rnorm(10,0,4)
#simulate response data
dat$Biomass<-rnorm(100,modmat%*%coeff,1)
#the model
m<-lm(Biomass~Temp*Treatment,dat)
#model check
chk<-lm.test(m)

 

lm.test1

lm.test2

Can you find which one is the real one? I could not, here is the answer:

 

chk
$Position
[1] 4

Happy and safe modelling!

Using and interpreting different contrasts in linear models in R

When building a regression model with categorical variables with more than two levels (ie “Cold”, “Freezing”, “Warm”) R is doing internally some transformation to be able to compute regression coefficient. What R is doing is that it is turning your categorical variables into a set of contrasts, this number of contrasts is the number of level in the variable (3 in the example above) minus 1. Here I will present three ways to set the contrasts and depending on your research question and your variables one might be more appropriate than the others.

 # setting seed so that numerical results stay stable
 set.seed(25)
 # let's imagine an experiment which measure plant biomass based on various
 # levels of nutrient added to the medium first case one treatments three
 # levels
 f <- gl(n = 3, k = 20, labels = c("control", "low", "high"))
 # with treatments contrasts (default)
 mat <- model.matrix(~f, data = data.frame(f = f))
 # this tell us which contrast has been used to generate the model matrix
 attr(mat, "contrasts")
 
## $f
 ## [1] "contr.treatment"
# simulate some ys
 beta <- c(12, 3, 6)  #these are the simulated regression coefficient
 y <- rnorm(n = 60, mean = mat %*% beta, sd = 2)
 m <- lm(y ~ f)
 summary(m)
## 
## Call:
## lm(formula = y ~ f)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.350 -1.611  0.161  1.162  5.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
 ## (Intercept) 11.498 0.463 24.84 < 2e-16 ***
 ## flow        3.037 0.655 4.64 2.1e-05 ***
 ## fhigh       6.163 0.655 9.41 3.3e-13 ***
 ## ---
 ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 ##
 ## Residual standard error: 2.07 on 57 degrees of freedom
 ## Multiple R-squared: 0.609, Adjusted R-squared: 0.595
 ## F-statistic: 44.3 on 2 and 57 DF, p-value: 2.45e-12

The first coefficient, the intercept, is the estimated average for the baseline group, which in our case is the control group (in the control group the biomass is estimated to be on average 11.49). The second coefficient “flow” is the estimated difference between the average in the baseline group and the average in the “low” group (this group has an increase in 3.03 biomass compared to the control group). Similarly the third coefficient is the difference between the average in the baseline and the “high” group (this group has an increase of 6.16 biomass compared to the control group).

 # plot the results
 plot(y ~ rep(1:3, each = 20), xaxt = "n", xlab = "Treatment")
 axis(1, at = c(1, 2, 3), labels = levels(f))
 points(c(1, 2, 3), c(coef(m)[1], coef(m)[2:3] + coef(m)[1]), pch = 16, cex = 2)
 

c1

This is by default, now let’s turn to other contrasts options:

  # now with helmert contrasts
 mat  attr(mat, "contrasts")
 ## $f
 ## [1] "contr.helmert"
# simulate the ys
beta <- c(5, 3, 2)
y <- rnorm(n = 60, mean = mat %*% beta, sd = 1.5)
# model
m <- lm(y ~ f, contrasts = list(f = "contr.helmert"))  #there we tell the model to use helmert contrast to build the model
summary(m)
## 
## Call:
## lm(formula = y ~ f, contrasts = list(f = "contr.helmert"))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.281 -0.973 -0.202  0.752  2.986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.756      0.184    25.9   <2e-16 ***
## f1             3.116      0.225    13.9   <2e-16 ***
## f2             1.949      0.130    15.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.42 on 57 degrees of freedom
## Multiple R-squared:  0.88,   Adjusted R-squared:  0.876 
## F-statistic:  209 on 2 and 57 DF,  p-value: <2e-16

Now the meaning of the various coefficient is different, the intercept is the average biomass over all the levels (across control, low and high), f1 is the difference between the average of the first level (control) and the average of the second one (low), plants in the treatment low have a 3.11 increase in biomass. f2 is the difference between the average of control and low and the average of high treatment. To put this differently, if we put together the data from the control and low treatment and compare there average value to the average value of plants in the high treatment we would get fhigh. Mean biomass of the plants in the high treatment is higher by 1.95 than plants of the control and low treatment. This type of contrast is a bit harder to interpret but is well suited for variables where the levels have an order, ie (“0”,“>0 and <5″,”>5 and <10” …), there we can gradually compare the successive levels.

As a note, in most applied cases we can set the contrasts that will be used by the model using contrasts(factor)<-‘some_contrasts’. Let’s turn to the last contrasts the effect (or deviation) coding:

# now with sum contrasts, let's spice up a little bit and introduce an
# interaction with a continuous variables
x <- runif(60, -2, 2)
levels(f) <- c("North", "South", "Center")  #let's make different level which cannot easily be given an order or a baseline
mat <- model.matrix(~x * f, data = data.frame(f = f), contrasts.arg = list(f = "contr.sum"))
attr(mat, "contrasts")
 # simulate the ys
beta <- c(5, -2, 3, 2, 1, -4)
y <- rnorm(n = 60, mean = mat %*% beta, sd = 1.5)
# model
m <- lm(y ~ x * f, contrasts = list(f = "contr.sum"))
summary(m)
## 
## Call:
## lm(formula = y ~ x * f, contrasts = list(f = "contr.sum"))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.271 -1.021 -0.165  0.867  3.914 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.057      0.207   24.42  < 2e-16 ***
## x             -1.869      0.185  -10.13  4.4e-14 ***
## f1             2.699      0.295    9.16  1.4e-12 ***
## f2             2.351      0.293    8.03  8.8e-11 ***
## x:f1           0.954      0.264    3.61  0.00067 ***
## x:f2          -4.269      0.268  -15.92  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.58 on 54 degrees of freedom
## Multiple R-squared:  0.922,  Adjusted R-squared:  0.914 
## F-statistic:  127 on 5 and 54 DF,  p-value: <2e-16

This type of contrasts is useful when there is no natural way to set a baseline or an ordering in the different levels of the variables. I changed the name of the level to illustrate what I mean by this, let’s imagine in this context that we had sampled our plant in three different localities, it is hard to determine in this context what should be the baseline, the deviation coding is a nice way to model these type of data. The intercept in this context is the overall mean across the levels (as in the helmert contrasts), so overall the plant biomass was 5.05. The second one the the average slope between the biomass and the x variable, if we increase x by one the plant biomass decrease by -1.87 across the geographical gradient. f1 is the difference between the overall mean and the mean in the north locality, similarly f2 is the difference between the overall mean and the south locality. To get the estimated average value at the center locality we have to do:

coef(m)[1] - coef(m)[3] - coef(m)[4]
## (Intercept)
## 0.00678

The interaction coefficient are the deviation of the slope within a group from the overall slope, therefore in the north if we increase x by 1, we decrease the biomass by 1.87-0.95= 0.92, similarly the slope in the south is -1.87+(-4.27)= -6.14 and in the center: -1.87-0.95-(-4.27) = +1.45. Around each of these coefficient we have some assessement of the significance of the difference between the overall mean and the various groups. So far I could not find a way to assess the significance of the difference between the overall mean and the last group …

Let’s do a nice figure of this:

 # a nice plot
 plot(y ~ x, xlab = "Temperature", ylab = "Biomass", pch = 16, col = rep(c("orange",
 "violetred", "coral"), each = 20))
 abline(a = 4.55, b = -2, lwd = 2, lty = 1, col = "blue")
 abline(a = coef(m)[1] + coef(m)[3], b = coef(m)[2] + coef(m)[5], lwd = 2, lty = 2,
 col = "orange")
 abline(a = coef(m)[1] + coef(m)[4], b = coef(m)[2] + coef(m)[6], lwd = 2, lty = 2,
 col = "violetred")
 abline(a = coef(m)[1] - coef(m)[3] - coef(m)[4], b = coef(m)[2] - coef(m)[5] -
 coef(m)[6], lwd = 2, lty = 2, col = "coral")
 legend("topright", legend = c("Average", "North", "South", "Center"), col = c("blue",
 "orange", "violetred", "coral"), pch = 16, lty = c(1, 2, 2, 2), bty = "n")
 

c2

Thats’s it for today, this page was greatly helpful in understanding the meaning of the various contrasts: http://www.unc.edu/courses/2006spring/ecol/145/001/docs/lectures/lecture26.htm, the wikipedia page is also rather nice on this: https://en.wikipedia.org/wiki/Categorical_variable

 

Interpreting regression coefficient in R

Linear models are a very simple statistical techniques and is often (if not always) a useful start for more complex analysis. It is however not so straightforward to understand what the regression coefficient means even in the most simple case when there are no interactions in the model. If we are not only fishing for stars (ie only interested if a coefficient is different for 0 or not) we can get much more information (to my mind) from these regression coefficient than from another widely used technique which is ANOVA. Comparing the respective benefit and drawbacks of both approaches is beyond the scope of this post. Here I would like to explain what each regression coefficient means in a linear model and how we can improve their interpretability following part of the discussion in Schielzeth (2010) Methods in Ecology and Evolution paper.

Let’s make an hypothetical example that will follow us through the post, say that we collected 10 grams of soils at 100 sampling sites, where half of the site were fertilized with Nitrogen and the other half was kept as control. We also used recorded measure of mean spring temperature and annual precipitation from neighboring meteorological stations. We are interested to know how temperature and precipitation affect the biomass of soil micro-organisms, and to look at the effect of nitrogen addition. To keep things simple we do not expect any interaction here.

# let's simulate the data the explanatory variables: temperature (x1),
# precipitation (x2) and the treatment (1=Control, 2= N addition)
set.seed(1)
x1 <- rnorm(100, 10, 2)
x2 <- rnorm(100, 100, 10)
x3 <- gl(n = 2, k = 50)
modmat <- model.matrix(~x1 + x2 + x3, data = data.frame(x1, x2, x3))
# vector of fixed effect
betas <- c(10, 2, 0.2, 3)
# generate data
y <- rnorm(n = 100, mean = modmat %*% betas, sd = 1)
# first model
m <- lm(y ~ x1 + x2 + x3)
summary(m)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8805 -0.4948  0.0359  0.7103  2.6669 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.4757     1.2522    8.37  4.8e-13 ***
## x1            2.0102     0.0586   34.33  < 2e-16 ***
## x2            0.1938     0.0111   17.52  < 2e-16 ***
## x32           3.1359     0.2109   14.87  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.05 on 96 degrees of freedom
## Multiple R-squared:  0.949,  Adjusted R-squared:  0.947 
## F-statistic:  596 on 3 and 96 DF,  p-value: <2e-16

Let’s go through each coefficient: the intercept is the fitted biomass value when temperature and precipitation are both equal to 0 for the Control units. In this context it is relatively meaningless since a site with a precipitation of 0mm is unlikely to occur, we cannot therefore draw further interpretation from this coefficient. Then x1 means that if we hold x2 (precipitation) constant an increase in 1° of temperature lead to an increase of 2mg of soil biomass, this is irrespective of whether we are in the control or nutrient added unit. Similarly x2 means that if we hold x1 (temperature) constant a 1mm increase in precipitation lead to an increase of 0.19mg of soil biomass. Finally x32 is the difference between the control and the nutrient added group when all the other variables are held constant, so if we are at a temperature of 10° and a precipitation of 100mm, adding nutrient to the soil lead to changes from 10+2x10+0.19x100= 49mg to 52mg of soil biomass. Now let’s make a figure of the effect of temperature on soil biomass

plot(y ~ x1, col = rep(c("red", "blue"), each = 50), pch = 16, xlab = "Temperature [°C]", 
    ylab = "Soil biomass [mg]")
abline(a = coef(m)[1], b = coef(m)[2], lty = 2, lwd = 2, col = "red")
Coeff1

What happened there? It seems as if our model is completely underestimating the y values … Well what we have been drawing is the estimated effect of temperature on soil biomass for the control group and for a precipitation of 0mm, this is not so interesting, instead we might be more interested to look at the effect for average precipitation values:

plot(y ~ x1, col = rep(c("red", "blue"), each = 50), pch = 16, xlab = "Temperature [°C]", 
    ylab = "Soil biomass [mg]")
abline(a = coef(m)[1] + coef(m)[3] * mean(x2), b = coef(m)[2], lty = 2, lwd = 2, 
    col = "red")
abline(a = coef(m)[1] + coef(m)[4] + coef(m)[3] * mean(x2), b = coef(m)[2], 
    lty = 2, lwd = 2, col = "blue")
# averaging effect of the factor variable
abline(a = coef(m)[1] + mean(c(0, coef(m)[4])) + coef(m)[3] * mean(x2), b = coef(m)[2], 
    lty = 1, lwd = 2)
legend("topleft", legend = c("Control", "N addition"), col = c("red", "blue"), 
    pch = 16)
Coeff2

Now this look better, the black line is the effect of temperature on soil biomass averaging out the effect of the treatment, it might be of interest if we are only interested in looking at temperature effects.

In this model the intercept did not make much sense, a way to remedy this is to center the explanatory variables, ie removing the mean value from the variables.

# now center the continuous variable to change interpretation of the
# intercept
data_center <- data.frame(x1 = x1 - mean(x1), x2 = x2 - mean(x2), x3 = x3)
modmat <- model.matrix(~x1 + x2 + x3, data = data.frame(x1 = x1, x2 = x2, x3 = x3))
data_center$y_center <- rnorm(n = 100, mean = modmat %*% betas, sd = 1)

# second model
m_center <- lm(y_center ~ x1 + x2 + x3, data_center)
summary(m_center)
## 
## Call:
## lm(formula = y_center ~ x1 + x2 + x3, data = data_center)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4700 -0.5525 -0.0287  0.6701  1.7920 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  50.4627     0.1423   354.6   <2e-16 ***
## x1            1.9724     0.0561    35.2   <2e-16 ***
## x2            0.1946     0.0106    18.4   <2e-16 ***
## x32           2.8976     0.2020    14.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1 on 96 degrees of freedom
## Multiple R-squared:  0.951,  Adjusted R-squared:  0.949 
## F-statistic:  620 on 3 and 96 DF,  p-value: <2e-16

Now through this centering we know that under average temperature and precipitation conditions the soil biomass in the control plot is equal to 50.46mg, in the nitrogen enriched plot we have 53mg of soil biomass. The slopes are not changing we are just shifting where the intercept lie making it directly interpretable. Let’s do a plot

plot(y_center ~ x2, data_center, col = rep(c("red", "blue"), each = 50), pch = 16, 
    xlab = "Precipitation [mm]", ylab = "Biomass [mg]")
abline(a = coef(m_center)[1], b = coef(m_center)[3], lty = 2, lwd = 2, col = "red")
abline(a = coef(m_center)[1] + coef(m_center)[4], b = coef(m_center)[3], lty = 2, 
    lwd = 2, col = "blue")
# averaging effect of the factor variable
abline(a = coef(m_center)[1] + mean(c(0, coef(m_center)[4])), b = coef(m_center)[3], 
    lty = 1, lwd = 2)
legend("bottomright", legend = c("Control", "N addition"), col = c("red", "blue"), 
    pch = 16)
Coeff3

We might also be interested in knowing which from the temperature or the precipitation as the biggest impact on the soil biomass, from the raw slopes we cannot get this information as variables with low standard deviation will tend to have bigger regression coefficient and variables with high standard deviation will have low regression coefficient. One solution is to derive standardized slopes that are in unit of standard deviation and therefore directly comparable in terms of their strength between continuous variables:

# now if we want to find out which of the two continuous variables as the
# most importance on y we can compute the standardized slopes from the
# unstandardized ones:
std_slope <- function(model, variable) {
    return(coef(model)[variable] * (sd(m$model[[variable]])/sd(m$model[[1]])))
}

std_slope(m, "x1")
##     x1 
## 0.7912
std_slope(m, "x2")
##     x2 
## 0.4067

From this we can conclude that temperature as a bigger impact on soil biomass than precipitation. If we wanted to compare the continuous variables with the binary variable we could standardize our variables by dividing by two times their standard deviation following Gelman (2008) Statistics in medecine.

Here we saw in a simple linear context how to derive quite a lot of information from our estimated regression coefficient, this understanding can then be apply to more complex models like GLM or GLMM. These models are offering us much more information than just the binary significant/non-significant categorization. Happy coding.

Checking (G)LM model assumptions in R

(Generalized) Linear models make some strong assumptions concerning the data structure:

  1. Independance of each data points
  2. Correct distribution of the residuals
  3. Correct specification of the variance structure
  4. Linear relationship between the response and the linear predictor

For simple lm 2-4) means that the residuals should be normally distributed, the variance should be homogenous across the fitted values of the model and for each predictors separately, and the y’s should be linearly related to the predictors. In R checking these assumptions from a lm and glm object is fairly easy:

# testing model assumptions some simulated data
x <- runif(100, 0, 10)
y <- 1 + 2 * x + rnorm(100, 0, 1)
m <- lm(y ~ x)
par(mfrow = c(2, 2))
plot(m)

Check_01

The top-left and top-right graphs are the most important one, the top-left graph check for the homogeneity of the variance and the linear relation, if you see no pattern in this graph (ie if this graph looks like stars in the sky), then your assumptions are met. The second graphs check for the normal distribution of the residuals, the points should fall on a line. The bottom-left graph is similar to the top-left one, the y-axis is changed, this time the residuals are square-root standardized (?rstandard) making it easier to see heterogeneity of the variance. The fourth one allow detecting points that have a too big impact on the regression coefficients and that should be removed. These graphs from simulated data are extremely nice, in applied statistics you will rarely see such nice graphs. Now many people new to linear modelling and used to strict p-values black and white decision are a bit lost not knowing when there model is fine and when it should be rejected. Below is an example of a model that is clearly wrong:

# some wrong model
y <- 1 + 2 * x + 1 * x^2 - 0.5 * x^3
m <- lm(y ~ x)
par(mfrow = c(2, 2))
plot(m)

Check_02

These two example are easy, life is not. Real-life models are sometimes hard to assess, the bottom-line is you should always check your model assumptions and be truthfull. Reporting and interpreting models that do not meet their assumptions is bad science and close to falsification of the results. Now let’s see a real life example where it is tricky to decide if the model meet the assumptions or not, the dataset is in the ggplot2 library just look at ?mpg for a description:

 

# a real life dataset
library(ggplot2)
head(mpg)
##   manufacturer model displ year cyl      trans drv cty hwy fl   class
## 1         audi    a4   1.8 1999   4   auto(l5)   f  18  29  p compact
## 2         audi    a4   1.8 1999   4 manual(m5)   f  21  29  p compact
## 3         audi    a4   2.0 2008   4 manual(m6)   f  20  31  p compact
## 4         audi    a4   2.0 2008   4   auto(av)   f  21  30  p compact
## 5         audi    a4   2.8 1999   6   auto(l5)   f  16  26  p compact
## 6         audi    a4   2.8 1999   6 manual(m5)   f  18  26  p compact
m <- lm(cty ~ displ + factor(cyl), mpg)
par(mfrow = c(2, 2))
plot(m)

Check_03

The residuals vs fitted graphs looks rather ok to me, there is some higher variance for high fitted values but this does not look too bad to me, however the qqplot (checking the normality of the residuals) looks pretty awfull with residuals on the right consistently going further away from the theoretical line. A nice way to see if the patterns are different from those expected under the model conditions is to derive new response values from the fitted coefficient and the residual variance, you can then derive 8 new plots and randomly allocate the real plot to a position, if you are able to find the real plot and if its pattern are different from the other then the model do not meet its assumptions:

# randomizing to see if the patterns are different from expected
modmat <- model.matrix(~displ + factor(cyl), data = mpg)
mus <- modmat %*% coef(m)
set.seed(1246)
# the position of the real plot in a 3x3 panel
s <- sample(1:9, size = 1)
par(mfrow = c(3, 3))
for (i in 1:9) {
    if (i == s) {
        # the real plot
        qqnorm(resid(m))
        qqline(resid(m))
    } else {
        # draw new y values from the fitted values with the residuals standard
        # deviation
        y <- rnorm(dim(mpg)[1], mus, sd(resid(m)))
        y <- y - fitted(m)
        qqnorm(y)
        qqline(y)
    }

}

Check_04

Are you able to find in which panel the real plot is? I can it is on the second row, third column. The other qqplot do not look that different from the real one, there are however a few points that are definitevely away from what we expect under the model assumptions. A next step would be to look at these points and understand where these discrepency might come from (measurement error, special case…) We can also derive such plots for checking the first graph.

Resources on model checking: