Skip to content

A function to help graphical model checks of lm and ANOVA

As always a more colourful version of this post is available on rpubs.

Even if LM are very simple models at the basis of many more complex ones, LM still have some assumptions that if not met would render any interpretation from the models plainly wrong. In my field of research most people were taught about checking ANOVA assumptions using tests like Levene & co. This is however not the best way to check if my model meet its assumptions as p-values depend on the sample size, with small sample size we will almost never reject the null hypothesis while with big sample even small deviation will lead to significant p-values (discussion). As ANOVA and linear models are two different ways to look at the same model (explanation) we can check ANOVA assumptions using graphical check from a linear model. In R this is easily done using plot(model), but people often ask me what amount of deviation makes me reject a model. One easy way to see if the model checking graphs are off the charts is to simulate data from the model, fit the model to these newly simulated data and compare the graphical checks from the simulated data with the real data. If you cannot differentiate between the simulated and the real data then your model is fine, if you can then try again!

EDIT: You can also make the conclusion form your visual inspection more rigorous following the protocols outlined in this article. (Thanks for Tim comment)

Below is a little function that implement this idea:


  #the model frame
  #the model matrix
  #the standard deviation of the residuals
  #sample size
  #get the right-hand side of the formula  
  #rhs<-all.vars(update(f, 0~.))
  #simulate 8 response vectors from model
  ys<-lapply(1:8,function(x) rnorm(n,modmat%*%coef(m),sd.resid))
  #refit the models
  ms<-llply(ys,function(y) lm(y~modmat[,-1]))
  #put the residuals and fitted values in a list
  df<-llply(ms,function(x) data.frame(Fitted=fitted(x),Resid=resid(x)))
  #select a random number from 2 to 8
  #put the original data into the list





This function print the two basic plots: one looking at the spread of the residuals around the fitted values, the other one look at the normality of the residuals. The function return the position of the real model in the 3×3 window, counting from left to right and from top to bottom (ie position 1 is upper left graph).

Let’s try the function:

#a simulated data frame of independent variables
dat<-data.frame(Temp=runif(100,0,20),Treatment=gl(n = 5,k = 20))
#the model matrix
#the coefficient
#simulate response data
#the model
#model check




Can you find which one is the real one? I could not, here is the answer:


[1] 4

Happy and safe modelling!

Generating ANOVA-like table from GLMM using parametric bootstrap


This article may also be found on RPubs:

In the list of worst to best way to test for effect in GLMM the list on state that parametric bootstrapping is among the best options. PBmodcomp in the pbkrtest package implement such parametric bootstrapping by comparing a full model to a null one. The function simulate data (the response vector) from the null model then fit these data to the null and full model and derive a likelihood ratio test for each of the simulated data. Then we can compare the observed likelihood ratio test to the null distribution generated from the many simulation and derive a p-value. The advantage of using such a method over the classical p-values derived from a chi-square test on the likelihood ratio test is that in the parametric bootstrap we do not assume any null distribution (like chi-square) but instead derive our own null distribution from the model and the data at hand. We do not make the assumption then that the likelihood ratio test statistic is chi-square distributed. I have made a little function that wraps around the PBmodcomp function to compute bootstrapped p-values for each term in a model by sequentially adding them. This lead to anova-like table that are typically obtained when one use the command anova on a glm object.

#the libraries used
#the function
  #generate a list of reduced model formula
  fs[[1]]<-as.formula(paste(resp,"~ 1 +",rand))
    for(i in 1:nb_terms){

  #fit the reduced model to the data
    m_fit<-lapply(fs,function(x) lmer(x,data,REML=FALSE))
  } else if(fam=="binomial"){
    m_fit<-lapply(fs,function(x) glmer(x,data,family=fam,weights=w))
  }  else if(fam=="Negative Binomial"){
 m_fit<-lapply(fs,function(x) glmer.nb(x,data))
 }  else if(fam=="poisson"){
    m_fit<-lapply(fs,function(x) glmer(x,data,family=fam))
  #compare nested model with one another and get LRT values (ie increase in the likelihood of the models as parameters are added)
  for(i in 1:(length(m_fit)-1)){
    #here are reported the bootstrapped p-values, ie not assuming any parametric distribution like chi-square to the LRT values generated under the null model
    #these p-values represent the number of time the simulated LRT value (under null model) are larger than the observe one
    print(paste("Variable ",term_added," tested",sep=""))
  print(paste("Seed set to:",seed))

You pass your GLMM model to the function together with the random part as character (see example below), if you fitted a binomial GLMM you also need to provide the weights as a vector, you can then set a seed and the last argument is the number of simulation to do, it is set by default to 50 for rapid checking purpose but if you want to report these results larger values (ie 1000, 10000) should be used.

Let’s look at a simple LMM example:

## Linear mixed model fit by REML ['lmerMod']
## Formula: TICKS ~ cHEIGHT + YEAR + (1 | BROOD)
##    Data: grouseticks
## REML criterion at convergence: 2755
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.406 -0.246 -0.036  0.146  5.807 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  BROOD    (Intercept) 87.3     9.34    
##  Residual             28.1     5.30    
## Number of obs: 403, groups:  BROOD, 118
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   5.4947     1.6238    3.38
## cHEIGHT      -0.1045     0.0264   -3.95
## YEAR96        4.1910     2.2424    1.87
## YEAR97       -4.3304     2.2708   -1.91
## Correlation of Fixed Effects:
##         (Intr) cHEIGH YEAR96
## cHEIGHT -0.091              
## YEAR96  -0.726  0.088       
## YEAR97  -0.714  0.052  0.518
## [1] "Variable cHEIGHT tested"
## [1] "Variable YEAR tested"
## [1] "Seed set to: 63"
##      term   LRT p_value
## 1 cHEIGHT 14.55 0.01961
## 2    YEAR 14.40 0.01961

The resulting table show for each term in the model the likelihood ratio test, which is basically the decrease in deviance when going from the null to the full model and the p value, you may look at the PBtest line in the details of ?PBmodcomp to see how it is computed.

Now let’s see how to use the function with binomial GLMM:

#simulate some binomial data
group<-gl(n = 20,k = 5)
y<-rbinom(n = 100,size = 10,prob = invlogit(p))
#fit a model
m<-glmer(prop~x1+x2+(1|group),family="binomial",weights = rep(10,100))
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: prop ~ x1 + x2 + (1 | group)
## Weights: rep(10, 100)
##      AIC      BIC   logLik deviance df.resid 
##    288.6    299.1   -140.3    280.6       96 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -2.334 -0.503  0.181  0.580  2.466 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  group  (Intercept) 1.38     1.18    
## Number of obs: 100, groups:  group, 20
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.748      0.287    2.61   0.0092 ** 
## x1             0.524      0.104    5.02  5.3e-07 ***
## x2            -2.083      0.143  -14.56  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Correlation of Fixed Effects:
##    (Intr) x1    
## x1  0.090       
## x2 -0.205 -0.345
#takes some time
anova_merMod(m,rand = "(1|group)",w = rep(10,100))
## [1] "Variable x1 tested"
## [1] "Variable x2 tested"
## [1] "Seed set to: 98"
##   term      LRT p_value
## 1   x1   0.0429 0.80392
## 2   x2 502.0921 0.01961

For binomial model, the model must be fitted with proportion data and a vector of weights (ie the number of binomial trial) must be passed to the ‘w’ argument. Some warning message may pop up at the end of the function, this comes from convergence failure in PBmodcomp, this do not affect the results, you may read the article from the pbkrtest package: to understand better where this comes from.

Happy modeling and as Ben Bolker say: “When all else fails, don’t forget to keep p-values in perspective:

Evenness effect on ecosystem functioning


Most biodiversity ecosystem function (BEF) experiment focus on one aspect of diversity: species richness, and had to it some functional composition flavour. The treatment manipulate the number of species and the number of broad functional group occurring. The issue there is that diversity is not only species richness, the classical definition of diversity tells us that it is the combination between richness and evenness that makes diversity (Tuomitso 2012). Diversity is therefore the number of biological unit present at a given location at a given time taking into account how are the individuals distributed amongst the species.

Some definition:

A community with 10 species where 99% of individuals belong to one species is thought to be less diverse than a community with 10 species and equal repartition of the individuals between the species. The difference in the distribution of the total abundance between the different species in a community is called evenness. A community where all species have the same abundance will get the highest possible evenness value while a community where most of the individual belong to one species will have the lowest evenness value. There is a whole bunch of index that can be used to represent these three different components: species richness, evenness and diversity, people interested by knowing how to measure these should read Magurran and McGill 2011.

Here in this post I will talk about the ways by which a community of changing evenness will affect its ecosystem function.

The early literature on BEF already highlighted the potential impacts of changing evenness on the function of the ecosystem (Chapin et al 2000), as changes in the evenness will impact the distribution of the functional traits values (see figure 1 in Hillebrand et al 2008).

A community evenness will shift much earlier than its species richness, before a species is loss its abundance is likely to decline gradually leading to variation in evenness if this decline is not uniformly affecting all species in the system (ie if all species loose 5 individuals between time 1 and time 2, evenness will be unchanged). Tracking community evenness gives us some kind of early warning signals of what is going on in the community.

We can make three different scenarios on how changes in evenness will affect ecosystem function:

– if the function comes from synergies between the species in the community (ie complementarity, mutualism …) a reduction in evenness would lead to decline in the function rates as the beneficial supports between the species declines.

– if the function rates is dominated by the output of one species (ie one dominant species) and this species performance for the focal function is below the average, then a decrease in evenness due to the increased dominance of this species will lead to decreasing function rates

– if the function rates is dominated by the output of one species performing above average (a superproductive species), then the decrease in evenness caused by the increased dominance of this species will lead to an increase in the function.

Empirical results of the relation of evenness and ecosystem function are still rather rare, Wilsey and Potvin (2000) found in a grassland system that in plots with higher evenness there were higher variation in the light-capturing strategies used by the plant and so higher biomass production.

Evenness and dominance are two closely linked concept that often correlates between one another, Hillebrand et al (2008) reviewed some of the literature investigating links between evenness or dominance and ecosystem function, all kind of relation have already been found ie negative, positive or no relation.

So evenness effect on ecosystem function may take different form and will be sensible to spatial and time scale under study but knowledge on these relations will help us predict how the ecosystem functions will respond to the continuous shifts in the abundance of the species under anthropogenic pressures.


Chapin III, F. Stuart, et al. “Consequences of changing biodiversity.” Nature 405.6783 (2000): 234-242.

Hillebrand, Helmut, Danuta M. Bennett, and Marc W. Cadotte. “Consequences of dominance: a review of evenness effects on local and regional ecosystem processes.” Ecology 89.6 (2008): 1510-1520.

Magurran, Anne E., and Brian J. McGill, eds. Biological diversity: frontiers in measurement and assessment. Vol. 12. Oxford: Oxford University Press, 2011.

Tuomisto, Hanna. “An updated consumer’s guide to evenness and related indices.” Oikos 121.8 (2012): 1203-1218.

Wilsey, Brian J., and Catherine Potvin. “Biodiversity and ecosystem functioning: importance of species evenness in an old field.” Ecology 81.4 (2000): 887-892.

Using and interpreting different contrasts in linear models in R

When building a regression model with categorical variables with more than two levels (ie “Cold”, “Freezing”, “Warm”) R is doing internally some transformation to be able to compute regression coefficient. What R is doing is that it is turning your categorical variables into a set of contrasts, this number of contrasts is the number of level in the variable (3 in the example above) minus 1. Here I will present three ways to set the contrasts and depending on your research question and your variables one might be more appropriate than the others.

 # setting seed so that numerical results stay stable
 # let's imagine an experiment which measure plant biomass based on various
 # levels of nutrient added to the medium first case one treatments three
 # levels
 f <- gl(n = 3, k = 20, labels = c("control", "low", "high"))
 # with treatments contrasts (default)
 mat <- model.matrix(~f, data = data.frame(f = f))
 # this tell us which contrast has been used to generate the model matrix
 attr(mat, "contrasts")
## $f
 ## [1] "contr.treatment"
# simulate some ys
 beta <- c(12, 3, 6)  #these are the simulated regression coefficient
 y <- rnorm(n = 60, mean = mat %*% beta, sd = 2)
 m <- lm(y ~ f)
## Call:
## lm(formula = y ~ f)
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.350 -1.611  0.161  1.162  5.201 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
 ## (Intercept) 11.498 0.463 24.84 < 2e-16 ***
 ## flow        3.037 0.655 4.64 2.1e-05 ***
 ## fhigh       6.163 0.655 9.41 3.3e-13 ***
 ## ---
 ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 ## Residual standard error: 2.07 on 57 degrees of freedom
 ## Multiple R-squared: 0.609, Adjusted R-squared: 0.595
 ## F-statistic: 44.3 on 2 and 57 DF, p-value: 2.45e-12

The first coefficient, the intercept, is the estimated average for the baseline group, which in our case is the control group (in the control group the biomass is estimated to be on average 11.49). The second coefficient “flow” is the estimated difference between the average in the baseline group and the average in the “low” group (this group has an increase in 3.03 biomass compared to the control group). Similarly the third coefficient is the difference between the average in the baseline and the “high” group (this group has an increase of 6.16 biomass compared to the control group).

 # plot the results
 plot(y ~ rep(1:3, each = 20), xaxt = "n", xlab = "Treatment")
 axis(1, at = c(1, 2, 3), labels = levels(f))
 points(c(1, 2, 3), c(coef(m)[1], coef(m)[2:3] + coef(m)[1]), pch = 16, cex = 2)


This is by default, now let’s turn to other contrasts options:

  # now with helmert contrasts
 mat  attr(mat, "contrasts")
 ## $f
 ## [1] "contr.helmert"
# simulate the ys
beta <- c(5, 3, 2)
y <- rnorm(n = 60, mean = mat %*% beta, sd = 1.5)
# model
m <- lm(y ~ f, contrasts = list(f = "contr.helmert"))  #there we tell the model to use helmert contrast to build the model
## Call:
## lm(formula = y ~ f, contrasts = list(f = "contr.helmert"))
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.281 -0.973 -0.202  0.752  2.986 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.756      0.184    25.9   <2e-16 ***
## f1             3.116      0.225    13.9   <2e-16 ***
## f2             1.949      0.130    15.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.42 on 57 degrees of freedom
## Multiple R-squared:  0.88,   Adjusted R-squared:  0.876 
## F-statistic:  209 on 2 and 57 DF,  p-value: <2e-16

Now the meaning of the various coefficient is different, the intercept is the average biomass over all the levels (across control, low and high), f1 is the difference between the average of the first level (control) and the average of the second one (low), plants in the treatment low have a 3.11 increase in biomass. f2 is the difference between the average of control and low and the average of high treatment. To put this differently, if we put together the data from the control and low treatment and compare there average value to the average value of plants in the high treatment we would get fhigh. Mean biomass of the plants in the high treatment is higher by 1.95 than plants of the control and low treatment. This type of contrast is a bit harder to interpret but is well suited for variables where the levels have an order, ie (“0”,“>0 and <5″,”>5 and <10” …), there we can gradually compare the successive levels.

As a note, in most applied cases we can set the contrasts that will be used by the model using contrasts(factor)<-‘some_contrasts’. Let’s turn to the last contrasts the effect (or deviation) coding:

# now with sum contrasts, let's spice up a little bit and introduce an
# interaction with a continuous variables
x <- runif(60, -2, 2)
levels(f) <- c("North", "South", "Center")  #let's make different level which cannot easily be given an order or a baseline
mat <- model.matrix(~x * f, data = data.frame(f = f), contrasts.arg = list(f = "contr.sum"))
attr(mat, "contrasts")
 # simulate the ys
beta <- c(5, -2, 3, 2, 1, -4)
y <- rnorm(n = 60, mean = mat %*% beta, sd = 1.5)
# model
m <- lm(y ~ x * f, contrasts = list(f = "contr.sum"))
## Call:
## lm(formula = y ~ x * f, contrasts = list(f = "contr.sum"))
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.271 -1.021 -0.165  0.867  3.914 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.057      0.207   24.42  < 2e-16 ***
## x             -1.869      0.185  -10.13  4.4e-14 ***
## f1             2.699      0.295    9.16  1.4e-12 ***
## f2             2.351      0.293    8.03  8.8e-11 ***
## x:f1           0.954      0.264    3.61  0.00067 ***
## x:f2          -4.269      0.268  -15.92  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.58 on 54 degrees of freedom
## Multiple R-squared:  0.922,  Adjusted R-squared:  0.914 
## F-statistic:  127 on 5 and 54 DF,  p-value: <2e-16

This type of contrasts is useful when there is no natural way to set a baseline or an ordering in the different levels of the variables. I changed the name of the level to illustrate what I mean by this, let’s imagine in this context that we had sampled our plant in three different localities, it is hard to determine in this context what should be the baseline, the deviation coding is a nice way to model these type of data. The intercept in this context is the overall mean across the levels (as in the helmert contrasts), so overall the plant biomass was 5.05. The second one the the average slope between the biomass and the x variable, if we increase x by one the plant biomass decrease by -1.87 across the geographical gradient. f1 is the difference between the overall mean and the mean in the north locality, similarly f2 is the difference between the overall mean and the south locality. To get the estimated average value at the center locality we have to do:

coef(m)[1] - coef(m)[3] - coef(m)[4]
## (Intercept)
## 0.00678

The interaction coefficient are the deviation of the slope within a group from the overall slope, therefore in the north if we increase x by 1, we decrease the biomass by 1.87-0.95= 0.92, similarly the slope in the south is -1.87+(-4.27)= -6.14 and in the center: -1.87-0.95-(-4.27) = +1.45. Around each of these coefficient we have some assessement of the significance of the difference between the overall mean and the various groups. So far I could not find a way to assess the significance of the difference between the overall mean and the last group …

Let’s do a nice figure of this:

 # a nice plot
 plot(y ~ x, xlab = "Temperature", ylab = "Biomass", pch = 16, col = rep(c("orange",
 "violetred", "coral"), each = 20))
 abline(a = 4.55, b = -2, lwd = 2, lty = 1, col = "blue")
 abline(a = coef(m)[1] + coef(m)[3], b = coef(m)[2] + coef(m)[5], lwd = 2, lty = 2,
 col = "orange")
 abline(a = coef(m)[1] + coef(m)[4], b = coef(m)[2] + coef(m)[6], lwd = 2, lty = 2,
 col = "violetred")
 abline(a = coef(m)[1] - coef(m)[3] - coef(m)[4], b = coef(m)[2] - coef(m)[5] -
 coef(m)[6], lwd = 2, lty = 2, col = "coral")
 legend("topright", legend = c("Average", "North", "South", "Center"), col = c("blue",
 "orange", "violetred", "coral"), pch = 16, lty = c(1, 2, 2, 2), bty = "n")


Thats’s it for today, this page was greatly helpful in understanding the meaning of the various contrasts:, the wikipedia page is also rather nice on this:


What roles should ecologist have in the society?


This post is a reflection on the roles that ecologist should have in the society, it was sparked by some discussion I had at the BES/SFE meeting in december. This is one of my first try at blogging essay-like away from my nice beloved R-code.

First of I’d like to start with a definition of ecologist, a concept that might be not so clear in people’s mind. At least in French (and maybe in other languages) the word to name someone who is doing ecology as a science and the word for political/militanting ecology is the same (écologistes). I would like to stress here the difference between the two: the first one represent people doing ecology as a science, so trying to answer questions using repeatable protocols and interpreting the results stemming from it in the light of previous works. The second one represent peoples fighting for some values that are dear to them, and trying to implement in the political agenda some policies that tend to reach these goals. Throughout this post the word ecologist will refer to the first meaning of the word (scientist).

In the recent years there have been more and more calls for ecologist to do policy-relevant science. Also the creation of the IPBES is a formidable opportunity for ecologist to engage in assessment that will hopefully guide policy-making at various levels (international to local). What are the different roles that an ecologist could take at this interface between science and policy but also science and the society?

The first role which most ecologist feel confident about is being an expert, your knowledge in a certain area is acknowledged by your peers and when policy-maker need some information on particular topic (ie a government would like to protect a certain bird species) they call you to provide them with some knowledge. In this context the ecologist give some information on the possible outcomes from the various alternatives on the table and then go back to his/her research. Under this scheme the ecologist do not rank the various alternatives, just outline their likely consequences and benefits, it is then for the policy-maker and other members of the civil society present at the discussion to decide what would be the best course of actions based on this newly-acquired knowledge and some sets of values.

First note aside about using words: some words carry values and should be avoided by scientist talking to society/policy-maker as an expert (ie poor, good …), values comes from a certain perspectives and if the perspective is clearly defined (ie we want to maintain 60% of bird species in this area at any costs) then using value-laden words maybe ok (ie removing wind farms will be beneficial for the policy). However in a general context it can be misleading to use such words (ie biodiversity is good) because they call for a desired outcome that may or may not be explicitly stated. In a broader context the use of neutral words like increase or vary is to be preferred.

The second role goes one step further and do not only outline the likely consequences of various policy alternatives but also state which of these alternatives is the best, this is called advocating. Here there come some serious issues that what one person defines as best alternatives do not depend on scientific knowledge but on personal, normative values (forest are inherently better than urban habitat for biodiversity). This role can make sense because ecologist working on a certain system for 20-30 years have developed a deep understanding of this system, why should they not state what they feel is the best? Even more provocative, we could say that it is some kind of duty for ecologist to state what is best are they are the one knowing better and if they remain silent suboptimal options could be taken up as other voices will make stronger case: “It would be irresponsible not to share views based on deep experience with the public and decision makers” Mooney and Ehrlich see litterature below.

Second note aside on scientist as perfectly neutral person: the idea that scientist in general are neutral and able to give information without our opinions transpiring into it is most certainly untrue. We are human beings with constraints affecting how we work and how we communicate and interpret our results. More importantly we all have values and must be able as every citizen to fight for them. As a note within a note this is actually the nice thing about turning ecology from a qualitative science to a quantitative one, numbers will tend not to reflect ones opinions and feelings (even if they maybe very important in understanding your system/data).

My point of view is to be both, expert and advocate but in different settings and making clear what role we are endorsing when making a statement. Engage in panel discussion with policy-maker as an expert and provide information on various alternatives, but also engage with the society and when advocating for particular outcome clearly state it and acknowledge that you are no longer an expert but a advocate, your values are not inherently better than others. The trick here would be to still be invited to participate to the discussion with the policy-maker, I do not know how to solve this …

Finding the right balance between neutral expert and engaged citizen will be one of the challenge of the future ecologists, a quote to finish: “Why should I study science if I can’t personally influence the direction of events?” Students asking Roughgarden a tricky question.

Some food for thoughts:

Summary and thoughts of the BES/SFE meeting


This week was the British Ecological Society/ Société Francaise d’Écologie meeting in Lille, and as I am sitting on the train on the way home I’d like to give some thoughts and a summary of the various bits and pieces that I got throughout the week.

First I think it important to emphasize that this was the first meeting of its kind, french and english ecologists decided to unit their strength and make a common annual meeting. This gave a collaborative flavor through the whole meeting and as the last plenary speaker said: ”it ensured to have a combination of both good music and great food”. Over the week around 1200 ecologists met and talked.

So now here are a couple of unsorted important informations that sticks out of my notes:

– Ontogenetic habitat shifts (an insect larvae going from aquatic environment to the land once adult) stabilize higher trophic levels in the modeled systems of Guill (Article)

– Interaction network modularity is a strong predictor of functional group diversity in Montoya system (website)

– Indirect mutualism (when two different predator reduce the population size of competing prey allowing them to survive in the environment) was found in the aphids-parasitoid system of Sanders (website)

– The british have impressive citizen monitoring schemes with the Bird Breeding survey with data being collected for 20 years and the Butterfly Monitoring survey with data for almost 40 years, these dataset will be extremely useful in understanding how anthropogenic pressure will act in affecting species ranges and communities shifts.

Pedro Jordano gave a great plenary talk on the interactome, the concept of the interaction network at different scales. As population sizes varies, interaction will disappear before a species goes locally extinct. These lost interaction will impact the ecosystem functions. He and his team uses cool genomic techniques to track which animal species moved seeds from which mother tree, giving them a full mapping of the seed dispersal functions at small and landscape scales.

– Ecologist uses a variety of data collection approaches from simulations to surveys with field or lab experiments somewhere in between. From simulations to field surveys we have a decreasing gradient of replicability but an increasing gradient of realism. I really feel that our generation of ecologist will have to make use of this vast array of techniques to go back and forth from theory to application of the knowledge generated. These ideas were touched by the symposium on bridging local-scale process to global-scale patterns

– There was then a fascinating workshop on ecosystem assessment tools organized by Aletta Bonn with three speakers to give the tone of the session: Sandra Lavorel, a scientist advisor at the european commission and the chief scientist coordinator of the UK National Ecosystem Assessment. The return on experience of the UK coordinator was great to hear, how the compartmentalized process of the assessment hindered some of the outcomes. We then discussed of how we ecologist felt about the science-policy interface in biodiversity. A few comments were exchanged on nature valuation that has been shifting in the recent years from purely economically based, and then spoke most of the remaining time about uncertainty. The issue is that policy-maker expect us to throw precise numbers on various scenarios while we understand (yet) little on the processes structuring and (dis)assembling ecological communities and their relation to ecosystem function. A fundamental issue there is that our uncertainty do not only come from our lack of knowledge but also from the inherent stochastic processes affecting natural systems. There is and will always will be some part of the system that we will not be able to predict. There it is very interesting to look at the vocabulary developed by the IPCC in their later report (from high to low confidence) and rooting their main conclusion in probability seems the way to go in biodiversity as well.

– So far we think that biotic interactions are important only at the local scale, but what if these effects also impact communities at bigger scales? Kevin Cazelles is trying to develop models scaling biotic interactions to larger scales and their effect on species coexistence.

Tim Benton gave a great but slightly depressing talk in the symposium on the EU Common Agricultural Policy. His main message was that policy-makers are only interested by stuff that would destroy jobs in the next week to the next year, longer-term trends or catastrophic shifts at farther away horizon are not in their agenda. He advocated for ecologist going out there and talking to the industry or the society because when the demand his shifting (towards organic product, fair trade …), the industries will align themselves to keep their profit and the policy-maker will then follow and make these shifts easier.

Stephan Peishl nicely showed in his simulation that individuals at the moving-edge of a species distribution will accumulate deleterious mutations due to lower population sizes. If we find similar results in nature this would have major consequences for species conservation.

All in all this was a great conference, gave me some nice ideas of things to test on my dataset, re-enforced my vision that linking networks to ecosystem function will be a key topic in the coming years, but made me think even more about what role I should/would like to endorse, I will talk more on this in a later post.

Now you might wonder what was my (little) contribution to this conference, I basically talked about the direct impact of plant species richness and the big influence of some plant functional groups on the diversity of our arthropod communities. I received quite a few encouragements of various people finding the story pretty nice and convincing, I got a couple of feedbacks that I will try to check in the coming weeks.

I am pretty tired of this intensive week but am really looking forward for the next one!

Interpreting regression coefficient in R

Linear models are a very simple statistical techniques and is often (if not always) a useful start for more complex analysis. It is however not so straightforward to understand what the regression coefficient means even in the most simple case when there are no interactions in the model. If we are not only fishing for stars (ie only interested if a coefficient is different for 0 or not) we can get much more information (to my mind) from these regression coefficient than from another widely used technique which is ANOVA. Comparing the respective benefit and drawbacks of both approaches is beyond the scope of this post. Here I would like to explain what each regression coefficient means in a linear model and how we can improve their interpretability following part of the discussion in Schielzeth (2010) Methods in Ecology and Evolution paper.

Let’s make an hypothetical example that will follow us through the post, say that we collected 10 grams of soils at 100 sampling sites, where half of the site were fertilized with Nitrogen and the other half was kept as control. We also used recorded measure of mean spring temperature and annual precipitation from neighboring meteorological stations. We are interested to know how temperature and precipitation affect the biomass of soil micro-organisms, and to look at the effect of nitrogen addition. To keep things simple we do not expect any interaction here.

# let's simulate the data the explanatory variables: temperature (x1),
# precipitation (x2) and the treatment (1=Control, 2= N addition)
x1 <- rnorm(100, 10, 2)
x2 <- rnorm(100, 100, 10)
x3 <- gl(n = 2, k = 50)
modmat <- model.matrix(~x1 + x2 + x3, data = data.frame(x1, x2, x3))
# vector of fixed effect
betas <- c(10, 2, 0.2, 3)
# generate data
y <- rnorm(n = 100, mean = modmat %*% betas, sd = 1)
# first model
m <- lm(y ~ x1 + x2 + x3)
## Call:
## lm(formula = y ~ x1 + x2 + x3)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8805 -0.4948  0.0359  0.7103  2.6669 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.4757     1.2522    8.37  4.8e-13 ***
## x1            2.0102     0.0586   34.33  < 2e-16 ***
## x2            0.1938     0.0111   17.52  < 2e-16 ***
## x32           3.1359     0.2109   14.87  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.05 on 96 degrees of freedom
## Multiple R-squared:  0.949,  Adjusted R-squared:  0.947 
## F-statistic:  596 on 3 and 96 DF,  p-value: <2e-16

Let’s go through each coefficient: the intercept is the fitted biomass value when temperature and precipitation are both equal to 0 for the Control units. In this context it is relatively meaningless since a site with a precipitation of 0mm is unlikely to occur, we cannot therefore draw further interpretation from this coefficient. Then x1 means that if we hold x2 (precipitation) constant an increase in 1° of temperature lead to an increase of 2mg of soil biomass, this is irrespective of whether we are in the control or nutrient added unit. Similarly x2 means that if we hold x1 (temperature) constant a 1mm increase in precipitation lead to an increase of 0.19mg of soil biomass. Finally x32 is the difference between the control and the nutrient added group when all the other variables are held constant, so if we are at a temperature of 10° and a precipitation of 100mm, adding nutrient to the soil lead to changes from 10+2x10+0.19x100= 49mg to 52mg of soil biomass. Now let’s make a figure of the effect of temperature on soil biomass

plot(y ~ x1, col = rep(c("red", "blue"), each = 50), pch = 16, xlab = "Temperature [°C]", 
    ylab = "Soil biomass [mg]")
abline(a = coef(m)[1], b = coef(m)[2], lty = 2, lwd = 2, col = "red")

What happened there? It seems as if our model is completely underestimating the y values … Well what we have been drawing is the estimated effect of temperature on soil biomass for the control group and for a precipitation of 0mm, this is not so interesting, instead we might be more interested to look at the effect for average precipitation values:

plot(y ~ x1, col = rep(c("red", "blue"), each = 50), pch = 16, xlab = "Temperature [°C]", 
    ylab = "Soil biomass [mg]")
abline(a = coef(m)[1] + coef(m)[3] * mean(x2), b = coef(m)[2], lty = 2, lwd = 2, 
    col = "red")
abline(a = coef(m)[1] + coef(m)[4] + coef(m)[3] * mean(x2), b = coef(m)[2], 
    lty = 2, lwd = 2, col = "blue")
# averaging effect of the factor variable
abline(a = coef(m)[1] + mean(c(0, coef(m)[4])) + coef(m)[3] * mean(x2), b = coef(m)[2], 
    lty = 1, lwd = 2)
legend("topleft", legend = c("Control", "N addition"), col = c("red", "blue"), 
    pch = 16)

Now this look better, the black line is the effect of temperature on soil biomass averaging out the effect of the treatment, it might be of interest if we are only interested in looking at temperature effects.

In this model the intercept did not make much sense, a way to remedy this is to center the explanatory variables, ie removing the mean value from the variables.

# now center the continuous variable to change interpretation of the
# intercept
data_center <- data.frame(x1 = x1 - mean(x1), x2 = x2 - mean(x2), x3 = x3)
modmat <- model.matrix(~x1 + x2 + x3, data = data.frame(x1 = x1, x2 = x2, x3 = x3))
data_center$y_center <- rnorm(n = 100, mean = modmat %*% betas, sd = 1)

# second model
m_center <- lm(y_center ~ x1 + x2 + x3, data_center)
## Call:
## lm(formula = y_center ~ x1 + x2 + x3, data = data_center)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4700 -0.5525 -0.0287  0.6701  1.7920 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  50.4627     0.1423   354.6   <2e-16 ***
## x1            1.9724     0.0561    35.2   <2e-16 ***
## x2            0.1946     0.0106    18.4   <2e-16 ***
## x32           2.8976     0.2020    14.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1 on 96 degrees of freedom
## Multiple R-squared:  0.951,  Adjusted R-squared:  0.949 
## F-statistic:  620 on 3 and 96 DF,  p-value: <2e-16

Now through this centering we know that under average temperature and precipitation conditions the soil biomass in the control plot is equal to 50.46mg, in the nitrogen enriched plot we have 53mg of soil biomass. The slopes are not changing we are just shifting where the intercept lie making it directly interpretable. Let’s do a plot

plot(y_center ~ x2, data_center, col = rep(c("red", "blue"), each = 50), pch = 16, 
    xlab = "Precipitation [mm]", ylab = "Biomass [mg]")
abline(a = coef(m_center)[1], b = coef(m_center)[3], lty = 2, lwd = 2, col = "red")
abline(a = coef(m_center)[1] + coef(m_center)[4], b = coef(m_center)[3], lty = 2, 
    lwd = 2, col = "blue")
# averaging effect of the factor variable
abline(a = coef(m_center)[1] + mean(c(0, coef(m_center)[4])), b = coef(m_center)[3], 
    lty = 1, lwd = 2)
legend("bottomright", legend = c("Control", "N addition"), col = c("red", "blue"), 
    pch = 16)

We might also be interested in knowing which from the temperature or the precipitation as the biggest impact on the soil biomass, from the raw slopes we cannot get this information as variables with low standard deviation will tend to have bigger regression coefficient and variables with high standard deviation will have low regression coefficient. One solution is to derive standardized slopes that are in unit of standard deviation and therefore directly comparable in terms of their strength between continuous variables:

# now if we want to find out which of the two continuous variables as the
# most importance on y we can compute the standardized slopes from the
# unstandardized ones:
std_slope <- function(model, variable) {
    return(coef(model)[variable] * (sd(m$model[[variable]])/sd(m$model[[1]])))

std_slope(m, "x1")
##     x1 
## 0.7912
std_slope(m, "x2")
##     x2 
## 0.4067

From this we can conclude that temperature as a bigger impact on soil biomass than precipitation. If we wanted to compare the continuous variables with the binary variable we could standardize our variables by dividing by two times their standard deviation following Gelman (2008) Statistics in medecine.

Here we saw in a simple linear context how to derive quite a lot of information from our estimated regression coefficient, this understanding can then be apply to more complex models like GLM or GLMM. These models are offering us much more information than just the binary significant/non-significant categorization. Happy coding.


Get every new post delivered to your Inbox.