Hierarchical models with RStan (Part 1)

Real-world data sometime show complex structure that call for the use of special models. When data are organized in more than one level, hierarchical models are the most relevant tool for data analysis. One classic example is when you record student performance from different schools, you might decide to record student-level variables (age, ethnicity, social background) as well as school-level variable (number of student, budget). In this post I will show how to fit such models using RStan. As there is much to say and try on such models I restrict myself in this post to a rather simple example, I will extend this to more complex situations in latter posts.

A few words about RStan:

If you don’t know anything about STAN and RStan make sure to check out this webpage. In a few words RStan is an R interface to the STAN programming language that let’s you fit Bayesian models. A classical workflow looks like this:

  1. Write a STAN model file ending with a .stan
  2. In R fit the model using the RStan package passing the model file and the data to the stan function
  3. Check model fit, a great way to do it is to use the shinystan package

First example with simulated data:

Say that we recorded the response of 10 different plant species to rising temperature and nitrogen concentration. We measured the biomass of 10 individuals per species along a gradient of both temperature and nitrogen concentration and we would like to know how these two variables affect plant biomass. In hierarchical model we let regression parameters vary between the species, this means that, for example, species A might have a more positive slope between temperature and biomass than species B. Note however that we do not fit separate regression to each species, rather the regression parameters for the different species are themselves being fitted to a statistical distribution.

In mathematical terms this example can be written:

\mu_{ij} = \beta_{0j} + \beta_{1j} * Temperature_{ij} + \beta_{2j} * Nitrogen_{ij}

\beta_{0j} \sim N(\gamma_{0},\tau_{0})

… (same for all regression coefficients) …

y_{ij} \sim N(\mu_{ij},\sigma)

For observations i: 1 … N and species j: 1 … J.

This is how such a model looks like in STAN:

/*A simple example of an hierarchical model*/
data {
  int<lower=1> N; //the number of observations
  int<lower=1> J; //the number of groups
  int<lower=1> K; //number of columns in the model matrix
  int<lower=1,upper=J> id[N]; //vector of group indeces
  matrix[N,K] X; //the model matrix
  vector[N] y; //the response variable
}
parameters {
  vector[K] gamma; //population-level regression coefficients
  vector<lower=0>[K] tau; //the standard deviation of the regression coefficients

  vector[K] beta[J]; //matrix of group-level regression coefficients
  real<lower=0> sigma; //standard deviation of the individual observations
}
model {
  vector[N] mu; //linear predictor
  //priors
  gamma ~ normal(0,5); //weakly informative priors on the regression coefficients
  tau ~ cauchy(0,2.5); //weakly informative priors, see section 6.9 in STAN user guide
  sigma ~ gamma(2,0.1); //weakly informative priors, see section 6.9 in STAN user guide
  
  for(j in 1:J){
   beta[j] ~ normal(gamma,tau); //fill the matrix of group-level regression coefficients 
  }
  
  for(n in 1:N){
    mu[n] = X[n] * beta[id[n]]; //compute the linear predictor using relevant group-level regression coefficients 
  }

  //likelihood
  y ~ normal(mu,sigma);
}

You can copy/paste the code into an empty text editor and save it under a .stan file.

Now we turn into R:

#load libraries
library(rstan)
library(RColorBrewer)
#where the STAN model is saved
setwd("~/Desktop/Blog/STAN/")
#simulate some data
set.seed(20161110)
N<-100 #sample size
J<-10 #number of plant species
id<-rep(1:J,each=10) #index of plant species
K<-3 #number of regression coefficients
#population-level regression coefficient
gamma<-c(2,-1,3)
#standard deviation of the group-level coefficient
tau<-c(0.3,2,1)
#standard deviation of individual observations
sigma<-1
#group-level regression coefficients
beta<-mapply(function(g,t) rnorm(J,g,t),g=gamma,t=tau) 
#the model matrix
X<-model.matrix(~x+y,data=data.frame(x=runif(N,-2,2),y=runif(N,-2,2)))
y<-vector(length = N)
for(n in 1:N){
  #simulate response data
  y[n]<-rnorm(1,X[n,]%*%beta[id[n],],sigma)
}
#run the model
m_hier<-stan(file="hierarchical1.stan",data=list(N=N,J=J,K=K,id=id,X=X,y=y))

The MCMC sampling takes place (took about 90 sec per chain on my computer), and then I got this warning message: “Warning messages:
1: There were 61 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
2: Examine the pairs() plot to diagnose sampling problems”

Here is an explanation for this warning: “For some intuition, imagine walking down a steep mountain. If you take too big of a step you will fall, but if you can take very tiny steps you might be able to make your way down the mountain, albeit very slowly. Similarly, we can tell Stan to take smaller steps around the posterior distribution, which (in some but not all cases) can help avoid these divergences.” from here. This issue occur quite often in hierarchical model with limited sample size, the simplest solution being to re-parameterize the model, in other words to re-write the equations so that the MCMC sampler has an easier time sampling the posterior distribution.

Below is a new STAN model with a non-centered parameterization (See Section 22.6 in STAN user guide):

parameters {
  vector[K] gamma; //population-level regression coefficients
  vector<lower=0>[K] tau; //the standard deviation of the regression coefficients
  //implementing Matt's trick
  vector[K] beta_raw[J];
  real<lower=0> sigma; //standard deviation of the individual observations
}
transformed parameters {
  vector[K] beta[J]; //matrix of group-level regression coefficients
  //computing the group-level coefficient, based on non-centered parametrization based on section 22.6 STAN (v2.12) user's guide
  for(j in 1:J){
    beta[j] = gamma + tau .* beta_raw[j];
  }
}
model {
  vector[N] mu; //linear predictor
  //priors
  gamma ~ normal(0,5); //weakly informative priors on the regression coefficients
  tau ~ cauchy(0,2.5); //weakly informative priors, see section 6.9 in STAN user guide
  sigma ~ gamma(2,0.1); //weakly informative priors, see section 6.9 in STAN user guide
  for(j in 1:J){
   beta_raw[j] ~ normal(0,1); //fill the matrix of group-level regression coefficients 
  }
  for(n in 1:N){
    mu[n] = X[n] * beta[id[n]]; //compute the linear predictor using relevant group-level regression coefficients 
  }
  //likelihood
  y ~ normal(mu,sigma);
}

Note that the data model block is identical in the two cases.

We turn back to R:

#re-parametrize the model
m_hier<-stan(file="hierarchical1_reparam.stan",data=list(N=N,J=J,K=K,id=id,X=X,y=y))
#no more divergent iterations, we can start exploring the model
#a great way to start is to use the shinystan library
#library(shinystan)
#launch_shinystan(m_hier)
#model inference
print(m_hier,pars=c("gamma","tau","sigma"))
Inference for Stan model: hierarchical1_reparam.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

          mean se_mean   sd  2.5%   25%   50%  75% 97.5% n_eff Rhat
gamma[1]  1.96    0.00 0.17  1.61  1.86  1.96 2.07  2.29  2075    1
gamma[2] -0.03    0.02 0.77 -1.53 -0.49 -0.04 0.43  1.55  1047    1
gamma[3]  2.81    0.02 0.49  1.84  2.52  2.80 3.12  3.79   926    1
tau[1]    0.34    0.01 0.21  0.02  0.19  0.33 0.46  0.79  1135    1
tau[2]    2.39    0.02 0.66  1.47  1.94  2.26 2.69  4.04  1234    1
tau[3]    1.44    0.01 0.41  0.87  1.16  1.37 1.65  2.43  1317    1
sigma     1.04    0.00 0.09  0.89  0.98  1.04 1.10  1.23  2392    1

Samples were drawn using NUTS(diag_e) at Thu Nov 10 14:11:41 2016.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The regression parameters were all decently estimated except for the second slope coefficient, the simulated value was -1.

All MCMC samples for all coefficient can be easily extracted and used to compute whatever your interest is:

#extract the MCMC samples
mcmc_hier<-extract(m_hier)
str(mcmc_hier)

#plot average response to explanatory variables
X_new<-model.matrix(~x+y,data=data.frame(x=seq(-2,2,by=0.2),y=0))
#get predicted values for each MCMC sample
pred_x1<-apply(mcmc_hier$gamma,1,function(beta) X_new %*% beta)
#now get median and 95% credible intervals
pred_x1<-apply(pred_x1,1,quantile,probs=c(0.025,0.5,0.975))
#same stuff for the second explanatory variables
X_new<-model.matrix(~x+y,data=data.frame(x=0,y=seq(-2,2,by=0.2)))
pred_x2<-apply(mcmc_hier$gamma,1,function(beta) X_new %*% beta)
pred_x2<-apply(pred_x2,1,quantile,probs=c(0.025,0.5,0.975))

Here we basically generated new model matrices where only one variable was moving at a time, this allowed us to get the model prediction for the effect of say temperature on plant biomass under average nutrient conditions. These predictions were obtained by multiplying the model matrix with the coefficients for each MCMC sample (the first apply command), and then we can get from these samples the median with 95% credible intervals (the second apply command).

Now we can plot this (code for the plots at the end of the post)

hier1

Another important plot is the variation in the regression parameters between the species, again this is easily done using the MCMC samples:

#now we could look at the variation in the regression coefficients between the groups doing caterpillar plots
ind_coeff<-apply(mcmc_hier$beta,c(2,3),quantile,probs=c(0.025,0.5,0.975))
df_ind_coeff<-data.frame(Coeff=rep(c("(Int)","X1","X2"),each=10),LI=c(ind_coeff[1,,1],ind_coeff[1,,2],ind_coeff[1,,3]),Median=c(ind_coeff[2,,1],ind_coeff[2,,2],ind_coeff[2,,3]),HI=c(ind_coeff[3,,1],ind_coeff[3,,2],ind_coeff[3,,3]))
gr<-paste("Gr",1:10)
df_ind_coeff$Group<-factor(gr,levels=gr)
#we may also add the population-level median estimate
pop_lvl<-data.frame(Coeff=c("(Int)","X1","X2"),Median=apply(mcmc_hier$gamma,2,quantile,probs=0.5))
ggplot(df_ind_coeff,aes(x=Group,y=Median))+geom_point()+
  geom_linerange(aes(ymin=LI,ymax=HI))+coord_flip()+
  facet_grid(.~Coeff)+
  geom_hline(data=pop_lvl,aes(yintercept=Median),color="blue",linetype="dashed")+
  labs(y="Regression parameters")

hier2

The cool thing with using STAN is that we can extend or modify the model in many ways. This will be the topics of future posts which will include: crossed and nested design, multilevel modelling, non-normal distributions and much more, stay tuned!

Code for the first plot:

cols<-brewer.pal(10,"Set3")
par(mfrow=c(1,2),mar=c(4,4,0,1),oma=c(0,0,3,5))
plot(y~X[,2],pch=16,xlab="Temperature",ylab="Response variable",col=cols[id])
lines(seq(-2,2,by=0.2),pred_x1[1,],lty=2,col="red")
lines(seq(-2,2,by=0.2),pred_x1[2,],lty=1,lwd=3,col="blue")
lines(seq(-2,2,by=0.2),pred_x1[3,],lty=2,col="red")
plot(y~X[,3],pch=16,xlab="Nitrogen concentration",ylab="Response variable",col=cols[id])
lines(seq(-2,2,by=0.2),pred_x2[1,],lty=2,col="red")
lines(seq(-2,2,by=0.2),pred_x2[2,],lty=1,lwd=3,col="blue")
lines(seq(-2,2,by=0.2),pred_x2[3,],lty=2,col="red")
mtext(text = "Population-level response to the two\nexplanatory variables with 95% CrI",side = 3,line = 0,outer=TRUE)
legend(x=2.1,y=10,legend=paste("Gr",1:10),ncol = 1,col=cols,pch=16,bty="n",xpd=NA,title = "Group\nID")

 

Simulating local community dynamics under ecological drift

In 2001 the book by Stephen Hubbell on the neutral theory of biodiversity was a major shift from classical community ecology. Before this book the niche-assembly framework was dominating the study of community dynamics. Very briefly under this framework local species composition is the result of the resource available at a particular site and species presence or absence depends on species niche (as defined in Chase 2003). As a result community assembly was seen as a deterministic process and a specific set of species should emerge from a given set of resources and species composition should be stable as long as the system is not disturbed. Hubbell theory based on a dispersal-assembly framework (historically developed by MacArthur and Wilson) was a departure from this mainstream view of community dynamics by assuming that local species composition are transient with the absence of any equilibrium. Species abundance follow a random walk under ecological drift, species are continuously going locally extinct while a constant rain of immigrants from the metacommunity bring new recruits. The interested reader is kindly encouraged to read: this, that or that.

Being an R geek and after reading Chapter 4 of Hubbell’s book, I decided to implement his model of local community dynamic under ecological drift. Do not expect any advanced analysis of the model behavior, Hubbell actually derive in this chapter analytical solutions for this model. This post is mostly educational to get a glimpse into the first part of Hubbell’s neutral model.

Before diving into the code it is helpful to verbally outline the main characteristics of the model:

  • The model is individual-based and neutral, assuming that all individuals are equal in their probability to die or to persist.
  • The model assume a zero-sum dynamic, the total number of individual (J) is constant, imagine for example a forest. When a tree dies it allows for a new individual to grow and use the free space
  • At each time step a number of individual (D) dies and are replaced by new recruits coming either from the metacommunity (with probability m) or from the local communities (with probability 1-m)
  • The species identity of the new recruit is proportional to the species relative abundance either in the metacommunity (assumed to be fixed and having a log-normal distribution) or in the local community (which obviously varies)

We are now equipped to immerse ourselves into R code:

#implement zero-sum dynamic with ecological drift from Hubbel theory (Chp4)
untb<-function(R=100,J=1000,D=1,m=0,Time=100,seed=as.numeric(Sys.time())){
  mat<-matrix(0,ncol=R,nrow=Time+1,dimnames=
  list(paste0("Time",0:Time),as.character(1:R)))
  #the metacommunity SAD
  metaSAD<-rlnorm(n=R)
  metaSAD<-metaSAD/sum(metaSAD)
  #initialize the local community
  begin<-table(sample(x=1:R,size = J,replace = TRUE,prob = metaSAD))
  mat[1,attr(begin,"dimnames")[[1]]]<-begin
  #loop through the time and get community composition
  for(i in 2:(Time+1)){
    newC<-helper_untb(vec = mat[(i-1),],R = R,D = D,m = m,metaSAD=metaSAD)
    mat[i,attr(newC,"dimnames")[[1]]]<-newC
  }
  return(mat)
}
#the function that compute individual death and 
#replacement by new recruits 
helper_untb<-function(vec,R,D,m,metaSAD){
  ind<-rep(1:R,times=vec)
  #remove D individuals from the community
  ind_new<-ind[-sample(x = 1:length(ind),size = D,replace=FALSE)]
  #are the new recruit from the Local or Metacommunity?
  recruit<-sample(x=c("L","M"),size=D,replace=TRUE,prob=c((1-m),m))
  #sample species ID according to the relevant SAD
  ind_recruit<-c(ind_new,sapply(recruit,function(isLocal) {
    if(isLocal=="L")
      sample(x=ind_new,size=1)
    else
      sample(1:R,size=1,prob=metaSAD)
  }))
  return(table(ind_recruit))
}

I always like simple theories that can be implemented with few lines of code and do not require fancy computations. Let’s put the function in action and get for a closed community (m=0) of 10 species the species abundance and the rank-abundance distribution dynamics:

library(reshape2) #for data manipulation I
library(plyr) #for data manipulation II
library(ggplot2) #for the nicest plot in the world
#look at population dynamics over time
comm1<-untb(R = 10,J = 100,D = 10,m = 0,Time = 200)
#quick and dirty first graphs
comm1m<-melt(comm1)
comm1m$Time<-as.numeric(comm1m$Var1)-1
ggplot(comm1m,aes(x=Time,y=value,color=factor(Var2)))
+geom_path()+labs(y="Species abundance")
+scale_color_discrete(name="Species ID")
#let's plot the RAD curve over time
pre_rad<-apply(comm1,1,function(x) sort(x,decreasing = TRUE))
pre_radm<-melt(pre_rad)
pre_radm$Time<-as.numeric(pre_radm$Var2)-1
ggplot(pre_radm,aes(x=Var1,y=value/100,color=Time,group=Time))
+geom_line()+labs(x="Species rank",y="Species relative abundance")

untb1.1

We see there an important feature of the zero-sum dynamics, with no immigration the model will always (eventually) lead to the dominance of one species. This is because with no immigration there are so-called absorbing states, species abundance are attracted to these states and once these values are reached species abundance stay constant. In this case there are only two absorbing states for species relative abundance: 0 or 1, either you go extinct or you dominate the community.

In the next step we open up the community while varying the death rate and we track species richness dynamic:

#apply this over a range of parameters
pars<-expand.grid(R=20,J=200,D=c(1,2,4,8,16,32,64),
 + m=c(0,0.01,0.05,0.1,0.5),Time=200)
comm_list<-alply(pars,1,function(p) do.call(untb,as.list(p)))
#get species abundance for each parameter set
rich<-llply(comm_list,function(x) apply(x,1,function(y) length(y[y!=0])))
rich<-llply(rich,function(x) data.frame(Time=0:200,Rich=x))
rich<-rbind.fill(rich)
rich$D<-rep(pars$D,each=201)
rich$m<-rep(pars$m,each=201)
ggplot(rich,aes(x=Time,y=Rich,color=factor(D)))+geom_path()
+facet_grid(.~m)

untb3

As we increase the probability that new recruit come from the metacommunity, the impact of increasing the number of death per time step becomes null. This is because the metacommunity is static in this model, even if species are wiped out by important local disturbance there will always be new immigrants coming in maintaining species richness to rather constant levels.

There is an R package for everything these days and the neutral theory is not an exception, check out the untb package which implement the full neutral model (ie with speciation and dynamic metacommunities) but also allow to fit the model to empirical data and to get model parameter estimate.

Exploring the diversity of Life using Rvest and the Catalog of Life

I am writing the general introduction for my thesis and wanted to have a nice illustration of the diversity of Arthropods compared to other phyla (my work focus on Arthropods so this is a nice motivation). As the literature I have had access so far use pie charts to graphically represent these diversities and knowing that pie chart are bad, I decided to create my own illustration.

Fortunately I came across the Catalogue of Life website which provide (among other things) an overview of the number of species in each phylum. So I decided to try and have a go at directly importing the data from the website into R using the rvest package.

Let’s go:

 

#load the packages
library(rvest)
library(ggplot2)
library(scales)#for comma separator in ggplot2 axis

#read the data
col<-read_html("http://www.catalogueoflife.org/col/info/totals")
col%>%
  html_table(header=TRUE)->sp_list  
sp_list<-sp_list[[1]]
#some minor data re-formatting
#re-format the data frame keeping only animals, plants and
#fungi
sp_list<-sp_list[c(3:37,90:94,98:105),-4]
#add a kingdom column
sp_list$kingdom<-rep(c("Animalia","Fungi","Plantae"),times=c(35,5,8))
#remove the nasty commas and turn into numeric
sp_list[,2]<-as.numeric(gsub(",","",sp_list[,2]))
sp_list[,3]<-as.numeric(gsub(",","",sp_list[,3]))
names(sp_list)[2:3]<-c("Nb_Species_Col","Nb_Species")

Now we are read to make the first plot

ggplot(sp_list,aes(x=Taxon,y=Nb_Species,fill=kingdom))+
 geom_bar(stat="identity")+
 coord_flip()+
 scale_fill_discrete(name="Kingdom")+
 labs(y="Number of Species",x="",title="The diversity of life")

 

Div1

This is a bit overwhelming, half of the phyla have less than 1000 species so let’s make a second graph only with the phyla comprising more than 1000 species. And just to make things nicer I sort the phyla by the number of species:

subs<-subset(sp_list,Nb_Species>1000)
subs$Taxon<-factor(subs$Taxon,levels=subs$Taxon[order(subs$Nb_Species)])
ggplot(subs,aes(x=Taxon,y=Nb_Species,fill=kingdom))+
 geom_bar(stat="identity")+
 theme(panel.border=element_rect(linetype="dashed",color="black",fill=NA),
 panel.background=element_rect(fill="white"),
 panel.grid.major.x=element_line(linetype="dotted",color="black"))+
 coord_flip()+
 scale_fill_discrete(name="Kingdom")+
 labs(y="Number of Species",x="",
 title="The diversity of multicellular organisms from the Catalog of Life")+
 scale_y_continuous(expand=c(0,0),limits=c(0,1250000),labels=comma)

 

Div2

That’s it for a first exploration of the powers of rvest, this was actually super easy I expected to have to spend much more time trying to decipher xml code, but rvest seems to know its way around …

This graph is also a nice reminder that most of the described multicellular species out there are small crawling beetles and that we still know alarmingly very little about their ecology and status of threat. As a comparison all the vertebrates (all birds, mammals, reptiles, fishes and some other taxa) are within the Chordata and having a total of 50,000 described species. An even more sobering thought is the fact that the total number of described species is only a fraction of what is left undescribed.

Exploring Spatial Patterns and Coexistance

Today is a rainy day and I had to drop my plans for going out hiking, instead I continued reading “Self-Organization in Complex Ecosystems” from Richard Solé and Jordi Bascompte.

As I will be busy in the coming weeks with spatial models at the iDiv summer school I was closely reading chapter three on spatial self-organization and decided to try and implement in R one of the Coupled Map Lattice Models that is described in the book.

The model is based on a discrete time version of the Lotka-Volterra competition model, for example for two competing species we model the population normalized abundances (the population abundance divided by the carrying capacity) N1 and N2 using the following equations:

N_{1,t+1} = \phi_{N_{1}}(N_{1,t},N_{2,t}) = N_{1,t} * exp[r_{N_{1}} * (1 - N_{1} - \alpha_{12} * N_{2})]

N_{2,t+1} = \phi_{N_{2}}(N_{1,t},N_{2,t}) = N_{2,t} * exp[r_{N_{2}} * (1 - N_{2} - \alpha_{21} * N_{1})]

Where r is the growth rate and the \alpha are the interspecific competition coefficients. Note that since we use normalized abundance the intraspecific competition coefficient is equal to one.

Space can be introduced in this model by considering dispersion between neighboring cells. Say we have a 50 x 50 lattice map and that individuals disperse between neighboring cells with a probability D, as a result the normalized population abundances in cell k at time t+1 will be:

N_{1,t+1,k} = (1 - D) * \phi_{N_{1}}(N_{1,t,k},N_{2,t,k}) + \frac{D}{8} * \sum_{j=1}^{8} \phi_{N_{1}}(N_{1,t,j},N_{2,t,j})

Where j is an index of all 8 neighboring cells around cell k. We have a similar equation for the competing species.

Now let’s implement this in R (you may a nicer version of the code here).

I start by defining two functions: one that will compute the population dynamics based on the first two equations, and the other one that will return for each cell in lattice the row numbers of the neighboring cells:

#Lotka-Voltera discrete time competition function
#@Ns are the population abundance
#@rs are the growth rate
#@as is the competition matrix
discrete<-function(Ns,rs,as){
  #return population abundance at time t+1
  return(as.numeric(Ns*exp(rs*(1-as%*%Ns))))
}

#an helper function returning line numbers of all neighbouring cells
find_neighbour<-function(vec,dat=dat){
  lnb<-which(dat[,"x"]%in%(vec["x"]+c(-1,-1,-1,0,1,1,1,0)) &
+ dat[,"y"]%in%(vec["y"]+c(-1,0,1,1,1,0,-1,-1)))
  #remove own line
  lnb<-lnb[which(lnb!=vec["lnb"])]
  return(lnb)
}

Then I create a lattice, fill it with the two species and some random noise.

dat<-expand.grid(x=1:50,y=1:50)
dat$N1<-0.5+rnorm(2500,0,0.1)
dat$N2<-0.5+rnorm(2500,0,0.1)
dat$lnb<-1:2500
list_dat<-list(dat)

The next step is to initialize all the parameters, in the example given in the book the two species have equal growth rate, equal dispersal and equal interspecific competition.

#competition parameters
as<-matrix(c(1,1.2,1.2,1),ncol=2,byrow=TRUE)
#growth rate
rs<-c(1.5,1.5)
#dispersal rate
ds<-c(0.05,0.05)
#infos on neighbouring cells
neigh<-apply(dat,1,function(x) find_neighbour(x,dat))

Now we are ready to start the model (the code is rather ugly, sorry for that, saturday coding makes me rather lazy …)

generation<-1:60
#model the dynamic assuming absorbing boundary (ie ind that go out of the grid are lost to the system)
for(i in generation){  
  list_dat[[(i+1)]]<-rbind.fill(apply(list_dat[[i]],1,function(x){
    #population dynamics within one grid cell
    intern<-(1-ds)*discrete(Ns=x[c("N1","N2")],rs=rs,as=as)
    #grab the neighbouring cells
    neigh_cell<-list_dat[[i]][neigh[[x["lnb"]]],]
    #the number of immigrant coming into the focal grid cell
    imm<-(ds/8)*rowSums(apply(neigh_cell,1,function(y) {
    discrete(Ns=y[c("N1","N2")],rs=c(1.5,1.5),as=as)}))
    out<-data.frame(x=x["x"],y=x["y"],N1=intern[1]+imm[1],N2=intern[2]
    + +imm[2],lnb=x["lnb"])
    return(out)
  }))
  #print(i)
}

First let’s reproduce figure 3.8 from the book:

#look at coexistence between the two species within one cell
cell525<-ldply(list_dat,function(x) c(x[525,"N1"],x[525,"N2"]))
cell525$Gen<-1:61
cell525<-melt(cell525,id.vars="Gen",
+ measure.vars=1:2,variable.name="Species",value.name="Pop")
ggplot(cell525,aes(x=Gen,y=Pop,color=Species))+geom_path()

#look at coexistence across the system
all<-ldply(list_dat,function(x) c(sum(x[,"N1"]),sum(x[,"N2"])))
all$Gen<-1:61
all<-melt(all,id.vars="Gen",measure.vars=1:2,
+ variable.name="Species",value.name="Pop")
ggplot(all,aes(x=Gen,y=Pop,color=Species))+geom_path()

 

space1

And now figure 3.9 plus an extra figure looking at the correlation in population abundance between the two species at the beginning and the end of the simulation:

#compare species-species abundance correlation at the beginning and at the end of the simulation
exmpl<-rbind(list_dat[[1]],list_dat[[60]])
exmpl$Gen<-rep(c("Gen : 1","Gen : 60"),each=2500)
ggplot(exmpl,aes(x=N1,y=N2))+geom_point()+facet_grid(.~Gen)

space3

#look at variation in spatial patterns at the beginning and the end of the simulation
exmplm<-melt(exmpl,id.vars=c("x","y","Gen"),
+ measure.vars=c("N1","N2"),variable.name="Species",value.name="Pop")
ggplot(exmplm,aes(x,y,fill=Pop))+geom_raster()
 +scale_fill_viridis(option="plasma",direction=-1)+facet_grid(Gen~Species)

space2

 

Neat stuff, I am looking forward to implement other models in this book!

 

Plotting regression curves with confidence intervals for LM, GLM and GLMM in R

Once models have been fitted and checked and re-checked comes the time to interpret them. The easiest way to do so is to plot the response variable versus the explanatory variables (I call them predictors) adding to this plot the fitted regression curve together (if you are feeling fancy) with a confidence interval around it. Now this approach is preferred over the partial residual one because it allows the averaging out of any other potentially confounding predictors and so focus only on the effect of one focal predictor on the response. In my work I have been doing this hundreds of time and finally decided to put all this into a function to clean up my code a little bit. As always a cleaner version of this post is available here.
Let’s dive into some code (the function is at the end of the post just copy/paste into your R environment):

#####LM example######
#we measured plant biomass for 120 pots under 3 nutrient treatments and across a gradient of CO2
#due to limited place in our greenhouse chambers we had to use 4 of them, so we established a blocking design
data<-data.frame(C=runif(120,-2,2),N=gl(n = 3,k = 40,labels = c("Few","Medium","A_lot")),Block=rep(rep(paste0("B",1:4),each=10),times=3))
xtabs(~N+Block,data)

##         Block
## N        B1 B2 B3 B4
##   Few    10 10 10 10
##   Medium 10 10 10 10
##   A_lot  10 10 10 10

modmat<-model.matrix(~Block+C*N,data)
#the paramters of the models
params<-c(10,-0.4,2.3,-1.5,1,0.5,2.3,0.6,2.7)
#simulate a response vector
data$Biom<-rnorm(120,modmat%*%params,1)
#fit the model
m<-lm(Biom~Block+C*N,data)
summary(m)

## 
## Call:
## lm(formula = Biom ~ Block + C * N, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.11758 -0.68801 -0.01582  0.75057  2.55953 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.0768     0.2370  42.521  < 2e-16 ***
## BlockB2      -0.3011     0.2690  -1.119 0.265364    
## BlockB3       2.3322     0.2682   8.695 3.54e-14 ***
## BlockB4      -1.4505     0.2688  -5.396 3.91e-07 ***
## C             0.7585     0.1637   4.633 9.89e-06 ***
## NMedium       0.4842     0.2371   2.042 0.043489 *  
## NA_lot        2.4011     0.2335  10.285  < 2e-16 ***
## C:NMedium     0.7287     0.2123   3.432 0.000844 ***
## C:NA_lot      3.2536     0.2246  14.489  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.028 on 111 degrees of freedom
## Multiple R-squared:  0.9201, Adjusted R-squared:  0.9144 
## F-statistic: 159.8 on 8 and 111 DF,  p-value: < 2.2e-16

Here we would normally continue and make some model checks. As output from the model we would like to plot the effect of CO2 on plant biomass for each level of N addition. Of course we want to average out the Block effect (otherwise we would have to plot one separate line for each Block). This is how it works:

pred<-plot_fit(m,focal_var = "C",inter_var = "N")
head(pred)

##            C   N       LC     Pred       UC
## 1 -1.9984087 Few 8.004927 8.706142 9.407358
## 2 -1.7895749 Few 8.213104 8.864545 9.515986
## 3 -1.5807411 Few 8.417943 9.022948 9.627952
## 4 -1.3719073 Few 8.618617 9.181350 9.744084
## 5 -1.1630735 Few 8.814119 9.339753 9.865387
## 6 -0.9542397 Few 9.003286 9.498156 9.993026

#the function output a data frame with columns for the varying predictors
#a column for the predicted values (Pred), one lower bound (LC) and an upper one (UC)
#let's plot this
plot(Biom~C,data,col=c("red","green","blue")[data$N],pch=16,xlab="CO2 concentration",ylab="Plant biomass")
lines(Pred~C,pred[1:20,],col="red",lwd=3)
lines(LC~C,pred[1:20,],col="red",lwd=2,lty=2)
lines(UC~C,pred[1:20,],col="red",lwd=2,lty=2)
lines(Pred~C,pred[21:40,],col="green",lwd=3)
lines(LC~C,pred[21:40,],col="green",lwd=2,lty=2)
lines(UC~C,pred[21:40,],col="green",lwd=2,lty=2)
lines(Pred~C,pred[41:60,],col="blue",lwd=3)
lines(LC~C,pred[41:60,],col="blue",lwd=2,lty=2)
lines(UC~C,pred[41:60,],col="blue",lwd=2,lty=2)
legend("topleft",legend = c("Few","Medium","A lot"),col=c("red","green","blue"),pch=16,lwd=3,title = "N addition",bty="n")

plotFit1

The cool thing is that the function will also work for GLM, LMM and GLMM. For mixed effect models the confidence interval is computed from parametric bootstrapping:

######LMM example#######
#now let's say that we took 5 measurements per pots and we don't want to aggregate them
data<-data.frame(Pots=rep(paste0("P",1:120),each=5),Block=rep(rep(paste0("B",1:4),each=5*10),times=3),N=gl(n = 3,k = 40*5,labels=c("Few","Medium","A_lot")),C=rep(runif(120,-2,2),each=5))
#a random intercept term
rnd_int<-rnorm(120,0,0.4)
modmat<-model.matrix(~Block+C*N,data)
lin_pred<-modmat%*%params+rnd_int[factor(data$Pots)]
data$Biom<-rnorm(600,lin_pred,1)
m<-lmer(Biom~Block+C*N+(1|Pots),data)
summary(m)

## Linear mixed model fit by REML ['lmerMod']
## Formula: Biom ~ Block + C * N + (1 | Pots)
##    Data: data
## 
## REML criterion at convergence: 1765.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.6608 -0.6446 -0.0340  0.6077  3.2002 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Pots     (Intercept) 0.1486   0.3855  
##  Residual             0.9639   0.9818  
## Number of obs: 600, groups:  Pots, 120
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  9.93917    0.13225   75.15
## BlockB2     -0.42019    0.15153   -2.77
## BlockB3      2.35993    0.15364   15.36
## BlockB4     -1.36188    0.15111   -9.01
## C            0.97208    0.07099   13.69
## NMedium      0.36236    0.13272    2.73
## NA_lot       2.25624    0.13189   17.11
## C:NMedium    0.70157    0.11815    5.94
## C:NA_lot     2.78150    0.10764   25.84
## 
## Correlation of Fixed Effects:
##           (Intr) BlckB2 BlckB3 BlckB4 C      NMedim NA_lot C:NMdm
## BlockB2   -0.572                                                 
## BlockB3   -0.575  0.493                                          
## BlockB4   -0.576  0.495  0.493                                   
## C          0.140 -0.012 -0.018 -0.045                            
## NMedium   -0.511  0.003  0.027  0.007 -0.118                     
## NA_lot    -0.507  0.008  0.003  0.003 -0.119  0.502              
## C:NMedium -0.134  0.019  0.161  0.038 -0.601  0.175  0.071       
## C:NA_lot  -0.107  0.077  0.020  0.006 -0.657  0.078  0.129  0.394

#again model check should come here
#for LMM and GLMM we also need to pass as character (vector) the names of the random effect
pred<-plot_fit(m,focal_var = "C",inter_var = "N",RE = "Pots")
#let's plot this
plot(Biom~C,data,col=c("red","green","blue")[data$N],pch=16,xlab="CO2 concentration",ylab="Plant biomass")
lines(Pred~C,pred[1:20,],col="red",lwd=3)
lines(LC~C,pred[1:20,],col="red",lwd=2,lty=2)
lines(UC~C,pred[1:20,],col="red",lwd=2,lty=2)
lines(Pred~C,pred[21:40,],col="green",lwd=3)
lines(LC~C,pred[21:40,],col="green",lwd=2,lty=2)
lines(UC~C,pred[21:40,],col="green",lwd=2,lty=2)
lines(Pred~C,pred[41:60,],col="blue",lwd=3)
lines(LC~C,pred[41:60,],col="blue",lwd=2,lty=2)
lines(UC~C,pred[41:60,],col="blue",lwd=2,lty=2)
legend("topleft",legend = c("Few","Medium","A lot"),col=c("red","green","blue"),pch=16,lwd=3,title = "N addition",bty="n")

plotFit2
Please note a few elements:
– so far the function only return 95% confidence intervals
– I have tested it on various types of models that I usually build but there are most certainly still some bugs hanging around so if the function return an error please let me know of the model you fitted and the error returned
– the bootstrap computation can take some time for GLMM so be ready to wait a few minute if you have a big complex model
– the function accept a vector of variable names for the inter_var argument, it should also work for the RE argument even if I did not tried it yet

Happy plotting!

Here is the code for the function:

#for lm, glm, lmer and glmer models
#parameters:
#@m : the fitted lm, glm or merMod object (need to be provided)
#@focal_var: a character, the name of variable of interest that will be plotted on the x axis, ie the varying variable (need to be provided)
#@inter_var: a character or character vector, the name of variable interacting with the focal variable, ie categorical variables from which prediction will be drawn for each level across the focal_var gradient
#@RE: if giving a merMod object give as character or character vector of the name of the random effects variable (so far I only tried with one RE)
#@n: a numeric, the number of data point that will form the gradient of the focal variable
#@n_core: the number of core used to compute the bootstrapped CI for GLMM models

plot_fit<-function(m,focal_var,inter_var=NULL,RE=NULL,n=20,n_core=4){
  require(arm)  
  dat<-model.frame(m)
  #turn all character variable to factor
  dat<-as.data.frame(lapply(dat,function(x){
    if(is.character(x)){
      as.factor(x)
    }
    else{x}
  }))
  #make a sequence from the focal variable
  x1<-list(seq(min(dat[,focal_var]),max(dat[,focal_var]),length=n))
  #grab the names and unique values of the interacting variables
  isInter<-which(names(dat)%in%inter_var)
  if(length(isInter)==1){
    x2<-list(unique(dat[,isInter]))
    names(x2)<-inter_var
  }
  if(length(isInter)>1){
    x2<-lapply(dat[,isInter],unique)
  }
  if(length(isInter)==0){
    x2<-NULL
  }
  #all_var<-x1
  #add the focal variable to this list
  all_var<-c(x1,x2)
  #expand.grid on it
  names(all_var)[1]<-focal_var
  all_var<-expand.grid(all_var)
  
  #remove varying variables and non-predictors
  dat_red<-dat[,-c(1,which(names(dat)%in%c(focal_var,inter_var,RE,"X.weights."))),drop=FALSE]
  if(dim(dat_red)[2]==0){
    new_dat<-all_var
  }
  else{
    fixed<-lapply(dat_red,function(x) if(is.numeric(x)) mean(x) else factor(levels(x)[1],levels = levels(x)))
    #the number of rows in the new_dat frame
    fixed<-lapply(fixed,rep,dim(all_var)[1])
    #create the new_dat frame starting with the varying focal variable and potential interactions
    new_dat<-cbind(all_var,as.data.frame(fixed)) 
    #get the name of the variable to average over, debug for conditions where no variables are to be avergaed over
    name_f<-names(dat_red)[sapply(dat_red,function(x) ifelse(is.factor(x),TRUE,FALSE))]
  }  
    
  
  #get the predicted values
  cl<-class(m)[1]
  if(cl=="lm"){
    pred<-predict(m,newdata = new_dat,se.fit=TRUE)
  }
  
  if(cl=="glm" | cl=="negbin"){
    #predicted values on the link scale
    pred<-predict(m,newdata=new_dat,type="link",se.fit=TRUE)
  }
  if(cl=="glmerMod" | cl=="lmerMod"){
    pred<-list(fit=predict(m,newdata=new_dat,type="link",re.form=~0))
    #for bootstrapped CI
    new_dat<-cbind(new_dat,rep(0,dim(new_dat)[1]))
    names(new_dat)[dim(new_dat)[2]]<-as.character(formula(m)[[2]])
    mm<-model.matrix(formula(m,fixed.only=TRUE),new_dat)
  }
  #average over potential categorical variables  
  if(length(name_f)>0){
    if(cl=="glmerMod" | cl=="lmerMod"){
      coef_f<-lapply(name_f,function(x) fixef(m)[grep(paste0("^",x,"\\w+$"),names(fixef(m)))])
    }
    else{
      coef_f<-lapply(name_f,function(x) coef(m)[grep(paste0("^",x,"\\w+$"),names(coef(m)))])
    }    
    pred$fit<-pred$fit+sum(unlist(lapply(coef_f,function(x) mean(c(0,x)))))
  }
  #to get the back-transform values get the inverse link function
  linkinv<-family(m)$linkinv
  
  #get the back transformed prediction together with the 95% CI for LM and GLM
  if(cl=="glm" | cl=="lm"){
    pred$pred<-linkinv(pred$fit)
    pred$LC<-linkinv(pred$fit-1.96*pred$se.fit)
    pred$UC<-linkinv(pred$fit+1.96*pred$se.fit)
  }
  
  #for GLMM need to use bootstrapped CI, see ?predict.merMod
  if(cl=="glmerMod" | cl=="lmerMod"){
    pred$pred<-linkinv(pred$fit)
    predFun<-function(.) mm%*%fixef(.)
    bb<-bootMer(m,FUN=predFun,nsim=200,parallel="multicore",ncpus=n_core) #do this 200 times
    bb$t<-apply(bb$t,1,function(x) linkinv(x))
    #as we did this 200 times the 95% CI will be bordered by the 5th and 195th value
    bb_se<-apply(bb$t,1,function(x) x[order(x)][c(5,195)])
    pred$LC<-bb_se[1,]
    pred$UC<-bb_se[2,] 
  }
  
  #the output
  out<-as.data.frame(cbind(new_dat[,1:(length(inter_var)+1)],pred$LC,pred$pred,pred$UC))
  names(out)<-c(names(new_dat)[1:(length(inter_var)+1)],"LC","Pred","UC")
  return(out)
}

Two little annoying stats detail

UPDATED: Thanks to Ben and Florian comments I’ve updated the first part of the post

 

A very brief post at the end of the field season on two little “details” that are annoying me in paper/analysis that I see being done (sometimes) around me.

The first one concern mixed effect models where the models built in the contain a grouping factor (say month or season) that is fitted as both a fixed effect term and as a random effect term (on the right side of the | in lme4 model synthax). I don’t really understand why anyone would want to do this and instead of spending time writing equations let’s just make a simple simulation example and see what are the consequences of doing this:


library(lme4)
set.seed(20150830)
#an example of a situation measuring plant biomass on four different month along a gradient of temperature
data<-data.frame(temp=runif(100,-2,2),month=gl(n=4,k=25))
modmat<-model.matrix(~temp+month,data)
#the coefficient
eff<-c(1,2,0.5,1.2,-0.9)
data$biom<-rnorm(100,modmat%*%eff,1)
#the simulated coefficient for Months are 0.5, 1.2 -0.9
#a simple lm
m_fixed<-lm(biom~temp+month,data)
coef(m_fixed) #not too bad
## (Intercept)        temp      month2      month3      month4 
##   0.9567796   2.0654349   0.4307483   1.2649599  -0.8925088

#a lmm with month ONLY as random term
m_rand<-lmer(biom~temp+(1|month),data)
fixef(m_rand)

## (Intercept)        temp 
##    1.157095    2.063714

ranef(m_rand)

## $month
##   (Intercept)
## 1  -0.1916665
## 2   0.2197100
## 3   1.0131908
## 4  -1.0412343

VarCorr(m_rand) #the estimated sd for the month coeff

##  Groups   Name        Std.Dev.
##  month    (Intercept) 0.87720 
##  Residual             0.98016

sd(c(0,0.5,1.2,-0.9)) #the simulated one, not too bad!

## [1] 0.8831761

#now a lmm with month as both fixed and random term
m_fixedrand<-lmer(biom~temp+month+(1|month),data) fixef(m_fixedrand) ## (Intercept) temp month2 month3 month4 ## 0.9567796 2.0654349 0.4307483 1.2649599 -0.8925088 ranef(m_fixedrand) #very, VERY small ## $month ## (Intercept) ## 1 0.000000e+00 ## 2 1.118685e-15 ## 3 -9.588729e-16 ## 4 5.193895e-16 VarCorr(m_fixedrand) ## Groups Name Std.Dev. ## month (Intercept) 0.40397 ## Residual 0.98018 #how does it affect the estimation of the fixed effect coefficient? summary(m_fixed)$coefficients ## Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.9567796  0.2039313  4.691676 9.080522e-06
## temp         2.0654349  0.1048368 19.701440 2.549792e-35
## month2       0.4307483  0.2862849  1.504614 1.357408e-01
## month3       1.2649599  0.2772677  4.562233 1.511379e-05
## month4      -0.8925088  0.2789932 -3.199035 1.874375e-03

summary(m_fixedrand)$coefficients

##               Estimate Std. Error    t value
## (Intercept)  0.9567796  0.4525224  2.1143256
## temp         2.0654349  0.1048368 19.7014396
## month2       0.4307483  0.6390118  0.6740851
## month3       1.2649599  0.6350232  1.9919901
## month4      -0.8925088  0.6357784 -1.4038048

#the numeric response is not affected but the standard error around the intercept and the month coefficient is doubled, this makes it less likely that a significant p-value will be given for these effect ie higher chance to infer that there is no month effect when there is some

#and what if we simulate data as is supposed by the model, ie a fixed effect of month and on top of it we add a random component
rnd.eff<-rnorm(4,0,1.2)
mus<-modmat%*%eff+rnd.eff[data$month]
data$biom2<-rnorm(100,mus,1)
#an lmm model
m_fixedrand2<-lmer(biom2~temp+month+(1|month),data)
fixef(m_fixedrand2) #weird coeff values for the fixed effect for month

## (Intercept)        temp      month2      month3      month4 
##   -2.064083    2.141428    1.644968    4.590429    3.064715

c(0,eff[3:5])+rnd.eff #if we look at the intervals between the intercept and the different levels we can realize that the fixed effect part of the model sucked in the added random part

## [1] -2.66714133 -1.26677658  1.47977624  0.02506236

VarCorr(m_fixedrand2)

##  Groups   Name        Std.Dev.
##  month    (Intercept) 0.74327 
##  Residual             0.93435

ranef(m_fixedrand2) #again very VERY small

## $month
##     (Intercept)
## 1  1.378195e-15
## 2  7.386264e-15
## 3 -2.118975e-14
## 4 -7.752347e-15

#so this is basically not working it does not make sense to have a grouping factor as both a fixed effect terms and a random term (ie on the right-hand side of the |)

Take-home message don’t put a grouping factor as both a fixed and random term effect in your mixed effect model. lmer is not able to separate between the fixed and random part of the effect (and I don’t know how it could be done) and basically gives everything to the fixed effect leaving very small random effects. The issue is abit pernicious because if you would only look at the standard deviation of the random term from the merMod summary output you could not have guessed that something is wrong. You need to actually look at the random effects to realize that they are incredibely small. So beware when building complex models with many fixed and random terms to always check the estimated random effects.

Now this is only true for factorial variables, if you have a continuous variable (say year) that affect your response through both a long-term trend but also present some between-level (between year) variation, it actually makes sense (provided you have enough data point) to fit a model with this variable as both a fixed and random term. Let’s look into this:


#an example of a situation measuring plant biomass on 10 different year along a gradient of temperature
set.seed(10)
data<-data.frame(temp=runif(100,-2,2),year=rep(1:10,each=10))
modmat<-model.matrix(~temp+year,data)
#the coefficient
eff<-c(1,2,-1.5)
rnd_eff<-rnorm(10,0,0.5)
data$y<-rnorm(100,modmat%*%eff+rnd_eff[data$year],1)
#a simple lm
m_fixed<-lm(y~temp+year,data)
summary(m_fixed)

Call:
lm(formula = y ~ temp + year, data = data)

Residuals:
Min       1Q   Median       3Q      Max
-2.27455 -0.83566 -0.03557  0.92881  2.74613

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.41802    0.25085   5.653 1.59e-07 ***
temp         2.11359    0.11230  18.820  < 2e-16 ***
year        -1.54711    0.04036 -38.336  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.159 on 97 degrees of freedom
Multiple R-squared:  0.9508,    Adjusted R-squared:  0.9498
F-statistic: 937.1 on 2 and 97 DF,  p-value: < 2.2e-16
coef(m_fixed) #not too bad

(Intercept)        temp        year
1.418019    2.113591   -1.547111
#a lmm
m_rand<-lmer(y~temp+year+(1|factor(year)),data)
summary(m_rand)

Linear mixed model fit by REML ['lmerMod']
Formula: y ~ temp + year + (1 | factor(year))
Data: data

REML criterion at convergence: 304.8

Scaled residuals:
Min      1Q  Median      3Q     Max
-2.0194 -0.7775  0.1780  0.6733  1.9903

Random effects:
Groups       Name        Variance Std.Dev.
factor(year) (Intercept) 0.425    0.6519
Residual                 1.004    1.0019
Number of obs: 100, groups:  factor(year), 10

Fixed effects:
Estimate Std. Error t value
(Intercept)  1.40266    0.49539   2.831
temp         2.01298    0.10191  19.752
year        -1.54832    0.07981 -19.400

Correlation of Fixed Effects:
(Intr) temp
temp  0.031
year -0.885  0.015
fixef(m_rand) #very close to the lm estimation

(Intercept)        temp        year
1.402660    2.012976   -1.548319
plot(rnd_eff,ranef(m_rand)[[1]][,1])
abline(0,1) #not too bad
VarCorr(m_rand) #the estimated sd for the within-year variation

Groups       Name        Std.Dev.
factor(year) (Intercept) 0.65191
Residual                 1.00189

Interestingly we see in this case that the standard error (and the related t-value) of the intercept and year slope are twice as big (small for the t-values) in the LMM compared to the LM. This means that not taking into account between-year random variation leads us to over-estimate the precision of the long-term temporal trend (we might conclude that there are significant effect when there are a lot of noise not taken into account). I still don’t fully grasp how this work, but thanks to the commenter for pointing this out.

The second issue is maybe a bit older but I saw it appear in a recent paper (which is a cool one excpet for this stats detail). After fitting a model with several predictors one wants to plot their effects on the response, some people use partial residuals plot to do this (wiki). The issue with these plots is that when two variables have a high covariance the partial residual plot will tend to be over-optimistic concerning the effect of variable X on Y (ie the plot will look much nice than it should be). Again let’s do a little simulation on this:


library(MASS)
set.seed(20150830)
#say we measure plant biomass in relation with measured temperature and number of sunny hours say per week
#the variance-covariance matrix between temperature and sunny hours
sig<-matrix(c(2,0.7,0.7,10),ncol=2,byrow=TRUE)
#simulate some data
xs<-mvrnorm(100,c(5,50),sig)
data<-data.frame(temp=xs[,1],sun=xs[,2])
modmat<-model.matrix(~temp+sun,data)
eff<-c(1,2,0.2)
data$biom<-rnorm(100,modmat%*%eff,0.7)

m<-lm(biom~temp+sun,data)
sun_new<-data.frame(sun=seq(40,65,length=20),temp=mean(data$temp))
#partial residual plot of sun
sun_res<-resid(m)+coef(m)[3]*data$sun
plot(data$sun,sun_res,xlab="Number of sunny hours",ylab="Partial residuals of Sun")
lines(sun_new$sun,coef(m)[3]*sun_new$sun,lwd=3,col="red")

Annoy1


#plot of sun effect while controlling for temp
pred_sun<-predict(m,newdata=sun_new)
plot(biom~sun,data,xlab="Number of sunny hours",ylab="Plant biomass")
lines(sun_new$sun,pred_sun,lwd=3,col="red")

Annoy2


#same stuff for temp
temp_new<-data.frame(temp=seq(1,9,length=20),sun=mean(data$sun))
pred_temp<-predict(m,newdata=temp_new)
plot(biom~temp,data,xlab="Temperature",ylab="Plant biomass")
lines(temp_new$temp,pred_temp,lwd=3,col="red")

Annoy3

The first graph is a partial residual plot, from this graph alone we would be tempted to say that the number of hour with sun has a large influence on the biomass. This conclusion is biased by the fact that the number of sunny hours covary with temperature and temperature has a large influence on plant biomass. So who is more important temperature or sun? The way to resolve this is to plot the actual observation and to add a fitted regression line from a new dataset (sun_new in the example) where one variable is allowed to vary while all others are fixed to their means. This way we see how an increase in the number of sunny hour at an average temperature affect the biomass (the second figure). The final graph is then showing the effect of temperature while controlling for the effect of the number of sunny hours.

Happy modelling!

Count data: To Log or Not To Log

Count data are widely collected in ecology, for example when one count the number of birds or the number of flowers. These data follow naturally a Poisson or negative binomial distribution and are therefore sometime tricky to fit with standard LMs. A traditional approach has been to log-transform such data and then fit LMs to the transformed data. Recently a paper advocated against the use of such transformation since it led to high bias in the estimated coefficients. More recently another paper argued that log-transformation of count data followed by LM led to lower type I error rate (ie saying that an effect is significant when it is not) than GLMs. What should we do then?

Using a slightly changed version of the code published in the Ives 2015 paper let’s explore the impact of using these different modelling strategies on the rejection of the null hypothesis and the bias of the estimated coefficients.


#load the libraries
library(MASS)
library(lme4)
library(ggplot2)
library(RCurl)
library(plyr)

#download and load the functions that will be used
URL<-"https://raw.githubusercontent.com/Lionel68/Jena_Exp/master/stats/ToLog_function.R"
download.file(URL,destfile=paste0(getwd(),"/ToLog_function.R"),method="curl")
source("ToLog_function.R")

Following the paper from Ives and code therein I simulated some predictor (x) and a response (y) that follow a negative binomial distribution and is linearly related to x.
In the first case I look at the impact of varying the sample size on the rejection of the null hypothesis and the bias in the estimated coefficient between y and x.


######univariate NB case############
base_theme<-theme(title=element_text(size=20),text=element_text(size=18))
#range over n
output<-compute.stats(NRep=500,n.range = c(10,20,40,80,160,320,640,1280))
ggplot(output,aes(x=n,y=Reject,color=Model))+geom_path(size=2)+scale_x_log10()+labs(x="Sample Size (log-scale)",y="Proportion of Rejected H0")+base_theme
ggplot(output,aes(x=n,y=Bias,color=Model))+geom_path(size=2)+scale_x_log10()+labs(x="Sample Size (log-scale)",y="Bias in the slope coefficient")+base_theme

Figure 1

For this simulation round the coefficient of the slope (b1) was set to 0 (no effect of x on y), and only the sample size varied. The top panel show the average proportion of time that the p-value of the slope coefficient was lower than 0.05 (H0:b1=0 rejected). We see that for low sample size (<40) the Negative Binomial model has higher proportion of rejected H0 (type I error rate) but this difference between the model disappear as we reached sample size bigger than 100. The bottom panel show the bias (estimated value – true value) in the estimated coefficient. For very low sample size (n=10), Log001, Negative Binomial and Quasipoisson have higher bias than Log1 and LogHalf. For larger sample size the difference between the GLM team (NB and QuasiP) and the LM one (Log1 and LogHalf) gradually decrease and both teams converged to a bias around 0 for larger sample size. Only Log0001 is behaved very badly. From what we saw here it seems that Log1 and LogHalf are good choices for count data, they have low Type I error and Bias along the whole sample size gradient.

The issue is that an effect of exactly 0 never exist in real life where most of the effect are small (but non-zero) thus the Null Hypothesis will never be true. Let’s look know how the different models behaved when we vary b1 alone in a first time and crossed with sample size variation.


#range over b1
outpu<-compute.stats(NRep=500,b1.range = seq(-2,2,length=17))
ggplot(outpu,aes(x=b1,y=Reject,color=Model))+geom_path(size=2)+base_theme+labs(y="Proportion of Rejected H0")
ggplot(outpu,aes(x=b1,y=Bias,color=Model))+geom_path(size=2)+base_theme+labs(y="Bias in the slope coefficient")

Figure2

Here the sample size was set to 100, what we see in the top graph is that for a slope of exactly 0 all model have a similar average proportion of rejection of the null hypothesis. As b1 become smaller or bigger the average proportion of rejection show very similar increase for all model expect for Log0001 which has a slower increase. This curves basically represent the power of the model to detect an effect and is very similar to the Fig2 in the Ives 2015 paper. Now the bottom panel show that all the LM models have bad behaviour concerning their bias, they have only small bias for very small (close to 0) coefficient, has the coefficient gets bigger the absolute bias increase. This means that even if the LM models are able to detect an effect with similar power the estimated coefficient is wrong. This is due to the value added to the untransformed count data in order to avoid -Inf for 0s. I have no idea on how one may take into account arithmetically these added values and then remove its effects …

Next let’s cross variation in the coefficient with sample size variation:

#range over n and b1
output3<-compute.stats(NRep=500,b1.range=seq(-1.5,1.5,length=9),n.range=c(10,20,40,80,160,320,640,1280))
ggplot(output3,aes(x=n,y=Reject,color=Model))+geom_path(size=2)+scale_x_log10()+facet_wrap(~b1)+base_theme+labs(x="Sample size (log-scale)",y="Proportion of rejected H0")
ggplot(output3,aes(x=n,y=Bias,color=Model))+geom_path(size=2)+scale_x_log10()+facet_wrap(~b1)+base_theme+labs(x="Sample size (log-scale)",y="Bias in the slope coefficient")

Figure 3

The tope panel show one big issue of focussing only on the significance level: rejection of H0 depend not only on the size of the effect but also on the sample size. For example for b1=0.75 (a rather large value since we work on the exponential scale) less than 50% of all models rejected the null hypothesis for a sample size of 10. Of course as the effect sizes gets larger the impact of the sample size on the rejection of the null hypothesis is reduced. However most effect around the world are small so that we need big sample size to be able to “detect” them using null hypothesis testing. The top graph also shows that NB is slightly better than the other models and that Log0001 is again having the worst performance. The bottom graphs show something interesting, the bias is quasi-constant over the sample size gradient (maybe if we would look closer we would see some variation). Irrespective of how many data point you will collect the LMs will always have bigger bias than the GLMs (expect for the artificial case of b1=0)

To finish with in Ives 2015 the big surprise was the explosion of type I error with the increase in the variation in individual-level random error (adding a random normally distributed value to the linear predictor of each data point and varying the standard deviation of these random values) as can be seen in the Fig3 of the paper.


#range over b1 and sd.eps
output4<-compute.statsGLMM(NRep=500,b1.range=seq(-1.5,1.5,length=9),sd.eps.range=seq(0.01,2,length=10))
ggplot(output4,aes(x=sd.eps,y=Reject,color=Model))+geom_path(size=2)+facet_wrap(~b1)+base_theme+labs(x="",y="Proportion of rejected H0")
ggplot(output4,aes(x=sd.eps,y=Bias,color=Model))+geom_path(size=2)+facet_wrap(~b1)+base_theme+labs(x="Standard Deviation of the random error",y="Bias of the slope coefficient")

Figure 4

Before looking at the figure in detail please note that a standard deviation of 2 in this context is very high, remember that these values will be added to the linear predictor which will be exponentiated so that we will end up with very large deviation. In the top panel there are two surprising results, the sign of the coefficient affect the pattern of null hypothesis rejection and I do not see the explosion of rejection rates for NB or QuasiP that are presented in Ives 2015. In his paper Ives reported the LRT test for the NB models when I am reporting the p-values from the model summary directly (Wald test). If some people around have computing power it would be interesting to see if changing the seed and/or increasing the number of replication lead to different patterns … The bottom panel show again that the LMs bias are big, the NB and QuasiP models have an increase in the bias with the standard deviation but only if the coefficient is negative (I suspect some issue with the exponentiating of large random positive error), as expected the GLMM perform the best in this context.

Pre-conclusion, in real life of course we would rarely have a model with only one predictor, most of the time we will build larger models with complex interaction structure between the predictors. This will of course affect both H0 rejection and Bias, but this is material for a next post🙂

Let’s wrap it up; we’ve seen that even if LM transformation seem to be a good choice for having a lower type I error rate than GLMs this advantage will be rather minimal when using empirical data (no effect are equal to 0) and potentially dangerous (large bias). Ecologists have sometime the bad habits to turn their analysis into a star hunt (R standard model output gives stars to significant effects) and focusing only on using models that have the best behavior towards significance (but large bias) does not seem to be a good strategy to me. More and more people call for increasing the predictive power of ecological model, we need then modelling techniques that are able to precisely (low bias) estimate the effects. In this context transforming the data to make them somehow fit normal assumption is sub-optimal, it is much more natural to think about what type of processes generated the data (normal, poisson, negative binomial, with or without hierarchical structure) and then model it accordingly. There are extensive discussion nowadays about the use and abuse of p-values in science and I think that in our analysis we should slowly but surely shifts our focus from “significance/p-values<0.05/star hunting” only to a more balanced mix of effect-sizes (or standardized slopes), p-values and R-square.

Confidence Intervals for prediction in GLMMs

With LM and GLM the predict function can return the standard error for the predicted values on either the observed data or on new data. This is then used to draw confidence or prediction intervals around the fitted regression lines. The confidence intervals (CI) focus on the regression lines and can be interpreted as (assuming that we draw 95% CI): “If we would repeat our sampling X times the regression line would fall between this interval 95% of the time”. On the other hand the prediction interval focus on single data point and could be interpreted as (again assuming that we draw 95% CI): “If we would sample X times at these particular value for the explanatory variables, the response value would fall between this interval 95% of the time”. The wikipedia page has some nice explanation about the meaning of confidence intervals.
For GLMM the predict function does not allow one to derive standard error, the reason being (from the help page of predict.merMod): “There is no option for computing standard errors of predictions because it is difficult to define an efficient method that incorporates uncertainty in the variance parameters”. This means there is for now no way to include in the computation of the standard error for predicted values the fact that the fitted random effect standard deviation are just estimates and may be more or less well estimated. We can however still derive confidence or prediction intervals keeping in mind that we might underestimate the uncertainty around the estimates.


library(lme4)

#first case simple lmer, simulate 100 data points from 10 groups with one continuous fixed effect variable
x<-runif(100,0,10)
f<-gl(n = 10,k = 10)
data<-data.frame(x=x,f=f)
modmat<-model.matrix(~x,data)
#the fixed effect coefficient
fixed<-c(1,0.5)
#the random effect
rnd<-rnorm(10,0,0.7)
#the simulated response values
data$y<-rnorm(100,modmat%*%fixed+rnd[f],0.3)

#model
m<-lmer(y~x+(1|f),data)

#first CI and PI using predict-like method, using code posted here: http://glmm.wikidot.com/faq
newdat<-data.frame(x=seq(0,10,length=20))
mm<-model.matrix(~x,newdat)
newdat$y<-mm%*%fixef(m) 
#predict(m,newdat,re.form=NA) would give the same results
pvar1 <- diag(mm %*% tcrossprod(vcov(m),mm))
tvar1 <- pvar1+VarCorr(m)$f[1] # must be adapted for more complex models
newdat <- data.frame(
  newdat
  , plo = newdat$y-1.96*sqrt(pvar1)
  , phi = newdat$y+1.96*sqrt(pvar1)
  , tlo = newdat$y-1.96*sqrt(tvar1)
  , thi = newdat$y+1.96*sqrt(tvar1)
)

#second version with bootMer
#we have to define a function that will be applied to the nsim simulations
#here we basically get a merMod object and return the fitted values
predFun<-function(.) mm%*%fixef(.) 
bb<-bootMer(m,FUN=predFun,nsim=200) #do this 200 times
#as we did this 200 times the 95% CI will be bordered by the 5th and 195th value
bb_se<-apply(bb$t,2,function(x) x[order(x)][c(5,195)])
newdat$blo<-bb_se[1,]
newdat$bhi<-bb_se[2,]

plot(y~x,data)
lines(newdat$x,newdat$y,col="red",lty=2,lwd=3)
lines(newdat$x,newdat$plo,col="blue",lty=2,lwd=2)
lines(newdat$x,newdat$phi,col="blue",lty=2,lwd=2)
lines(newdat$x,newdat$tlo,col="orange",lty=2,lwd=2)
lines(newdat$x,newdat$thi,col="orange",lty=2,lwd=2)
lines(newdat$x,newdat$bhi,col="darkgreen",lty=2,lwd=2)
lines(newdat$x,newdat$blo,col="darkgreen",lty=2,lwd=2)
legend("topleft",legend=c("Fitted line","Confidence interval","Prediction interval","Bootstrapped CI"),col=c("red","blue","orange","darkgreen"),lty=2,lwd=2,bty="n")

bootGLMM1

This looks pretty familiar, the prediction interval being always bigger than the confidence interval.
Now in the help page for the predict.merMod function the authors of the lme4 package wrote that bootMer should be the prefered method to derive confidence intervals from GLMM. The idea there is to simulate N times new data from the model and get some statistic of interest. In our case we are interested in deriving the bootstrapped fitted values to get confidence interval for the regression line. bb$t is a matrix with the observation in the column and the different bootstrapped samples in the rows. To get the 95% CI for the fitted line we then need to get the [0.025*N,0.975*N] values of the sorted bootstrapped values.

The bootstrapped CI falls pretty close to the “normal” CI, even if for each bootstrapped sample new random effect values were calculated (because use.u=FALSE per default in bootMer)

Now let’s turn to a more complex example, a Poisson GLMM with two crossed random effects:


#second case more complex design with two crossed RE and a poisson response
x<-runif(100,0,10)
f1<-gl(n = 10,k = 10)
f2<-as.factor(rep(1:10,10))
data<-data.frame(x=x,f1=f1,f2=f2)
modmat<-model.matrix(~x,data)

fixed<-c(-0.12,0.35)
rnd1<-rnorm(10,0,0.7)
rnd2<-rnorm(10,0,0.2)

mus<-modmat%*%fixed+rnd1[f1]+rnd2[f2]
data$y<-rpois(100,exp(mus))

m<-glmer(y~x+(1|f1)+(1|f2),data,family="poisson")

#for GLMMs we have to back-transform the prediction after adding/removing the SE
newdat<-data.frame(x=seq(0,10,length=20))
mm<-model.matrix(~x,newdat)
y<-mm%*%fixef(m)
pvar1 <- diag(mm %*% tcrossprod(vcov(m),mm))
tvar1 <- pvar1+VarCorr(m)$f1[1]+VarCorr(m)$f2[1]  ## must be adapted for more complex models
newdat <- data.frame(
  x=newdat$x,
  y=exp(y),
   plo = exp(y-1.96*sqrt(pvar1))
  , phi = exp(y+1.96*sqrt(pvar1))
  , tlo = exp(y-1.96*sqrt(tvar1))
  , thi = exp(y+1.96*sqrt(tvar1))
)

#second version with bootMer
predFun<-function(.) exp(mm%*%fixef(.)) 
bb<-bootMer(m,FUN=predFun,nsim=200)
bb_se<-apply(bb$t,2,function(x) x[order(x)][c(5,195)])
newdat$blo<-bb_se[1,]
newdat$bhi<-bb_se[2,]

#plot
plot(y~x,data)
lines(newdat$x,newdat$y,col="red",lty=2,lwd=3)
lines(newdat$x,newdat$plo,col="blue",lty=2,lwd=2)
lines(newdat$x,newdat$phi,col="blue",lty=2,lwd=2)
lines(newdat$x,newdat$tlo,col="orange",lty=2,lwd=2)
lines(newdat$x,newdat$thi,col="orange",lty=2,lwd=2)
lines(newdat$x,newdat$bhi,col="darkgreen",lty=2,lwd=2)
lines(newdat$x,newdat$blo,col="darkgreen",lty=2,lwd=2)
legend("topleft",legend=c("Fitted line","Confidence interval","Prediction interval","Bootstrapped CI"),col=c("red","blue","orange","darkgreen"),lty=2,lwd=2,bty="n")

bootGLMM2

Again in this case the bootstrapped CI falled pretty close to the “normal” CI. We have here seen three different way to derive intervals representing the uncertainty around the regression lines (CI) and the response points (PI). Choosing among them would depend on what you want to see (what is the level of uncertainty around my fitted line vs if I sample new observations which value will they take), but also for complex model on computing power, as bootMer can take some time to run for GLMM with many observations and complex model structure.

Exploration of Functional Diversity indices using Shiny

Biological diversity (or biodiversity) is a complex concept with many different aspects in it, like species richness, evenness or functional redundancy. My field of research focus on understanding the effect of changing plant diversity on higher trophic levels communities but also ecosystem function. Even if the founding papers of this area of research already hypothesized that all components of biodiversity could influence ecosystem function (See Fig1 in Chapin et al 2000), the first experimental results were focusing on taxonomic diversity (ie species richness, shannon diversity, shannon evenness …). More recently the importance of functional diversity as the main driver of changes in ecosystem function has been emphasized by several authors (ie Reiss et al 2009). The idea behind functional diversity is basically to measure a characteristic (the traits) of the organisms under study, for example the height of a plant or the body mass of an insect, and then derive an index of how diverse these traits values are in a particular sites. A nice introduction into the topic is the Chapter from Evan Weiher in Biological Diversity.

Now as taxonomic diversity has many different indices so do functional diversity, recent developments of a new multidimensional framework and of an R package allow researchers to easily derive functional diversity index from their dataset. But finding the right index for his system can be rather daunting and as several studies showed there is not a single best index (See Weiher Ch.13 in Biological Diversity) but rather a set of different index each showing a different facet of the functional diversity like functional richness, functional evenness or functional divergence.

Here I show a little Shiny App to graphically explore in a 2D trait dimension the meaning of a set of functional diversity indices. The App is still in its infancy and many things could be added (ie variation in trait distribution …) but here it is:

#load libraries
library(shiny)
library(MASS)
library(geometry)
library(plotrix)
library(FD)

#launch App
runGitHub("JenaExp",username = "Lionel68",subdir = "FD/Shiny_20150426")

All codes are available here: https://github.com/Lionel68/JenaExp/tree/master/FD

Literature:

Chapin III, F. Stuart, et al. “Consequences of changing biodiversity.” Nature 405.6783 (2000): 234-242.

Reiss, Julia, et al. “Emerging horizons in biodiversity and ecosystem functioning research.” Trends in ecology & evolution 24.9 (2009): 505-514.

Weiher, E. “A primer of trait and functional diversity.” Biological diversity, frontiers in measurement and assessment (2011): 175-193.

Villéger, Sébastien, Norman WH Mason, and David Mouillot. “New multidimensional functional diversity indices for a multifaceted framework in functional ecology.” Ecology 89.8 (2008): 2290-2301.

A function to help graphical model checks of lm and ANOVA

As always a more colourful version of this post is available on rpubs.

Even if LM are very simple models at the basis of many more complex ones, LM still have some assumptions that if not met would render any interpretation from the models plainly wrong. In my field of research most people were taught about checking ANOVA assumptions using tests like Levene & co. This is however not the best way to check if my model meet its assumptions as p-values depend on the sample size, with small sample size we will almost never reject the null hypothesis while with big sample even small deviation will lead to significant p-values (discussion). As ANOVA and linear models are two different ways to look at the same model (explanation) we can check ANOVA assumptions using graphical check from a linear model. In R this is easily done using plot(model), but people often ask me what amount of deviation makes me reject a model. One easy way to see if the model checking graphs are off the charts is to simulate data from the model, fit the model to these newly simulated data and compare the graphical checks from the simulated data with the real data. If you cannot differentiate between the simulated and the real data then your model is fine, if you can then try again!

EDIT: You can also make the conclusion form your visual inspection more rigorous following the protocols outlined in this article. (Thanks for Tim comment)

Below is a little function that implement this idea:

 

lm.test<-function(m){
  require(plyr)
  #the model frame
  dat<-model.frame(m)
  #the model matrix
  f<-formula(m)
  modmat<-model.matrix(f,dat)
  #the standard deviation of the residuals
  sd.resid<-sd(resid(m))
  #sample size
  n<-dim(dat)[1]
  #get the right-hand side of the formula  
  #rhs<-all.vars(update(f, 0~.))
  #simulate 8 response vectors from model
  ys<-lapply(1:8,function(x) rnorm(n,modmat%*%coef(m),sd.resid))
  #refit the models
  ms<-llply(ys,function(y) lm(y~modmat[,-1]))
  #put the residuals and fitted values in a list
  df<-llply(ms,function(x) data.frame(Fitted=fitted(x),Resid=resid(x)))
  #select a random number from 2 to 8
  rnd<-sample(2:8,1)
  #put the original data into the list
  df<-c(df[1:(rnd-1)],list(data.frame(Fitted=fitted(m),Resid=resid(m))),df[rnd:8])

  #plot 
  par(mfrow=c(3,3))
  l_ply(df,function(x){
    plot(Resid~Fitted,x,xlab="Fitted",ylab="Residuals")
    abline(h=0,lwd=2,lty=2)
  })

  l_ply(df,function(x){
    qqnorm(x$Resid)
    qqline(x$Resid)
  })

  out<-list(Position=rnd)
  return(out)
}

 

This function print the two basic plots: one looking at the spread of the residuals around the fitted values, the other one look at the normality of the residuals. The function return the position of the real model in the 3×3 window, counting from left to right and from top to bottom (ie position 1 is upper left graph).

Let’s try the function:
 

#a simulated data frame of independent variables
dat<-data.frame(Temp=runif(100,0,20),Treatment=gl(n = 5,k = 20))
contrasts(dat$Treatment)<-"contr.sum"
#the model matrix
modmat<-model.matrix(~Temp*Treatment,data=dat)
#the coefficient
coeff<-rnorm(10,0,4)
#simulate response data
dat$Biomass<-rnorm(100,modmat%*%coeff,1)
#the model
m<-lm(Biomass~Temp*Treatment,dat)
#model check
chk<-lm.test(m)

 

lm.test1

lm.test2

Can you find which one is the real one? I could not, here is the answer:

 

chk
$Position
[1] 4

Happy and safe modelling!