Review of the Fun from 2016

One year as passed, now is the time to look back at my activity for 2016, explore some stats and take some blogging resolutions for 2017. This post was inspired by Simon Leather’s roundabout review of the year, especially the stats part.

Looking back:

In 2016 I posted 9 articles which is pretty similar to 2015 (11 articles), but is rather infrequent for a blog. Part of this can be explained by the nice opportunity that arose in late 2015 to post regular R tutorials directly at data science plus I was really happy to contribute to this and be part of a really diverse team of writer. Some posts over there were pretty successful like this one or this one. In March I decided to spent again some time for this blog and to be frank I did not have that many ideas for R tutorials any more. Around that time I also started to write my PhD thesis which left me with limited motivation to spend some extra time writing blog posts. Later on, in May, some inspiration came back and between May and December I posted once per month.

Some stats:

Of the 9 articles posted in 2016, two were purely about ecological themes, three (my favourites) were presenting some simple ecological models and simulated in R their dynamics and four were looking like R tutorials but were pretty specialized. Of the nine articles published this year the one with most views was the post where I showed how to fit simple hierarchical models with Rstan. In terms of overall views this article was the 8th most visited article this year with 591 views, well behind the two top posts: GLMM for ecologists and a post on interpreting interaction coefficients each one having more than 3’000 views. Looking at the temporal dynamic of number of monthly views over the last three years, we see an increase up to mid-2015, a decline up to mid-2016 and an increase until December 2016.

year
Number of monthly views with a fitted smoothed curve.

 

I guess that this is reflecting pretty well the fact that I stopped posting between November 2015 and May 2016, so blog views increase only if new materials is regularly posted (no way!). Let’s go further and look at how these views were distributed between the posts, Rank-Abundance Distribution are a great way to visualize this.

post
Rank-Abundance Distribution on the x-axis are the post ranks from most viewed (1) up to least viewed, on the y-axis is the log10 of the number of views. If all posts were to get the same number of views we would get an horizontal line.

 

In this graph we see the increase in proportion of views for the most read posts in 2016 compared to 2015 and 2014. Basically a third of all views were on only two posts out of the ~70 posts in 2016. It seems that successful post become more successful over time, thank you Google Bots. Finally, wordpress also gives me the number of views per country, 147 countries represented in 2016. Let’s look at the proportion of views (within years) coming from the top10 countries.

countries

Around a third of the views come from the US and it stayed rather constant between 2014 and 2016. One out of 10 views was from Germany in 2014, this proportion dropped to one out of 20 in 2016. At the same time the proportion of views from the UK kept rising. Readership on this blog is mostly from western countries, but I am always pretty happy when I see some connexions from countries I was not aware of like Guam, Malawi or Macao Special Administrative Region of the People’s Republic of China.

Resolutions for 2017:

Most of the readers come to this blog interested by R stuff. I do have a lot of fun writing these posts and it is always a challenge to explain (complex) concepts. For instance, you think that you know what an interaction term is but then you realize that you do not really know how to convey your (vague) understanding, to explain you need to develop your own understanding. All this to say that I will continue posting R-related posts but I intend to have more and more posts about what I actually do (ecology) and maybe some advice posts for students and PhDs. I am also thinking in re-designing a bit the homepage, moving more towards a professional website like this one and to post a bit more regularly, like twice a month. Thanks for reading, thanks for commenting, it is always satisfying to hear that the stuff I write are actually useful to some. Happy 2017!

Spatial Dynamic in a Host-Parasitoid system

In this post I will explore the spatial dynamic that arise from a simple deterministic host-parasitoid model with migration in a homogeneous environment. If the previous sentence did not drive you away, congrats (!) you are most certainly in a terminal state towards biological nerditude.

The idea of this post emerged while reading Chp. 6 in Hanski and Gilkin Metapopulation book. In this Chapter Nay and colleagues expand the classical one-species metapopulation models to models with interacting species. In one example they talk about their 1992 paper where they took a simple Host-Parasitoid system and applied this in a metapopulation context where local populations (a patch) are linked by migrations between adjacent patches. Their main results is that a system that is unstable in a single local population become stable at the metapopulation level (does this sounds familiar?). In other words, local host-parasitoid system that alone show extinction, present persistence of both species when local populations exchange migrants. Pretty cool stuff, of course along the way they also show some beautiful spatial patterns, but that is secondary (isn’t it?).

Here is the model:

N_{i,t} = J_{i,t-1}  *  e^{-a*P_{i,t-1}}

Q_{i,t} = c * (J_{i,t-1} - N_{i,t})

J_{i,t} = \lambda * (N_{i,t} * (1 - \mu_{n}) + \mu_{n}  * \sum_{j=1}^{8} \frac{N_{j,t}}{8})

P_{i,t} = Q_{i,t} * (1 - \mu_{q}) + \mu_{q}  * \sum_{j=1}^{8} \frac{Q_{j,t}}{8}

Where N is the abundance of the pre-dispersal hosts, Q is the abundance of pre-dispersal parasitoid , J is the abundance of post-dispersal hosts, P is the abundance of post-dispersal parasitoid. i is an index for the patches, t is the time, j is an index for the patches that are neighbouring patch i (we consider here only the direct neighbour). a is the searching efficiency of the parasitoid, c is the number of parasitoid per hosts, \lambda is the host growth rate, and the \mu are the migration rate for the host (n) and for the parasitoid (q).

Got it? Now let’s code this into R:

#Local dynamics following Nicholson-Bailey 
#host-parasitoid function with one parameter: a
#@J: matrix of juvenil abundance in the previous time step
#@P: number of adult parasitoid in the previous time step
#@a: the searching efficiency of the parasitoid
#@c: the number of parasitoid per hosts
local_dyn<-function(J,P,a=0.1,c=1){
  N <- J*exp(-a*P)
  Q <- c*(J - N)
  return(list(N=N,Q=Q))
}
#an helper function returning line numbers of all neighbouring cells
find_neighbour<-function(vec,dat=dat){
  lnb<-which(dat[,"x"]%in%(vec["x"]+
c(-1,-1,-1,0,1,1,1,0)) & dat[,"y"]%in%(vec["y"]+c(-1,0,1,1,1,0,-1,-1)))
  #remove own line
  lnb<-lnb[which(lnb!=vec["lnb"])]
  return(lnb)
}
#compute the next-generation parasitoid and hosts
#with immigration and absorbing boundaries
#@N: matrix of adult host in the current generation
#@Q: matrix of emerging parasitoid in the current generation
#@neigh: a list with the cell ID of neighbouring cells
#@mu: immigration rates
#@lbd: rate of increase of host population
#@bound: type of boundaries, either absorbing where individuals
#are lost from the systemat the boundaries or reflective where
#individuals are not able to leave the system
imm_dyn <- function(N,Q,neigh,mu_n=0.5,mu_q=0.5,lbd=2,bound="absorbing"){
  if(bound=="absorbing"){
    J <- lbd*(N*(1-mu_n) + mu_n*laply(neigh,function(x) sum(N[x]/8)))
    P <- Q*(1-mu_q) + mu_q*laply(neigh,function(x) sum(Q[x]/8))
  }
  else if(bound=="reflective"){
    J <- lbd*(N*(1-mu_n) + 
mu_n*laply(neigh,function(x) sum(N[x]/dat$Nb[dat$lnb%in%x])))
    P <- Q*(1-mu_q) + 
mu_q*laply(neigh,function(x) sum(Q[x]/dat$Nb[dat$lnb%in%x]))
  }
  else{
    print("Boundary type provided is not implemented yet")
    break
  }
  return(list(J=J,P=P))
}

Here are three functions, the first one implement the local host-parasitoid dynamic, the second one is just an helper function to get the cell ID of neighbouring cells and the third one implement the migration between neighbouring cells. A critical point in spatially-explicit models is boundary behaviour. A boundary can either be absorbing: modelled entities are able to leave the simulated system, and when they do they are lost; circular: modelled entities are able to leave the system and when they do they re-enter the system at the opposite end (ie the system is a sphere); reflective: modelled entities cannot leave the system. Here I implement absorbing (hosts or parasitoids are leaving the metapopulation) and reflective (hosts are parasitoids are “afraid” to leave the metapopulation) boundaries.

Simulation time:

#load the libraries
library(plyr) #for data wraggling
library(reshape2) #for melting
library(ggplot2) #for nice graphs
library(viridis) #for nice colors
#set up seed
set.seed(20161211)
#set up number of generations and grid size
generation<-1:100
n_x <- n_y <- 30
#representation of the grid
dat<-expand.grid(x=1:n_x,y=1:n_y)
dat$lnb<-1:(n_x*n_y)
#create a list where each element contain the cell ID 
#of the neighbour of the focal cell
neigh<-apply(dat,1,function(x) find_neighbour(x,dat))
#compute the number of neighbours for each cells to use 
#in reflective boundaries
dat$Nb<-laply(neigh,length)
#set the object keeping the dynamic and the initial condition
Js <- array(0,dim=c(30,30,101))
Ps <- array(0,dim = c(30,30,101))
#host and parasitoid start at one cell
Js[15,15,1] <- rpois(n=1,lambda=5)
Ps[15,15,1] <- rpois(n=1,lambda = 2)
#simulation
for(i in generation){  
  tmp <- local_dyn(J = Js[,,i],P = Ps[,,i],a=1)
  tmp2 <- imm_dyn(N = tmp[["N"]],Q = tmp[["Q"]],
    neigh = neigh,lbd=2,bound="absorbing")
  Js[,,(i+1)] <- tmp2[["J"]]
  Ps[,,(i+1)] <- tmp2[["P"]]
  print(i)
}
#plot temporal dynamic for one patch
plot(1:101,Js[20,15,],type="l",lwd=2,col="green",
  xlab="Generations",ylab="Population density")
lines(1:101,Ps[20,15,],lty=2,lwd=2,col="orange")
legend("topleft",c("Host","Parasitoid"),lwd=2,
  lty=c(1,2),col=c("green","orange"),bty="n")

meta_one

#temporal dynamic accros one transect from the center outwards
tra<-Js[15:30,15,]
tram<-melt(tra,value.name = "Pop")
names(tram)[1:2]<-c("Cell","Gen")
ggplot(tram,aes(x=Gen,y=Pop,color=factor(Cell),group=Cell))+geom_line()+
  labs(x="Generation",y="Population density")

meta_transect

This is actually pretty interesting, we see the wave of colonisation from the original cell (N° 1) up to the edge of the simulated system (cell N° 16), the first wave is pretty well ordered from 1-16 but then at generation 25, cells in the middle peak a bit earlier as the one in the center, and this is then amplified for the next peaks at generation 40 and 50.

#Phase space plot for average densities
ph<-data.frame(Gen=1:101,Host=log(apply(Js,3,mean)),
  Parasitoid=log(apply(Ps,3,mean)))
ggplot(ph,aes(x=Host,y=Parasitoid))+geom_path()+
  geom_label(aes(label=Gen))

meta_phase1

This graphs show the dynamics of  host-parasitoid average densities, it reveals among other things that we have cycles of varying amplitude.

#look at spatial dynamic for host and parasitoid
#between generations 50 and 53
exmpl<-data.frame(Gen=rep(50:53,each=900),x=rep(rep(1:30,times=30),4),
 y=rep(rep(1:30,each=30),4),Host=as.numeric(Js[,,50:53]),
 Parasitoid=as.numeric(Ps[,,50:53]))
exmplm<-melt(exmpl,id.vars=c("x","y","Gen"),variable.name="Species",
  value.name="Pop")
ggplot(exmplm,aes(x,y,fill=Pop))+geom_raster(interpolate=TRUE)+
  scale_fill_viridis(option="plasma",direction=-1)+
  facet_grid(Gen~Species)+labs(x="",y="")+
  theme(axis.text=element_text(size=0),axis.ticks.length=unit(0,"cm"))

meta_b1

Pretty nice! Now let’s do the same simulations but this time with reflective boundaries, ie host and parasitoid cannot leave the system:

#make the boundary reflective
Js2 <- array(0,dim=c(30,30,101))
Ps2 <- array(0,dim = c(30,30,101))
#host and parasitoid start at one cell
Js2[15,15,1] <- rpois(n=1,lambda=5)
Ps2[15,15,1] <- rpois(n=1,lambda = 2)

for(i in generation){  
  tmp <- local_dyn(J = Js2[,,i],P = Ps2[,,i],a=1)
  tmp2 <- imm_dyn(N = tmp[["N"]],Q = tmp[["Q"]],neigh = neigh,lbd=2,
    mu_n=0.5,mu_q=0.5,bound="reflective")
  Js2[,,(i+1)] <- tmp2[["J"]]
  Ps2[,,(i+1)] <- tmp2[["P"]]
  print(i)
}

#Phase space plot for average densities
ph<-data.frame(Gen=1:101,Host=log(apply(Js2,3,mean)),
  Parasitoid=log(apply(Ps2,3,mean)))
ggplot(ph,aes(x=Host,y=Parasitoid))+geom_path()+
  geom_label(aes(label=Gen))

meta_phase

Again cycling of host-parasitoid densities with changing amplitude.

#spatial patterns
exmpl<-data.frame(Gen=rep(75:78,each=900),x=rep(rep(1:30,times=30),4),
  y=rep(rep(1:30,each=30),4),Host=as.numeric(Js2[,,75:78]),
  Parasitoid=as.numeric(Ps2[,,75:78]))
exmplm<-melt(exmpl,id.vars=c("x","y","Gen"),
  variable.name="Species",value.name="Pop")
ggplot(exmplm,aes(x,y,fill=Pop))+geom_raster(interpolate=TRUE)+
  scale_fill_viridis(option="magma",direction=-1)+
  facet_grid(Gen~Species)+labs(x="",y="")+
  theme(axis.text=element_text(size=0),axis.ticks.length=unit(0,"cm"))

meta_b2

Using ImageMagick and its function convert one can easily turn this into an animated GIFs to observe the whole dynamic.

simu2

Why bothering about all this? Developing such models reveal that depending on the size of the system (the number of patch), the metapopulation of hosts and parasitoid may persist or not. In other words if we reduce the size of natural habitat a host-parasitoid system that was persisting through time despite large variations might suddenly be driven to extinction.

Crossed and Nested hierarchical models with STAN and R

Below I will expand on previous posts on bayesian regression modelling using STAN (see previous instalments here, here, and here). Topic of the day is modelling crossed and nested design in hierarchical models using STAN in R.

Crossed design appear when we have more than one grouping variable and when data are recorded for each combination of the grouping variables. For example say that we measured the growth of a fungi on different Petri dishes and that you took several samples from each dishes. In this example we have two grouping variables: the Petri dish and the sample. Since we have observations for each combination of the two grouping variables we are in a crossed design. We can model this using a hierarchical model with an intercept representing the average growth, a parameter representing the deviation from this average for each Petri dish and an additional parameter representing the deviation from the average for each sample. Below is the corresponding model in STAN code:

/*A simple example of an crossed hierarchical model
*based on the Penicillin data from the lme4 package
*/

data {
  int<lower=0> N;//number of observations
  int<lower=0> n_sample;//number of samples
  int<lower=0> n_plate;//number of plates
  int<lower=1,upper=n_sample> sample_id[N];//vector of sample indeces
  int<lower=1,upper=n_plate> plate_id[N];//vector of plate indeces
  vector[N] y;
}
parameters {
  vector[n_sample] gamma;//vector of sample deviation from the average 
  vector[n_plate] delta;//vector of plate deviation from the average
  real<lower=0> mu;//average diameter value
  real<lower=0> sigma_gamma;//standard deviation of the gamma coeffs
  real<lower=0> sigma_delta;//standard deviation of the delta coeffs
  real<lower=0> sigma_y;//standard deviation of the observations
}
transformed parameters {
  vector[N] y_hat;

  for (i in 1:N)
    y_hat[i] = mu + gamma[sample_id[i]] + delta[plate_id[i]];
}
model {
  //prior on the scale coefficient
//weakly informative priors, see section 6.9 in STAN user guide
  sigma_gamma ~ cauchy(0,2.5);
  sigma_delta ~ cauchy(0,2.5);
  sigma_y ~ gamma(2,0.1);
  //get sample and plate level deviation
  gamma ~ normal(0, sigma_gamma);
  delta ~ normal(0, sigma_delta);
  //likelihood
  y ~ normal(y_hat, sigma_y);
}
generated quantities {
//sample predicted values from the model for posterior predictive checks
  real y_rep[N];
  for(n in 1:N)
    y_rep[n] = normal_rng(y_hat[n],sigma_y);
}

Pasting and saving this code into a .stan file we now turn to R using the Penicillin dataset from the lme4 package as (real-life) example:

library(lme4)
library(rstan)
library(shinystan)#for great model viz
library(ggplot2)#for great viz in general
data(Penicillin)
#look if we have sample for each combination
xtabs(~plate+sample,Penicillin)
#create the plate and sample index
plate_id<-as.numeric(Penicillin$plate)
sample_id<-as.numeric(Penicillin$sample)
#the model matrix (just an intercept in this case)
X<-matrix(rep(1,dim(Penicillin)[1]),ncol=1)

#fit the model
m_peni<-stan(file = "crossed_penicillin.stan",
data=list(N=dim(Penicillin)[1],n_sample=length(unique(sample_id)),
n_plate=length(unique(plate_id)),sample_id=sample_id,
plate_id=plate_id,y=Penicillin$diameter))

#launch_shinystan(m_peni)

The model seem to fit pretty nicely, all chains converged for all parameters (Rhat around 1), we have decent posterior distribution (top panel in the figure below) and also good correlation between observed and fitted data (bottom panel figure below).

check

In a next step we can look at the deviation form the average diameter for each sample and each plate (Petri dish):

#make caterpillar plot
mcmc_peni<-extract(m_peni)
sample_eff<-apply(mcmc_peni$gamma,2,quantile,probs=c(0.025,0.5,0.975))
df_sample<-data.frame(ID=unique(Penicillin$sample),Group="Sample",
LI=sample_eff[1,],Median=sample_eff[2,],HI=sample_eff[3,])
plate_eff<-apply(mcmc_peni$delta,2,quantile,probs=c(0.025,0.5,0.975))
df_plate<-data.frame(ID=unique(Penicillin$plate),Group="Plate",
LI=plate_eff[1,],Median=plate_eff[2,],HI=plate_eff[3,])
df_all<-rbind(df_sample,df_plate)
ggplot(df_all,aes(x=ID,y=Median))+geom_point()+
 geom_linerange(aes(ymin=LI,ymax=HI))+facet_wrap(~Group,scales="free")+
 geom_hline(aes(yintercept=0),color="blue",linetype="dashed")+
 labs(y="Regression parameters")

cater_peni

We can compare this figure to Figure 2.2 in here where the same model was fitted to the data using lmer.

I now turn to nested design. Nested design occur when there is more than one grouping variable and when there is a hierarchy in these variables with categories from lower variables only being present at one level from higher variables. For examples if we measured student scores within classes within schools we would have a nested hierarchical design. In the following I will use the Arabidopsis dataset from the lme4 package. Arabidopsis plants from different regions (Netherlands, Spain and Sweden) and from different populations within these regions (nested design) were collected and the researchers looked at the effect of herbivory and nutrient addition on the number of fruits produced per plants. Below is the corresponding STAN code:

/*Nested regression example
*Three-levels with varying-intercept
*based on: https://rpubs.com/kaz_yos/stan-multi-2
*and applied to the Arabidopsis data from lme4
*/

data {
  int<lower=1> N; //number of observations
  int<lower=1> P; //number of populations
  int<lower=1> R; //number of regions
//population ID
  int<lower=1,upper=P> PopID[N]; 
 //index of population appertenance to a specific region
  int<lower=1,upper=R> PopWithinReg[P]; 

  int<lower=0> Fruit[N]; //the response variable
  real AMD[N]; //predictor variable, whether the apical meristem was unclipped (0) or clipped (1)
  real nutrient[N]; //predictor variable, whether nutrient level were control (0) or higher (1)
}

parameters {
  //regression slopes
  real beta_0; //intercept
  real beta_1; //effect of clipping apical meristem on number of fruits
  real beta_2; //effect of increaing nutrient level on number of fruits

  //the deviation from the intercept at the different levels
  real dev_pop[P]; //deviation between the populations within a region
  real dev_reg[R]; //deviation between the regions

  //the standard deviation for the deviations
  real<lower=0> sigma_pop;
  real<lower=0> sigma_reg;
}

transformed parameters {
  //varying intercepts
  real beta_0pop[P];
  real beta_0reg[R];

  //the linear predictor for the observations
  real<lower=0> lambda[N];

  //compute the varying intercept at the region level
  for(r in 1:R){
    beta_0reg[r] = beta_0 + dev_reg[r];}

  //compute varying intercept at the population within region level
  for(p in 1:P){
     beta_0pop[p] = beta_0reg[PopWithinReg[p]] + dev_pop[p];}

  //the linear predictor
  for(n in 1:N){
     lambda[n] = beta_0pop[PopID[n]] + beta_1 * AMD[n] + beta_2 * nutrient[n];}
}

model {
  //weakly informative priors on the slopes
  beta_0 ~ cauchy(0,5);
  beta_1 ~ cauchy(0,5);
  beta_2 ~ cauchy(0,5);

  //weakly informative prior on the standard deviation
  sigma_pop ~ cauchy(0,2.5);
  sigma_reg ~ cauchy(0,2.5);

  //distribution of the varying intercept
  dev_pop ~ normal(0,sigma_pop);
  dev_reg ~ normal(0,sigma_reg);

  //likelihood
  Fruit ~ poisson_log(lambda);
}

generated quantities {
//sample predicted values from the model for posterior predictive checks
 int<lower=0> fruit_rep[N];
 for(n in 1:N)
   fruit_rep[n] = poisson_log_rng(lambda[n]);
}

I decided to use a Poisson distribution as the response is a count variable. The only “tricky” part is the index linking a particular population to its specific region (PopWithinReg). In this model we assume that variations between populations within regions is only affecting the average number of fruits but is not affecting the plant responses to the simulated herbivory (AMD) and to increased in nutrient levels. In other words populations within region is an intercept-only “random effect”. We turn back to R:

 data("Arabidopsis")
#generate the IDs
pop.id <- as.numeric(Arabidopsis$popu)
pop_to_reg <- as.numeric(factor(substr(levels(Arabidopsis$popu),3,4)))
#create the predictor variables
amd <- ifelse(Arabidopsis$amd=="unclipped",0,1)
nutrient <- ifelse(Arabidopsis$nutrient==1,0,1)

m_arab <- stan("nested_3lvl.stan",data=list(N=625,P=9,R=3,PopID=pop.id,
PopWithinReg=pop_to_reg,Fruit=Arabidopsis$total.fruits,
AMD=amd,nutrient=nutrient))
#check model
#launch_shinystan(m_arab)

Rstan is warning us that we had some divergent iterations, we could correct this using non-centered re-parametrization (See this post and the STAN user guide). More worrisome is the discrepancy between the posterior predictive data and the observed ones:

check_nest

We can explore these errors for each populations within regions:

mcmc_arab <- extract(m_arab)
#plot obs vs fitted data across groups
fit_arab <- mcmc_arab$fruit_rep
#average across MCMC samples
Arabidopsis$Fit <- apply(fit_arab,2,mean)
#plot obs vs fit
ggplot(Arabidopsis,aes(x=total.fruits,y=Fit,color=amd,shape=factor(nutrient)+
geom_point()+facet_wrap(~popu,scales="free")+
geom_abline(aes(intercept=0,slope=1))

nested_hier

The model predict basically four values, one for each combination of the two treatment variables. The original data are way more dispersed than the fitted ones, one could try to use negative binomial distribution while making the treatment effect also vary between the populations between the regions …

That’s it for this post, a great source of regression models for further examples in the STAN-wiki.

 

 

 

Hierarchical models with RStan (Part 1)

Real-world data sometime show complex structure that call for the use of special models. When data are organized in more than one level, hierarchical models are the most relevant tool for data analysis. One classic example is when you record student performance from different schools, you might decide to record student-level variables (age, ethnicity, social background) as well as school-level variable (number of student, budget). In this post I will show how to fit such models using RStan. As there is much to say and try on such models I restrict myself in this post to a rather simple example, I will extend this to more complex situations in latter posts.

A few words about RStan:

If you don’t know anything about STAN and RStan make sure to check out this webpage. In a few words RStan is an R interface to the STAN programming language that let’s you fit Bayesian models. A classical workflow looks like this:

  1. Write a STAN model file ending with a .stan
  2. In R fit the model using the RStan package passing the model file and the data to the stan function
  3. Check model fit, a great way to do it is to use the shinystan package

First example with simulated data:

Say that we recorded the response of 10 different plant species to rising temperature and nitrogen concentration. We measured the biomass of 10 individuals per species along a gradient of both temperature and nitrogen concentration and we would like to know how these two variables affect plant biomass. In hierarchical model we let regression parameters vary between the species, this means that, for example, species A might have a more positive slope between temperature and biomass than species B. Note however that we do not fit separate regression to each species, rather the regression parameters for the different species are themselves being fitted to a statistical distribution.

In mathematical terms this example can be written:

\mu_{ij} = \beta_{0j} + \beta_{1j} * Temperature_{ij} + \beta_{2j} * Nitrogen_{ij}

\beta_{0j} \sim N(\gamma_{0},\tau_{0})

… (same for all regression coefficients) …

y_{ij} \sim N(\mu_{ij},\sigma)

For observations i: 1 … N and species j: 1 … J.

This is how such a model looks like in STAN:

/*A simple example of an hierarchical model*/
data {
  int<lower=1> N; //the number of observations
  int<lower=1> J; //the number of groups
  int<lower=1> K; //number of columns in the model matrix
  int<lower=1,upper=J> id[N]; //vector of group indeces
  matrix[N,K] X; //the model matrix
  vector[N] y; //the response variable
}
parameters {
  vector[K] gamma; //population-level regression coefficients
  vector<lower=0>[K] tau; //the standard deviation of the regression coefficients

  vector[K] beta[J]; //matrix of group-level regression coefficients
  real<lower=0> sigma; //standard deviation of the individual observations
}
model {
  vector[N] mu; //linear predictor
  //priors
  gamma ~ normal(0,5); //weakly informative priors on the regression coefficients
  tau ~ cauchy(0,2.5); //weakly informative priors, see section 6.9 in STAN user guide
  sigma ~ gamma(2,0.1); //weakly informative priors, see section 6.9 in STAN user guide
  
  for(j in 1:J){
   beta[j] ~ normal(gamma,tau); //fill the matrix of group-level regression coefficients 
  }
  
  for(n in 1:N){
    mu[n] = X[n] * beta[id[n]]; //compute the linear predictor using relevant group-level regression coefficients 
  }

  //likelihood
  y ~ normal(mu,sigma);
}

You can copy/paste the code into an empty text editor and save it under a .stan file.

Now we turn into R:

#load libraries
library(rstan)
library(RColorBrewer)
#where the STAN model is saved
setwd("~/Desktop/Blog/STAN/")
#simulate some data
set.seed(20161110)
N<-100 #sample size
J<-10 #number of plant species
id<-rep(1:J,each=10) #index of plant species
K<-3 #number of regression coefficients
#population-level regression coefficient
gamma<-c(2,-1,3)
#standard deviation of the group-level coefficient
tau<-c(0.3,2,1)
#standard deviation of individual observations
sigma<-1
#group-level regression coefficients
beta<-mapply(function(g,t) rnorm(J,g,t),g=gamma,t=tau) 
#the model matrix
X<-model.matrix(~x+y,data=data.frame(x=runif(N,-2,2),y=runif(N,-2,2)))
y<-vector(length = N)
for(n in 1:N){
  #simulate response data
  y[n]<-rnorm(1,X[n,]%*%beta[id[n],],sigma)
}
#run the model
m_hier<-stan(file="hierarchical1.stan",data=list(N=N,J=J,K=K,id=id,X=X,y=y))

The MCMC sampling takes place (took about 90 sec per chain on my computer), and then I got this warning message: “Warning messages:
1: There were 61 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
2: Examine the pairs() plot to diagnose sampling problems”

Here is an explanation for this warning: “For some intuition, imagine walking down a steep mountain. If you take too big of a step you will fall, but if you can take very tiny steps you might be able to make your way down the mountain, albeit very slowly. Similarly, we can tell Stan to take smaller steps around the posterior distribution, which (in some but not all cases) can help avoid these divergences.” from here. This issue occur quite often in hierarchical model with limited sample size, the simplest solution being to re-parameterize the model, in other words to re-write the equations so that the MCMC sampler has an easier time sampling the posterior distribution.

Below is a new STAN model with a non-centered parameterization (See Section 22.6 in STAN user guide):

parameters {
  vector[K] gamma; //population-level regression coefficients
  vector<lower=0>[K] tau; //the standard deviation of the regression coefficients
  //implementing Matt's trick
  vector[K] beta_raw[J];
  real<lower=0> sigma; //standard deviation of the individual observations
}
transformed parameters {
  vector[K] beta[J]; //matrix of group-level regression coefficients
  //computing the group-level coefficient, based on non-centered parametrization based on section 22.6 STAN (v2.12) user's guide
  for(j in 1:J){
    beta[j] = gamma + tau .* beta_raw[j];
  }
}
model {
  vector[N] mu; //linear predictor
  //priors
  gamma ~ normal(0,5); //weakly informative priors on the regression coefficients
  tau ~ cauchy(0,2.5); //weakly informative priors, see section 6.9 in STAN user guide
  sigma ~ gamma(2,0.1); //weakly informative priors, see section 6.9 in STAN user guide
  for(j in 1:J){
   beta_raw[j] ~ normal(0,1); //fill the matrix of group-level regression coefficients 
  }
  for(n in 1:N){
    mu[n] = X[n] * beta[id[n]]; //compute the linear predictor using relevant group-level regression coefficients 
  }
  //likelihood
  y ~ normal(mu,sigma);
}

Note that the data model block is identical in the two cases.

We turn back to R:

#re-parametrize the model
m_hier<-stan(file="hierarchical1_reparam.stan",data=list(N=N,J=J,K=K,id=id,X=X,y=y))
#no more divergent iterations, we can start exploring the model
#a great way to start is to use the shinystan library
#library(shinystan)
#launch_shinystan(m_hier)
#model inference
print(m_hier,pars=c("gamma","tau","sigma"))
Inference for Stan model: hierarchical1_reparam.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

          mean se_mean   sd  2.5%   25%   50%  75% 97.5% n_eff Rhat
gamma[1]  1.96    0.00 0.17  1.61  1.86  1.96 2.07  2.29  2075    1
gamma[2] -0.03    0.02 0.77 -1.53 -0.49 -0.04 0.43  1.55  1047    1
gamma[3]  2.81    0.02 0.49  1.84  2.52  2.80 3.12  3.79   926    1
tau[1]    0.34    0.01 0.21  0.02  0.19  0.33 0.46  0.79  1135    1
tau[2]    2.39    0.02 0.66  1.47  1.94  2.26 2.69  4.04  1234    1
tau[3]    1.44    0.01 0.41  0.87  1.16  1.37 1.65  2.43  1317    1
sigma     1.04    0.00 0.09  0.89  0.98  1.04 1.10  1.23  2392    1

Samples were drawn using NUTS(diag_e) at Thu Nov 10 14:11:41 2016.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The regression parameters were all decently estimated except for the second slope coefficient, the simulated value was -1.

All MCMC samples for all coefficient can be easily extracted and used to compute whatever your interest is:

#extract the MCMC samples
mcmc_hier<-extract(m_hier)
str(mcmc_hier)

#plot average response to explanatory variables
X_new<-model.matrix(~x+y,data=data.frame(x=seq(-2,2,by=0.2),y=0))
#get predicted values for each MCMC sample
pred_x1<-apply(mcmc_hier$gamma,1,function(beta) X_new %*% beta)
#now get median and 95% credible intervals
pred_x1<-apply(pred_x1,1,quantile,probs=c(0.025,0.5,0.975))
#same stuff for the second explanatory variables
X_new<-model.matrix(~x+y,data=data.frame(x=0,y=seq(-2,2,by=0.2)))
pred_x2<-apply(mcmc_hier$gamma,1,function(beta) X_new %*% beta)
pred_x2<-apply(pred_x2,1,quantile,probs=c(0.025,0.5,0.975))

Here we basically generated new model matrices where only one variable was moving at a time, this allowed us to get the model prediction for the effect of say temperature on plant biomass under average nutrient conditions. These predictions were obtained by multiplying the model matrix with the coefficients for each MCMC sample (the first apply command), and then we can get from these samples the median with 95% credible intervals (the second apply command).

Now we can plot this (code for the plots at the end of the post)

hier1

Another important plot is the variation in the regression parameters between the species, again this is easily done using the MCMC samples:

#now we could look at the variation in the regression coefficients between the groups doing caterpillar plots
ind_coeff<-apply(mcmc_hier$beta,c(2,3),quantile,probs=c(0.025,0.5,0.975))
df_ind_coeff<-data.frame(Coeff=rep(c("(Int)","X1","X2"),each=10),LI=c(ind_coeff[1,,1],ind_coeff[1,,2],ind_coeff[1,,3]),Median=c(ind_coeff[2,,1],ind_coeff[2,,2],ind_coeff[2,,3]),HI=c(ind_coeff[3,,1],ind_coeff[3,,2],ind_coeff[3,,3]))
gr<-paste("Gr",1:10)
df_ind_coeff$Group<-factor(gr,levels=gr)
#we may also add the population-level median estimate
pop_lvl<-data.frame(Coeff=c("(Int)","X1","X2"),Median=apply(mcmc_hier$gamma,2,quantile,probs=0.5))
ggplot(df_ind_coeff,aes(x=Group,y=Median))+geom_point()+
  geom_linerange(aes(ymin=LI,ymax=HI))+coord_flip()+
  facet_grid(.~Coeff)+
  geom_hline(data=pop_lvl,aes(yintercept=Median),color="blue",linetype="dashed")+
  labs(y="Regression parameters")

hier2

The cool thing with using STAN is that we can extend or modify the model in many ways. This will be the topics of future posts which will include: crossed and nested design, multilevel modelling, non-normal distributions and much more, stay tuned!

Code for the first plot:

cols<-brewer.pal(10,"Set3")
par(mfrow=c(1,2),mar=c(4,4,0,1),oma=c(0,0,3,5))
plot(y~X[,2],pch=16,xlab="Temperature",ylab="Response variable",col=cols[id])
lines(seq(-2,2,by=0.2),pred_x1[1,],lty=2,col="red")
lines(seq(-2,2,by=0.2),pred_x1[2,],lty=1,lwd=3,col="blue")
lines(seq(-2,2,by=0.2),pred_x1[3,],lty=2,col="red")
plot(y~X[,3],pch=16,xlab="Nitrogen concentration",ylab="Response variable",col=cols[id])
lines(seq(-2,2,by=0.2),pred_x2[1,],lty=2,col="red")
lines(seq(-2,2,by=0.2),pred_x2[2,],lty=1,lwd=3,col="blue")
lines(seq(-2,2,by=0.2),pred_x2[3,],lty=2,col="red")
mtext(text = "Population-level response to the two\nexplanatory variables with 95% CrI",side = 3,line = 0,outer=TRUE)
legend(x=2.1,y=10,legend=paste("Gr",1:10),ncol = 1,col=cols,pch=16,bty="n",xpd=NA,title = "Group\nID")

 

150 years of ecology: lessons for the future

Last week I was at the annual meeting of the german, austrian and swiss ecologists (GfÖ) which took place in Marburg (DE). The title of the conference referred to the creation of the word Oecologie in 1866 by Ernst Haeckel and the conference aimed at “reflect upon the progress we have made in the past and […] to identify the many unsolved scientific questions that still confront us.” Volkmar Wölters.

How I found the conference:

  • Marburg is a cool town, the “Altstadt” is super-pretty with loads of cosy bars to extend interactions on scientific and non-scientific topics well into the night (and morning for the bravest).
  • As always for a GfÖ conference, the participants were mostly PhDs and PostDocs from Germany, little diversity in terms of disciplines most of the people being hardcore ecologists.
  • Weird repartition of talks within the session, parallel to my talk on the links between plant diversity and herbivory/predation through consumer community shifts was a talk (in another session) on plant diversity increase food quantity and quality with positive effect on social bees. While two talks later in my session we heard about plant richness negative effect on root decomposition …
  • Great social events, the ice breaker was supported by awesome food, the BBQ night revealed the great musical talent of prominent German ecologists (including GfÖ president) and also fantastic dynamism when it comes to dancing on hits from the 80ies.
  • If I got one lesson for the future it is: be a generalist, develop skills and interact with associated disciplines like remote sensing (as Nathalie Pettorelli) or social science to bring new ideas, new concepts and new data into ecology.

 

Overview of the week:

(All these are based on my notes and my vague, selective memories of the talks, it is certainly not an unbiased vision of the conference)

Day 1:

I held my talk on the first day almost right after the keynote speech of Susanne Fritz so my notes on her talk are rather sparse. She argued that ecologists should embrace the historical dimension of their dataset, for example selective extinction (big species going extinct first) might explain some present-day trait distribution. After the keynote speech I held my talk despite technical failures (spent 15min shouting my slides in a too large lecture hall), I received little feedback from it so I guess that what I am doing is OK. Felix Fornoff presented some nice results on trap-nesting bees, he found that tree diversity effect on these insects are explained by changes in canopy area. As the canopy grows the climatic conditions under the trees become more stable (less wind, less temperature extreme) which positively affect abundance and richness of trap-nesting bees. After the coffee break I had the pleasure to hear an inspiring talk by Leana Zoller on the effect of artificial light on night pollination. In her experiment she put some LED street light into areas not affected by light pollution and recorded its effect on Ciriseum oleacerum fitness, she found strong effect of light pollution in this naive flora and fauna, seed mass of C. oleacerum declined by 20% in light polluted compared to control plot.

Day 2:

A nice talk by Per Schleuss caught my attention on the second day just before lunch, he looked at the degradation Kobresia grasslands in Tibet due to desiccation (climate change) and overgrazing in the area. He found that 70% of the soil organic carbon trapped by the system was either being eroded leading to important river pollution or being mineralized a process which release carbon dioxide in the atmosphere. Christian Wirth was the next keynote speaker with a provocative title: “After the hype: a reality check for trait-based functional biodiversity research”. The talk nicely outlined the different ways to quantify trait variation (mean, dispersion, range) each having different implications for ecosystem functions. He also showed that ecosystem functions are not determined by one unique “super-trait” but that on average five traits are necessary per ecosystem function. He went on talking about intraspecific trait variation saying that “if there are too many individuals varying in their traits to hope to put together a community piece by peice and to predict ecosystem function, then forget about plant traits”. A sobering statement as more and more data accumulate showing large within-species variation in trait values. At the end of the day Sanne Van Den Berge talked biodiversity of hedgerows and tree line in Belgium, in the area of study such structure represent 0.7% of the land cover but host 45% of the floral diversity. Unfortunately these structure are endangered by human pressures especially the older elements which also have the highest diversity. She also reported that local diversity increased by, on average, 4 species between 1974 and 2015, something to add (maybe) to the current debate on biodiversity trends.

Day 3:

I spent the morning sitting in the movement ecology session hearing some nice talks, in particular one by Wiebke Ullmann on hare movement in agricultural landscape and how energy expenditure was affected by home range size which was affected by farm size. As well as one by Lisa Fisler on hoverflies migration through the Swiss Alps showing impressive videos and data that these marvellous tiny creature actively fly against the winds through mountain passes up to 1900m! After lunch I went to a cool session one curious natural history observation or as one of the speaker putted it: “[he is] completely fed up with hypothesis-driven science”. Mark-Olivier Rödel presented great insights on the adaptation of the red-rubber frog which spend the dry season within ant nests completely unarmed by these ferocious insects who hunt and feed on other frog in the region. Then Manfred Türke delighted us with his observations of slug poops. It appears that many mite species survive the passage through the slug intestine and found on average 3 surviving mites per droppings. He thinks that mites dramatically increase their dispersal ability through the use of the “slug-highsped train”.

Day 4:

The last day started with a keynote speech by Shahid Naeem where he reflected on the evolution of conservation paradigms from nature for itself in the 60ies to nature for people in the 2000s to people and nature in current times. He argued that biodiversity is a multidimensional concept (taxonomic diversity, functional diversity …) and one need to look and preserve all dimensions to support ecosystem functions. Going on he argued that human well-being should be the primary focus of contemporary ecological approaches since if higher human well-being is being targeted then we need high provision of services from the ecosystems coming from good functioning of ecosystems which supported by biodiversity (CQFD), therefore society by protecting biodiversity will protect itself. This makes sense even if I have little love for utilitarian argument when it comes to biodiversity conservation. Later on that day I heard an inspiring talk by Jasper Wubs on how one can steer restoration in different directions by using different soil inoculum. In other words if you inoculate some heathland soils into a bare ground plot you get an heathland plant communities, and if 10 meters away you inoculate grassland soils you get grassland communities. How great is that? Finally I listened with great attention to Florian Hartig talk on model selection and its effect on prediction and inference using a simulation study. He showed that unconditional model averaging and LASSO led to the best performance when it comes to prediction while for inference keeping the full model led to the best results.

 

All in all the GfÖ meeting are always pretty nice, loads of little scientific nuggets here and there and great interactions with cool people. Next meeting will be in Belgium for a very exciting joined meeting with the British Ecological Society, already looking forward to it.

 

 

Shiny and Leaflet for Visualization Awards

Next week will be the meeting of the German (and Swiss and Austrians) ecologists in Marburg and the organizing team launched a visualization contest based on spatial data of the stores present in the city. Nadja Simons and I decided to enter the contest, our idea was to link the store data to the city bus network promoting a sustainable and safe way of movement through the town (we are ecologists after all).

We used leaflet to prepare the map and plot the bus lines, bus stops and stores. On top of that because all this information is rather overwhelming to grasp (more than 200 stores and 50 bus stops), we implemented a shiny App to allow the user to restrict its search of the best Bars in Marburg.

All the code is on GitHub and the App can accessed by clicking here.

Go there, enjoy, and if you want to learn a bit about the process of developing the App come back here.

 

First a few words on what is so cool about leaflet:

  • There is a lot of maps available
  • With the AwesomeMarker plugin, available in the github repo of the package, you can directly tap into three awesome libraries for icons: font awesome, glyphicons and ionicons
  • Leaflet understand HTML code, we used it to provide a clickable link to the bus timetables
  • Using groups makes it easy to interactively add or remove group of objects from the map

Having this was already nice, but putting it into a shiny App was even more exciting. Here are some of the main points:

  • A very important concept in shiny is the concept of reactivity, the way I understood it is that a reactive object is a small function that will get updated every time the user input some elements, see this nice tutorial for more on this.
  • Our idea of the App was that stores should appear when the mouse is passed over their nearest bus stop. The issue there is that the stores must then disappear if the mouse moves out. The trick is to create a set of reactiveValues that are NULL as long as the event (mouse passes over a bus stop) does not occure AND return to NULL when the event is finished (mouse moves out), helped by this post, we were able to implement what we wanted.
  • Shiny gives you a lot of freedom to create and customize the design of the App, what I found very cool was the possibility to have tabs.

Be sure to check the leaflet tutorial page, 90% of my questions were answered there, also check out the many articles on shiny.

See you in Marburg if you’ll be around!

 

Simulating local community dynamics under ecological drift

In 2001 the book by Stephen Hubbell on the neutral theory of biodiversity was a major shift from classical community ecology. Before this book the niche-assembly framework was dominating the study of community dynamics. Very briefly under this framework local species composition is the result of the resource available at a particular site and species presence or absence depends on species niche (as defined in Chase 2003). As a result community assembly was seen as a deterministic process and a specific set of species should emerge from a given set of resources and species composition should be stable as long as the system is not disturbed. Hubbell theory based on a dispersal-assembly framework (historically developed by MacArthur and Wilson) was a departure from this mainstream view of community dynamics by assuming that local species composition are transient with the absence of any equilibrium. Species abundance follow a random walk under ecological drift, species are continuously going locally extinct while a constant rain of immigrants from the metacommunity bring new recruits. The interested reader is kindly encouraged to read: this, that or that.

Being an R geek and after reading Chapter 4 of Hubbell’s book, I decided to implement his model of local community dynamic under ecological drift. Do not expect any advanced analysis of the model behavior, Hubbell actually derive in this chapter analytical solutions for this model. This post is mostly educational to get a glimpse into the first part of Hubbell’s neutral model.

Before diving into the code it is helpful to verbally outline the main characteristics of the model:

  • The model is individual-based and neutral, assuming that all individuals are equal in their probability to die or to persist.
  • The model assume a zero-sum dynamic, the total number of individual (J) is constant, imagine for example a forest. When a tree dies it allows for a new individual to grow and use the free space
  • At each time step a number of individual (D) dies and are replaced by new recruits coming either from the metacommunity (with probability m) or from the local communities (with probability 1-m)
  • The species identity of the new recruit is proportional to the species relative abundance either in the metacommunity (assumed to be fixed and having a log-normal distribution) or in the local community (which obviously varies)

We are now equipped to immerse ourselves into R code:

#implement zero-sum dynamic with ecological drift from Hubbel theory (Chp4)
untb<-function(R=100,J=1000,D=1,m=0,Time=100,seed=as.numeric(Sys.time())){
  mat<-matrix(0,ncol=R,nrow=Time+1,dimnames=
  list(paste0("Time",0:Time),as.character(1:R)))
  #the metacommunity SAD
  metaSAD<-rlnorm(n=R)
  metaSAD<-metaSAD/sum(metaSAD)
  #initialize the local community
  begin<-table(sample(x=1:R,size = J,replace = TRUE,prob = metaSAD))
  mat[1,attr(begin,"dimnames")[[1]]]<-begin
  #loop through the time and get community composition
  for(i in 2:(Time+1)){
    newC<-helper_untb(vec = mat[(i-1),],R = R,D = D,m = m,metaSAD=metaSAD)
    mat[i,attr(newC,"dimnames")[[1]]]<-newC
  }
  return(mat)
}
#the function that compute individual death and 
#replacement by new recruits 
helper_untb<-function(vec,R,D,m,metaSAD){
  ind<-rep(1:R,times=vec)
  #remove D individuals from the community
  ind_new<-ind[-sample(x = 1:length(ind),size = D,replace=FALSE)]
  #are the new recruit from the Local or Metacommunity?
  recruit<-sample(x=c("L","M"),size=D,replace=TRUE,prob=c((1-m),m))
  #sample species ID according to the relevant SAD
  ind_recruit<-c(ind_new,sapply(recruit,function(isLocal) {
    if(isLocal=="L")
      sample(x=ind_new,size=1)
    else
      sample(1:R,size=1,prob=metaSAD)
  }))
  return(table(ind_recruit))
}

I always like simple theories that can be implemented with few lines of code and do not require fancy computations. Let’s put the function in action and get for a closed community (m=0) of 10 species the species abundance and the rank-abundance distribution dynamics:

library(reshape2) #for data manipulation I
library(plyr) #for data manipulation II
library(ggplot2) #for the nicest plot in the world
#look at population dynamics over time
comm1<-untb(R = 10,J = 100,D = 10,m = 0,Time = 200)
#quick and dirty first graphs
comm1m<-melt(comm1)
comm1m$Time<-as.numeric(comm1m$Var1)-1
ggplot(comm1m,aes(x=Time,y=value,color=factor(Var2)))
+geom_path()+labs(y="Species abundance")
+scale_color_discrete(name="Species ID")
#let's plot the RAD curve over time
pre_rad<-apply(comm1,1,function(x) sort(x,decreasing = TRUE))
pre_radm<-melt(pre_rad)
pre_radm$Time<-as.numeric(pre_radm$Var2)-1
ggplot(pre_radm,aes(x=Var1,y=value/100,color=Time,group=Time))
+geom_line()+labs(x="Species rank",y="Species relative abundance")

untb1.1

We see there an important feature of the zero-sum dynamics, with no immigration the model will always (eventually) lead to the dominance of one species. This is because with no immigration there are so-called absorbing states, species abundance are attracted to these states and once these values are reached species abundance stay constant. In this case there are only two absorbing states for species relative abundance: 0 or 1, either you go extinct or you dominate the community.

In the next step we open up the community while varying the death rate and we track species richness dynamic:

#apply this over a range of parameters
pars<-expand.grid(R=20,J=200,D=c(1,2,4,8,16,32,64),
 + m=c(0,0.01,0.05,0.1,0.5),Time=200)
comm_list<-alply(pars,1,function(p) do.call(untb,as.list(p)))
#get species abundance for each parameter set
rich<-llply(comm_list,function(x) apply(x,1,function(y) length(y[y!=0])))
rich<-llply(rich,function(x) data.frame(Time=0:200,Rich=x))
rich<-rbind.fill(rich)
rich$D<-rep(pars$D,each=201)
rich$m<-rep(pars$m,each=201)
ggplot(rich,aes(x=Time,y=Rich,color=factor(D)))+geom_path()
+facet_grid(.~m)

untb3

As we increase the probability that new recruit come from the metacommunity, the impact of increasing the number of death per time step becomes null. This is because the metacommunity is static in this model, even if species are wiped out by important local disturbance there will always be new immigrants coming in maintaining species richness to rather constant levels.

There is an R package for everything these days and the neutral theory is not an exception, check out the untb package which implement the full neutral model (ie with speciation and dynamic metacommunities) but also allow to fit the model to empirical data and to get model parameter estimate.

Exploring the diversity of Life using Rvest and the Catalog of Life

I am writing the general introduction for my thesis and wanted to have a nice illustration of the diversity of Arthropods compared to other phyla (my work focus on Arthropods so this is a nice motivation). As the literature I have had access so far use pie charts to graphically represent these diversities and knowing that pie chart are bad, I decided to create my own illustration.

Fortunately I came across the Catalogue of Life website which provide (among other things) an overview of the number of species in each phylum. So I decided to try and have a go at directly importing the data from the website into R using the rvest package.

Let’s go:

 

#load the packages
library(rvest)
library(ggplot2)
library(scales)#for comma separator in ggplot2 axis

#read the data
col<-read_html("http://www.catalogueoflife.org/col/info/totals")
col%>%
  html_table(header=TRUE)->sp_list  
sp_list<-sp_list[[1]]
#some minor data re-formatting
#re-format the data frame keeping only animals, plants and
#fungi
sp_list<-sp_list[c(3:37,90:94,98:105),-4]
#add a kingdom column
sp_list$kingdom<-rep(c("Animalia","Fungi","Plantae"),times=c(35,5,8))
#remove the nasty commas and turn into numeric
sp_list[,2]<-as.numeric(gsub(",","",sp_list[,2]))
sp_list[,3]<-as.numeric(gsub(",","",sp_list[,3]))
names(sp_list)[2:3]<-c("Nb_Species_Col","Nb_Species")

Now we are read to make the first plot

ggplot(sp_list,aes(x=Taxon,y=Nb_Species,fill=kingdom))+
 geom_bar(stat="identity")+
 coord_flip()+
 scale_fill_discrete(name="Kingdom")+
 labs(y="Number of Species",x="",title="The diversity of life")

 

Div1

This is a bit overwhelming, half of the phyla have less than 1000 species so let’s make a second graph only with the phyla comprising more than 1000 species. And just to make things nicer I sort the phyla by the number of species:

subs<-subset(sp_list,Nb_Species>1000)
subs$Taxon<-factor(subs$Taxon,levels=subs$Taxon[order(subs$Nb_Species)])
ggplot(subs,aes(x=Taxon,y=Nb_Species,fill=kingdom))+
 geom_bar(stat="identity")+
 theme(panel.border=element_rect(linetype="dashed",color="black",fill=NA),
 panel.background=element_rect(fill="white"),
 panel.grid.major.x=element_line(linetype="dotted",color="black"))+
 coord_flip()+
 scale_fill_discrete(name="Kingdom")+
 labs(y="Number of Species",x="",
 title="The diversity of multicellular organisms from the Catalog of Life")+
 scale_y_continuous(expand=c(0,0),limits=c(0,1250000),labels=comma)

 

Div2

That’s it for a first exploration of the powers of rvest, this was actually super easy I expected to have to spend much more time trying to decipher xml code, but rvest seems to know its way around …

This graph is also a nice reminder that most of the described multicellular species out there are small crawling beetles and that we still know alarmingly very little about their ecology and status of threat. As a comparison all the vertebrates (all birds, mammals, reptiles, fishes and some other taxa) are within the Chordata and having a total of 50,000 described species. An even more sobering thought is the fact that the total number of described species is only a fraction of what is left undescribed.

Exploring Spatial Patterns and Coexistance

Today is a rainy day and I had to drop my plans for going out hiking, instead I continued reading “Self-Organization in Complex Ecosystems” from Richard Solé and Jordi Bascompte.

As I will be busy in the coming weeks with spatial models at the iDiv summer school I was closely reading chapter three on spatial self-organization and decided to try and implement in R one of the Coupled Map Lattice Models that is described in the book.

The model is based on a discrete time version of the Lotka-Volterra competition model, for example for two competing species we model the population normalized abundances (the population abundance divided by the carrying capacity) N1 and N2 using the following equations:

N_{1,t+1} = \phi_{N_{1}}(N_{1,t},N_{2,t}) = N_{1,t} * exp[r_{N_{1}} * (1 - N_{1} - \alpha_{12} * N_{2})]

N_{2,t+1} = \phi_{N_{2}}(N_{1,t},N_{2,t}) = N_{2,t} * exp[r_{N_{2}} * (1 - N_{2} - \alpha_{21} * N_{1})]

Where r is the growth rate and the \alpha are the interspecific competition coefficients. Note that since we use normalized abundance the intraspecific competition coefficient is equal to one.

Space can be introduced in this model by considering dispersion between neighboring cells. Say we have a 50 x 50 lattice map and that individuals disperse between neighboring cells with a probability D, as a result the normalized population abundances in cell k at time t+1 will be:

N_{1,t+1,k} = (1 - D) * \phi_{N_{1}}(N_{1,t,k},N_{2,t,k}) + \frac{D}{8} * \sum_{j=1}^{8} \phi_{N_{1}}(N_{1,t,j},N_{2,t,j})

Where j is an index of all 8 neighboring cells around cell k. We have a similar equation for the competing species.

Now let’s implement this in R (you may a nicer version of the code here).

I start by defining two functions: one that will compute the population dynamics based on the first two equations, and the other one that will return for each cell in lattice the row numbers of the neighboring cells:

#Lotka-Voltera discrete time competition function
#@Ns are the population abundance
#@rs are the growth rate
#@as is the competition matrix
discrete<-function(Ns,rs,as){
  #return population abundance at time t+1
  return(as.numeric(Ns*exp(rs*(1-as%*%Ns))))
}

#an helper function returning line numbers of all neighbouring cells
find_neighbour<-function(vec,dat=dat){
  lnb<-which(dat[,"x"]%in%(vec["x"]+c(-1,-1,-1,0,1,1,1,0)) &
+ dat[,"y"]%in%(vec["y"]+c(-1,0,1,1,1,0,-1,-1)))
  #remove own line
  lnb<-lnb[which(lnb!=vec["lnb"])]
  return(lnb)
}

Then I create a lattice, fill it with the two species and some random noise.

dat<-expand.grid(x=1:50,y=1:50)
dat$N1<-0.5+rnorm(2500,0,0.1)
dat$N2<-0.5+rnorm(2500,0,0.1)
dat$lnb<-1:2500
list_dat<-list(dat)

The next step is to initialize all the parameters, in the example given in the book the two species have equal growth rate, equal dispersal and equal interspecific competition.

#competition parameters
as<-matrix(c(1,1.2,1.2,1),ncol=2,byrow=TRUE)
#growth rate
rs<-c(1.5,1.5)
#dispersal rate
ds<-c(0.05,0.05)
#infos on neighbouring cells
neigh<-apply(dat,1,function(x) find_neighbour(x,dat))

Now we are ready to start the model (the code is rather ugly, sorry for that, saturday coding makes me rather lazy …)

generation<-1:60
#model the dynamic assuming absorbing boundary (ie ind that go out of the grid are lost to the system)
for(i in generation){  
  list_dat[[(i+1)]]<-rbind.fill(apply(list_dat[[i]],1,function(x){
    #population dynamics within one grid cell
    intern<-(1-ds)*discrete(Ns=x[c("N1","N2")],rs=rs,as=as)
    #grab the neighbouring cells
    neigh_cell<-list_dat[[i]][neigh[[x["lnb"]]],]
    #the number of immigrant coming into the focal grid cell
    imm<-(ds/8)*rowSums(apply(neigh_cell,1,function(y) {
    discrete(Ns=y[c("N1","N2")],rs=c(1.5,1.5),as=as)}))
    out<-data.frame(x=x["x"],y=x["y"],N1=intern[1]+imm[1],N2=intern[2]
    + +imm[2],lnb=x["lnb"])
    return(out)
  }))
  #print(i)
}

First let’s reproduce figure 3.8 from the book:

#look at coexistence between the two species within one cell
cell525<-ldply(list_dat,function(x) c(x[525,"N1"],x[525,"N2"]))
cell525$Gen<-1:61
cell525<-melt(cell525,id.vars="Gen",
+ measure.vars=1:2,variable.name="Species",value.name="Pop")
ggplot(cell525,aes(x=Gen,y=Pop,color=Species))+geom_path()

#look at coexistence across the system
all<-ldply(list_dat,function(x) c(sum(x[,"N1"]),sum(x[,"N2"])))
all$Gen<-1:61
all<-melt(all,id.vars="Gen",measure.vars=1:2,
+ variable.name="Species",value.name="Pop")
ggplot(all,aes(x=Gen,y=Pop,color=Species))+geom_path()

 

space1

And now figure 3.9 plus an extra figure looking at the correlation in population abundance between the two species at the beginning and the end of the simulation:

#compare species-species abundance correlation at the beginning and at the end of the simulation
exmpl<-rbind(list_dat[[1]],list_dat[[60]])
exmpl$Gen<-rep(c("Gen : 1","Gen : 60"),each=2500)
ggplot(exmpl,aes(x=N1,y=N2))+geom_point()+facet_grid(.~Gen)

space3

#look at variation in spatial patterns at the beginning and the end of the simulation
exmplm<-melt(exmpl,id.vars=c("x","y","Gen"),
+ measure.vars=c("N1","N2"),variable.name="Species",value.name="Pop")
ggplot(exmplm,aes(x,y,fill=Pop))+geom_raster()
 +scale_fill_viridis(option="plasma",direction=-1)+facet_grid(Gen~Species)

space2

 

Neat stuff, I am looking forward to implement other models in this book!

 

Discussion on Biodiversity trends in the Anthropocene

So today at our group meeting I made a talk inspired by the McGill et al (2015) paper.

The basic messages that I take from this paper are threefold: (i) biodiversity is a complex and vague concept and it is therefore very easy, if one is not careful, to end up comparing apples and oranges when trying to synthesize informations on biodiversity trends, (ii) spatial and temporal scales have a huge impacts on biodiversity patterns, we should therefore be explicit, when talking about biodiversity trends, about the scales under consideration and (iii) we are in a desperate need for community data collected in a standardized way across many sites (mostly in Africa and in the tropics) and repeatedly over time.

In the presentation (which can be found here) I showed studies looking at different biodiversity trends heavily relying on the literature in the McGill paper but also adding results that were published since this paper came out (here, here and here). I deliberately spent more time talking about trends in local species richness because I knew that it would be discussed afterwards as quite a few people in our group (including myself) work on biodiversity experiments.

Then I talked about the controversies that arose from the studies reporting flat trends in local species richness. When the Vellend et al paper came out in 2013 I was actually in Jena sitting with the group of Nico Eisenhauer and since the Vellend paper took quite a hard stance against grassland biodiversity experiment field site, some people in the group publish a short reply. At that time I felt not really concerned about these issues (I was after all in the first year of my PhD) and when I look at the presentation that I held in that year it is funny to note that I presented results from the Murphy and Romanuk study which was easier to include in the classic rhetoric of: “we are losing biodiversity and it will affect ecosystem functioning” rather than presenting the results from Vellend and colleagues. I briefly mentioned it in the presentation so will do the same here: there is a great article with great comments in the dynamic ecology blog on this topic.

More recently a paper led by Gonzalez appeared in Ecology criticizing two specific papers reporting no trends in local species richness: Vellend et al 2013 (again!) and Dornelas et al 2014. They raised three major points: (i) the samples in both studies in not representative of global species richness and anthropogenic impacts, it is therefore not possible to claim that the results in these studies are globally valid, they are rather showing a biased picture of the trend in local species richness. For me this is the major point. In contrast to the more recent study by Elahi et al where the author raised already in the abstract the caveat that their sample was not a random sample of coastal habitat due to the under-representation of impacted sites,Vellend and Dornelas papers can be read as being globally valid. (ii) Time-series duration affect the results with longer time-series being more likely to report loss of species richness than shorter ones. There it is interesting to compare FigS2 in the Vellend paper with Fig3 in the Gonzalez paper, this is exactly the same figure the only difference is that the x-axis (study duration) is logged in one case and not in the other. I leave it to the discretion of the readers to reflect whether study duration impact on species richness trends is more likely to be linear than to be logarithmic. (iii) depending on the baseline used, different results might be reached. For example a piece of forest habitat have been completely logged and richness was measured before and after logging, what should be the baseline for looking at local richness trends over time? The data before logging? Or after logging? Gonzalez et al argue that the undisturbed site should be used while Vellend et al argue that in this case the disturbance will have a major impact on ecosystem functioning and therefore classical BEF studies manipulating species richness within a certain habitat type are of little value.

I personally think that issues (ii) and (iii) are easily corrected by data re-analysis which can be then published and may keep the discussion going, these two points are to my mind not really major ones. The issue of non-representativeness is much bigger, since most of the sites in the Vellend and Dornelas paper are far from being representative of human impacts on natural systems one needs to be careful when interpreting their results (which can be said for many study out there). Anyway the perfect dataset to look at biodiversity trends do not exist so in the meantime it is nice to have such studies to generate some debate, shake some of our dusty, untested ideas that we have (like biodiversity is declining) and then go out and collect more data to maybe one day be able to build this perfect dataset.

In the discussion that issued after the talk it was interesting to hear that we talked much more about big concept like: are scientists responsible for the interpretation that the public make of their results, since there are so many publications out there one is able to cherry-pick studies to prove any points, the scientific success model is driving us towards big project, big groups, big paper with catchy title, complex stats and colourful figures. We spent more time talking about how we do science and how the system is less than optimal rather than talking about biodiversity trends.

Anyhow I could go on for quite some pages on this topic, I’ll stop it here hoping to find motivation to write more posts in the near future.