Skip to content

Ploting SEMs in R using semPlot

This is a short post presenting the great package semPlot developed by Sacha Epskamp (check out his nice website: to make nice plots from your SEMs. SEMs are a modelling tool that allow the researcher to investiguate complex relationships between the variables, you may find here many links to free tutorials: Here I present a few tricks to plot SEMs in R that I have been using but if you look at ?semPaths or ?qgraph you will see that there are many many options to tune your graphs that I will not present here.

 #loading the library
library(clusterGeneration) #this is to generate a positive definite covariance matrix
#simulate some data
sig<-genPositiveDefMat("onion",dim=5,eta=4)$Sigma #the covariance matrix
mus<-c(10,5,120,35,6) #the vector of the means
data<,mu=mus,Sigma=sig)) #the dataset
names(data)<-c("CO2","Temp","Nitro","Biom","Rich") #giving it some names
#building an SEM with a latent variable
m<-'Abiot =~ CO2 + Temp + Nitro
Biom ~ Abiot
Rich ~ Abiot + Biom'<-sem(m,data)

#the plot
#basic version, the what arguments specify what should be plotted, here we choose to look at the standardized path coefficients


Here is just a basic version of the plots, many things can be changed, I will focus on the layout of the graph, the labels going into the nodes and the different groups of variables.


#define the label that will go into the nodes
#define the groups
#define the layout
#new plot
text(0.9,0.9,labels="Some text about\nthe model or\nthe weather in Indonesia")



In this new plot I used the layout argument to specify my home-made layout, the plot is within a (-1,1)(-1,1) space and the position of each node can be specify using a 2 column matrix containing the X and Y position. To find out the order of the nodes and the edges one can do something like:





Using this knowledge we can define our own node labels using the nodeLabel argument (the \n is to add a line break in the label). Finally the groups argument need a list with character vectors of the different groups, each of the nodes belonging to the groups can get a particular color defined by color. sizeMan control the size of the nodes, posCol the color of the edges, when this is two colors the first one will be used for the positive edges and the second one for the negatives.

Again just by looking at the help pages of semPaths you will see much more ways to tailor your graphs to your need. Happy plotting.

Using bootMer to do model comparison in R

Setting the right random effect part in mixed effect models can be tricky in many applied situation. I will not talk here about choosing whether a grouping variable (sites, individuals …) should be included as a fixed term or as a random term, please see Gelman and Hill (2006) and Zuur et al (2009) for informations. Here I will present the use of the bootMer function in the package lme4 to compare two models with different random effect terms specification and decide whether one model do a (significantly) better job at fitting the data. The standard way to compare two model is to derive the likelihood ratio test (LRT) value and since these should follow a chi-square distribution derive a p-value corresponding to the probability to observe such an extreme LRT under the null hypothesis that both model perform equally well. This approach works relatively fine for GLM but for (G)LMM several problem arises due mainly to boundary effects (the null hypothesis being in this case that the variance of the random effects is 0) see Bolker et al (2009). One way to do model comparison in (G)LMM is to derive bootstrapped likelihood values from the two competing models and to draw confidence intervals around the observed values to decide if one model perform better than the other. Below are some code with simulated data (a cleaner version with more graphs can be found here:


##### work on model comparison using bootMer ##### simulate some data and fit a
##### random intercept model to them
x <- runif(100, 0, 10)
# the grouping variable
site <- gl(n = 10, k = 10)
# the random intercept effect, the simulated standard deviation around the
# intercept is 1
rnd <- rnorm(10, 0, 1)
# the simulated resposne variable, note that the fixed effect coefficient
# are 1 for the intercept and 3 for the slope. Also the simulated residuals
# will have a standard deviation of one
y <- rep(1 + rnd, each = 10) + 3 * x + rnorm(100, 0, 1)
# fit the model using Maximum Likelihood to be able to use the LRT
m1 <- lmer(y ~ x + (1 | site), REML = FALSE)

# simulate to generate credible intervals
simu <- sim(m1, n.sims = 1000)
# a new model matrix with ordered and equally spaced predictor values
new.x <- model.matrix(~x, data = data.frame(x = seq(0, 10, length.out = 100)))
new.y <- matrix(ncol = 1000, nrow = 100)
# get the predicted response values for each 1000 simulations of the fixed
# effect model parameters
new.y <- apply(simu@fixef, 1, function(x) new.x %*% x)
# compute the lower/upper quantile
lower <- apply(new.y, 1, function(x) quantile(x, prob = 0.025))
upper <- apply(new.y, 1, function(x) quantile(x, prob = 0.975))
median <- apply(new.y, 1, function(x) quantile(x, prob = 0.5))

# nice plot
pal <- brewer.pal(10, "RdYlBu")
plot(y ~ x, col = rep(pal, each = 10), pch = 16)
lines(new.x[, 2], median, col = "blue", lwd = 2)
lines(new.x[, 2], lower, col = "red", lwd = 2, lty = 2)
lines(new.x[, 2], upper, col = "red", lwd = 2, lty = 2)


# fit a second model with a random slope effect
m2 <- lmer(y ~ x + (x | site), REML = FALSE)

# using bootMer to compute 100 bootstrapped log-likelihood
b1 <- bootMer(m1, FUN = function(x) as.numeric(logLik(x)), nsim = 100)
b2 <- bootMer(m2, FUN = function(x) as.numeric(logLik(x)), nsim = 100)

# the observed LRT value
lrt <- as.numeric(-2 * logLik(m1) + 2 * logLik(m2))
# the 100 bootstrapped LRT
lrt.b <- -2 * b1$t + 2 * b2$t
# plot
quant <- quantile(lrt.b, probs = c(0.025, 0.975))
plot(1, lrt, xlab = "", ylab = "Likelihood ratio test", xaxt = "n", ylim = c(quant[1] + 
    1, quant[2] + 1))
abline(h = 0, lty = 2, lwd = 2, col = "red")
segments(1, quant[1], 1, quant[2], lend = 1)

In the above example the 95% CI of the bootstrapped LRT cross the 0 line which means that one model do not fit the data better than the other. In this case the rule use would be to use the most simple model (the one with the lower number of parameters) which is the random-intercept model.

Let's make another example:

# now simulate data from random intercept/ slope
rnd.slope <- rnorm(10, 0, 0.5)
y <- rep(1 + rnd, each = 10) + rep(3 + rnd.slope, each = 10) * x + rnorm(100, 
    0, 1)

# the new models
m3 <- lmer(y ~ x + (x | site), REML = FALSE)
m4 <- lmer(y ~ x + (1 | site), REML = FALSE)

# LRT the observed values
lrt <- -2 * logLik(m4) + 2 * logLik(m3)
# the bootstrap
b3 <- bootMer(m3, FUN = function(x) as.numeric(logLik(x)), nsim = 100)
b4 <- bootMer(m4, FUN = function(x) as.numeric(logLik(x)), nsim = 100)

# the 100 bootstrapped LRT
lrt.b <- -2 * b4$t + 2 * b3$t

# the nice plot
quant <- quantile(lrt.b, probs = c(0.025, 0.975))
plot(1, lrt, xlab = "", ylab = "Likelihood ratio test", xaxt = "n", ylim = c(0, 
    quant[2] + 1))
abline(h = 0, lty = 2, lwd = 2, col = "red")
segments(1, quant[1], 1, quant[2], lend = 1)

In this second example the random intercept/slope model fits much better to the data than the random intercept. This random effect structure should be kept. As mentioned in Bolker et al (2009) the LRT will be relevant depending on the design and the interest that is put on the random terms. In the case were random terms are due to the particular design of the study (site, blocks …) and when there are considered as a “nuisance” they may be included in the models without testing for the increase in fitness that their inclusion provide. In the case where the random term effects is of interest (individual sampling units …) then using LRT is a sensible way to detect and interpret the effect of the random terms. The function PBmodcomp in the package pbkrtest allows one to do all the preceding code in just one line with various ways to test for the significance of the likelihood ratio (thanks to Ben Bolker for his comment).

Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., & White, J. S. S. (2009). Generalized linear mixed models: a practical guide for ecology and evolution. Trends in ecology & evolution, 24(3), 127-135.
Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.
Zuur, A., Ieno, E. N., Walker, N., Saveliev, A. A., & Smith, G. M. (2009). Mixed effects models and extensions in ecology with R. Springer.

Regular expression and associated functions in R

When working with strings regular expressions are an extremely powerful tool to look for specific patterns in the strings. In informatics a string is several characters put together, this can be words, sentences, or DNA code. Regular expression were developed in the fifties (thanks to Jeff Newmiller comment) by computer scientists ( and I discovered them using Perl ( They have also been implemented in various other programming languages due to their nice functionalities. Here I will present the functions in R that use regular expression and will present some general use of regular expression in ecology. As always a cleaner version of the post is available at:

#regular expression functions, there are 7 functions that can use regular expression

These functions will look for a certain pattern in the provided string(s), that may be a vector of strings.

# example using grep
words <- c("abc", "Cde", "Truc", "Machin")  #a vector of string
grep("c", words)  #looking for 'C' in each element of the vector, this return the index of the element in the vector containing the pattern ('C')
grep("c", words, value = TRUE)  #same but return the element of the vector containing the pattern
# by default grep is case-sensitive, can be turned to case-insensitive
grep("C", words, = TRUE)
# example using grepl
grepl("c", words)  #return a vector of logical indicating if the pattern was found in the elements of the vector
# example using sub and gsub, these two function replace the pattern by a
# replacement value specified by the user
species <- c("Rattus_domesticus", "Formica_rufa", "Germanium_pratense_germanica")
sub("_", " ", species)  #sub will only replace the first occurence of the pattern in the string
gsub("_", " ", species)  #gsub will replace all occurences
# example using regexpr and gregexpr
species <- c("Onthophagus_vacca", "Copris_hispanus", "Carabus_hispanus_hispanus")
regexpr("hisp", species)  #regexpr return the position in the string of the first occurence of the pattern as well as the length of the pattern matched
gregexpr("hisp", species)  #gregexpr will return the position of all matched occurence of the pattern
regexec("hisp", species)  #similar as regexpr but with a different output formatting

As seen in the few examples above when we have a clear idea of what the pattern is that we want to match we can just use it for the pattern argument in the functions. However sometimes we do not know the exact form of the pattern or we want to match at one time several closely related strings, this is where regular expressions come into play. They are an abstracted way to represent the different element (letters, digits, space …) present in the strings.

# the regular expression help page in R
# regular expression with 5 different strings
words <- c("Bonjour", "Bienvenue", "Au revoir", "A la bonne heure", "2 heures")
#'\\w' will match any letters (a-z) present in the strings
grep("\\w", words)  #there are letters in all elements of the string
# now if we want to keep only elements having exactly 7 letters we use:
grep("\\w{7}", words)
# we can also match digits using '\\d', space '\\s' and their negation:
# '\\W','\\D','\\S' could you guess what the coming regular expression
# will match?
grep("\\w{2}\\W\\w+", words, value = TRUE)
# by placing '^' we match from the beginning of the string, similarly $ will
# match the end
grep("^\\w{2}\\W\\w+$", words, value = TRUE)
# a last one using '\\d'
grep("\\d\\W\\w+", words, value = TRUE)

Several comments, using {n} will look for a string where the item is matched n times, {n,} matched n or more times, {n,m} matched at least n times but no more than m times. We can also use + for matching an item one or more times and * for matching zero or more times. Have a look at ?regex where everything is explained in length.

Now I used regular expression most of the time to specifically format labels or species names, this is where gsub in combination with regular expression become very handy. For example:

# We have three labels with a plot ID BX followed by a genus species
# information with 3 letters each (Poatri = Poa trivialis), we would like to
# have the first letter for the genus and the species as upper-case
species <- c("B1_Poatri", "B2_PlaLan", "B3_lAtPRA")
# in sub and gsub we can put part of the pattern between parentheses to call
# it in the replacement argument
gsub(".{5}(\\w{1}).*", "\\1", species)
# now we use the argument perl=TRUE to use the \\U and \\L special
# symbols that set the following letters to upper and lower-case
# respectively
gsub("(.{3})(\\w{1})(\\w{2})(\\w{1})(\\w{2})", "\\1\\U\\2\\L\\3\\U\\4\\L\\5", 
    species, perl = TRUE)

Here is the end of this first overview of regular expression in R, I used them quite often for formatting strings when I don’t want to spend hours with calc. There are many subtleties not covered here (UTF-8 mode, perl mode …) but these informations should be enough to get everyone started.

Pourquoi je vais aller voter le 25 mai

Entre le 22 et le 25 mai prochain les européens éliront leurs représentants au parlement européen, cette élection est souvent marqué par un désintérêt, un manque de compréhension des enjeux et donc une abstention très élevée. Lors des dernières élections seul 40% des français se sont déplacés. Je vais expliquer un peu les raisons qui font que je pense que ce scrutin est important et mérite mon attention.



L’union européenne garantit la paix

Il y a bientôt 100 ans l’Europe sombrait dans la folie meurtrière de la première guerre mondiale détruisant un continent jusqu’alors opulent. Se rappeler de l’histoire européenne du siècle dernier renforce le sentiment que l’union européenne est essentiel pour continuer la réconciliation entre les peuples et laissé derrière soi les barbaries et le racisme du siècle dernier. L’union européenne ne peut se développer sans le soutien de sa population et le vote aux uniques élections européennes qui existent actuellement est le meilleur moyen de le montrer.


Ce scrutin est exceptionnel

Durant le mandat à venir les pouvoirs du parlement européen seront agrandis et le poids législatif de la commission européenne (institutions politiques composé de membres désigné par les gouvernements nationaux) et du parlement européens seront mis peu à peu à égalité. En d’autres termes le parlement européens a à présent un réel poids politique dans l’union. Ce qui veut dire que le vote de la populations européennes aura un impact au travers des élus sur la politique européenne.

Autre point plus symbolique mais tout aussi important, le président de la prochaine commission européenne sera élu par le parlement, ce qui veut dire que le prochain parti majoritaire pourra envoyer un de ces membres à la tête de l’union. Avant le président de la commission était désigné par les différents chefs de gouvernements européens, à présent l’un des représentants clefs de l’Europe sera élu au suffrage universel!


L’Europe est une chance pour l’avenir

Face aux problèmes actuels tel le changement climatique, la crise en Ukraine, la crise économique, des solutions et actions au niveau national ont une porté limité et sont peu efficace. Lorsque 28 états représentant plus de 400 milions de personnes s’engagent pour une action selon à un poids politique et diplomatique beaucoup plus important. Je pense donc qu’il faut aller vers une intégration européenne renforcé pour répondre aux enjeux des décennies à venir. Voter aux élections européennes montre à tous les politiques que la population s’intéresse à ces enjeux et est confiante que l’Europe sera mieux à même de les résoudre que les gouvernements nationaux.

Quelques références:

Voici donc mes raisons principales pour lesquels je me déplacerai le 25 mai et exprimerai ma confiance envers l’Europe. Et vous? Faisez-vous confiance en l’Europe? Qu’est ce que vous changerez si vous en avez le pouvoir?

Importing 100 years of climate change into R

This is a flashback post, I was working on species distribution shifts over the last 40 years last summer and recently Rémi Genevest contacted me asking me how I managed to import the CRU TS 1.2 dataset into R. As always a more readable version of the code can be found here.

At that time I used a not very elegant coding involving SpatialPixels and SpatilGridDataFrame, scrolling back to the question I asked to the R-sig-geo mailing list back then I stumbles across the answer from Robert Hijmans that I did not take into account at that time. Now one year after I found his answer going in the right direction and made some heavy change in the coding.

#reading in CRU files into R

#for the CRU TS 1.2 download the .zip at

#the raster we get at the end, the data are monthly for all the years between 1901 and 2000
temp<-brick(nrows=228,ncols=258,xmn=-11,xmx=32,ymn=34,ymx=72,nl=1200,crs=CRS("+proj=longlat +datum=WGS84"))

#example using the temperature

#now turn the data into a matrix format with every line corresponding to a raster cell and the first two columns the column and row number of the cell
#now add the temperature data from these cells for all month all year
numb<-apply(numb,1,function(x) seq(x[1],37465029,1203))
mat<-cbind(mat,apply(numb,2,function(x) as.numeric(all_dat[x])))

#reverse the rows number since they are numbered from bottom to top in CRU and from top to bottom in rasters

#get the cell numbers of each box defined in the CRU dataset
#attribute to these cells the temperature values
#divide by 10 to get the temperature in degree celsius
#put names to the layers

#the winter mean temperature between 1914 and 1918


#the standard deviation in temperature for the years 1901 and 2000


The only mathematical magic involve here is changing the row numbers. Then from this huge dataset we can do lots of neat thing, like we can see how cold did the soldier of the first world war were (first raster plot), or we can look at changes in standard deviation in temperature between the year 1901 and 2000 after one century of climate change.

If you use such data in your work do not forget to cite the owners: Mitchell, T.D., Carter, T.R., Jones, P.D., Hulme,M., New, M., 2003: A comprehensive set of high-resolution grids of monthly climate for Europe and the globe: the observed record (1901-2000) and 16 scenarios (2001-2100). Journal of Climate: submitted

And if you have some knowledge of similar dataset (monthly values over Europe) at a finer spatial resolution please contact me!



Fonctions récursives (avec R)

Les fonctions récursives sont des fonctions qui s’appellent elle mêmes lors de leurs exécution voir:

Un des challenges posé par Cédric sur son site: , nous mets au défi de crée une fonction factorielle qui est récursive.

Voici une solution en R:

#factoriel récursive

Comprendre le comment du pourquoi que sa marche est plutôt compliqué et brûle les neurones, j’imagine sa comme des poupées russe, lorsqu’on appelle f(3), la fonction
va appelée f(2) qui elle même appelle f(1).

Merci pour le challenge Cédric.

Checking (G)LM model assumptions in R

(Generalized) Linear models make some strong assumptions concerning the data structure:

  1. Independance of each data points
  2. Correct distribution of the residuals
  3. Correct specification of the variance structure
  4. Linear relationship between the response and the linear predictor

For simple lm 2-4) means that the residuals should be normally distributed, the variance should be homogenous across the fitted values of the model and for each predictors separately, and the y’s should be linearly related to the predictors. In R checking these assumptions from a lm and glm object is fairly easy:

# testing model assumptions some simulated data
x <- runif(100, 0, 10)
y <- 1 + 2 * x + rnorm(100, 0, 1)
m <- lm(y ~ x)
par(mfrow = c(2, 2))


The top-left and top-right graphs are the most important one, the top-left graph check for the homogeneity of the variance and the linear relation, if you see no pattern in this graph (ie if this graph looks like stars in the sky), then your assumptions are met. The second graphs check for the normal distribution of the residuals, the points should fall on a line. The bottom-left graph is similar to the top-left one, the y-axis is changed, this time the residuals are square-root standardized (?rstandard) making it easier to see heterogeneity of the variance. The fourth one allow detecting points that have a too big impact on the regression coefficients and that should be removed. These graphs from simulated data are extremely nice, in applied statistics you will rarely see such nice graphs. Now many people new to linear modelling and used to strict p-values black and white decision are a bit lost not knowing when there model is fine and when it should be rejected. Below is an example of a model that is clearly wrong:

# some wrong model
y <- 1 + 2 * x + 1 * x^2 - 0.5 * x^3
m <- lm(y ~ x)
par(mfrow = c(2, 2))


These two example are easy, life is not. Real-life models are sometimes hard to assess, the bottom-line is you should always check your model assumptions and be truthfull. Reporting and interpreting models that do not meet their assumptions is bad science and close to falsification of the results. Now let’s see a real life example where it is tricky to decide if the model meet the assumptions or not, the dataset is in the ggplot2 library just look at ?mpg for a description:


# a real life dataset
##   manufacturer model displ year cyl      trans drv cty hwy fl   class
## 1         audi    a4   1.8 1999   4   auto(l5)   f  18  29  p compact
## 2         audi    a4   1.8 1999   4 manual(m5)   f  21  29  p compact
## 3         audi    a4   2.0 2008   4 manual(m6)   f  20  31  p compact
## 4         audi    a4   2.0 2008   4   auto(av)   f  21  30  p compact
## 5         audi    a4   2.8 1999   6   auto(l5)   f  16  26  p compact
## 6         audi    a4   2.8 1999   6 manual(m5)   f  18  26  p compact
m <- lm(cty ~ displ + factor(cyl), mpg)
par(mfrow = c(2, 2))


The residuals vs fitted graphs looks rather ok to me, there is some higher variance for high fitted values but this does not look too bad to me, however the qqplot (checking the normality of the residuals) looks pretty awfull with residuals on the right consistently going further away from the theoretical line. A nice way to see if the patterns are different from those expected under the model conditions is to derive new response values from the fitted coefficient and the residual variance, you can then derive 8 new plots and randomly allocate the real plot to a position, if you are able to find the real plot and if its pattern are different from the other then the model do not meet its assumptions:

# randomizing to see if the patterns are different from expected
modmat <- model.matrix(~displ + factor(cyl), data = mpg)
mus <- modmat %*% coef(m)
# the position of the real plot in a 3x3 panel
s <- sample(1:9, size = 1)
par(mfrow = c(3, 3))
for (i in 1:9) {
    if (i == s) {
        # the real plot
    } else {
        # draw new y values from the fitted values with the residuals standard
        # deviation
        y <- rnorm(dim(mpg)[1], mus, sd(resid(m)))
        y <- y - fitted(m)



Are you able to find in which panel the real plot is? I can it is on the second row, third column. The other qqplot do not look that different from the real one, there are however a few points that are definitevely away from what we expect under the model assumptions. A next step would be to look at these points and understand where these discrepency might come from (measurement error, special case…) We can also derive such plots for checking the first graph.

Resources on model checking:


Get every new post delivered to your Inbox.