Count data: To Log or Not To Log

Count data are widely collected in ecology, for example when one count the number of birds or the number of flowers. These data follow naturally a Poisson or negative binomial distribution and are therefore sometime tricky to fit with standard LMs. A traditional approach has been to log-transform such data and then fit LMs to the transformed data. Recently a paper advocated against the use of such transformation since it led to high bias in the estimated coefficients. More recently another paper argued that log-transformation of count data followed by LM led to lower type I error rate (ie saying that an effect is significant when it is not) than GLMs. What should we do then?

Using a slightly changed version of the code published in the Ives 2015 paper let’s explore the impact of using these different modelling strategies on the rejection of the null hypothesis and the bias of the estimated coefficients.

#load the libraries

#download and load the functions that will be used

Following the paper from Ives and code therein I simulated some predictor (x) and a response (y) that follow a negative binomial distribution and is linearly related to x.
In the first case I look at the impact of varying the sample size on the rejection of the null hypothesis and the bias in the estimated coefficient between y and x.

######univariate NB case############
#range over n
output<-compute.stats(NRep=500,n.range = c(10,20,40,80,160,320,640,1280))
ggplot(output,aes(x=n,y=Reject,color=Model))+geom_path(size=2)+scale_x_log10()+labs(x="Sample Size (log-scale)",y="Proportion of Rejected H0")+base_theme
ggplot(output,aes(x=n,y=Bias,color=Model))+geom_path(size=2)+scale_x_log10()+labs(x="Sample Size (log-scale)",y="Bias in the slope coefficient")+base_theme

Figure 1

For this simulation round the coefficient of the slope (b1) was set to 0 (no effect of x on y), and only the sample size varied. The top panel show the average proportion of time that the p-value of the slope coefficient was lower than 0.05 (H0:b1=0 rejected). We see that for low sample size (<40) the Negative Binomial model has higher proportion of rejected H0 (type I error rate) but this difference between the model disappear as we reached sample size bigger than 100. The bottom panel show the bias (estimated value – true value) in the estimated coefficient. For very low sample size (n=10), Log001, Negative Binomial and Quasipoisson have higher bias than Log1 and LogHalf. For larger sample size the difference between the GLM team (NB and QuasiP) and the LM one (Log1 and LogHalf) gradually decrease and both teams converged to a bias around 0 for larger sample size. Only Log0001 is behaved very badly. From what we saw here it seems that Log1 and LogHalf are good choices for count data, they have low Type I error and Bias along the whole sample size gradient.

The issue is that an effect of exactly 0 never exist in real life where most of the effect are small (but non-zero) thus the Null Hypothesis will never be true. Let’s look know how the different models behaved when we vary b1 alone in a first time and crossed with sample size variation.

#range over b1
outpu<-compute.stats(NRep=500,b1.range = seq(-2,2,length=17))
ggplot(outpu,aes(x=b1,y=Reject,color=Model))+geom_path(size=2)+base_theme+labs(y="Proportion of Rejected H0")
ggplot(outpu,aes(x=b1,y=Bias,color=Model))+geom_path(size=2)+base_theme+labs(y="Bias in the slope coefficient")


Here the sample size was set to 100, what we see in the top graph is that for a slope of exactly 0 all model have a similar average proportion of rejection of the null hypothesis. As b1 become smaller or bigger the average proportion of rejection show very similar increase for all model expect for Log0001 which has a slower increase. This curves basically represent the power of the model to detect an effect and is very similar to the Fig2 in the Ives 2015 paper. Now the bottom panel show that all the LM models have bad behaviour concerning their bias, they have only small bias for very small (close to 0) coefficient, has the coefficient gets bigger the absolute bias increase. This means that even if the LM models are able to detect an effect with similar power the estimated coefficient is wrong. This is due to the value added to the untransformed count data in order to avoid -Inf for 0s. I have no idea on how one may take into account arithmetically these added values and then remove its effects …

Next let’s cross variation in the coefficient with sample size variation:

#range over n and b1
ggplot(output3,aes(x=n,y=Reject,color=Model))+geom_path(size=2)+scale_x_log10()+facet_wrap(~b1)+base_theme+labs(x="Sample size (log-scale)",y="Proportion of rejected H0")
ggplot(output3,aes(x=n,y=Bias,color=Model))+geom_path(size=2)+scale_x_log10()+facet_wrap(~b1)+base_theme+labs(x="Sample size (log-scale)",y="Bias in the slope coefficient")

Figure 3

The tope panel show one big issue of focussing only on the significance level: rejection of H0 depend not only on the size of the effect but also on the sample size. For example for b1=0.75 (a rather large value since we work on the exponential scale) less than 50% of all models rejected the null hypothesis for a sample size of 10. Of course as the effect sizes gets larger the impact of the sample size on the rejection of the null hypothesis is reduced. However most effect around the world are small so that we need big sample size to be able to “detect” them using null hypothesis testing. The top graph also shows that NB is slightly better than the other models and that Log0001 is again having the worst performance. The bottom graphs show something interesting, the bias is quasi-constant over the sample size gradient (maybe if we would look closer we would see some variation). Irrespective of how many data point you will collect the LMs will always have bigger bias than the GLMs (expect for the artificial case of b1=0)

To finish with in Ives 2015 the big surprise was the explosion of type I error with the increase in the variation in individual-level random error (adding a random normally distributed value to the linear predictor of each data point and varying the standard deviation of these random values) as can be seen in the Fig3 of the paper.

#range over b1 and sd.eps
ggplot(output4,aes(x=sd.eps,y=Reject,color=Model))+geom_path(size=2)+facet_wrap(~b1)+base_theme+labs(x="",y="Proportion of rejected H0")
ggplot(output4,aes(x=sd.eps,y=Bias,color=Model))+geom_path(size=2)+facet_wrap(~b1)+base_theme+labs(x="Standard Deviation of the random error",y="Bias of the slope coefficient")

Figure 4

Before looking at the figure in detail please note that a standard deviation of 2 in this context is very high, remember that these values will be added to the linear predictor which will be exponentiated so that we will end up with very large deviation. In the top panel there are two surprising results, the sign of the coefficient affect the pattern of null hypothesis rejection and I do not see the explosion of rejection rates for NB or QuasiP that are presented in Ives 2015. In his paper Ives reported the LRT test for the NB models when I am reporting the p-values from the model summary directly (Wald test). If some people around have computing power it would be interesting to see if changing the seed and/or increasing the number of replication lead to different patterns … The bottom panel show again that the LMs bias are big, the NB and QuasiP models have an increase in the bias with the standard deviation but only if the coefficient is negative (I suspect some issue with the exponentiating of large random positive error), as expected the GLMM perform the best in this context.

Pre-conclusion, in real life of course we would rarely have a model with only one predictor, most of the time we will build larger models with complex interaction structure between the predictors. This will of course affect both H0 rejection and Bias, but this is material for a next post 🙂

Let’s wrap it up; we’ve seen that even if LM transformation seem to be a good choice for having a lower type I error rate than GLMs this advantage will be rather minimal when using empirical data (no effect are equal to 0) and potentially dangerous (large bias). Ecologists have sometime the bad habits to turn their analysis into a star hunt (R standard model output gives stars to significant effects) and focusing only on using models that have the best behavior towards significance (but large bias) does not seem to be a good strategy to me. More and more people call for increasing the predictive power of ecological model, we need then modelling techniques that are able to precisely (low bias) estimate the effects. In this context transforming the data to make them somehow fit normal assumption is sub-optimal, it is much more natural to think about what type of processes generated the data (normal, poisson, negative binomial, with or without hierarchical structure) and then model it accordingly. There are extensive discussion nowadays about the use and abuse of p-values in science and I think that in our analysis we should slowly but surely shifts our focus from “significance/p-values<0.05/star hunting” only to a more balanced mix of effect-sizes (or standardized slopes), p-values and R-square.

Checking (G)LM model assumptions in R

(Generalized) Linear models make some strong assumptions concerning the data structure:

  1. Independance of each data points
  2. Correct distribution of the residuals
  3. Correct specification of the variance structure
  4. Linear relationship between the response and the linear predictor

For simple lm 2-4) means that the residuals should be normally distributed, the variance should be homogenous across the fitted values of the model and for each predictors separately, and the y’s should be linearly related to the predictors. In R checking these assumptions from a lm and glm object is fairly easy:

# testing model assumptions some simulated data
x <- runif(100, 0, 10)
y <- 1 + 2 * x + rnorm(100, 0, 1)
m <- lm(y ~ x)
par(mfrow = c(2, 2))


The top-left and top-right graphs are the most important one, the top-left graph check for the homogeneity of the variance and the linear relation, if you see no pattern in this graph (ie if this graph looks like stars in the sky), then your assumptions are met. The second graphs check for the normal distribution of the residuals, the points should fall on a line. The bottom-left graph is similar to the top-left one, the y-axis is changed, this time the residuals are square-root standardized (?rstandard) making it easier to see heterogeneity of the variance. The fourth one allow detecting points that have a too big impact on the regression coefficients and that should be removed. These graphs from simulated data are extremely nice, in applied statistics you will rarely see such nice graphs. Now many people new to linear modelling and used to strict p-values black and white decision are a bit lost not knowing when there model is fine and when it should be rejected. Below is an example of a model that is clearly wrong:

# some wrong model
y <- 1 + 2 * x + 1 * x^2 - 0.5 * x^3
m <- lm(y ~ x)
par(mfrow = c(2, 2))


These two example are easy, life is not. Real-life models are sometimes hard to assess, the bottom-line is you should always check your model assumptions and be truthfull. Reporting and interpreting models that do not meet their assumptions is bad science and close to falsification of the results. Now let’s see a real life example where it is tricky to decide if the model meet the assumptions or not, the dataset is in the ggplot2 library just look at ?mpg for a description:


# a real life dataset
##   manufacturer model displ year cyl      trans drv cty hwy fl   class
## 1         audi    a4   1.8 1999   4   auto(l5)   f  18  29  p compact
## 2         audi    a4   1.8 1999   4 manual(m5)   f  21  29  p compact
## 3         audi    a4   2.0 2008   4 manual(m6)   f  20  31  p compact
## 4         audi    a4   2.0 2008   4   auto(av)   f  21  30  p compact
## 5         audi    a4   2.8 1999   6   auto(l5)   f  16  26  p compact
## 6         audi    a4   2.8 1999   6 manual(m5)   f  18  26  p compact
m <- lm(cty ~ displ + factor(cyl), mpg)
par(mfrow = c(2, 2))


The residuals vs fitted graphs looks rather ok to me, there is some higher variance for high fitted values but this does not look too bad to me, however the qqplot (checking the normality of the residuals) looks pretty awfull with residuals on the right consistently going further away from the theoretical line. A nice way to see if the patterns are different from those expected under the model conditions is to derive new response values from the fitted coefficient and the residual variance, you can then derive 8 new plots and randomly allocate the real plot to a position, if you are able to find the real plot and if its pattern are different from the other then the model do not meet its assumptions:

# randomizing to see if the patterns are different from expected
modmat <- model.matrix(~displ + factor(cyl), data = mpg)
mus <- modmat %*% coef(m)
# the position of the real plot in a 3x3 panel
s <- sample(1:9, size = 1)
par(mfrow = c(3, 3))
for (i in 1:9) {
    if (i == s) {
        # the real plot
    } else {
        # draw new y values from the fitted values with the residuals standard
        # deviation
        y <- rnorm(dim(mpg)[1], mus, sd(resid(m)))
        y <- y - fitted(m)



Are you able to find in which panel the real plot is? I can it is on the second row, third column. The other qqplot do not look that different from the real one, there are however a few points that are definitevely away from what we expect under the model assumptions. A next step would be to look at these points and understand where these discrepency might come from (measurement error, special case…) We can also derive such plots for checking the first graph.

Resources on model checking:

First Lecture ever

So .. just a quick post to share with you my first lecture ever. It was for a master course that my working group is running on Experimental design and statistics in ecology. I talked about tools for non-normal data in ecology. I felt rather confortable and I hope that my messages got somehow into their head. Anyhow it was extremely pleasant to say: “Here is what is very likely to fall for the exam…”

Here is the link to the lecture:

Enjoy it!

Generalized Linear Modelling in R (part 1)

In classical linear modelling we are assuming that the response variable (Y) is normally distributed, however for certain type of data like count data or presence/absence data this is not the case.

There is in statistic an ensemble of technique called Generalized Linear Modelling (GLM in short) where the reponse variable follow one known distribution, these are for example: gaussian, poisson, Bernoulli or binomial.

A GLM is made of three parts: the assumptions of the distribution of the response variable, the predictors used to build the model (the X part) and the link between the predictors and the response variable.

Let’s first see an exmple of GLM with a gaussian distribution (which is basically a linear model)

#load the dataset airquality measuring various climatic variables

#you can compare the two output which gives similar estimates of the slope and intercept

#model validation plot, residuals vs fitted values, residuals vs predictors and histogramm of #residuals

#add a plot with the estimated intercept and slope (which are both significantly different from #0)

Model Validation Plots

From this example we see that linear models are GLM with a gaussian distribution. The model validation plot show that there is no pattern in the residuals which could violate the assumption of the homogeneity of the variance and the residuals are normally distributed. The conclusion that we can draw from this are: the wind values significantly affect the temperature value in a negative way.

Know let’s look at some data which are not normally distributed, count data which follow a poisson distribution:

#a poisson example with the count of Amphibians road kills (Zuur, 2009)
rk<-data.frame(death=c(22,14,65,55,88,104,49, 66, 26, 47, 35, 55, 44, 30, 33, 29, 34, 64, 76, 32, 34, 32, 35, 22, 34, 25, 18,14, 14, 7, 7, 17, 10, 3, 6, 5, 2, 3, 2, 2, 7, 3, 5, 4, 7, 12, 7, 14, 10, 4, 11, 3)
,distance=250.214 , 741.179, 1240.080, 1739.885, 2232.130, 2724.089, 3215.511, 3709.401, 4206.477, 4704.176,5202.328, 5700.669, 6199.342, 6698.151, 7187.762, 7668.833, 8152.155, 8633.224, 
9101.411, 9573.578,10047.630, 10523.939, 11002.496, 11482.896, 11976.232, 12470.968, 12968.285, 13465.914, 13961.321, 14432.954,14904.995, 15377.983, 15854.389, 16335.936, 16810.109, 
17235.045, 17673.064, 18167.269, 18656.949, 19149.507,19645.717, 20141.987, 20640.729, 21138.903, 21631.542, 22119.102 ,22613.647, 23113.450, 23606.088, 24046.886,24444.874, 24884.803)

#do a poisson model

#model validation plot with the fitted data
plot(r~f,ylab="Residuals",xlab="Fitted values")

#add the predicted values with the 95% confidence interval

Model Validation Plots

The summary command on a glm object give similar information as in a lm object. We have the formula call, informations about the distribution of the residuals, the estimated coefficiants; in this case they are on a logarithmic scale since the default link between the predictors and the response variable for a poisson distribution is the log function (see ?family). Then information on the total and residuals sum of squares, and an AIC coefficient which can be usefull when comparing different models.


Zuur et al books,

Zeilis et al,