Science is also about convincing others, be it your supervisors or your collaborators, that what you intend to do is relevant to get the answers sought. This “what” can be anything from designing an experiment, to collecting samples or formatting the data. This post was inspired by a recent discussion with one of my supervisor who argued against using hierarchical regression models with a beta distribution to analyse our data. The intent of this post is for me to get some concrete arguments for this approach but it may also help others faced by collaborators or reviewers asking for justification of using seemingly complex methods. I will first provide a bit of background on the data and the different alternatives that we have to analyse them, then I will make a case for hierarchical models in general also outlining when they may be inappropriate and in a last will talk about the advantages of using non-normal regressions vs data transformation.
In this project we want to link leaf herbivory rates (proportion of damaged leaf area) to leaf elemental concentrations (Carbon, Nitrogen and Phosphorus), sotchiometric ratios and plant diversity. In summer 2014, I looked at tens of thousands of leaves from 20 different plant species to estimate herbivory rates and two other colleagues (Jordan Guiz and Nina Hacker) measured C, N and P concentration on the same leaves. This is a pretty nice dataset to work with. We have data on 20 different plant species and we expect difference in responses between the species, two main options are open: (i) fit one regression model per plant species (the classical approach in my current working environment) or (ii) fit hierarchical models with regression parameters varying between the species. In addition, as the hebrivory rates are proportions bounded between 0 and 1, again two options are possible: (i) transform the herbivory values and use normal regression models or (ii) use beta regression models.
Why fit hierarchical models?
I must first clarify some terminology, hierarchical models are equivalent to mixed-effect model, I do not use this last term as the division between fixed and random effects is rather confusing. I prefer to use: varying-intercept or constant slopes, this makes it clearer what parameters varies between grouping levels (ie species) or not.
The first benefit of using hierarchical models is the possibility to model parameters varying between groups of data (like an interaction) without having to estimate separate values per group. For instance if I was to fit one regression per plant species to my dataset and say that I have 5 parameters per model, I would have to estimate 100 parameters, which seems pretty large. On the other hand hierarchical models assume that all species-level parameters come from a specific distribution and one would just need to estimate the parameter of this distribution to get all species-level values. In practice we usually assume that the regression parameters are normally distributed across the groups and so we need to estimate two parameters per varying coefficients: the mean and the standard deviation. For instance, if we think that the effect of nitrogen concentration on herbivory rates varies between the plant species we would estimate the average nitrogen effect across all plant species plus the variation in this effect. In other words the first benefit of hierarchical models is to model the variation of the effects to certain grouping variables without dramatically inflating the number of parameters to estimate when there are large number of groups.
Another benefit arise when for some groups we have only limited sample sizes, in my dataset we have species with 50 observations but other species have only 20. Hierarchical models use partial pooling to drive extreme parameter estimates in groups with low sample size towards the overall mean. More specifically, say that if I modelled the effect of nitrogen on herbivory for a plant species with only few observations and observed an effect of 10, while the average nitrogen effect across all plant species is 1. It is likely that this large effect is due to a few extreme observations. Fitting a hierarchical model will shrink this estimate towards 1 using partial pooling, see section 12.2 in Gelman and Hill book. Hierarchical models use informations available across all groups to estimate parameters that varies between the groups, while in separate regression models only “local”, group-level information, that may be extreme depending on sample size, is used. Interestingly this was the main criticism of my supervisor against using hierarchical models, his point being that actual species-level might be distorted by partial pooling. Being largely influenced by Gelman and Hill book on this topic I never saw this as a disadvantage. As outlined above partial pooling drives estimates from groups with few observations towards the overall means as we have little information from these groups. In the limit case where we have an infinite number of observations per groups, the partial pooling estimate will just be the group-level means (ie no pooling). I would argue that partial pooling is a desirable property to avoid extreme estimates to pop-up that would just be representing a combination of measurement error and noise but no proper signal whatsoever.
More conceptually, hierarchical models are a natural way to account for structure and dependency in the data. Ecological data often comes in groups be it experimentally created groups like blocks, benches or climate chambers or biologically meaningful groups like species, individuals or communities. Hierarchical models when properly set account for this in parsimonious ways, have a look at Bolker et al paper on GLMM for ecology and evolution.
I would argue that my dataset would be best analyse with hierarchical models because we have a rather large number of groups (20 species) with regression parameters potentially varying between them. In this context it is interesting to get both the average effect size but also how variable is this effect between the species. Such information might also be derived, albeit in a more cumbersome way, from fitting separate regression models, for instance I do not know if one can get quantitative estimation of variation in regression parameters between separate models.
To transform or not to transform?
This is a pretty old debate, in ecology two papers were pretty influential this one dealing with count data, and this one with proportions. My current philosophy is that if the data collected fall in a pretty clear case like counting stuff or performing a set of 0/1 experiment, then using generalized linear models with appropriate distribution, maybe some overdispersion parameters, and standard link function is the way to go. If the data distribution looks weird but the data could fall within a normal distribution (ie plant height, or biomass) then one can try different standard transformation and look at model residuals to check if everything is ok.
Transforming is not just about getting normally distributed residuals, actually linear models are pretty resistant to departure from normality, what is more important is to have an equal variance of the error (to get, among other thing, correct estimation of standard error) but also linearity and additivity of the effects. For instance say that your true model is of the form: y = exp(beta1*x1)exp(beta2*x2)exp(r), with r ~ N(0,sigma), then log-transforming the relation gives: log(y) = beta1*x1 + beta2*x2 + r. The first relation was neither additive nor linear while after log-transforming a linear model would be appropriate. We never now the true model that generated our empirical data, so the best way to get a grasp at how correct our transformations and models are we need to plot, plot and re-plot. Plot the response against the predictor, plot the residuals against the fitted values, plot the residuals against predictors not included in the models …
So in the case at hand where I have proportion data, one option would be to use a beta regression. The beta distribution is pretty flexible and can fit a variety of shapes (See graph below inspired from Fig. 4.15 in Ben Bolker emdbook).
The issue is that zeroes are not allowed for beta regression as the likelihood shoots up to infinity for 0 or 1 values. Adding a small constant to all values solve this issue but seems unnatural. The other option would be to logit-transform the proportion data, after adding a small constant to again avoid the zeroes issue.
I would argue for using a beta regression as I am always feeling a bit uneasy to directly transform my response variables. I prefer to find an appropriate distribution and let it takes care of linking my response variables with potential constraints (ie non-negative values) to my predictors. It seems to be more natural and leave the data on their original scale without forcing them to take on specific shapes. Of course one should still check that model assumptions are respected by doing the usual checks. Early exploration in this case show that a beta regression fits pretty well to the data. But I might have to actually show plots for logit-transform data to convince my supervisor to continue using beta regression. More generally, in ecology we almost always assume linearity and additivity of effects when using statistical models, most ecologists always assume that transforming is only for having normally distributed response data.
At the end of the day my experience of the last years has been that I would need anyway to fit a wide variety of models some making sense, others being pretty hard to interpret. The art of science come then into play when trying to develop a coherent story from these multiple results.