Prediction in ecology -implementing a priority.

In this post I would like to comment a bit on the recent forum paper by Houlahan et al on the importance of prediction to demonstrate ecological understanding. Before going through what I took from the paper I will outline a bit where I stood on prediction and understanding before reading the article.

Explanation vs Prediction

In my master I spent some time learning and using machine learning algorithms to answer different questions.  Going through the biblio on this wide topic two papers had a large influence on my vision of statistical modeling, the first paper was by Leo Breiman on the two cultures of statistical modeling. There Breiman argue that most of the statisticians approach problems using data models that produces simple and understandable pictures of the relation between predictors and responses. The issue is that simple models do not generally generate accurate predictions especially for complex situations. Breiman therefore argue for statisticians to use a wider range of tools including algorithmic modeling which are more accurate but less interpretable. The second paper was by Galit Shmueli and titled “To Explain or to Predict“. The basic message of the author was that every step in the modelling process, such as variable selection, choice of model algorithm or model selection, will be affected by whether the study aim at explaining or predicting. Therefore if you want to explain some patterns a specific sets of methods can be used to understand the processes acting. On the other hand if the aim is to predict a response under certain conditions then different methods will allow you to make accurate predictions. An important point is that rather than seeing understanding vs prediction as a simple trade-off one should see it rather as a two-dimensional space. Every model will have some explanatory power but also some predictive accuracy. It is the researcher task to set which combination of understanding and predictive power is expected in the project at hand in order to choose appropriate methods. These are the thoughts, more or less formed, that were in the background of my head while reading the Houlahan paper.

The priority of prediction

The main message I take form the Houlahan paper is that ecologists can claim understanding only if they are able to make quantitative predictions that are verified by data across space and time. They describe an iterative process, close to a Popperian framework, where after acquiring knowledge ecologists would demonstrate their understanding by making “correct and risky predictions”. If the predictions are not verified against new data, then new understanding should be acquired. The issue is that there is currently little incentives for ecologists to demonstrate their understanding, the authors outline three reasons for this: (i) institutions by placing a high weight on novelty encourage researcher to spend most of their career on acquiring new knowledge (See this post), (ii) ecology is not perceived as a critical science by the public (compared to medicine for example), there is therefore little pressure by the public to push ecologists to demonstrate their understanding of the world and (iii) ecology is complex, everything depends on everything and is potentially context-dependent, therefore if ecologists were to try to demonstrate their understanding there might be some bad surprises (the new book by Mark Vellend address such issue by focusing on high-level processes rather than low-level ones).

What should we do?

The authors outline 6 ways in which ecology should change if prediction was set as a priority. (i) moving away from qualitative (effect vs no effect) hypotheses towards quantitative ones.  (ii) Identify modeling technique suited for predictions. (iii) assess replicability of ecological patterns across space and time. (iv) estimate measurement errors which put an upper limit to prediction accuracy. (v) extend the concept of power beyond its use in a NullHypothesisSignificanceTesting framework (See Ben Bolker take on this, page 217). (vi) be explicit about the type of science done, for example a study might aim at presenting new and unexplained pattern, building on this a subsequent study may investigate the mechanisms driving this pattern.

Some thoughts:

I like the concept of demonstrating understanding, this would force ecologists to go beyond chasing novelty and spend more efforts on confirming past results, re-using models set by others, testing old theories before proposing new ones … None of this is new. Yet this paper bring it all in the light of prediction. I am however, a bit unsure as to how such studies should go. As outlined above understanding and prediction calls for different workflow (selecting variables, choice of model algorithm, mode selection …). So how should we build models if we want to demonstrate our understanding achieved through prediction? I could not find a clear way forward in the article so I am left with speculating how it could go. The first important step for me would be to realize that maximizing absolute predictive accuracy should not be a goal in demonstrating understanding. Models with high predictive ability might be further away from the true model (the goal of understanding efforts) than models with lower predictive validity (See Shmueli paper). Rather predictive ability should be compared amongst models of similar complexity that are tractable, otherwise one might tend towards including many parameters and using more complex algorithms that are not designed to give an understandable vision on the true model. Demonstrating understanding should therefore be achieved by using some of the data modeling culture outlined by Breiman. Second, I would argue that the concept of demonstrating understanding outlined in the paper is a replicability issue using out-of-sample prediction as a metric. Basically something like this article but using prediction on new data rather than p-values or effect sizes as metrics. Finally, measurement errors and validity should be taken seriously. As big databases are continuously emerging providing formidable tools to test the generality of our models, one still need to keep in mind that the data should be closely linked to the object under study. Vague concepts like biodiversity have many more or less correlated features making it probable to end up using sub-optimal indicators for the processes under scrutiny (Read this for more infos).

If you are captivated by this topic make sure to read the series of blog post and their comments on “Ecologist need to do a better job of prediction” by Brian McGill. Let’s see how the ecological literature will respond to these pleas to move towards a more predictive focus.

10 thoughts on “Prediction in ecology -implementing a priority.

  1. Hi Lionel, thanks for reading and writing about our paper – you gave a better summary than I could have. And you raise a great point about making ‘prediction’. operational But I disagree that you can’t simply maximize absolute prediction error in demonstrating understanding. You present the argument I’ve run into before in the machine learning community – that models that are more true can make worse predictions than models that are less true. That argument seems to be based on the idea that the variables in your model and the functional relationships among variables define the ‘trueness’ of the model. That the parameter estimates don’t have anything to do with the truth of the model. But the only way (that I can see) that a model with the correct variables and the correct functional relationships can make worse predictions than a model that has missing/too many variables or incorrect functional relationships is that its parameter estimates are worse. That’s the tradeoff – for a given number of data points, the parameter estimates get less reliable as the model complexity goes up. In my opinion, parameter estimates also define the truth of a model and if the parameter estimates are way wrong then the model is far from the truth. I understand why, in theory, parameter estimates are seen as less important than getting the variables and functional relationships right – if you have the variables and functional relationships right then parameter estimates are just a sampling and measurement error problem…if you have enough good data you’ll get good parameter estimates. But in practice, poor parameter estimates leave you just as far from the truth as missing/too many variables (in my opinion). All this to say that I think maximizing absolute prediction error is the way to go.
    With one caveat, evidence of causality. Good predictions don’t necessarily demonstrate causality and so, evidence of causality + prediction = understanding.

    Best, Jeff Houlahan

    1. Hi Jeff, Thanks for commenting, it is nice to read your argument.
      I see your point about parameter estimates but how do you do in real life to attribute model error (in terms of parameter bias) to your data limitation vs to the wrong functional relationships or incorrect predictors? You do have a paragraph in your paper that address in some way the measurement error issue. Looking at the predictive limitation of a model, one can argue both: “we should get more data”and “we should try new variables/ new functional relation”, how do we know which way to go?
      My main point concerning absolute predictive power is that I could build ensemble machine learning models with 5 way interactions and tens of complex predictors and if I compare the predictive power of this model to a mechanistic one with a few parameters and simple predictors (abundance …), I would argue that the first one will get higher scores. Now in terms of demonstrating that we understand what is going on the first model will give us little information while in the mechanistic model one could look where is the model wrong and imagine ways to take this into account.
      Food for thoughts.

  2. Hi Lionel, it is food for thought but I think the way out is relatively clear (although lots of people may disagree).
    Let me start with your first point – the original point was that sometimes simpler models that don’t contain all causal variables make better predictions than more complex models that do contain all the causal variables. My point was that that difference in predictive ability comes down to poor parameter estimates in the more complex models and that makes them ‘less’ true. So, my position is that I don’t buy the premise – I don’t think it’s possible for ‘truer’ models to make worse predictions Now, whether the answer is more data to improve parameter estimates or include new variables or functional relationships is always unknown but they certainly aren’t mutually exclusive. More data, all things being equal, is always a good idea but , of course, things are not always equal and sometimes the costs outweigh the benefits. And if there are good reasons for looking at other variables or functional relationships then they should be explored. You take your best shot and you test how successful you’ve been by looking at how much your predictions have improved.
    I think in your second paragraph you are making the opposite point (although I may have misunderstood you) – that more complex models will make better predictions…but be less interpretable. I agree with you but not for the reason that you are implying. A ‘mechanistic’ model is still just a model that contains variables, functional relationships and parameter estimates. It has an underlying story that the model builders tell but it still has the sum of the three components (i.e. variables, functional relationships and parameter estimates) just as the machine learning model does. To me the big problem with complex machine learning models is not that they aren’t ‘mechanistic’, it’s that they are, to some extent, black boxes. In a nutshell, when we put together a simple model that predicts well, the model understands the world well and we understand our model but when we put together a very complex model that predicts well, it may be that the model understands the world very well but we don’t understand our model. That’s not a problem for prediction but it is for understanding. Our thesis is predicated on the assumption that we understand our models perfectly but that isn’t always the case with complex models and, in particular, with machine learning models like neural nets or random forests. This adds a dimension that we didn’t explore. Thanks for taking the time to think and write about this, Lionel – I’ve enjoyed the conversation. Best, Jeff H

  3. Jeff, fwiw, I had similar thoughts as Lionel when skimming your paper.

    I would say the fundamental problem is the following: good predictions = good theory works in a controlled environment, e.g. physics, that has clean data and simple models.

    In ecology, however, there is likely no “true” theory. All models that we could bring forward will be seriously wrong and less complex than reality. In this context, things may work a bit differently.

    ML has amazed classical statistics by showing that is is possible to created highly predictive models in such a situation with structures that are clearly not causal, and offer very little “understanding”.

    About the “understanding”: I would say in the widest sense we are speaking about causal modelling, i.e. approaches that try to detect causal structures in big datasets. The criteria for causility, however, are usually not predictions. The typical approach is more in the direction of what you mention above, e.g. asking: are parameter estimates for this correlation stable? If not, there must be some other causal predictor, etc …

  4. Hi Florian, I disagree with every point you make – that probably means you’re on exactly the right track. But I’ll still try and make my case.

    Complex reality and messy data aren’t relevant to the fundamental assertion that prediction is the only way to demonstrate understanding. I don’t believe our assertion can be refuted by stating that “it’s hard to make good predictions because the world is complex and the data are messy”. It has to be refuted by identifying alternatives to prediction for demonstrating understanding. Complexity and messy data make it harder to make accurate predictions but that just means that complexity and messy data make it harder for us to understand the world. I agree with that. So, I think the statement that prediction is the only way to demonstrate understanding is true in all systems, it’s just that some systems are easier to understand than others.

    I don’t agree that there is no ‘true’ theory in ecology. How is it possible that ‘truth’ exists in physics but not in ecology? When I hear people say this I think what they really mean is that ecology is way more complex than physics (i.e. the true number of drivers in ecology is extremely large, the functional relationships are non-linear and involve many interactions and it would take an impractically large amount of data to precisely measure all parameters) and so identifying the true model is beyond our current capabilities. That is much different than saying that a true model doesn’t exist. The statement “All models that we could bring forward will be seriously wrong and less complex than reality” may be currently true but it speaks only to our capabilities, not to whether there is a true model. There may be practical reasons to target simple but “untrue” models but generally our goal should be to try and get closer and closer to the true model.

    How is that ML models make good predictions but aren’t causal? There’s only one way that I can see that could happen is if a ‘true’ driver is replaced by a variable that is tightly correlated with the ‘true’ driver That’s where experimentation should be used – to confirm or deny causality. And if an ML model makes good predictions and the input variables have all been demonstrated to be causal, how can we say there is no understanding?
    Where I agree that ML complicates matters is that the position of our paper assumes that you understand your model perfectly. That’s true of most traditional approaches like regression (that we understand the model perfectly) but for something like a multi-layer neural net, the model is very difficult for human minds to grasp – it’s, if not a black box then at least, a grey box. In the extreme example, we could have a model that understands the world perfectly (i.e. makes perfect predictions and all drivers have been demonstrated to be causal) but that we don’t understand at all. That’s a problem that we will have to confront.

    I guess I was wrong to say I disagree with everything because I partially agree with your last paragraph – that we need to demonstrate causality and causality would not be demonstrated by making good predictions. Causality would usually be demonstrated using experiments. So, I see prediction and causality as necessary and sufficient for demonstrating understanding but, in most cases, the approaches would be different. Causality is likely to require controlled randomised experiments ( although some kind of structural equation modelling approach might integrate things) and prediction, likely will require modelling. But ultimately, predictive ability will demonstrate understanding – if you find that variable A is a causal driver of variable B but variable A makes poor predictions of variable B then you have poor understanding.

    So, while I think your take on prediction is informed and considered, I just don’t agree with it. That may be a little too diplomatic – I’m pretty sure it’s wrong.


    1. Thanks Florian and Jeff for continuing the discussion.

      On the complexity issue inherent with ecological data, I really like the approach of Vellend in his new book recommending us to look at higher-level processes like selection or drift instead of low-level processes like intraspecific competition or predation where everything affect everything and is context-dependent. I would argue that this is the way to go, it would make the ecological world a less complex and messy place enabling us to develop better theories/models/predictions.

      Machine learning models make better prediction because they are way more complex than traditional statistical or mechanistic models. They were also developped precisely to make good predictions, every aspect of these algorithms is optimized to make good predictions, while statistical models were made based on probability distribution, randomness, likelihood … This makes these algorithms “naturally” better than other approached regarding predictive power. “And if an ML model makes good predictions and the input variables have all been demonstrated to be causal, how can we say there is no understanding?”, this is the critical point, I would argue on the contrary that in such an exercice one do not derive any understanding. If you fit such ML models and look at the response curves (predicted values vs predictor) in most of the cases you will find super complex relationships that you would have a hard time linking to understanding. More generally I doubt that we will ever come to a situation in ecology where all input variables that we can think of and that we actually measured have been demonstrated to be causal within the studied systems.

      “But ultimately, predictive ability will demonstrate understanding – if you find that variable A is a causal driver of variable B but variable A makes poor predictions of variable B then you have poor understanding”, Totally agree with you on this, I guess the only point that I am still struggling with is: how to actually do it. So we agree on the goal but might still have different opinions on how to reach it.


    2. Jeff,

      without getting too philosophical about the meaning of truth, let’s assume that things in ecology follow rules and we can understand them.

      The issue I have is: how do we do that, if we have a massively complex system that is composed of many parts.

      You mention two things: “experimentation should be used” – this is the classical idea of reductionism – I guess we don’t have to argue about that one. If we can reduce the problem to a simple experiment, we should do them. But then we also don’t need a new philosophy about predictions.

      The question I thought you are after is: how can we reliably “identify” if a complex model with many processes and regulating factors is “correct / useful”? My point here is that that we will likely never include all real processes in a mechanistic model (there are too many), so we have to be careful what we mean by wrong. A model is not necessarily wrong if it neglects a particular factor, but has the right causal structure otherwise.

      So, my point is: if we compare model via a prediction contest under cross-validation, we might set up mechanistic models to fail against ML approaches. We have to look in more detail at WHAT is predicted correctly. The WHAT would not need to be physical model outputs, it could be correlations between variables, dependency structures, any kind of patterns. But that is what modelers have done for a long time, unfortunately it’s not such a trivial task.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s