cloudberries

navigation

species distribution modelling (again)

[tags] machine learning publication
[date]
2025.09.26

Using Opportunistic Plant Sightings to Improve Ecological Models

Also as part of CLEF 2025, Darren and I returned to the GeoLifeCLEF competition, following up on our previous paper. This time we led with a different research question. Rather than chasing the highest possible score through ensemble complexity, we wanted to test something more specific: can you use the abundance of messy, opportunistically-collected species sightings to improve a model that’s ultimately trained on the smaller, more reliable survey data? The full paper is available here.

The Problem

There are two kinds of plant occurrence data available at scale in Europe. Presence-Absence (PA) data comes from systematic surveys where observers record both what they found and what wasn’t there. It’s reliable, but there isn’t much of it and it doesn’t cover everywhere evenly. Presence-Only (PO) data comes from sources like citizen science platforms and museum collections, comprising millions of records, but with no absence information and significant geographic bias toward places people frequently go to look at plants.

The question is whether the PO data, despite its limitations, can still be useful. The approach we took was simple: pre-train a model on PO data first, then fine-tune it on PA data, and see if that beats training on PA data alone.

What We Did

Three ResNet18 models were trained, one each on Sentinel-2 satellite imagery, Landsat time series data, and bioclimatic cubes. Each was pre-trained for 10 epochs on a subset of the PO data, then fine-tuned for a further 10 epochs on the PA training data. A matching set of baseline models were trained for 10 epochs on PA data only, starting from random weight initialisation.

Before getting into modelling, it was worth checking whether the spatial distribution of the training and test data actually overlapped meaningfully. They didn’t, particularly. The Jensen-Shannon Divergence between the PA training and test sets was 0.59, which is a fairly substantial mismatch. Switzerland is a good illustration of this: the PA test data covers it broadly, the PA training data barely touches it, but the PO data covers it reasonably well. That’s the gap the pre-training is meant to help bridge.

Results

Pre-training on PO data improved all three models by around 7% in samples-averaged F1-score over their PA-only equivalents. The Landsat model was the strongest overall, reaching an F1 of 0.178 with PO pre-training compared to roughly 0.166 without it.

Training on PO data alone was not sufficient to produce a useful model — the Bioclimatic and Sentinel models pre-trained on PO data alone didn’t even beat a naive baseline of always predicting the 25 most common species. The improvement only materialised once the PA fine-tuning stage was added. Which is about what you’d expect: PO data can teach a model something about what environments look like, but without absence labels you can’t properly learn what it means for a species not to be there.

One practical note: PO pre-training took roughly 18 times longer per epoch than PA training, simply because there’s so much more of it. That’s a real cost, though it remained within the reach of a single consumer GPU throughout.

Reflections

The 7% improvement is encouraging but not dramatic, and there’s a fair amount left unexplored. Hyperparameter tuning was minimal, the pre-training strategy was straightforward, and we didn’t use the full PO dataset. It’s a proof of concept more than an optimised pipeline. The more interesting claim is that PO data helps even when the spatial distributions of PO and PA data don’t perfectly align, which they rarely will in practice (although likely with less extreme differences than seen in the source data here).