Predicting Plant Species with Machine Learning
As part of the GeoLifeCLEF 2024 competition on Kaggle, Darren and I built a system for predicting which plant species are present at a given location in Europe, using satellite imagery, climate data, soil characteristics, and other environmental variables as inputs. The full paper is available here.
The Problem
Given a location, predict which of 11,255 plant species are present there. The output is multi-label (any number of species can be present at once) which makes this considerably harder than a typical classification task. The training data included around five million plant occurrence records from across Europe, though these came with the usual caveats of opportunistically collected data: just because a species isn’t recorded somewhere doesn’t mean it isn’t there.
What We Did
The approach ended up being an ensemble of two main models: an XGBoost regression model trained on the tabular environmental data, and a multi-modal neural network that combined ResNet18 for the environmental time series with a Swin Transformer for the satellite images. Neither model alone was particularly remarkable, but combining them gave a meaningful improvement in the F1 score.
One thing that helped more than expected was running the same multi-modal model multiple times with different random seeds and averaging the outputs. Each additional instance improved performance, though with diminishing returns. Six seeds gave our best result. It’s not the most elegant solution, but it does work quite well.
We also needed to predict not just which species are present, but how many. A fixed top-K approach worked as a baseline, but training a separate XGBoost model to predict the count per survey, and then nudging that count upward by a few to account for false positives, consistently outperformed it. The micro-averaged F1 metric rewards finding more true positives even at the cost of some false positives, which explains why predicting slightly more species than the model strictly suggests tends to help.
Results
Our best submission achieved an F1 score of 0.355, up from a baseline of around 0.316. The gains came incrementally: ensembling the two model types added about 0.02, the dynamic count prediction added another 0.01, and the multi-seed averaging added a further 0.01. None of it was dramatic, but it all added up.
Reflections
The main constraint throughout was compute. Training the multi-modal model once takes a while; training it six times is obviously six times worse. There’s almost certainly a smarter path to the same result through better hyperparameter tuning or a stronger single model, rather than brute-forcing it with repeated runs.
The plant occurrence data was also anonymized to protect rare species, which made it harder to understand where the models were struggling. That’s a reasonable tradeoff for conservation purposes, but it does make debugging somewhat opaque.
If you’re interested in the full details (data preprocessing, hyperparameters, the extended results tables) the paper has all of it.