The Problem
The EY Open Science Data Challenge is one of the largest annual data competitions in the world, drawing thousands of participants across 146 countries. The 2026 edition — Optimizing Clean Water Supply — focused on forecasting river water quality across South Africa using geospatial and remote-sensing datasets.
Participants were given ground-truth measurements at a set of geographic sampling sites and asked to predict three water quality targets:
| Target | What it measures |
|---|---|
| Total Alkalinity | Buffering capacity of water against acidification |
| Electrical Conductance | Proxy for dissolved ion concentration / salinity |
| Dissolved Reactive Phosphorus | Bioavailable phosphorus — key driver of algal blooms |
The core training data spans 2011–2015 and covers roughly 200 river sampling locations in South Africa. The critical constraint: the validation set contained entirely different geographic regions from the training set. A model that simply memorised local patterns would fail — it had to learn generalizable relationships between environmental signals and water chemistry.
EY provided a starter notebook with a simple linear baseline. The baseline scored poorly, which was the real starting point.
Technical Approach
The pipeline had four stages: data acquisition, index & feature engineering, spatial aggregation, and model training per target.
1 — Data Acquisition
The EY baseline used only the core dataset. To improve generalisation, three additional sources were pulled in via their respective APIs and aligned to each sampling point's coordinates:
- Landsat — optical surface reflectance bands (visible, NIR, SWIR, thermal) and derived water indices providing rich spectral information
- TerraClimate — monthly climate rasters (precipitation, temperature, evapotranspiration, soil moisture) capturing the catchment-level hydrological context that drives alkalinity and conductance
- Copernicus DEM — 30 m elevation grid used to derive slope and upstream catchment area, both strong proxies for mineral weathering (alkalinity) and runoff ionic load (conductance)
2 — Index & Feature Engineering
Raw bands were combined into physically motivated indices. Each target guided which indices to prioritise:
eps = 1e-9
# Water / sediment indices — relevant for Alkalinity & Conductance
ndwi = (green - nir) / (green + nir + eps) # water extent
ndti = (red - green)/ (red + green + eps) # turbidity, suspended solids
sabi = (nir - red) / (blue + green + eps) # surface algal bloom
# Phosphorus proxies — DRP fuels phytoplankton, shifts red/NIR balance
fai = nir - (red + (swir - red) * 0.3) # floating algae index
rednir = red / (nir + eps) # red-to-NIR ratio
3 — Spatial Aggregation
Features were then matched by latitude, longtitude and dates to ensure corresponding values stay on the same row
4 — Model Selection
Separate models were trained per target after benchmarking several algorithms:
Random Forest → Dissolved Reactive Phosphorus DRP is the noisiest target — driven by localised algal dynamics, short-term weather events, and non-linear threshold effects (blooms can spike suddenly). Random Forest's ensemble of deep, independent trees handles this variance well and is less likely to overfit the sparse positive-skewed DRP distribution than a boosted model, which would chase outliers aggressively.
XGBoost → Total Alkalinity & Electrical Conductance Alkalinity and conductance are more smoothly determined by geology and catchment ion load — relationships that are monotonic and well-structured. XGBoost's gradient boosting captures these interactions efficiently, benefits more from the terrain and climate features (which have ordinal structure), and converged to lower validation RMSE than Random Forest on both targets.
Results & Impact
| Metric | Value |
|---|---|
| Final score (avg. R²) | 0.4079 |
| Placement | 70th out of 945 teams (3 000+ participants) |
| Scored targets | 3 (Total Alkalinity, Electrical Conductance, Dissolved Reactive Phosphorus) |
Methodology
The project followed a structured four-stage process, typical of applied data science consulting:
- Problem Scoping — Defined the prediction targets, understood the spatial generalisation constraint, and identified why the EY baseline underperformed (it relied on a single data source with no terrain or climate context).
- Data Acquisition & Integration — Extended the pipeline with three external datasets (Landsat, TerraClimate, Copernicus DEM), each chosen because it adds a physically distinct signal. Data was aligned by coordinates and sampling dates.
- Feature Engineering & Modelling — Built domain-informed spectral indices and trained per-target models. Model selection was guided by the structure of each target: gradient boosting for smooth geochemical relationships, random forest for noisy ecological signals.
- Validation & Delivery — Used spatial cross-validation to simulate the out-of-sample submission constraint. Predictions were submitted to the EY leaderboard as the final deliverable.