EY Data & AI Challenge

The Problem

The EY Open Science Data Challenge is one of the largest annual data competitions in the world, drawing thousands of participants across 146 countries. The 2026 edition — Optimizing Clean Water Supply — focused on forecasting river water quality across South Africa using geospatial and remote-sensing datasets.

In the challenge, we were given ground-truth measurements at a set of geographic sampling sites and asked to predict three water quality targets:

Target	What it measures
Total Alkalinity	Buffering capacity of water against acidification
Electrical Conductance	Proxy for dissolved ion concentration / salinity
Dissolved Reactive Phosphorus	Bioavailable phosphorus — key driver of algal blooms

The core training data spans 2011–2015 and covers roughly 200 river sampling locations in South Africa. The critical constraint: the validation set contained entirely different geographic regions from the training set. A model that simply memorised local patterns would fail — it had to learn generalizable relationships between environmental signals and water chemistry.

EY provided a starter notebook with a simple linear baseline. The baseline scored poorly, which was the real starting point.

Technical Approach

I built my own pipeline around three main steps: pulling in extra environmental data, engineering features from it, and training separate models for each target. The main idea was to move beyond the starter notebook and give the model more context about the water, climate, and terrain around each sampling location.

1 — Data Acquisition

The EY baseline used only the core dataset. To improve generalisation, three additional sources were pulled in via their respective APIs and aligned to each sampling point's coordinates:

Landsat — optical surface reflectance bands (visible, NIR, SWIR, thermal) and derived water indices providing rich spectral information
TerraClimate — monthly climate rasters (precipitation, temperature, evapotranspiration, soil moisture) capturing the catchment-level hydrological context that drives alkalinity and conductance
Copernicus DEM — 30 m elevation grid used to derive slope and upstream catchment area, both strong proxies for mineral weathering (alkalinity) and runoff ionic load (conductance)

2 — Index & Feature Engineering

Raw bands were combined into physically motivated indices. Each target guided which indices to prioritise:

eps = 1e-9

# Water / sediment indices — relevant for Alkalinity & Conductance
ndwi   = (green - nir)  / (green + nir  + eps)  # water extent
ndti   = (red   - green)/ (red   + green + eps)  # turbidity, suspended solids
sabi   = (nir   - red)  / (blue  + green + eps)  # surface algal bloom

# Phosphorus proxies — DRP fuels phytoplankton, shifts red/NIR balance
fai    = nir - (red + (swir - red) * 0.3)        # floating algae index
rednir = red / (nir + eps)                        # red-to-NIR ratio

3 — Model Selection

Separate models were trained per target after benchmarking several algorithms:

Random Forest → Dissolved Reactive Phosphorus DRP is the noisiest target — driven by localised algal dynamics, short-term weather events, and non-linear threshold effects (blooms can spike suddenly). Random Forest's ensemble of deep, independent trees handles this variance well and is less likely to overfit the sparse positive-skewed DRP distribution than a boosted model, which would chase outliers aggressively.

XGBoost → Total Alkalinity & Electrical Conductance Alkalinity and conductance are more smoothly determined by geology and catchment ion load — relationships that are monotonic and well-structured. XGBoost's gradient boosting captures these interactions efficiently, benefits more from the terrain and climate features (which have ordinal structure), and converged to lower validation RMSE than Random Forest on both targets.

Results & Impact

Metric	Value
Final score (avg. R²)	0.4079
Placement	70th out of 945 teams (3 000+ participants)

Process

The baseline's main weakness was too little context per sampling point, so most of the work was adding the right external signals and then iterating — validation RMSE and leaderboard feedback guided which features and models to keep, rather than betting on one model up front.