Ondřej Kutil

projects / ey-water-quality

EY Data & AI Challenge

Water quality prediction using satellite data and machine learning models.

Pythonscikit-learn
github ↗

The Problem

The EY Open Science Data Challenge is one of the largest annual data competitions in the world, drawing thousands of participants across 146 countries. The 2026 edition — Optimizing Clean Water Supply — focused on forecasting river water quality across South Africa using geospatial and remote-sensing datasets.

In the challenge, we were given ground-truth measurements at a set of geographic sampling sites and asked to predict three water quality targets:

TargetWhat it measures
Total AlkalinityBuffering capacity of water against acidification
Electrical ConductanceProxy for dissolved ion concentration / salinity
Dissolved Reactive PhosphorusBioavailable phosphorus — key driver of algal blooms

The core training data spans 2011–2015 and covers roughly 200 river sampling locations in South Africa. The critical constraint: the validation set contained entirely different geographic regions from the training set. A model that simply memorised local patterns would fail — it had to learn generalizable relationships between environmental signals and water chemistry.

EY provided a starter notebook with a simple linear baseline. The baseline scored poorly, which was the real starting point.

Technical Approach

I built my own pipeline around three main steps: pulling in extra environmental data, engineering features from it, and training separate models for each target. The main idea was to move beyond the starter notebook and give the model more context about the water, climate, and terrain around each sampling location.

1 — Data Acquisition

The EY baseline used only the core dataset. To improve generalisation, three additional sources were pulled in via their respective APIs and aligned to each sampling point's coordinates:

  • Landsat — optical surface reflectance bands (visible, NIR, SWIR, thermal) and derived water indices providing rich spectral information
  • TerraClimate — monthly climate rasters (precipitation, temperature, evapotranspiration, soil moisture) capturing the catchment-level hydrological context that drives alkalinity and conductance
  • Copernicus DEM — 30 m elevation grid used to derive slope and upstream catchment area, both strong proxies for mineral weathering (alkalinity) and runoff ionic load (conductance)

2 — Index & Feature Engineering

Raw bands were combined into physically motivated indices. Each target guided which indices to prioritise:

eps = 1e-9

# Water / sediment indices — relevant for Alkalinity & Conductance
ndwi   = (green - nir)  / (green + nir  + eps)  # water extent
ndti   = (red   - green)/ (red   + green + eps)  # turbidity, suspended solids
sabi   = (nir   - red)  / (blue  + green + eps)  # surface algal bloom

# Phosphorus proxies — DRP fuels phytoplankton, shifts red/NIR balance
fai    = nir - (red + (swir - red) * 0.3)        # floating algae index
rednir = red / (nir + eps)                        # red-to-NIR ratio

3 — Model Selection

Separate models were trained per target after benchmarking several algorithms:

Random Forest → Dissolved Reactive Phosphorus DRP is the noisiest target — driven by localised algal dynamics, short-term weather events, and non-linear threshold effects (blooms can spike suddenly). Random Forest's ensemble of deep, independent trees handles this variance well and is less likely to overfit the sparse positive-skewed DRP distribution than a boosted model, which would chase outliers aggressively.

XGBoost → Total Alkalinity & Electrical Conductance Alkalinity and conductance are more smoothly determined by geology and catchment ion load — relationships that are monotonic and well-structured. XGBoost's gradient boosting captures these interactions efficiently, benefits more from the terrain and climate features (which have ordinal structure), and converged to lower validation RMSE than Random Forest on both targets.

Results & Impact

MetricValue
Final score (avg. R²)0.4079
Placement70th out of 945 teams (3 000+ participants)

Methodology

The project was a good exercise in working through an unfamiliar data problem without pretending to know the perfect answer from the start:

  1. Understand the baseline — I started by checking what the EY starter notebook did and why it was limited. The biggest issue was that it used very little context around each sampling point.
  2. Add better signals — I brought in Landsat, TerraClimate, and elevation data because water quality is affected by more than the measured location itself. Climate, terrain, and spectral information all seemed worth testing.
  3. Build features and compare models — I created water and algae-related indices, then tested different models instead of assuming one algorithm would work best for every target.
  4. Submit and iterate — I used validation results and leaderboard feedback to improve the pipeline step by step. This was less about one perfect model and more about making a series of reasonable improvements.

Code & Artifacts