ML From Scratch — Ondřej Kutil

Overview

Three core ML algorithms built from first principles in NumPy — no scikit-learn, no PyTorch, no shortcuts. Each algorithm lives in its own notebook with a real dataset, explicit matrix operations, and enough commentary to follow the math.

The goal is to understand what happens inside sklearn.fit(), not just call it.

Algorithms Implemented

Algorithm	Dataset	What was built
Neural network	MNIST (42 000 handwritten digits)	784 → 20 → 20 → 10 network, ReLU + Softmax, manual backprop, cross-entropy loss — reaches ~83 % accuracy in 300 epochs
K-Means	Synthetic population density + income (300 samples)	Centroid initialisation, assignment/update loop, convergence tracking, normalised feature space
Decision Tree (CART)	Synthetic 2-class classification (1 000 samples)	Gini impurity scoring, greedy split selection, recursive tree growth, decision boundary visualisation

What the Implementation Looks Like

Neural network — every weight matrix and bias vector is initialised and updated by hand. Backpropagation walks backwards through three layers computing gradients explicitly. Training loss drops from 3.15 → 0.53; accuracy climbs from ~10 % → ~83 % across 300 epochs, logged every 30 steps.

K-Means — features are normalised to [0, 1] before clustering so neither axis dominates distance calculations. The loop runs until cluster assignments stop changing.

Decision Tree — a pure CART implementation. At each node, all possible feature/threshold splits are evaluated; the one minimising weighted Gini impurity is chosen. The tree grows recursively until a stopping condition is met. Decision boundaries are plotted against the synthetic dataset.

Skills Demonstrated

Mathematical fluency — loss functions, Gini impurity, distance metrics, and gradients written out in full
Backpropagation from scratch — no autograd; every partial derivative computed explicitly across all layers
Convergence analysis — loss/accuracy curves for the network, centroid stability for k-means, depth analysis for the tree
Reproducibility — fixed seeds, normalised inputs, self-contained notebooks

Code & Artifacts

GitHub repo →