The Problem
When a model fails in production, someone needs to open the black box and explain why. That requires understanding the mathematics under the surface — not just calling sklearn.fit().
This repository builds core ML algorithms from first principles using only NumPy and pandas, then validates each against the sklearn equivalent on identical data. The goal is the kind of deep technical understanding needed to audit model behaviour, diagnose failures, and explain predictions to non-technical stakeholders.
Algorithms Covered
| Algorithm | Key concept practised |
|---|---|
| Linear regression (OLS) | Normal equations, gradient descent optimisation |
| k-means | Clustering data points |
| Neural networks | Forward/backward propagation, activation functions |
Every implementation follows the same constraint: NumPy only — no scikit-learn, no PyTorch, no shortcuts. Every matrix operation, gradient computation, and update rule is written explicitly.
Methodology
Each algorithm follows a structured five-step process:
- Select — Choose an algorithm with clear mathematical foundations and practical relevance.
- Derive — Work through the mathematics: loss functions, gradients, update rules. Document each derivation step-by-step.
- Implement — Build from scratch in NumPy or pandas. Every matrix operation is visible and explainable.
- Validate — Run both the from-scratch implementation and the sklearn equivalent on the same dataset. Compare outputs numerically and visually.
- Document — Write up the case study in notebook format, explaining not just what the algorithm does but why each design decision matters.
What This Demonstrates
This project is less about novel results and more about demonstrating the ability to read a paper, translate mathematics into working code, and verify correctness rigorously.
The case study format — one notebook per algorithm — forces clear written explanation of every step. That's the same skill required when presenting model audits to stakeholders: explaining why a model makes the predictions it does, what its failure modes are, and where the boundaries of its reliability lie.
Key competencies shown:
- Mathematical fluency — loss functions, derivatives, matrix operations implemented from first principles
- Training dynamics — learning rates, convergence analysis, optimisation algorithms
- Validation discipline — every implementation verified against a reference library
- Technical communication — each notebook is a self-contained explainer, not just code