Data Analytics Case Studies

Overview

A collection of end-to-end ML case studies. Each one covers the full pipeline — data acquisition, feature engineering, model selection, and evaluation against domain-specific metrics.

Featured Projects

Real Estate Price Prediction

An end-to-end pipeline that scrapes listings from Sreality.cz (the largest Czech real estate portal), engineers features from both structured data and free-text descriptions, and trains a tuned XGBoost regressor to predict apartment prices in Prague.

Data Acquisition

Async scraping with batching, robust JSON/HTML parsing, and incremental appends to keep memory bounded during long runs
Thousands of listings collected with full metadata: location, layout, condition, amenities, energy ratings

Feature Engineering

The preprocessing pipeline extracts 30+ features across multiple categories:

Category	Example features
Location	City, district, street, ZIP, lat/long
Structural	Layout (2+kk, 3+1), floor, building type, condition
Area	Usable area, balcony, terrace, cellar, garden
Amenities	Garage, parking, elevator, furnished
Energy	Energy rating, heating type, low-energy flag
Text-derived	Metro/tram/bus proximity, park, school, renovated
Transport / POIs	Nearby grocery, restaurant, leisure, doctor counts

Text-derived features are extracted from listing descriptions using keyword matching — flags like desc_has_metro, desc_is_renovated, desc_is_sunny that capture signals not present in the structured fields.

Model Training

XGBoost regressor with hyperparameter tuning via GridSearchCV
Evaluation using RMSE and R², with feature importance analysis to identify the strongest price drivers

Sentiment Analysis

Text classification comparing a fine-tuned BERT model against an LSTM baseline to evaluate the trade-off between modern transformer architectures and traditional recurrent approaches.

Aspect	BERT	LSTM
Architecture	Pre-trained transformer	Recurrent neural network
Tokenisation	WordPiece	Word-level
Class imbalance handling	✓	✓
Evaluation metrics	Accuracy, F1, ROC-AUC	Accuracy, F1, ROC-AUC

The project implements:

Data preprocessing and tokenisation pipelines for both architectures
Class imbalance handling and regularisation
Side-by-side evaluation with accuracy, F1-score, ROC-AUC, and confusion matrices
Analysis of the performance vs. computational cost trade-off

Skills Demonstrated

Full-pipeline ML — from raw data acquisition to trained, evaluated models
Web scraping — async architecture with batching and robust parsing
Feature engineering — structured fields, text extraction, domain-informed transformations
Model comparison — benchmarking across architectures (XGBoost, BERT, LSTM)
Visualisation — matplotlib, seaborn, Plotly

Code & Artifacts

GitHub repo →