Ondřej Kutil

projects / data-analytics

Data Analytics Case Studies

End-to-end ML pipelines — from async web scraping to XGBoost and BERT model comparison.

PythonXGBoostBERTWeb Scraping
github ↗

The Problem

Applying machine learning to real-world domains requires more than model training — it requires building the entire pipeline: data acquisition, feature engineering, model selection, and evaluation against domain-specific metrics. This repository is a collection of end-to-end ML case studies that demonstrate exactly that process.

Featured Projects

Real Estate Price Prediction

An end-to-end pipeline that scrapes listings from Sreality.cz (the largest Czech real estate portal), engineers features from both structured data and free-text descriptions, and trains a tuned XGBoost regressor to predict apartment prices in Prague.

Data Acquisition

  • Async scraping with batching, robust JSON/HTML parsing, and incremental appends to keep memory bounded during long runs
  • Thousands of listings collected with full metadata: location, layout, condition, amenities, energy ratings

Feature Engineering

The preprocessing pipeline extracts 30+ features across multiple categories:

CategoryExample features
LocationCity, district, street, ZIP, lat/long
StructuralLayout (2+kk, 3+1), floor, building type, condition
AreaUsable area, balcony, terrace, cellar, garden
AmenitiesGarage, parking, elevator, furnished
EnergyEnergy rating, heating type, low-energy flag
Text-derivedMetro/tram/bus proximity, park, school, renovated
Transport / POIsNearby grocery, restaurant, leisure, doctor counts

Text-derived features are extracted from listing descriptions using keyword matching — flags like desc_has_metro, desc_is_renovated, desc_is_sunny that capture signals not present in the structured fields.

Model Training

  • XGBoost regressor with hyperparameter tuning via GridSearchCV
  • Evaluation using RMSE and R², with feature importance analysis to identify the strongest price drivers

Sentiment Analysis

Text classification comparing a fine-tuned BERT model against an LSTM baseline to evaluate the trade-off between modern transformer architectures and traditional recurrent approaches.

AspectBERTLSTM
ArchitecturePre-trained transformerRecurrent neural network
TokenisationWordPieceWord-level
Class imbalance handling
Evaluation metricsAccuracy, F1, ROC-AUCAccuracy, F1, ROC-AUC

The project implements:

  • Data preprocessing and tokenisation pipelines for both architectures
  • Class imbalance handling and regularisation
  • Side-by-side evaluation with accuracy, F1-score, ROC-AUC, and confusion matrices
  • Analysis of the performance vs. computational cost trade-off

Skills Demonstrated

  • Full-pipeline ML — from raw data acquisition to trained, evaluated models
  • Web scraping — async architecture with batching and robust parsing
  • Feature engineering — structured fields, text extraction, domain-informed transformations
  • Model comparison — systematic benchmarking across architectures (XGBoost, BERT, LSTM)
  • Visual storytelling — matplotlib, seaborn, and Plotly for communicating results

Code & Artifacts

What I'd Do Differently

For the real estate project, replacing the keyword-flag approach for text features with TF-IDF or lightweight embeddings would capture richer semantic signals from descriptions. For sentiment analysis, adding a proper train/validation/test split timeline and testing on an out-of-domain corpus would better demonstrate real-world generalisation performance.