The Problem
Applying machine learning to real-world domains requires more than model training — it requires building the entire pipeline: data acquisition, feature engineering, model selection, and evaluation against domain-specific metrics. This repository is a collection of end-to-end ML case studies that demonstrate exactly that process.
Featured Projects
Real Estate Price Prediction
An end-to-end pipeline that scrapes listings from Sreality.cz (the largest Czech real estate portal), engineers features from both structured data and free-text descriptions, and trains a tuned XGBoost regressor to predict apartment prices in Prague.
Data Acquisition
- Async scraping with batching, robust JSON/HTML parsing, and incremental appends to keep memory bounded during long runs
- Thousands of listings collected with full metadata: location, layout, condition, amenities, energy ratings
Feature Engineering
The preprocessing pipeline extracts 30+ features across multiple categories:
| Category | Example features |
|---|---|
| Location | City, district, street, ZIP, lat/long |
| Structural | Layout (2+kk, 3+1), floor, building type, condition |
| Area | Usable area, balcony, terrace, cellar, garden |
| Amenities | Garage, parking, elevator, furnished |
| Energy | Energy rating, heating type, low-energy flag |
| Text-derived | Metro/tram/bus proximity, park, school, renovated |
| Transport / POIs | Nearby grocery, restaurant, leisure, doctor counts |
Text-derived features are extracted from listing descriptions using keyword matching — flags like desc_has_metro, desc_is_renovated, desc_is_sunny that capture signals not present in the structured fields.
Model Training
- XGBoost regressor with hyperparameter tuning via GridSearchCV
- Evaluation using RMSE and R², with feature importance analysis to identify the strongest price drivers
Sentiment Analysis
Text classification comparing a fine-tuned BERT model against an LSTM baseline to evaluate the trade-off between modern transformer architectures and traditional recurrent approaches.
| Aspect | BERT | LSTM |
|---|---|---|
| Architecture | Pre-trained transformer | Recurrent neural network |
| Tokenisation | WordPiece | Word-level |
| Class imbalance handling | ✓ | ✓ |
| Evaluation metrics | Accuracy, F1, ROC-AUC | Accuracy, F1, ROC-AUC |
The project implements:
- Data preprocessing and tokenisation pipelines for both architectures
- Class imbalance handling and regularisation
- Side-by-side evaluation with accuracy, F1-score, ROC-AUC, and confusion matrices
- Analysis of the performance vs. computational cost trade-off
Skills Demonstrated
- Full-pipeline ML — from raw data acquisition to trained, evaluated models
- Web scraping — async architecture with batching and robust parsing
- Feature engineering — structured fields, text extraction, domain-informed transformations
- Model comparison — systematic benchmarking across architectures (XGBoost, BERT, LSTM)
- Visual storytelling — matplotlib, seaborn, and Plotly for communicating results
Code & Artifacts
What I'd Do Differently
For the real estate project, replacing the keyword-flag approach for text features with TF-IDF or lightweight embeddings would capture richer semantic signals from descriptions. For sentiment analysis, adding a proper train/validation/test split timeline and testing on an out-of-domain corpus would better demonstrate real-world generalisation performance.