About This Project
A full-cycle data engineering and analytical intelligence project built on a real-world Portuguese hotel booking dataset.
- Normalized relational schema — 11 entities, 4 functional domains, 3NF throughout
- Full SQL Server implementation with PK/FK, CHECK, and UNIQUE constraints
- 4-stage bulk insert pipeline: Stage → Validate → Normalize → Error-log
- 10 analytical SQL views: occupancy, revenue, cancellations, housekeeping, staff
- CRUD operations: INSERT reservations, UPDATE status, DELETE test records
- Full 20-question Appendix B EDA: profiling, quality, univariate, bivariate, multivariate
- Python: pandas, matplotlib, seaborn — clean notebook with 18 visualizations
- Random Forest classifier: AUC 0.86, Accuracy 81%, F1 0.79
- k-Means guest clustering: k=3, Silhouette Score 0.61
- Orange Data Mining workflows for both supervised and unsupervised learning
Relational Schema Design
11 normalized tables across 4 functional domains — Reservations as the central hub with 4 foreign keys.
- Hotels — HotelID, Name, City, Country, StarRating
- RoomTypes — TypeID, Name, Capacity, BasePrice
- Rooms — RoomID, UNIQUE(HotelID, RoomNum)
- Guests — GuestID, Name, Email, IsRepeatedGuest
- Staff — StaffID, Role, Department, HotelID FK
- Services — ServiceID, Name, Category, Price
- Reservations — 4×FK hub, Status, Channel, Nights
- ServiceUsage — M:N junction (Reservation × Service)
- Invoices — 1:1 with Reservation, OutstandingAmount
- Payments — 1:N under Invoice, partial payments OK
- HousekeepingLogs — StartTime, EndTime, CleaningType
- HotelBooking_Staging — flat NVARCHAR buffer for CSV ingestion
- 4-stage ETL: Load → Validate → Normalize → Error-log
- 119,390 rows ingested in under 30 seconds
10 Analytical SQL Views
| # | View Name | Business Question | Key Technique |
|---|---|---|---|
| * | vw_HotelOccupancyRate | Occupancy % per hotel (last 30 days) | COUNT DISTINCT · NULLIF guard |
| * | vw_MonthlyRevenue | Revenue trend by month & hotel | YEAR/MONTH · GROUP BY |
| * | vw_GuestStayHistory | Guest lifetime value & stay count | LEFT JOIN · SUM · COUNT |
| * | vw_CancellationByChannel | Cancellation rate per booking channel | Correlated subquery · CAST % |
| * | vw_RoomTypePerformance | Revenue & demand by room type | LEFT JOIN · AVG · SUM |
| * | vw_OutstandingInvoices | Unpaid balances & days overdue | DATEDIFF · WHERE outstanding > 0 |
| * | vw_ServiceUsageAnalysis | Most revenue-generating services | LEFT JOIN Services · rank |
| * | vw_PaymentMethodBreakdown | Revenue share per payment method | Scalar subquery · % share |
| * | vw_HousekeepingTurnaround | Avg & max cleaning time per room | DATEDIFF MINUTE · AVG · MAX |
| * | vw_StaffPerformance | Revenue & bookings per staff member | Multiple LEFT JOINs · COUNT · SUM |
CRUD Operations & Entity Relationship
INSERT INTO Reservations (
HotelID, GuestID, RoomID,
CheckInDate, Nights, Status
) VALUES (
1, 1042, 55,
'2025-06-15', 3, 'Confirmed'
);
UPDATE Reservations
SET Status = 'Canceled',
UpdatedAt = GETDATE()
WHERE ReservationID = 8921
AND Status != 'CheckedOut';
EDA Gallery
Comprehensive data quality, univariate, bivariate, and multivariate analysis.
Univariate histograms and KDE curves for all numerical dataset features.
Boxplots highlighting statistical outliers in lead time, ADR, and stay duration.
Distribution of Average Daily Rate after capping at the 99th percentile.
Executive KPIs & Interactive Charts
Mirroring the 10 SQL views — real computed metrics from the hotel booking dataset.
Executive KPIs & Interactive Charts
Mirroring the 10 SQL views — real computed metrics from the hotel booking dataset.
Predictive Intelligence
Machine Learning Engine
Four classification models trained on 119,390 hotel bookings to predict cancellation risk at booking time — powering real-time intervention strategies.
Four Models — One Champion
Cancellation Risk Calculator
Real-Time Booking Risk Assessment
Enter booking parameters below — the Random Forest model scores cancellation probability instantly.
and click Calculate
What Drives Cancellations?
🌲 Random Forest — Feature Importance
📈 Logistic Regression — Coefficients
ROC Curve Comparison
End-to-End Workflow
Model Intelligence Insights
Lead Time is #1 Predictor
Highest importance across all models. Bookings made 180+ days ahead cancel at 2.3× the rate of same-week bookings.
Deposit Paradox
Non-Refund deposits show ~99% cancellation — a model-confirmed OTA inventory-blocking behaviour not a genuine booking signal.
Repeat Guest Shield
is_repeated_guest is a top-5 feature. Returning guests cancel at ~5% vs 37% baseline — loyalty programmes directly reduce ML-predicted risk.
Online TA Volatility
Online TA segment drives 47% of bookings but is the highest-risk segment. The model assigns mean probability of 0.48 to OTA bookings.
Special Requests = Trust Signal
Each additional special request reduces cancellation probability by ~8%. Guests who customize their stay are 3× less likely to cancel.
Revenue × Risk Trade-off
High-ADR bookings show lower cancel risk for Direct channel but higher risk for OTA. Targeted pre-auth policies recommended by segment.
Predictive Models & Guest Segmentation
Four classification models compared on 119,390 bookings — plus k-Means guest clustering into 3 behavioral profiles.
Direct booking channel
High special requests
Non-Refund OTA deposits
Speculative booking behaviour
Predictable ADR patterns
Strong revenue contribution
6 Actionable Strategies
Derived directly from EDA findings and validated by the ML classifier. Structured by implementation priority.