{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from google.colab import drive\n",
    "drive.mount('/content/drive')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Library Imports ───────────────────────────────────────────────────────\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.ticker as mticker\n",
    "import seaborn as sns\n",
    "import math\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Global plot style\n",
    "sns.set_theme(style='whitegrid', palette='muted')\n",
    "plt.rcParams.update({'figure.dpi': 120, 'axes.titlesize': 13,\n",
    "                     'axes.titleweight': 'bold', 'axes.labelsize': 11})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 1. Dataset Initialization & Overview\n",
    "\n",
    "We load the Hotel Booking dataset and examine its structure: dimensions, column names, data types, and a sample of raw rows.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Load Data ─────────────────────────────────────────────────────────────\n",
    "df = pd.read_csv('/content/drive/MyDrive/EDA final project/hotel_booking.csv')\n",
    "\n",
    "print(f\"Dataset shape  : {df.shape[0]:,} rows × {df.shape[1]} columns\")\n",
    "print(\"-\" * 50)\n",
    "print(\"Column list:\")\n",
    "for col in df.columns:\n",
    "    print(f\"  • {col}\")\n",
    "print(\"-\" * 50)\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Preview 5 Rows ────────────────────────────────────────────────────────\n",
    "pd.set_option('display.max_columns', None)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 2. Data Dictionary\n",
    "\n",
    "| Column | Description | Expected Type | Constraint |\n",
    "|:---|:---|:---|:---|\n",
    "| `hotel` | Hotel type | Categorical | Resort Hotel / City Hotel |\n",
    "| `is_canceled` | Booking cancelled? | Boolean | 0 = No, 1 = Yes |\n",
    "| `lead_time` | Days between booking & arrival | Integer | ≥ 0 |\n",
    "| `arrival_date_year` | Year of arrival | Integer | 2015–2017 |\n",
    "| `arrival_date_month` | Month of arrival | String | Jan–Dec |\n",
    "| `arrival_date_week_number` | ISO week number | Integer | 1–53 |\n",
    "| `arrival_date_day_of_month` | Day of arrival | Integer | 1–31 |\n",
    "| `stays_in_weekend_nights` | Weekend nights booked | Integer | ≥ 0 |\n",
    "| `stays_in_week_nights` | Weekday nights booked | Integer | ≥ 0 |\n",
    "| `adults` | Number of adults | Integer | ≥ 0 |\n",
    "| `children` | Number of children | Float→Int | ≥ 0 |\n",
    "| `babies` | Number of babies | Integer | ≥ 0 |\n",
    "| `meal` | Meal plan | Categorical | BB / HB / FB / No Meal |\n",
    "| `country` | Guest country of origin | Categorical | ISO 3166 |\n",
    "| `market_segment` | Market segment | Categorical | e.g., Online TA |\n",
    "| `distribution_channel` | Booking channel | Categorical | e.g., Direct |\n",
    "| `is_repeated_guest` | Returning guest? | Boolean | 0 / 1 |\n",
    "| `previous_cancellations` | Prior cancellations by guest | Integer | ≥ 0 |\n",
    "| `previous_bookings_not_canceled` | Prior kept bookings | Integer | ≥ 0 |\n",
    "| `reserved_room_type` | Room type requested | Categorical | A–L |\n",
    "| `assigned_room_type` | Room type assigned | Categorical | A–L |\n",
    "| `booking_changes` | Number of changes made | Integer | ≥ 0 |\n",
    "| `deposit_type` | Deposit type | Categorical | No Deposit / Non Refund / Refundable |\n",
    "| `agent` | Travel agent ID | Float→Int | 0 = No Agent |\n",
    "| `company` | Company ID | Float | High nulls → dropped |\n",
    "| `days_in_waiting_list` | Days on waitlist | Integer | ≥ 0 |\n",
    "| `customer_type` | Guest type | Categorical | Transient / Contract / etc. |\n",
    "| `adr` | Average Daily Rate (€) | Float | ≥ 0 |\n",
    "| `required_car_parking_spaces` | Parking requested | Integer | ≥ 0 |\n",
    "| `total_of_special_requests` | Special requests count | Integer | ≥ 0 |\n",
    "| `reservation_status` | Final booking status | Categorical | Canceled / Check-Out / No-Show |\n",
    "| `reservation_status_date` | Date of last status change | Object→Date | Valid date |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 3. Data Types Inspection\n",
    "\n",
    "We audit each column's current dtype against its logical type to flag mismatches before cleaning.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Data Types Audit ──────────────────────────────────────────────────────\n",
    "print(\"Current Data Types:\")\n",
    "print(df.dtypes.to_string())\n",
    "print(\"-\" * 50)\n",
    "\n",
    "issues = []\n",
    "if df['reservation_status_date'].dtype == 'object':\n",
    "    issues.append(\"  [ISSUE] 'reservation_status_date' is text  → convert to datetime\")\n",
    "for col in ['children', 'agent', 'company']:\n",
    "    if col in df.columns and df[col].dtype == 'float64':\n",
    "        issues.append(f\"  [ISSUE] '{col}' is float (has NaNs)          → fill NaNs then cast to int\")\n",
    "for col in ['is_canceled', 'is_repeated_guest']:\n",
    "    issues.append(f\"  [NOTE]  '{col}' is int64 (0/1 flag)            → convert to bool\")\n",
    "\n",
    "print(\"\\nIdentified Issues:\")\n",
    "for i in issues:\n",
    "    print(i)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 4. Missing Values Assessment\n",
    "\n",
    "We quantify missing data per column before deciding on a handling strategy.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Missing Values Report ─────────────────────────────────────────────────\n",
    "missing_count = df.isnull().sum()\n",
    "missing_pct   = (missing_count / len(df)) * 100\n",
    "\n",
    "missing_report = pd.DataFrame({\n",
    "    'Missing Count': missing_count,\n",
    "    'Missing %': missing_pct.round(2)\n",
    "}).query('`Missing Count` > 0').sort_values('Missing Count', ascending=False)\n",
    "\n",
    "print(f\"Total rows in dataset: {len(df):,}\")\n",
    "print()\n",
    "display(missing_report)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 5. Missing Values Handling Plan\n",
    "\n",
    "| Column | Strategy | Justification |\n",
    "|:---|:---|:---|\n",
    "| `company` | **Drop column** | > 90 % missing — adds noise, no analytical value |\n",
    "| `agent` | **Fill with 0** | Null = no agent used; 0 preserves this business meaning |\n",
    "| `country` | **Fill with \"Unknown\"** | < 1 % missing; preserves rows without distorting distribution |\n",
    "| `children` | **Fill with 0** | Few missing; null implies zero children |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Execute Missing Handling Plan ────────────────────────────────────────\n",
    "df.drop(columns=['company'], inplace=True)\n",
    "df['agent']    = df['agent'].fillna(0)\n",
    "df['country']  = df['country'].fillna('Unknown')\n",
    "df['children'] = df['children'].fillna(0)\n",
    "\n",
    "# Verification\n",
    "remaining = df[['agent', 'country', 'children']].isnull().sum()\n",
    "print(\"Missing values after treatment:\")\n",
    "print(remaining.to_string())\n",
    "print(\"\\n✓ All targeted missing values resolved. 'company' column dropped.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 6. Data Type Corrections\n",
    "\n",
    "With NaNs resolved, we safely cast each column to its logical type.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Apply Type Corrections ────────────────────────────────────────────────\n",
    "df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'])\n",
    "df['children']                = df['children'].astype('int64')\n",
    "df['agent']                   = df['agent'].astype('int64')\n",
    "df['is_canceled']             = df['is_canceled'].astype(bool)\n",
    "df['is_repeated_guest']       = df['is_repeated_guest'].astype(bool)\n",
    "\n",
    "print(\"Updated types for corrected columns:\")\n",
    "cols_to_verify = ['reservation_status_date', 'children', 'agent',\n",
    "                  'is_canceled', 'is_repeated_guest']\n",
    "print(df[cols_to_verify].dtypes.to_string())\n",
    "print(\"\\n✓ All data type corrections applied.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 7. Duplicate Records Analysis\n",
    "\n",
    "We first check for **exact duplicates** (identical across every column), then distinguish technical errors from legitimate repeat-guest bookings using a derived `Guest_ID`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Exact Duplicate Check ─────────────────────────────────────────────────\n",
    "dup_count = df.duplicated().sum()\n",
    "print(f\"Exact duplicate rows: {dup_count}\")\n",
    "\n",
    "if dup_count > 0:\n",
    "    print(\"\\nSample duplicate rows:\")\n",
    "    display(df[df.duplicated()].head(5))\n",
    "else:\n",
    "    print(\"✓ No exact duplicate rows found.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Derived Guest_ID (Composite Key) ──────────────────────────────────────\n",
    "df['Guest_ID'] = (df['name'].astype(str) + '|' +\n",
    "                  df['email'].astype(str) + '|' +\n",
    "                  df['phone-number'].astype(str))\n",
    "\n",
    "total_dup_ids = df['Guest_ID'].duplicated().sum()\n",
    "\n",
    "# Technical duplicates: same guest + same arrival date + same hotel\n",
    "tech_errors = df.duplicated(\n",
    "    subset=['Guest_ID', 'hotel',\n",
    "            'arrival_date_year', 'arrival_date_month', 'arrival_date_day_of_month'],\n",
    "    keep=False\n",
    ").sum()\n",
    "\n",
    "loyal_repeats = max(0, total_dup_ids - tech_errors)\n",
    "\n",
    "print(f\"Duplicate Guest IDs (repeat visitors)  : {total_dup_ids:,}\")\n",
    "print(f\"Confirmed technical duplicates (errors): {tech_errors:,}\")\n",
    "print(f\"Legitimate repeat bookings (loyalty)   : {loyal_repeats:,}\")\n",
    "print()\n",
    "print(\"Decision: Keep all records — duplicate Guest_IDs represent returning customers,\")\n",
    "print(\"          not data entry errors.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 8. Validity Rules Check\n",
    "\n",
    "We flag records that violate domain logic identified during the `describe()` review.\n",
    "\n",
    "| Rule | Condition | Expected Action |\n",
    "|:---|:---|:---|\n",
    "| Financial integrity | `adr < 0` | Remove |\n",
    "| Ghost bookings | adults + children + babies = 0 | Remove |\n",
    "| Unrealistic occupancy | `adults > 10` | Remove |\n",
    "| Negative lead time | `lead_time < 0` | Remove |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Validity Violations Count ────────────────────────────────────────────\n",
    "neg_adr       = df[df['adr'] < 0]\n",
    "zero_guests   = df[(df['adults'] + df['children'] + df['babies']) == 0]\n",
    "extreme_adults= df[df['adults'] > 10]\n",
    "neg_lead      = df[df['lead_time'] < 0]\n",
    "\n",
    "print(f\"1. Negative ADR               : {len(neg_adr):,} rows\")\n",
    "print(f\"2. Ghost bookings (0 guests)  : {len(zero_guests):,} rows\")\n",
    "print(f\"3. Extreme adults (> 10)      : {len(extreme_adults):,} rows\")\n",
    "print(f\"4. Negative lead time         : {len(neg_lead):,} rows\")\n",
    "\n",
    "if len(extreme_adults):\n",
    "    print(\"\\nSample — extreme adults:\")\n",
    "    display(extreme_adults[['name','adults','children','hotel','adr']].head(3))\n",
    "\n",
    "if len(zero_guests):\n",
    "    print(\"\\nSample — ghost bookings:\")\n",
    "    display(zero_guests[['name','adults','children','babies','hotel']].head(3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Remove Invalid Records ────────────────────────────────────────────────\n",
    "initial_size = len(df)\n",
    "\n",
    "df = df[\n",
    "    ((df['adults'] + df['children'] + df['babies']) > 0) &\n",
    "    (df['adr'] >= 0) &\n",
    "    (df['adults'] <= 10) &\n",
    "    (df['lead_time'] >= 0)\n",
    "].copy()\n",
    "\n",
    "removed = initial_size - len(df)\n",
    "print(f\"Rows removed : {removed:,}\")\n",
    "print(f\"Rows retained: {len(df):,}\")\n",
    "print()\n",
    "print(f\"Min ADR after cleaning      : {df['adr'].min():.2f}\")\n",
    "print(f\"Min total guests after clean: {(df['adults']+df['children']+df['babies']).min()}\")\n",
    "print(\"✓ All impossible values removed.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 9. Category Cleanliness (Label Consistency)\n",
    "\n",
    "We strip whitespace, unify casing, and map synonymous labels in key categorical columns to prevent fragmented frequency counts.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Discover Raw Unique Values ────────────────────────────────────────────\n",
    "cat_cols = ['hotel', 'meal', 'market_segment',\n",
    "            'distribution_channel', 'reserved_room_type',\n",
    "            'deposit_type', 'customer_type']\n",
    "\n",
    "print(\"Raw unique values before cleaning:\")\n",
    "for col in cat_cols:\n",
    "    print(f\"  {col}: {sorted(df[col].unique())}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Apply Cleaning Plan ───────────────────────────────────────────────────\n",
    "for col in cat_cols:\n",
    "    df[col] = df[col].astype(str).str.strip().str.title()\n",
    "\n",
    "# Standardise meal labels (Undefined & SC both mean No Meal Plan)\n",
    "meal_map = {\n",
    "    'Undefined': 'No Meal', 'Sc': 'No Meal',\n",
    "    'Bb': 'Bed & Breakfast', 'Hb': 'Half Board', 'Fb': 'Full Board'\n",
    "}\n",
    "df['meal'] = df['meal'].replace(meal_map)\n",
    "\n",
    "print(\"Cleaned unique values for key columns:\")\n",
    "for col in ['meal', 'market_segment', 'deposit_type']:\n",
    "    print(f\"  {col}: {sorted(df[col].unique())}\")\n",
    "print(\"\\n✓ Category labels cleaned.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 10. Comprehensive Numerical Summary\n",
    "\n",
    "We compute mean, median, std, min, and max for all numerical columns, then highlight where mean ≠ median (evidence of skewness / outliers).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Statistical Summary Table ─────────────────────────────────────────────\n",
    "summary = df.describe().loc[['mean','50%','std','min','max']]\n",
    "summary.rename(index={'50%': 'median'}, inplace=True)\n",
    "print(\"Statistical Summary (transposed for readability):\")\n",
    "display(summary.T.round(2))\n",
    "\n",
    "# ─── Mean vs Median Skewness Commentary ────────────────────────────────────\n",
    "print(\"\\nMean vs Median Observations (key columns):\")\n",
    "print(\"-\" * 55)\n",
    "for col in ['lead_time', 'adr', 'total_of_special_requests',\n",
    "            'stays_in_week_nights', 'booking_changes']:\n",
    "    if col not in df.columns:\n",
    "        continue\n",
    "    m, med = df[col].mean(), df[col].median()\n",
    "    if med == 0:\n",
    "        tag = \"Extremely right-skewed (median = 0)\"\n",
    "    elif m > med * 1.15:\n",
    "        tag = f\"Right-skewed  (mean {m:.1f} > median {med:.1f})\"\n",
    "    elif m < med * 0.85:\n",
    "        tag = f\"Left-skewed   (mean {m:.1f} < median {med:.1f})\"\n",
    "    else:\n",
    "        tag = f\"Roughly symmetric (mean {m:.1f} ≈ median {med:.1f})\"\n",
    "    print(f\"  {col:<35}: {tag}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 11. Numerical Distributions & Skewness Analysis\n",
    "\n",
    "Histograms with KDE curves reveal the shape of each numerical variable.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Distribution Histograms ───────────────────────────────────────────────\n",
    "numeric_cols = df.select_dtypes(include='number').columns.tolist()\n",
    "n_cols = 3\n",
    "n_rows = math.ceil(len(numeric_cols) / n_cols)\n",
    "\n",
    "fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, n_rows * 4))\n",
    "axes = axes.flatten()\n",
    "\n",
    "for i, col in enumerate(numeric_cols):\n",
    "    sns.histplot(df[col], kde=True, bins=30, ax=axes[i], color='steelblue')\n",
    "    axes[i].set_title(f'Distribution: {col}')\n",
    "    axes[i].set_xlabel('')\n",
    "    axes[i].set_ylabel('Frequency')\n",
    "\n",
    "# Hide unused subplots\n",
    "for j in range(i + 1, len(axes)):\n",
    "    axes[j].set_visible(False)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.suptitle('Numerical Feature Distributions', y=1.01, fontsize=15, fontweight='bold')\n",
    "plt.show()\n",
    "\n",
    "# ─── Key Interpretations ────────────────────────────────────────────────────\n",
    "interpretations = {\n",
    "    'lead_time'               : 'Right-skewed — most guests book close to arrival; a long tail of early planners exists.',\n",
    "    'adr'                     : 'Right-skewed — most rates cluster around the median; a few luxury prices pull the mean up.',\n",
    "    'stays_in_week_nights'    : 'Right-skewed — short stays (1–4 nights) dominate; long stays are rare.',\n",
    "    'adults'                  : 'Concentrated at 2 — reflects standard double-occupancy room bookings.',\n",
    "    'total_of_special_requests': 'Highly right-skewed — the vast majority of guests make zero special requests.',\n",
    "}\n",
    "print(\"\\nKey Distribution Insights:\")\n",
    "for col, text in interpretations.items():\n",
    "    if col in numeric_cols:\n",
    "        print(f\"  • {col}: {text}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 12. Outlier Detection & Treatment\n",
    "\n",
    "Boxplots expose extreme values. We decide per-column whether to **keep**, **cap (Winsorise)**, or **remove** outliers.\n",
    "\n",
    "| Column | Decision | Rationale |\n",
    "|:---|:---|:---|\n",
    "| `adr` | **Cap at 99th percentile** | Extreme luxury prices distort the mean; the 99th cap preserves high-end reality |\n",
    "| `lead_time` | **Keep** | Long advance bookings are legitimate business events |\n",
    "| `stays_in_week_nights` | **Cap at 30 nights** | Stays > 30 nights are non-standard and likely mis-entered |\n",
    "| `adults` | **Cap at ≤ 4** | > 4 adults in one room is unrealistic; already filtered above |\n",
    "| `children`, `babies` | **Keep** | Large-family records are genuine |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Boxplot Grid (Before Treatment) ──────────────────────────────────────\n",
    "outlier_cols = ['lead_time', 'adr',\n",
    "                'stays_in_weekend_nights', 'stays_in_week_nights', 'adults']\n",
    "\n",
    "fig, axes = plt.subplots(1, len(outlier_cols),\n",
    "                         figsize=(18, 5))\n",
    "for ax, col in zip(axes, outlier_cols):\n",
    "    sns.boxplot(y=df[col], ax=ax, color='salmon', width=0.5, flierprops=dict(marker='.', alpha=0.3))\n",
    "    ax.set_title(col)\n",
    "    ax.set_xlabel('')\n",
    "\n",
    "plt.suptitle('Outlier Detection — Boxplots (Before Treatment)',\n",
    "             fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Apply Outlier Treatment ───────────────────────────────────────────────\n",
    "# 1. Cap ADR at 99th percentile (Winsorise)\n",
    "adr_99 = df['adr'].quantile(0.99)\n",
    "df['adr'] = df['adr'].clip(upper=adr_99)\n",
    "\n",
    "# 2. Remove extreme total stay durations\n",
    "df = df[(df['stays_in_weekend_nights'] + df['stays_in_week_nights']) <= 30]\n",
    "\n",
    "size_after = len(df)\n",
    "print(f\"ADR capped at        : {adr_99:.2f} €\")\n",
    "print(f\"Rows after treatment : {size_after:,}\")\n",
    "\n",
    "# ─── Boxplot After Treatment ────────────────────────────────────────────────\n",
    "fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n",
    "sns.boxplot(y=df['adr'], ax=axes[0], color='lightgreen',\n",
    "            flierprops=dict(marker='.', alpha=0.3))\n",
    "axes[0].set_title('ADR After Capping (99th pct)')\n",
    "\n",
    "sns.boxplot(y=df['stays_in_week_nights'] + df['stays_in_weekend_nights'],\n",
    "            ax=axes[1], color='lightblue',\n",
    "            flierprops=dict(marker='.', alpha=0.3))\n",
    "axes[1].set_title('Total Nights After Capping (≤30)')\n",
    "\n",
    "plt.suptitle('Outlier Treatment Results', fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 13. Categorical Variable Analysis\n",
    "\n",
    "We display frequency tables and bar charts for the most analytically important categorical columns.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Frequency Tables ──────────────────────────────────────────────────────\n",
    "cat_cols_analysis = ['hotel', 'meal', 'market_segment',\n",
    "                     'distribution_channel', 'deposit_type', 'customer_type']\n",
    "\n",
    "for col in cat_cols_analysis:\n",
    "    counts  = df[col].value_counts()\n",
    "    pct     = df[col].value_counts(normalize=True).mul(100).round(2)\n",
    "    tbl = pd.DataFrame({'Count': counts, 'Percentage (%)': pct})\n",
    "    print(f\"\\n{'='*50}\")\n",
    "    print(f\"  {col.upper()}\")\n",
    "    print('='*50)\n",
    "    display(tbl)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Bar Charts for Key Categorical Columns ────────────────────────────────\n",
    "fig, axes = plt.subplots(2, 3, figsize=(18, 10))\n",
    "axes = axes.flatten()\n",
    "\n",
    "for i, col in enumerate(cat_cols_analysis):\n",
    "    order = df[col].value_counts().index\n",
    "    sns.countplot(data=df, y=col, order=order, ax=axes[i],\n",
    "                  palette='Blues_r', saturation=0.85)\n",
    "    axes[i].set_title(f'Frequency: {col}')\n",
    "    axes[i].set_xlabel('Count')\n",
    "    axes[i].set_ylabel('')\n",
    "    # Add percentage labels\n",
    "    total = len(df)\n",
    "    for p in axes[i].patches:\n",
    "        pct = f'{100 * p.get_width() / total:.1f}%'\n",
    "        axes[i].annotate(pct,\n",
    "                         (p.get_width() + total*0.002, p.get_y() + p.get_height()/2),\n",
    "                         va='center', fontsize=9)\n",
    "\n",
    "plt.suptitle('Categorical Feature Distributions', fontsize=15, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 14. Rare Category Handling\n",
    "\n",
    "Categories with < 3 % frequency are consolidated into **'Other'** to avoid noise in models and visualisations.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Identify & Group Rare Categories ─────────────────────────────────────\n",
    "threshold = 0.03\n",
    "cols_to_check = ['country', 'market_segment', 'distribution_channel', 'meal']\n",
    "\n",
    "print(\"Rare Category Report (< 3 %):\")\n",
    "print(\"=\" * 50)\n",
    "\n",
    "for col in cols_to_check:\n",
    "    rel_freq    = df[col].value_counts(normalize=True)\n",
    "    rare_labels = rel_freq[rel_freq < threshold].index.tolist()\n",
    "    if rare_labels:\n",
    "        print(f\"\\n  {col.upper()} — {len(rare_labels)} rare labels found:\")\n",
    "        print(f\"  Examples: {rare_labels[:5]}\")\n",
    "        df[col] = df[col].apply(lambda x: 'Other' if x in rare_labels else x)\n",
    "        print(f\"  → Grouped into 'Other' successfully.\")\n",
    "    else:\n",
    "        print(f\"\\n  {col.upper()} — No rare categories found.\")\n",
    "\n",
    "print(\"\\nUpdated COUNTRY top-10 after consolidation:\")\n",
    "display(df['country'].value_counts().head(10).to_frame('Count'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 15. Numerical Correlation Analysis\n",
    "\n",
    "We compute the Pearson correlation matrix and highlight the five strongest pairwise relationships.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Correlation Heatmap ───────────────────────────────────────────────────\n",
    "numeric_df  = df.select_dtypes(include='number')\n",
    "corr_matrix = numeric_df.corr()\n",
    "\n",
    "plt.figure(figsize=(14, 10))\n",
    "mask = np.triu(np.ones_like(corr_matrix, dtype=bool))   # show lower triangle only\n",
    "sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f',\n",
    "            cmap='RdBu_r', center=0, linewidths=0.4,\n",
    "            annot_kws={'size': 8}, vmin=-1, vmax=1)\n",
    "plt.title('Pearson Correlation Matrix (Lower Triangle)', fontsize=14)\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# ─── Top 5 Strongest Pairs ─────────────────────────────────────────────────\n",
    "upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))\n",
    "top5  = upper.stack().reindex(\n",
    "    upper.stack().abs().sort_values(ascending=False).index\n",
    ").head(5)\n",
    "\n",
    "print(\"\\nTop 5 Strongest Numerical Relationships:\")\n",
    "print(\"-\" * 55)\n",
    "for (f1, f2), v in top5.items():\n",
    "    direction = \"positive\" if v > 0 else \"negative\"\n",
    "    print(f\"  {f1}  ↔  {f2}\")\n",
    "    print(f\"    Correlation: {v:.3f}  ({direction})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 16. Visualising Key Numerical Relationships\n",
    "\n",
    "Scatter plots with regression lines confirm the nature of the two strongest correlations.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Scatter Plots for Top Pairs ───────────────────────────────────────────\n",
    "df_sample = df.sample(n=min(2000, len(df)), random_state=42)\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(16, 6))\n",
    "\n",
    "# Plot 1: Weekend vs week nights\n",
    "sns.regplot(data=df_sample, x='stays_in_weekend_nights', y='stays_in_week_nights',\n",
    "            ax=axes[0],\n",
    "            scatter_kws={'alpha': 0.25, 'color': '#2c7bb6', 's': 15},\n",
    "            line_kws={'color': '#d7191c', 'lw': 2})\n",
    "axes[0].set_title('Weekend Nights vs Week Nights')\n",
    "axes[0].set_xlabel('Weekend Nights')\n",
    "axes[0].set_ylabel('Week Nights')\n",
    "\n",
    "# Plot 2: Adults vs ADR\n",
    "sns.regplot(data=df_sample, x='adults', y='adr',\n",
    "            ax=axes[1],\n",
    "            scatter_kws={'alpha': 0.25, 'color': '#1a9641', 's': 15},\n",
    "            line_kws={'color': '#d7191c', 'lw': 2})\n",
    "axes[1].set_title('Number of Adults vs ADR')\n",
    "axes[1].set_xlabel('Adults')\n",
    "axes[1].set_ylabel('Average Daily Rate (€)')\n",
    "\n",
    "plt.suptitle('Key Numerical Relationships', fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"Interpretation:\")\n",
    "print(\"  1. Weekend vs Week Nights: Strong positive linear trend — longer stays are\")\n",
    "print(\"     proportionally distributed across the whole week.\")\n",
    "print(\"  2. Adults vs ADR: Upward slope confirms that room price increases with guest count.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 17. Numeric Effects by Category\n",
    "\n",
    "We compare ADR across hotel types and market segments to understand revenue drivers.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── ADR by Hotel Type ─────────────────────────────────────────────────────\n",
    "hotel_adr = (df.groupby('hotel')['adr']\n",
    "               .agg(Mean='mean', Median='median', Std='std', Count='count')\n",
    "               .reset_index()\n",
    "               .rename(columns={'hotel': 'Hotel Type'}))\n",
    "\n",
    "print(\"ADR by Hotel Type:\")\n",
    "display(hotel_adr.round(2))\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "# Bar chart\n",
    "sns.barplot(data=df, x='hotel', y='adr', estimator='mean',\n",
    "            ax=axes[0], palette='Set2', capsize=0.1)\n",
    "axes[0].set_title('Mean ADR by Hotel Type')\n",
    "axes[0].set_ylabel('Mean ADR (€)')\n",
    "axes[0].set_xlabel('')\n",
    "\n",
    "# Violin plot — shows full distribution\n",
    "sns.violinplot(data=df, x='hotel', y='adr',\n",
    "               ax=axes[1], palette='Set2', inner='quartile')\n",
    "axes[1].set_title('ADR Distribution by Hotel Type')\n",
    "axes[1].set_ylabel('ADR (€)')\n",
    "axes[1].set_xlabel('')\n",
    "\n",
    "plt.suptitle('ADR Analysis by Hotel Type', fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# ─── ADR by Market Segment ──────────────────────────────────────────────────\n",
    "seg_adr = (df.groupby('market_segment')['adr']\n",
    "             .agg(Mean='mean', Count='count')\n",
    "             .sort_values('Mean', ascending=False)\n",
    "             .reset_index())\n",
    "\n",
    "plt.figure(figsize=(12, 5))\n",
    "sns.barplot(data=seg_adr, x='market_segment', y='Mean',\n",
    "            palette='Blues_r', order=seg_adr['market_segment'])\n",
    "plt.title('Mean ADR by Market Segment')\n",
    "plt.ylabel('Mean ADR (€)')\n",
    "plt.xlabel('')\n",
    "plt.xticks(rotation=30, ha='right')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 18. Cancellation Drivers — Multivariate Analysis\n",
    "\n",
    "We investigate three predictors of `is_canceled`: deposit type, lead time, and prior cancellations.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Cancellation Dashboard ────────────────────────────────────────────────\n",
    "fig, axes = plt.subplots(1, 3, figsize=(20, 6))\n",
    "\n",
    "# 1. Deposit Type vs Cancellation (stacked 100% bar)\n",
    "dep_cancel = pd.crosstab(df['deposit_type'], df['is_canceled'],\n",
    "                          normalize='index') * 100\n",
    "dep_cancel.plot(kind='bar', stacked=True, ax=axes[0],\n",
    "                color=['#2ca25f', '#de2d26'], edgecolor='white')\n",
    "axes[0].set_title('Cancellation Rate by Deposit Type')\n",
    "axes[0].set_ylabel('Percentage (%)')\n",
    "axes[0].set_xlabel('')\n",
    "axes[0].tick_params(axis='x', rotation=15)\n",
    "axes[0].legend(['Not Canceled', 'Canceled'], loc='lower right')\n",
    "\n",
    "# 2. Lead Time vs Cancellation (boxplot)\n",
    "sns.boxplot(data=df, x='is_canceled', y='lead_time',\n",
    "            ax=axes[1], palette='Set2')\n",
    "axes[1].set_title('Lead Time vs Cancellation Status')\n",
    "axes[1].set_xlabel('Canceled')\n",
    "axes[1].set_ylabel('Lead Time (days)')\n",
    "axes[1].set_xticklabels(['No (0)', 'Yes (1)'])\n",
    "\n",
    "# 3. Avg Previous Cancellations (barplot)\n",
    "sns.barplot(data=df, x='is_canceled', y='previous_cancellations',\n",
    "            ax=axes[2], palette='coolwarm', capsize=0.1)\n",
    "axes[2].set_title('Avg Prior Cancellations by Status')\n",
    "axes[2].set_xlabel('Canceled')\n",
    "axes[2].set_ylabel('Mean Prior Cancellations')\n",
    "axes[2].set_xticklabels(['No (0)', 'Yes (1)'])\n",
    "\n",
    "plt.suptitle('Cancellation Driver Analysis', fontsize=15, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# ─── Summary Table ──────────────────────────────────────────────────────────\n",
    "print(\"\\nMean values by cancellation status:\")\n",
    "summary = df.groupby('is_canceled')[\n",
    "    ['lead_time', 'previous_cancellations', 'adr', 'total_of_special_requests']\n",
    "].mean().round(2)\n",
    "summary.index = ['Not Canceled', 'Canceled']\n",
    "display(summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 19. Multivariate Correlation Heatmap\n",
    "\n",
    "A final full-feature heatmap highlights multicollinearity and pinpoints features most correlated with `is_canceled`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ─── Full Correlation Heatmap ──────────────────────────────────────────────\n",
    "numeric_df2  = df.select_dtypes(include=np.number)\n",
    "corr_matrix2 = numeric_df2.corr()\n",
    "\n",
    "plt.figure(figsize=(15, 11))\n",
    "sns.heatmap(corr_matrix2, annot=True, fmt='.2f',\n",
    "            cmap='coolwarm', linewidths=0.4, center=0,\n",
    "            annot_kws={'size': 7.5}, vmin=-1, vmax=1)\n",
    "plt.title('Multivariate Correlation Heatmap — All Numeric Features',\n",
    "          fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# ─── 3 Key Insights ─────────────────────────────────────────────────────────\n",
    "print(\"Key Insights from the Heatmap:\")\n",
    "print()\n",
    "print(\"  1. LEAD TIME ↔ IS_CANCELED (positive):\")\n",
    "print(\"     Bookings made far in advance have a notably higher cancellation rate.\")\n",
    "print(\"     This suggests guests are less committed when booking months ahead.\")\n",
    "print()\n",
    "print(\"  2. WEEKEND NIGHTS ↔ WEEK NIGHTS (strong positive):\")\n",
    "print(\"     These two features are highly redundant — they measure the same\")\n",
    "print(\"     construct (total stay length). One could be dropped before modelling.\")\n",
    "print()\n",
    "print(\"  3. ADR ↔ ADULTS / CHILDREN (moderate positive):\")\n",
    "print(\"     Room pricing is primarily driven by guest count rather than\")\n",
    "print(\"     booking channel, confirming a per-person pricing model.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 20. Final EDA Summary: Insights, Risks & Roadmap\n",
    "\n",
    "### I. Top 5 Analytical Insights\n",
    "\n",
    "1. **Lead Time → Cancellation**: Significant positive correlation. Guests booking months in advance are substantially more likely to cancel — a key risk signal for revenue management.\n",
    "2. **Deposit Type Predictive Power**: \"Non-Refund\" deposit holders paradoxically show a high cancellation rate, likely driven by the OTA (Online Travel Agency) segment behaviour.\n",
    "3. **Seasonality**: Booking volume peaks in summer (July/August), especially for Resort Hotels. Temporal features carry strong predictive information.\n",
    "4. **Customer Loyalty**: Repeat guests (`is_repeated_guest = True`) cancel at a significantly lower rate — a high-value binary feature for cancellation prediction.\n",
    "5. **Market Segment Volatility**: \"Online TA\" dominates volume (~47 %) but carries the highest cancellation risk, making segment a strong categorical predictor.\n",
    "\n",
    "---\n",
    "\n",
    "### II. Top 5 Data Quality Risks Identified & Resolved\n",
    "\n",
    "| # | Issue | Status |\n",
    "|:---|:---|:---|\n",
    "| 1 | Missing values in `agent`, `country`, `children`, `company` | ✓ Resolved (imputed / dropped) |\n",
    "| 2 | ADR outliers (extreme luxury rates) | ✓ Capped at 99th percentile |\n",
    "| 3 | Ghost bookings (0 total guests) | ✓ Removed |\n",
    "| 4 | Inconsistent categorical labels (e.g., `meal` = 'Sc' / 'Undefined') | ✓ Standardised |\n",
    "| 5 | Rare country / segment categories cluttering distributions | ✓ Consolidated into 'Other' |\n",
    "\n",
    "---\n",
    "\n",
    "### III. Recommended Next Steps\n",
    "\n",
    "1. **Feature Engineering**: Create `total_stay` (weekend + week nights), `total_guests` (adults + children + babies), and an `arrival_season` label.\n",
    "2. **Encoding**: One-hot encode low-cardinality categoricals; target-encode `country`.\n",
    "3. **Scaling**: Standardise `lead_time`, `adr`, and `days_in_waiting_list` before modelling.\n",
    "4. **Modelling**: Train a classification model (Logistic Regression → Random Forest → XGBoost) on `is_canceled` as the target variable.\n",
    "5. **Orange Data Mining**: Import the cleaned CSV into Orange and replicate key visualisations (distributions, scatter plots, heatmap) as required by the project rubric.\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
