16 Take-Home Midterm Exam

16.1 Overview

In this take-home exam, you will work with the Bike Sharing Dataset (daily data, day.csv) to explore and predict bike rental demand. The exam is designed to test your skills in data import, cleaning, exploratory data analysis (EDA), and predictive modeling. You will also write a short memo summarizing your findings and recommendations.

16.1.1 Data source

The dataset comes from the UCI Machine Learning Repository: Bike Sharing Dataset by Hadi Fanaee-T. It contains daily counts of bike rentals in Washington, D.C. along with weather and calendar information.

You will use day.csv (731 rows; one row per day). Variables (from the dataset README) are listed below:

dteday: date
season: season (1=spring, 2=summer, 3=fall, 4=winter)
yr: year (0=2011, 1=2012)
mnth: month (1–12)
holiday: whether the day is a holiday (1=yes, 0=no)
weekday: day of week (0–6)
workingday: 1 if neither weekend nor holiday, else 0
weathersit:
- 1: Clear / few clouds / partly cloudy
- 2: Mist + cloudy
- 3: Light snow / light rain + thunderstorm
- 4: Heavy rain / ice pellets / snow + fog
temp: normalized temperature in Celsius (divided by 41)
atemp: normalized “feels like” temperature (divided by 50)
hum: normalized humidity (divided by 100)
windspeed: normalized wind speed (divided by 67)
casual: count of casual users
registered: count of registered users
cnt: total rentals (casual + registered)

16.1.2 Problem

Our main goal is to forecast daily bike rental demand (cnt). Accurate forecasts can help staffing, bike rebalancing, and maintenance planning.

16.1.3 Expectations for your submission

You should submit:
1. A _exam_bike_name.qmd file of your report with your code/answers embeded. Your code should be executable and produce the same results when run.
2. A rendered HTML or PDF report. The report should be well-organized and clearly labeled.
You may follow the steps and use any reasonable statistical/machine learning/AI approaches.
Your answers should be reproducible.
You should justify key choices if applicable (e.g., data splitting, feature engineering, metric selection, model choice).

16.1.3.1 Load packages and data

Load the packages you plan to use (e.g., pandas, numpy, matplotlib or seaborn, and scikit-learn). For example:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
# import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

You can also import sklearn.compose.ColumnTransformer, sklearn.preprocessing.OneHotEncoder, and models like LinearRegression, Ridge, RandomForestRegressor, etc.

Place day.csv in the same folder as your .qmd, or in a data/ subfolder.

#load data
day = pd.read_csv("data/day.csv")
day["dteday"] = pd.to_datetime(day["dteday"])

day.shape

(731, 16)

16.2 Part 1 — Data import and data quality

16.2.1 Task 1 — Data checks: missingness, duplicates, and consistency

Display the first 5-10 rows of the dataset to understand its structure.
How many missing values are in each column?
Are there duplicated dates (dteday)? If yes, how many?
Verify whether cnt equals casual + registered for all rows. If not, report how many rows fail.

# TODO

16.2.2 Task 2 — Examine individual variables and marginal distributions

Report summary statistics of temp, atemp, hum, windspeed, or make some visualization (e.g., boxplot, histogram) of those variables. Note that these variables are normalized to be between 0 and 1 (except that humidity can be 0).
Briefly comment: do these distributions look plausible? Any potential outliers or anomalies?

# TODO

16.3 Part 2 — Exploratory data analysis (EDA)

16.3.1 Task 3 — Summary statistics for the response (demand)

Compute and report:

Overall mean, median, and standard deviation of cnt.
Mean cnt by year (yr), and the difference between 2012 and 2011.

Breifly interpret what you see.

# TODO

16.3.2 Task 4 — Visualize trends over time

Make a time-series plot of cnt vs dteday.

Is there an overall trend over time?
Do you see seasonality (repeating yearly patterns)?

# TODO

16.3.3 Task 5 — Visual comparisons by groups

Create two appropriate and informative plots (your choice) to compare cnt across groups. Pick from:

season
weathersit
workingday
weekday
mnth

Write a short interpretation for each plot.

# TODO

16.4 Part 3 — Predictive modeling

16.4.1 Task 6 — Set up modeling target and features

We will predict:

Target: cnt (daily total rentals)

Choose feature set X using only variables that would be known at prediction time. For example:

features = ["season","yr","mnth","holiday","weekday","workingday",
            "weathersit","temp","atemp","hum","windspeed"]
X = day[features].copy()
y = day["cnt"].copy()

Explain briefly (1–3 sentences) why you included these features and excluded others. In particular, why casual and registered should be excluded.
If you want to engineer new features, briefly explain your rationale.

# TODO

16.4.2 Task 7 — Train/test split

We will use the following code to make train/test data split.

day_sorted = day.sort_values("dteday").reset_index(drop=True)

train_prop = 0.80

split = int(len(day_sorted) * train_prop)
train = day_sorted.iloc[:split]
test  = day_sorted.iloc[split:]

X_train = train[features]
y_train = train["cnt"]
X_test  = test[features]
y_test  = test["cnt"]

len(train), len(test)

(584, 147)

Explain how the train/test split is done in the provided code.
Explain why we do not use random splitting.

# TODO

16.4.3 Task 8 — Model training

The following code fits a ridge regression with one-hot encoding for categorical variables.

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

# ---------------------------------------------------------
# Assumes you already have: X_train, y_train, X_test, y_test
# ---------------------------------------------------------

# 1) Identify categorical vs numeric columns
# Treat integer-coded calendar/weather fields as categorical if present
cat_cols = []
for c in ["season", "yr", "mnth", "holiday", "weekday", "workingday", "weathersit"]:
    if c in X_train.columns:
        cat_cols.append(c)

# Also include any object/category columns (e.g., if you kept dteday as string)
cat_cols += X_train.select_dtypes(include=["object", "category"]).columns.tolist()
cat_cols = sorted(set(cat_cols))

num_cols = [c for c in X_train.columns if c not in cat_cols]

# 2) Preprocess: scale numeric, one-hot categorical
preprocess = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ],
    remainder="drop",
)

# 3) Ridge pipeline + CV tuning
ridge_pipe = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", Ridge())
    ]
)

param_grid = {
    "model__alpha": np.logspace(-3, 3, 25)  # tune strength of regularization
}

grid = GridSearchCV(
    estimator=ridge_pipe,
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=5,
    n_jobs=-1,
)

grid.fit(X_train, y_train)

print("Best alpha:", grid.best_params_["model__alpha"])
best_ridge = grid.best_estimator_

# 4) Evaluate on test set
pred = best_ridge.predict(X_test)
mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, pred)

print(f"Test RMSE: {rmse:,.2f}")
print(f"Test R^2 : {r2:.3f}")

Best alpha: 1.7782794100389228
Test RMSE: 1,105.08
Test R^2 : 0.652

Use the same train/test split, fit another model of your choice. You may optionally consider other ways of possibly improving the model, e.g., variable transformation, feature engineering, etc.
Evaluate your model with the same metrics. Compare performance and state which model you would choose.
(Bonus) Your model does not need to outperform the ridge regression, but if it does, you get bonus points.
(Bonus) Report the top-5 feaures from either models based on feature importance metrics, and briefly interpret them.

16.5 Part 4 — Summary

16.5.1 Task 9

Write a short memo (as if to a bike-share operations manager).

2–3 key insights from your analysis
What model you chose and why
One or two limitations and one or two next steps (e.g., adding external features or handling holidays better)