import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score15 Take-Home Midterm Exam
15.1 Overview
In this take-home exam, you will work with the Bike Sharing Dataset (daily data, day.csv) to explore and predict bike rental demand. The exam is designed to test your skills in data import, cleaning, exploratory data analysis (EDA), and predictive modeling. You will also write a short memo summarizing your findings and recommendations.
15.1.1 Data source
The dataset comes from the UCI Machine Learning Repository: Bike Sharing Dataset by Hadi Fanaee-T. It contains daily counts of bike rentals in Washington, D.C. along with weather and calendar information.
You will use day.csv (731 rows; one row per day). Variables (from the dataset README) are listed below:
dteday: date
season: season (1=spring, 2=summer, 3=fall, 4=winter)
yr: year (0=2011, 1=2012)
mnth: month (1–12)
holiday: whether the day is a holiday (1=yes, 0=no)
weekday: day of week (0–6)
workingday: 1 if neither weekend nor holiday, else 0
weathersit:- 1: Clear / few clouds / partly cloudy
- 2: Mist + cloudy
- 3: Light snow / light rain + thunderstorm
- 4: Heavy rain / ice pellets / snow + fog
temp: normalized temperature in Celsius (divided by 41)atemp: normalized “feels like” temperature (divided by 50)hum: normalized humidity (divided by 100)windspeed: normalized wind speed (divided by 67)casual: count of casual usersregistered: count of registered userscnt: total rentals (casual + registered)
15.1.2 Problem
Our main goal is to forecast daily bike rental demand (cnt). Accurate forecasts can help staffing, bike rebalancing, and maintenance planning.
15.1.3 Expectations for your submission
- You should submit:
- A
_exam_bike_name.qmdfile of your report with your code/answers embeded. Your code should be executable and produce the same results when run. - A rendered HTML or PDF report. The report should be well-organized and clearly labeled.
- A
- You may follow the steps and use any reasonable statistical/machine learning/AI approaches.
- Your answers should be reproducible.
- You should justify key choices if applicable (e.g., data splitting, feature engineering, metric selection, model choice).
15.1.3.1 Load packages and data
Load the packages you plan to use (e.g., pandas, numpy, matplotlib or seaborn, and scikit-learn). For example:
You can also import sklearn.compose.ColumnTransformer, sklearn.preprocessing.OneHotEncoder, and models like LinearRegression, Ridge, RandomForestRegressor, etc.
Place day.csv in the same folder as your .qmd, or in a data/ subfolder.
#load data
day = pd.read_csv("data/day.csv")
day["dteday"] = pd.to_datetime(day["dteday"])
day.shape(731, 16)
15.2 Part 1 — Data import and data quality
15.2.1 Task 1 — Data checks: missingness, duplicates, and consistency
- Display the first 5-10 rows of the dataset to understand its structure.
- How many missing values are in each column?
- Are there duplicated dates (
dteday)? If yes, how many? - Verify whether
cntequalscasual + registeredfor all rows. If not, report how many rows fail.
# TODO15.2.2 Task 2 — Examine individual variables and marginal distributions
- Report summary statistics of
temp,atemp,hum,windspeed, or make some visualization (e.g., boxplot, histogram) of those variables. Note that these variables are normalized to be between 0 and 1 (except that humidity can be 0). - Briefly comment: do these distributions look plausible? Any potential outliers or anomalies?
# TODO15.3 Part 2 — Exploratory data analysis (EDA)
15.3.1 Task 3 — Summary statistics for the response (demand)
Compute and report:
- Overall mean, median, and standard deviation of
cnt. - Mean
cntby year (yr), and the difference between 2012 and 2011.
Breifly interpret what you see.
# TODO15.3.2 Task 4 — Visualize trends over time
Make a time-series plot of cnt vs dteday.
- Is there an overall trend over time?
- Do you see seasonality (repeating yearly patterns)?
# TODO15.3.3 Task 5 — Visual comparisons by groups
Create two appropriate and informative plots (your choice) to compare cnt across groups. Pick from:
seasonweathersitworkingdayweekdaymnth
Write a short interpretation for each plot.
# TODO15.4 Part 3 — Predictive modeling
15.4.1 Task 6 — Set up modeling target and features
We will predict:
- Target:
cnt(daily total rentals)
Choose feature set X using only variables that would be known at prediction time. For example:
features = ["season","yr","mnth","holiday","weekday","workingday",
"weathersit","temp","atemp","hum","windspeed"]
X = day[features].copy()
y = day["cnt"].copy()- Explain briefly (1–3 sentences) why you included these features and excluded others. In particular, why
casualandregisteredshould be excluded. - If you want to engineer new features, briefly explain your rationale.
# TODO15.4.2 Task 7 — Train/test split
We will use the following code to make train/test data split.
day_sorted = day.sort_values("dteday").reset_index(drop=True)
train_prop = 0.80
split = int(len(day_sorted) * train_prop)
train = day_sorted.iloc[:split]
test = day_sorted.iloc[split:]
X_train = train[features]
y_train = train["cnt"]
X_test = test[features]
y_test = test["cnt"]
len(train), len(test)(584, 147)
- Explain how the train/test split is done in the provided code.
- Explain why we do not use random splitting.
# TODO15.4.3 Task 8 — Model training
The following code fits a ridge regression with one-hot encoding for categorical variables.
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
# ---------------------------------------------------------
# Assumes you already have: X_train, y_train, X_test, y_test
# ---------------------------------------------------------
# 1) Identify categorical vs numeric columns
# Treat integer-coded calendar/weather fields as categorical if present
cat_cols = []
for c in ["season", "yr", "mnth", "holiday", "weekday", "workingday", "weathersit"]:
if c in X_train.columns:
cat_cols.append(c)
# Also include any object/category columns (e.g., if you kept dteday as string)
cat_cols += X_train.select_dtypes(include=["object", "category"]).columns.tolist()
cat_cols = sorted(set(cat_cols))
num_cols = [c for c in X_train.columns if c not in cat_cols]
# 2) Preprocess: scale numeric, one-hot categorical
preprocess = ColumnTransformer(
transformers=[
("num", StandardScaler(), num_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
],
remainder="drop",
)
# 3) Ridge pipeline + CV tuning
ridge_pipe = Pipeline(
steps=[
("preprocess", preprocess),
("model", Ridge())
]
)
param_grid = {
"model__alpha": np.logspace(-3, 3, 25) # tune strength of regularization
}
grid = GridSearchCV(
estimator=ridge_pipe,
param_grid=param_grid,
scoring="neg_root_mean_squared_error",
cv=5,
n_jobs=-1,
)
grid.fit(X_train, y_train)
print("Best alpha:", grid.best_params_["model__alpha"])
best_ridge = grid.best_estimator_
# 4) Evaluate on test set
pred = best_ridge.predict(X_test)
mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, pred)
print(f"Test RMSE: {rmse:,.2f}")
print(f"Test R^2 : {r2:.3f}")Best alpha: 1.7782794100389228
Test RMSE: 1,105.08
Test R^2 : 0.652
- Use the same train/test split, fit another model of your choice. You may optionally consider other ways of possibly improving the model, e.g., variable transformation, feature engineering, etc.
- Evaluate your model with the same metrics. Compare performance and state which model you would choose.
- (Bonus) Your model does not need to outperform the ridge regression, but if it does, you get bonus points.
- (Bonus) Report the top-5 feaures from either models based on feature importance metrics, and briefly interpret them.
15.5 Part 4 — Summary
15.5.1 Task 9
Write a short memo (as if to a bike-share operations manager).
- 2–3 key insights from your analysis
- What model you chose and why
- One or two limitations and one or two next steps (e.g., adding external features or handling holidays better)