12 Exercises

Quarto and Git setup Quarto and Git are two important tools for data science. Get familiar with them through the following tasks. Please use the templates/hw.qmd template to document, for each step, what you did, the obstacles you encountered, and how you overcame them. Think of this as a user manual for students who are new to this. Use the command line interface.
1. Set up SSH authentication between your computer and your GitHub account.
2. Install Quarto onto your computer following the instructions of Get Started.
3. Pick a tool of your choice (e.g., VS Code, Jupyter Notebook, Emacs, etc.), follow the instructions to reproduce the example of line plot on polar axis.
4. Render the homework into a pdf file and put the file into a release in your GitHub repo.
Working on Homework Problems All the requirements on homework styles have reasons. Reviewing these questions help you to understand them.
1. What are the differences between binary and source files?
2. Why do we not want to track binary files in a repo?
3. Why do I require pdf output via release?
4. Why do I not want your files added via ‘upload’?
5. Why do I require line width under 80?
6. Why is it not a good idea to have spaces in file/folder names?
Contributing to the Class Notes To contribute to the classnotes, you need to have a working copy of the sources on your computer. Document the following steps in a qmd file in the form of a step-by-step manual, as if you are explaining them to someone who wants to contribute too. Make at least 10 commits for this task, each with an informative message.
1. Create a fork of the notes repo into your own GitHub account.
2. Clone it to an appropriate folder on your computer.
3. Render the classnotes on your computer; document the obstacles and solutions.
4. Make a new branch (and name it appropriately) to experiment with your changes.
5. Checkout your branch and add your wishes to the wish list; commit with an informative message; and push the changes to your GitHub account.
6. Make a pull request to class notes repo from your fork at GitHub. Make sure you have clear messages to document the changes.
Monty Hall Consider a generalized Monty Hall experiment. Suppose that the game start with \(n\) doors; after you pick one, the host opens \(m \le n - 2\) doors, that show no award. Include sufficient text around the code chunks to explain them.
1. Write a function to simulate the experiment once. The function takes two arguments ndoors and nempty, which represent the number of doors and the number of empty doors showed by the host, respectively, It returns the result of two strategies, switch and no-switch, from playing this game.
2. Play this game with 3 doors and 1 empty a few times.
3. Play this game with 10 doors and 8 empty a few times.
4. Write a function to demonstrate the Monty Hall problem through simulation. The function takes three arguments ndoors, nempty, and ntrials, where ntrial is the number of trials in a simulation. The function should return the proportion of wins for both the switch and no-switch strategy.
5. Apply your function with 3 doors (1 empty) and 10 doors (8 empty), both with 1000 trials. Summarize your results.
Approximating \(\pi\) Write a function to do a Monte Carlo approximation of \(\pi\). The function takes a Monte Carlo sample size n as input, and returns a point estimate of \(\pi\) and a 95% confidence interval. Apply your function with sample size 1000, 2000, 4000, and 8000. Repeat the experiment 1000 times for each sample size and check the empirical probability that the confidence intervals cover the true value of \(\pi\). Comment on the results.
Google Billboard Ad Find the first 10-digit prime number occurring in consecutive digits of \(e\). This was a Google recruiting ad.
Game 24 The math game 24 is one of the addictive games among number lovers. With four randomly selected cards from a deck of poker cards, use all four values and elementary arithmetic operations (\(+-\times /\)) to come up with 24. Let \(\square\) be one of the four numbers. Let \(\bigcirc\) represent one of the four operators. For example, \[\begin{equation*} (\square \bigcirc \square) \bigcirc (\square \bigcirc \square) \end{equation*}\] is one way to group the the operations.
1. List all the possible ways to group the four numbers.
2. How many possible ways are there to check for a solution?
3. Write a function to solve the problem in a brutal force way. The inputs of the function are four numbers. The function returns a list of solutions. Some of the solutions will be equivalent, but let us not worry about that for now.
NYC Crash Data Cleaning The NYC motor vehicle collisions data with documentation is available from NYC Open Data. The raw data needs some cleaning.
1. Use the filter from the website to download the crash data of the week of June 30, 2024 in CSV format; save it under a directory data with an informative name (e.g., nyccrashes_2024w0630_by20240916.csv); read the data into a Panda data frame with careful handling of the date time variables.
2. Clean up the variable names. Use lower cases and replace spaces with underscores.
3. Check the crash date and time to see if they really match the filter we intented. Remove the extra rows if needed.
4. Get the basic summaries of each variables: missing percentage; descriptive statistics for continuous variables; frequency tables for discrete variables.
5. Are their invalid longitude and latitude in the data? If so, replace them with NA.
6. Are there zip_code values that are not legit NYC zip codes? If so, replace them with NA.
7. Are there missing in zip_code and borough? Do they always co-occur?
8. Are there cases where zip_code and borough are missing but the geo codes are not missing? If so, fill in zip_code and borough using the geo codes.
9. Is it redundant to keep both location and the longitude/latitude at the NYC Open Data server?
10. Check the frequency of crash_time by hour. Is there a matter of bad luck at exactly midnight? How would you interpret this?
11. Are the number of persons killed/injured the summation of the numbers of pedestrians, cyclist, and motorists killed/injured? If so, is it redundant to keep these two columns at the NYC Open Data server?
12. Print the whole frequency table of contributing_factor_vehicle_1. Convert lower cases to uppercases and check the frequencies again.
13. Provided an opportunity to meet the data provider, what suggestions would you make based on your data exploration experience?
NYC Crash Data Exploration Except for the first question, use the cleaned crash data in feather format.
1. Construct a contigency table for missing in geocode (latitude and longitude) by borough. Is the missing pattern the same across boroughs? Formulate a hypothesis and test it.
2. Construct a hour variable with integer values from 0 to 23. Plot the histogram of the number of crashes by hour. Plot it by borough.
3. Overlay the locations of the crashes on a map of NYC. The map could be a static map or a Google map.
4. Create a new variable severe which is one if the number of persons injured or deaths is 1 or more; and zero otherwise. Construct a cross table for severe versus borough. Is the severity of the crashes the same across boroughs? Test the null hypothesis that the two variables are not associated with an appropriate test.
5. Merge the crash data with the Census zip code database which contains zip-code level demographic or socioeconomic variables.
6. Fit a logistic model with severe as the outcome variable and covariates that are available in the data or can be engineered from the data. For example, zip code level covariates obtained from merging with the zip code database; crash hour; number of vehicles involved.
NYC Crash severity modeling Using the cleaned NYC crash data, merged with zipcode level information, predict severe of a crash.
1. Set random seed to 1234. Randomly select 20% of the crashes as testing data and leave the rest 80% as training data.
2. Fit a logistic model on the training data and validate the performance on the testing data. Explain the confusion matrix result from the testing data. Compute the F1 score.
3. Fit a logistic model on the training data with \(L_1\) regularization. Select the tuning parameter with 5-fold cross-validation in F1 score
4. Apply the regularized logistic regression to predict the severity of the crashes in the testing data. Compare the performance of the two logistic models in terms of accuracy, precision, recall, F1-score, and AUC.
Midterm project: Street flood in NYC The class presentation at the 2025 NYC Open Data Weeik is scheduled for Monday, March 24, 2:00-3:00 pm.

The NYC Open Data of 311 Service Requests contains all service requests from 2010 to the present. This analysis focuses on two sewer-related complaints in 2024: Street Flooding (SF) and Catch Basin (CB). SF complaints serve as a practical indicator of street flooding, while CB complaints provide insights into a key infrastructural factor—when catch basins fail to drain rainwater properly due to blockages or structural issues, water accumulates on the streets. SF complaints are typically filed when residents observe standing water or flooding, whereas CB complaints report clogged basins, defective grates, or other drainage problems. The dataset is available in CSV format as data/nycflood2024.csv. Refer to the online data dictionary for a detailed explanation of variable meanings. Try to tell a story in your report while going through the questions.
1. Data cleaning.
  1. Import the data, rename the columns with our preferred styles.
  2. Summarize the missing information. Are there variables that are close to completely missing?
  3. Are there redundant information in the data? Try storing the data using the Arrow format and comment on the efficiency gain.
  4. Are there invalid NYC zipcode or borough? Can some of the missing values be filled? Fill them if yes.
  5. Are there date errors? Examples are earlier closed_date than created_date; closed_date and created_date matching to the second; dates exactly at midnight or noon to the second.
  6. Summarize your suggestions to the data curator in several bullet points.
2. Exploratory analysis.
  1. Visualize the locations of complaints on a NYC map, with different symbols for different descriptors.
  2. Create a variable response_time, which is the duration from created_date to closed_date.
  3. Visualize the comparison of response time by complaint descriptor and borough. The original may not be the best given the long tail or outlers.
  4. Is there significant difference in response time between SF and CB complaints? Across different boroughs? Does the difference between SF and CB depend on borough? State your hypothesis, justify your test, and summarize your results in plain English.
  5. Create a binary variable over3d to indicate that a service request took three days or longer to close.
  6. Does over3d depend on the complaint descriptor, borough, or weekday (vs weekend/holiday)? State your hypotheses, justify your test, and summarize your results.
3. Modeling the occurrence of overly long response time.
  1. Create a data set which contains the outcome variable over3d and variables that might be useful in predicting it. Consider including time-of-day effects (e.g., rush hour vs. late-night), seasonal trends, and neighborhood-level demographics. Zip code level information could be useful too, such as the zip code area and the ACS 2023 variables (data/nyc_zip_areas.feather and data/acs2023.feather).
  2. Randomly select 20% of the complaints as testing data with seeds 1234. Build a logistic model to predict over3d for the complaints with the training data. If you have tuning parameters, justify how they were selected.
  3. Construct the confusion matrix from your prediction with a threshold of 1/2 on both training and testing data. Explain your accuracy, recall, precision, and F1 score to a New Yorker.
  4. Construct the ROC curve of your fitted logistic model and obtain the AUROC for both training and testing data. Explain your results to a New Yorker.
  5. Identify the most important predictors of over3d. Use model coefficients or feature importance (e.g., odds ratios, standardized coefficients, or SHAP values).
  6. Summarize your results to a New Yorker who is not data science savvy in several bullet points.
4. Modeling the count of SF complains by zip code.
  1. Create a data set by aggregate the count of SF and SB complains by day for each zipcode.
  2. Merge the NYC precipitation (data/rainfall_CP.csv), by day to this data set.
  3. Merge the NYC zip code level landscape variables (data/nyc_zip_lands.csv) and ACS 2023 variables into the data set.
  4. For each day, create two variables representing 1-day lag of the precipitation and the number of CB complaints.
  5. Filter data from March 1 to November 30, excluding winter months when flooding is less frequent. November 30.
  6. Compare a Poisson regression with a Negative Binomial regression to account for overdispersion. Which model fits better? Explain the results to a New Yorker.