13 Exercises

Quarto and Git setup Quarto and Git are two important tools for data science. Get familiar with them through the following tasks. Please use the templates/hw.qmd template.
1. Install Quarto onto your computer following the instructions of Get Started. Document the obstacles you encountered and how you overcame them.
2. Pick a tool of your choice (e.g., VS Code, Jupyter Notebook, Emacs, etc.), follow the instructions to reproduce the example of line plot on polar axis.
3. Render the homework into a pdf file and put the file into a release in your GitHub repo. Document any obstacles you have and how you overcome them.
Git basics and GitHub setup Learn the Git basics and set up an account on GitHub if you do not already have one. Practice the tips on Git in the notes. By going through the following tasks, ensure your repo has at least 10 commits, each with an informative message. Regularly check the status of your repo using git status. The specific tasks are:
1. Clone the class notes repo to an appropriate folder on your computer.
2. Add all the files to your designated homework repo from GitHub Classroom and work on that repo for the rest of the problem.
3. Add your name and wishes to the Wishlist; commit.
4. Remove the Last, First entry from the list; commit.
5. Create a new file called add.qmd containing a few lines of texts; commit.
6. Remove add.qmd (pretending that this is by accident); commit.
7. Recover the accidentally removed file add.qmd; add a long line (a paragraph without a hard break); add a short line (under 80 characters); commit.
8. Change one word in the long line and one word in the short line; use git diff to see the difference from the last commit; commit.
9. Play with other git operations and commit.
Contributing to the Class Notes To contribute to the classnotes, you need to have a working copy of the sources on your computer. Document the following steps in a qmd file as if you are explaining them to someone who want to contribute too.
1. Create a fork of the notes repo into your own GitHub account.
2. Clone it to an appropriate folder on your computer.
3. Render the classnotes on your computer; document the obstacles and solutions.
4. Make a new branch (and name it appropriately) to experiment with your changes.
5. Checkout your branch and add your wishes to the wish list; commit with an informative message; and push the changes to your GitHub account.
6. Make a pull request to class notes repo from your fork at GitHub. Make sure you have clear messages to document the changes.
Monty Hall Consider a generalized Monty Hall experiment. Suppose that the game start with \(n\) doors; after you pick one, the host opens \(m \le n - 2\) doors, that show no award. Include sufficient text around the code chunks to explain them.
1. Write a function to simulate the experiment once. The function takes two arguments ndoors and nempty, which represent the number of doors and the number of empty doors showed by the host, respectively, It returns the result of two strategies, switch and no-switch, from playing this game.
2. Play this game with 3 doors and 1 empty a few times.
3. Play this game with 10 doors and 8 empty a few times.
4. Write a function to play this game ntrial times and return the proportion of wins for both strategies.
5. Apply your function to play this game 1000 times, with 3 doors and 10 doors, and summarize your results.
6. Write a function to demonstrate the Monty Hall problem through simulation. The function takes two arguments ndoors and ntrials, representing the number of doors in the experiment and the number of trials in a simulation, respectively. The function should return the proportion of wins for both the switch and no-switch strategy.
7. Apply your function with 3 doors and 5 doors, both with 1000 trials. Summarize your results.
Approximating \(\pi\) Write a function to do a Monte Carlo approximation of \(\pi\). The function takes a Monte Carlo sample size n as input, and returns a point estimate of \(\pi\) and a 95% confidence interval. Apply your function with sample size 1000, 2000, 4000, and 8000. Repeat the experiment 1000 times for each sample size and check the empirical probability that the confidence intervals cover the true value of \(\pi\). Comment on the results.
Google Billboard Ad Find the first 10-digit prime number occurring in consecutive digits of \(e\). This was a Google recruiting ad.
Game 24 The math game 24 is one of the addictive games among number lovers. With four randomly selected cards form a deck of poker cards, use all four values and elementary arithmetic operations (\(+-\times /\)) to come up with 24. Let \(\square\) be one of the four numbers. Let \(\bigcirc\) represent one of the four operators. For example, \[\begin{equation*} (\square \bigcirc \square) \bigcirc (\square \bigcirc \square) \end{equation*}\] is one way to group the the operations.
1. List all the possible ways to group the four numbers.
2. How many possible ways are there to check for a solution?
3. Write a function to solve the problem in a brutal force way. The inputs of the function are four numbers. The function returns a list of solutions. Some of the solutions will be equivalent, but let us not worry about that for now.
NYC Crash Data Cleaning The NYC motor vehicle collisions data with documentation is available from NYC Open Data. The raw data needs some cleaning.
1. Use the filter from the website to download the crash data of the week of June 30, 2024 in CSV format; save it under a directory data with an informative name (e.g., nyccrashes_2024w0630_by20240916.csv); read the data into a Panda data frame with careful handling of the date time variables.
2. Clean up the variable names. Use lower cases and replace spaces with underscores.
3. Get the basic summaries of each variables: missing percentage; descriptive statistics for continuous variables; frequency tables for discrete variables.
4. Are their invalid longitude and latitude in the data? If so, replace them with NA.
5. Are there zip_code values that are not legit NYC zip codes? If so, replace them with NA.
6. Are there missing in zip_code and borough? Do they always co-occur?
7. Are there cases where zip_code and borough are missing but the geo codes are not missing? If so, fill in zip_code and borough using the geo codes.
8. Is it redundant to keep both location and the longitude/latitude at the NYC Open Data server?
9. Check the frequency of crash_time by hour. Is there a matter of bad luck at exactly midnight? How would you interpret this?
10. Are the number of persons killed/injured the summation of the numbers of pedestrians, cyclist, and motorists killed/injured? If so, is it redundant to keep these two columns at the NYC Open Data server?
11. Print the whole frequency table of contributing_factor_vehicle_1. Convert lower cases to uppercases and check the frequencies again.
12. Provided an opportunity to meet the data provider, what suggestions would you make based on your data exploration experience?
NYC Crash Data Exploration Except for the first question, use the cleaned crash data in feather format.
1. Construct a contigency table for missing in geocode (latitude and longitude) by borough. Is the missing pattern the same across boroughs? Formulate a hypothesis and test it.
2. Construct a hour variable with integer values from 0 to 23. Plot the histogram of the number of crashes by hour. Plot it by borough.
3. Overlay the locations of the crashes on a map of NYC. The map could be a static map or Google map.
4. Create a new variable severe which is one if the number of persons injured or deaths is 1 or more; and zero otherwise. Construct a cross table for severe versus borough. Is the severity of the crashes the same across boroughs? Test the null hypothesis that the two variables are not associated with an appropriate test.
5. Merge the crash data with the zip code database.
6. Fit a logistic model with severe as the outcome variable and covariates that are available in the data or can be engineered from the data. For example, zip code level covariates can be obtained by merging with the zip code database; crash hour; number of vehicles involved.
NYC Crash severity modeling Using the cleaned NYC crash data, merged with zipcode level information, predict severe of a crash.
1. Set random seed to 1234. Randomly select 20% of the crashes as testing data and leave the rest 80% as training data.
2. Fit a logistic model on the training data and validate the performance on the testing data. Explain the confusion matrix result from the testing data. Compute the F1 score.
3. Fit a logistic model on the training data with \(L_1\) regularization. Select the tuning parameter with 5-fold cross-validation in F1 score
4. Apply the regularized logistic regression to predict the severity of the crashes in the testing data. Compare the performance of the two logistic models in terms of accuracy, precision, recall, F1-score, and AUC.
Midterm project: Noise complaints in NYC The NYC Open Data of 311 Service Requests contains all requests from 2010 to present. We consider a subset of it with requests to NYPD on noise complaints that are created between 00:00:00 06/30/2024 and 24:00:00 07/06/2024. The subset is available in CSV format as data/nypd311w063024noise_by100724.csv. Read the data dictionary online to understand the meaning of the variables.
1. Data cleaning.
  1. Import the data, rename the columns with our preferred styles.
  2. Summarize the missing information. Are there variables that are close to completely missing?
  3. Are there redundant information in the data? Try storing the data using the Arrow format and comment on the efficiency gain.
  4. Are there invalid NYC zipcode or borough? Justify and clean them if yes.
  5. Are there date errors? Examples are earlier closed_date than created_date; closed_date and created_date matching to the second; dates exactly at midnight or noon to the second; action_update_date after closed_date.
  6. Summarize your suggestions to the data curator in several bullet points.
2. Data exploration.
  1. If we suspect that response time may depend on the time of day when a complaint is made, we can compare the response times for complaints submitted during nighttime and daytime. To do this, we can visualize the comparison by complaint type, borough, and weekday (vs weekend/holiday).
  2. Perform a formal hypothesis test to confirm the observations from your visualization. Formally state your hypotheses and summarize your conclusions in plain English.
  3. Create a binary variable over2h to indicate that a service request took two hours or longer to close.
  4. Does over2h depend on the complaint type, borough, or weekday (vs weekend/holiday)? State your hypotheses and summarize your conclusions in plain English.
3. Data analysis.
  1. The addresses of NYC police precincts are stored in data/nypd_precincts.csv. Use geocoding tools to find their geocode (longitude and latitude) from the addresses.
  2. Create a variable dist2pp which represent the distance from each request incidence to the nearest police precinct.
  3. Create zip code level variables by merging with data from package uszipcode.
  4. Randomly select 20% of the complaints as testing data with seeds 1234. Build a logistic model to predict over2h for the noise complaints with the training data, using all the variables you can engineer from the available data. If you have tuning parameters, justify how they were selected.
  5. Assess the performance of your model in terms of commonly used metrics. Summarize your results to a New Yorker who is not data science savvy.