12  Exercises

  1. Git basics and GitHub setup Learn the Git basics and set up an account on GitHub if you do not already have one. Practice the tips on Git in the notes. By going through the following tasks, ensure your repo has at least 10 commits, each with an informative message. Regularly check the status of your repo using git status. The specific tasks are:

    1. Clone the class notes repo to an appropriate folder on your computer.
    2. Add all the files to your designated homework repo from GitHub Classroom and work on that repo for the rest of the problem.
    3. Add your name and wishes to the Wishlist; commit.
    4. Remove the Last, First entry from the list; commit.
    5. Create a new file called add.qmd containing a few lines of texts; commit.
    6. Remove add.qmd (pretending that this is by accident); commit.
    7. Recover the accidently removed file add.qmd; add a long line (a paragraph without a hard break); add a short line (under 80 characters); commit.
    8. Change one word in the long line and one word in the short line; use git diff to see the difference from the last commit; commit.
    9. Play with other git operations and commit.
  2. Contributing to the Class Notes

    1. Create a fork of the notes repo into your own GitHub account.
    2. Clone it to your local computer.
    3. Make a new branch to experiment with your changes.
    4. Checkout your branch and add your wishes to the wish list; push to your GitHub account.
    5. Make a pull request to class notes repo from your fork at GitHub. Make sure you have clear messages to document the changes.
  3. Monty Hall Write a function to demonstrate the Monty Hall problem through simulation. The function takes two arguments ndoors and ntrials, representing the number of doors in the experiment and the number of trails in a simulation, respectively. The function should return the proportion of wins for both the switch and no-switch strategy. Apply your function with 3 doors and 5 doors, both with 1000 trials. Include sufficient text around the code to explain your them.

  4. Approximating \(\pi\) Write a function to do a Monte Carlo approximation of \(\pi\). The function takes a Monte Carlo sample size n as input, and returns a point estimate of \(\pi\) and a 95% confidence interval. Apply your function with sample size 1000, 2000, 4000, and 8000. Repeat the experiment 1000 times for each sample size and check the empirical probability that the confidence intervals cover the true value of \(\pi\). Comment on the results.

  5. Google Billboard Ad Find the first 10-digit prime number occurring in consecutive digits of \(e\). This was a Google recruiting ad.

  6. Game 24 The math game 24 is one of the addictive games among number lovers. With four randomly selected cards form a deck of poker cards, use all four values and elementary arithmetic operations (\(+-\times /\)) to come up with 24. Let \(\square\) be one of the four numbers. Let \(\bigcirc\) represent one of the four operators. For example, \[\begin{equation*} (\square \bigcirc \square) \bigcirc (\square \bigcirc \square) \end{equation*}\] is one way to group the the operations.

    1. List all the possible ways to group the four numbers.
    2. How many possibly ways are there to check for a solution?
    3. Write a function to solve the problem in a brutal force way. The inputs of the function are four numbers. The function returns a list of solutions. Some of the solutions will be equivalent, but let us not worry about that for now.
  7. The NYC motor vehicle collisions data with documentation is available from NYC Open Data. The raw data needs some cleaning. (JY: Add variable name cleaning next year.)

    1. Use the filter from the website to download the crash data of January 2023; save it under a directory data with an informative name (e.g., nyc_crashes_202301.csv).
    2. Get basic summaries of each variable: missing percentage; descriptive statistics for continuous variables; frequency tables for discrete variables.
    3. Are the LATITUDE and LONGITIDE values all look legitimate? If not (e.g., zeroes), code them as missing values.
    4. If OFF STREET NAME is not missing, are there any missing LATITUDE and LONGITUDE? If so, geocode the addresses.
    5. (Optional) Are the missing patterns of ON STREET NAME and LATITUDE the same? Summarize the missing patterns by a cross table. If ON STREET NAME and CROSS STREET NAME are available, use geocoding by intersection to fill the LATITUDE and LONGITUDE.
    6. Are ZIP CODE and BOROUGH always missing together? If LATITUDE and LONGITUDE are available, use reverse geocoding to fill the ZIP CODE and BOROUGH.
    7. Print the whole frequency table of CONTRIBUTING FACTOR VEHICLE 1. Convert lower cases to uppercases and check the frequencies again.
    8. Provided an opportunity to meet the data provider, what suggestions do you have to make the data better based on your data exploration experience?
  8. Except the first problem, use the cleaned data set with missing geocode imputed (data/nyc_crashes_202301_cleaned.csv).

    1. Construct a contigency table for missing in geocode (latitude and longitude) by borough. Is the missing pattern the same across borough? Formulate a hypothesis and test it.
    2. Construct a hour variable with integer values from 0 to 23. Plot the histogram of the number of crashes by hour. Plot it by borough.
    3. Overlay the locations of the crashes on a map of NYC. The map could be a static map or Google map.
    4. Create a new variable injury which is one if the number of persons injured is 1 or more; and zero otherwise. Construct a cross table for injury versus borough. Test the null hypothesis that the two variables are not associated.
    5. Merge the crash data with the zip code database.
    6. Fit a logistic model with injury as the outcome variable and covariates that are available in the data or can be engineered from the data. For example, zip code level covariates can be obtained by merging with the zip code database.
  9. Using the cleaned NYC crash data, perform classification of injury with support vector machine and compare the results with the benchmark from regularized logistic regression. Use the last week’s data as testing data.

    1. Explain the parameters you used in your fitting for each method.
    2. Explain the confusion matrix retult from each fit.
    3. Compare the performance of the two approaches in terms of accuracy, precision, recall, F1-score, and AUC.
  10. The NYC Open Data of 311 Service Requests contains all requests from 2010 to present. We consider a subset of it with request time between 00:00:00 01/15/2023 and 24:00:00 01/21/2023. The subset is available in CSV format as data/nyc311_011523-012123_by022023.csv. Read the data dictionary to understand the meaning of the variables,

    1. Clean the data: fill missing fields as much as possible; check for obvious data entry errors (e.g., can Closed Date be earlier than Created Date?); summarize your suggestions to the data curator in several bullet points.
    2. Remove requests that are not made to NYPD and create a new variable duration, which represents the time period from the Created Date to Closed Date. Note that duration may be censored for some requests. Visualize the distribution of uncensored duration by weekdays/weekend and by borough, and test whether the distributions are the same across weekdays/weekends of their creation and across boroughs.
    3. Define a binary variable over3h which is 1 if duration is greater than 3 hours. Note that it can be obtained even for censored duration. Build a model to predict over3h. If your model has tuning parameters, justify their choices. Apply this model to the 311 requests of NYPD in the week of 01/22/2023. Assess the performance of your model.
    4. Now you know the data quite well. Come up with a research question of interest that can be answered by the data, which could be analytics or visualizations. Perform the needed analyses and answer your question.
  11. NYC Rodents Rats in NYC are widespread, as they are in many densely populated areas (https://en.wikipedia.org/wiki/Rats_in_New_York_City). As of October 2023, NYC dropped from the 2nd to the 3rd places in the annual “rattiest city” list released by a pest control company. In the 311 Service Request Data, there is one complain type Rodent. Extract all the requests with complain type Rodent, created between January 1, 2022 and December 31, 2023. Save them into a csv file named rodent_2022-2023.csv.

    1. Are there any complains that are not closed yet?
    2. Are there any complains with a closed data before the created date?
    3. How many agencies were this complain type reported to?
    4. Summarize the missingess for each variable.
    5. Summarize a frequency table for the descriptor variable, and summarize a cross table by year.
    6. Which types of ‘DESCRIPTOR’ do you think should be included if our interest is rodent sighting?
    7. Take a subset of the data with the descriptors you chose and summarize the response time by borough.
  12. NYC rodent sightings data cleaning The data appears to need some cleaning before any further analysis. Some missing values could be filled based on other columns.

    1. Checking all 47 column names suggests that some columns might be redundant. Identify them and demonstrate the redundancy.
    2. Are zip code and borough always missing together? If geocodes are available, use reverse geocoding to fill the zip code.
    3. Export the cleaned data in both csv and feather format. Comment on the file sizes.
  13. SQL Practice on NYC rodent sightings The NYC rodent sightings data that we prepared could be stored more efficiently using a database. Let us start from the csv file you exported from the last problem.

    1. Create a table called rodent from the csv file.
    2. The agency and agency_name columns are redundant in the table. Create a table called agency, which contains only these two columns, one agency a row.
    3. Drop the agency_name name from the rodent table. Justify why we do not need it here.
    4. Comment on the sizes of the table (or exported csv file) of rodent before and after dropping the agency_name column.
    5. Come up with a scheme for the two tables that allows even more efficient storage of the agency column in the rodent table. _Hint: use an integer to code the agencies.
  14. Logistic Modeling The response time to 311 service requests is a measure of civic service quality. Let us model the response time to 311 requests with complain type Rodent.

    1. Compute the response time in hours. Note that some response will be missing because of unavailable closed date.
    2. Compute a binary variable over3d, which is one if the response time is greater than 3 days, and zero otherwise. Note that this variable should have no missing values.
    3. Use the package uszipcode to obtain the zip code level covaraites such as median house income and median home value. Merge these variables to the rodent data.
    4. Split the data at random into training (80%) and testing (20%). Build a logistic model to predict over3d on the training data, and validate the performance on the testing data.
    5. Build a lasso logistic model to predict over3d, and justify your choice of the tuning parameter. Validate on the testing data.
  15. Midterm Project: Rodents in NYC Rodents in NYC are widespread, as they are in many densely populated areas. As of October 2023, NYC dropped from the 2nd to the 3rd places in the annual “rattiest city” list released by a pest control company. Rat sightings in NYC was analyzed by Dr. Michael Walsh in a 2014 PeerJ article. We investigate this problem from a different angle with the NYC Rodent Inspection data, provided by the Department of Health and Mental Hygiene (DOHMH). Download the 2022-2023 data by filtering the INSPECTION_DATE to between 11:59:59 pm of 12/31/2021 and 12:00:00 am of 01/01/2024 and INSPECTION_TYPE is either Initial or Compliance (which should be about 108 MB). Read the meta data information to understand the data.

    1. Data cleaning.
      • There are two zipcode columns: ZIP_CODE and Zipcodes. Which one represent the zipcode of the inspection site? Comment on the data dictionary.
      • Summarize the missing information. Are their missing values that can be filled using other columns? Fill them if yes.
      • Are their redundant information in the data? Try storing the data using arrow and comment on the efficiency gain.
      • Are there invalid zipcode or borough? Justify and clean them up if yes.
    2. Data exploration.
      • Create binary variable passing indicating passing or not for the inspection result. Does passing depend on whether the inspection is initial or compliance? State your hypothesis and summarize your test result.
      • Are the passing pattern different across different boroughs for initial inspections? How about compliance inspections? State your hypothesis and summarize your test results.
      • If we suspect that the passing rate may depends on the time of a day of the inspection, we may compare the passting rates for inspections done in the mornings and inspections one in the afternoons. Visualize the comparison by borough and inspection type.
      • Perform a formal hypothesis test to confirm the observations from your visualization.
    3. Data analytics.
      • Aggregate the inspections by zip code to create a dataset with five columns. The first three columns are zipcode; n_initial, the count of the initial inspections in that zipcode; and n_initpass, the number of initial inspections with a passing result in that zipcode. The other two variables are n_compliance and n_comppass, the counterpart for compliance inspections.
      • Add a variable to your dataset, n_sighting, which represent the number of rodent sightings from the 311 service request data in the same 2022-2023 period.
      • Merge your dataset with the simple zipcode table in package uszipcode by zipcode to obtain demographic and socioeconomic variables at the zipcode level.
      • Build a binomial regression for the passing rate of initial inspections at the zipcode level. Assess the goodness-of-fit of your model. Summarize your results to a New Yorker who is not data science savvy.
    4. Now you know the data quite well. Come up with a research question of interest that can be answered by the data, which could be analytics or visualizations. Perform the needed analyses and answer your question.