12 Exercises
Git basics and GitHub setup Learn the Git basics and set up an account on GitHub if you do not already have one. Practice the tips on Git in the notes. By going through the following tasks, ensure your repo has at least 10 commits, each with an informative message. Regularly check the status of your repo using
git status
. The specific tasks are:- Clone the class notes repo to an appropriate folder on your computer.
- Add all the files to your designated homework repo from GitHub Classroom and work on that repo for the rest of the problem.
- Add your name and wishes to the Wishlist; commit.
- Remove the
Last, First
entry from the list; commit. - Create a new file called
add.qmd
containing a few lines of texts; commit. - Remove
add.qmd
(pretending that this is by accident); commit. - Recover the accidently removed file
add.qmd
; add a long line (a paragraph without a hard break); add a short line (under 80 characters); commit. - Change one word in the long line and one word in the short line; use
git diff
to see the difference from the last commit; commit. - Play with other git operations and commit.
Contributing to the Class Notes
- Create a fork of the notes repo into your own GitHub account.
- Clone it to your local computer.
- Make a new branch to experiment with your changes.
- Checkout your branch and add your wishes to the wish list; push to your GitHub account.
- Make a pull request to class notes repo from your fork at GitHub. Make sure you have clear messages to document the changes.
Monty Hall Write a function to demonstrate the Monty Hall problem through simulation. The function takes two arguments
ndoors
andntrials
, representing the number of doors in the experiment and the number of trails in a simulation, respectively. The function should return the proportion of wins for both the switch and no-switch strategy. Apply your function with 3 doors and 5 doors, both with 1000 trials. Include sufficient text around the code to explain your them.Approximating \(\pi\) Write a function to do a Monte Carlo approximation of \(\pi\). The function takes a Monte Carlo sample size
n
as input, and returns a point estimate of \(\pi\) and a 95% confidence interval. Apply your function with sample size 1000, 2000, 4000, and 8000. Repeat the experiment 1000 times for each sample size and check the empirical probability that the confidence intervals cover the true value of \(\pi\). Comment on the results.Google Billboard Ad Find the first 10-digit prime number occurring in consecutive digits of \(e\). This was a Google recruiting ad.
Game 24 The math game 24 is one of the addictive games among number lovers. With four randomly selected cards form a deck of poker cards, use all four values and elementary arithmetic operations (\(+-\times /\)) to come up with 24. Let \(\square\) be one of the four numbers. Let \(\bigcirc\) represent one of the four operators. For example, \[\begin{equation*} (\square \bigcirc \square) \bigcirc (\square \bigcirc \square) \end{equation*}\] is one way to group the the operations.
- List all the possible ways to group the four numbers.
- How many possibly ways are there to check for a solution?
- Write a function to solve the problem in a brutal force way. The inputs of the function are four numbers. The function returns a list of solutions. Some of the solutions will be equivalent, but let us not worry about that for now.
The NYC motor vehicle collisions data with documentation is available from NYC Open Data. The raw data needs some cleaning. (JY: Add variable name cleaning next year.)
- Use the filter from the website to download the crash data of January 2023; save it under a directory
data
with an informative name (e.g.,nyc_crashes_202301.csv
). - Get basic summaries of each variable: missing percentage; descriptive statistics for continuous variables; frequency tables for discrete variables.
- Are the
LATITUDE
andLONGITIDE
values all look legitimate? If not (e.g., zeroes), code them as missing values. - If
OFF STREET NAME
is not missing, are there any missingLATITUDE
andLONGITUDE
? If so, geocode the addresses. - (Optional) Are the missing patterns of
ON STREET NAME
andLATITUDE
the same? Summarize the missing patterns by a cross table. IfON STREET NAME
andCROSS STREET NAME
are available, use geocoding by intersection to fill theLATITUDE
andLONGITUDE
. - Are
ZIP CODE
andBOROUGH
always missing together? IfLATITUDE
andLONGITUDE
are available, use reverse geocoding to fill theZIP CODE
andBOROUGH
. - Print the whole frequency table of
CONTRIBUTING FACTOR VEHICLE 1
. Convert lower cases to uppercases and check the frequencies again. - Provided an opportunity to meet the data provider, what suggestions do you have to make the data better based on your data exploration experience?
- Use the filter from the website to download the crash data of January 2023; save it under a directory
Except the first problem, use the cleaned data set with missing geocode imputed (
data/nyc_crashes_202301_cleaned.csv
).- Construct a contigency table for missing in geocode (latitude and longitude) by borough. Is the missing pattern the same across borough? Formulate a hypothesis and test it.
- Construct a
hour
variable with integer values from 0 to 23. Plot the histogram of the number of crashes byhour
. Plot it by borough. - Overlay the locations of the crashes on a map of NYC. The map could be a static map or Google map.
- Create a new variable
injury
which is one if the number of persons injured is 1 or more; and zero otherwise. Construct a cross table forinjury
versus borough. Test the null hypothesis that the two variables are not associated. - Merge the crash data with the zip code database.
- Fit a logistic model with
injury
as the outcome variable and covariates that are available in the data or can be engineered from the data. For example, zip code level covariates can be obtained by merging with the zip code database.
Using the cleaned NYC crash data, perform classification of
injury
with support vector machine and compare the results with the benchmark from regularized logistic regression. Use the last week’s data as testing data.- Explain the parameters you used in your fitting for each method.
- Explain the confusion matrix retult from each fit.
- Compare the performance of the two approaches in terms of accuracy, precision, recall, F1-score, and AUC.
The NYC Open Data of 311 Service Requests contains all requests from 2010 to present. We consider a subset of it with request time between 00:00:00 01/15/2023 and 24:00:00 01/21/2023. The subset is available in CSV format as
data/nyc311_011523-012123_by022023.csv
. Read the data dictionary to understand the meaning of the variables,- Clean the data: fill missing fields as much as possible; check for obvious data entry errors (e.g., can
Closed Date
be earlier thanCreated Date
?); summarize your suggestions to the data curator in several bullet points. - Remove requests that are not made to NYPD and create a new variable
duration
, which represents the time period from theCreated Date
toClosed Date
. Note thatduration
may be censored for some requests. Visualize the distribution of uncensoredduration
by weekdays/weekend and by borough, and test whether the distributions are the same across weekdays/weekends of their creation and across boroughs. - Define a binary variable
over3h
which is 1 ifduration
is greater than 3 hours. Note that it can be obtained even for censoredduration
. Build a model to predictover3h
. If your model has tuning parameters, justify their choices. Apply this model to the 311 requests of NYPD in the week of 01/22/2023. Assess the performance of your model. - Now you know the data quite well. Come up with a research question of interest that can be answered by the data, which could be analytics or visualizations. Perform the needed analyses and answer your question.
- Clean the data: fill missing fields as much as possible; check for obvious data entry errors (e.g., can
NYC Rodents Rats in NYC are widespread, as they are in many densely populated areas (https://en.wikipedia.org/wiki/Rats_in_New_York_City). As of October 2023, NYC dropped from the 2nd to the 3rd places in the annual “rattiest city” list released by a pest control company. In the 311 Service Request Data, there is one complain type
Rodent
. Extract all the requests with complain typeRodent
, created between January 1, 2022 and December 31, 2023. Save them into a csv file namedrodent_2022-2023.csv
.- Are there any complains that are not closed yet?
- Are there any complains with a closed data before the created date?
- How many agencies were this complain type reported to?
- Summarize the missingess for each variable.
- Summarize a frequency table for the
descriptor
variable, and summarize a cross table by year. - Which types of ‘DESCRIPTOR’ do you think should be included if our interest is rodent sighting?
- Take a subset of the data with the descriptors you chose and summarize the response time by borough.
NYC rodent sightings data cleaning The data appears to need some cleaning before any further analysis. Some missing values could be filled based on other columns.
- Checking all 47 column names suggests that some columns might be redundant. Identify them and demonstrate the redundancy.
- Are zip code and borough always missing together? If geocodes are available, use reverse geocoding to fill the zip code.
- Export the cleaned data in both csv and feather format. Comment on the file sizes.
SQL Practice on NYC rodent sightings The NYC rodent sightings data that we prepared could be stored more efficiently using a database. Let us start from the csv file you exported from the last problem.
- Create a table called
rodent
from the csv file. - The
agency
andagency_name
columns are redundant in the table. Create a table calledagency
, which contains only these two columns, one agency a row. - Drop the
agency_name
name from therodent
table. Justify why we do not need it here. - Comment on the sizes of the table (or exported csv file) of
rodent
before and after dropping theagency_name
column. - Come up with a scheme for the two tables that allows even more efficient storage of the
agency
column in therodent
table. _Hint: use an integer to code the agencies.
- Create a table called
Logistic Modeling The response time to 311 service requests is a measure of civic service quality. Let us model the response time to 311 requests with complain type
Rodent
.- Compute the response time in hours. Note that some response will be missing because of unavailable closed date.
- Compute a binary variable
over3d
, which is one if the response time is greater than 3 days, and zero otherwise. Note that this variable should have no missing values. - Use the package
uszipcode
to obtain the zip code level covaraites such as median house income and median home value. Merge these variables to the rodent data. - Split the data at random into training (80%) and testing (20%). Build a logistic model to predict
over3d
on the training data, and validate the performance on the testing data. - Build a lasso logistic model to predict
over3d
, and justify your choice of the tuning parameter. Validate on the testing data.
Midterm Project: Rodents in NYC Rodents in NYC are widespread, as they are in many densely populated areas. As of October 2023, NYC dropped from the 2nd to the 3rd places in the annual “rattiest city” list released by a pest control company. Rat sightings in NYC was analyzed by Dr. Michael Walsh in a 2014 PeerJ article. We investigate this problem from a different angle with the NYC Rodent Inspection data, provided by the Department of Health and Mental Hygiene (DOHMH). Download the 2022-2023 data by filtering the
INSPECTION_DATE
to between 11:59:59 pm of 12/31/2021 and 12:00:00 am of 01/01/2024 andINSPECTION_TYPE
is eitherInitial
orCompliance
(which should be about 108 MB). Read the meta data information to understand the data.- Data cleaning.
- There are two zipcode columns:
ZIP_CODE
andZipcodes
. Which one represent the zipcode of the inspection site? Comment on the data dictionary. - Summarize the missing information. Are their missing values that can be filled using other columns? Fill them if yes.
- Are their redundant information in the data? Try storing the data using
arrow
and comment on the efficiency gain. - Are there invalid zipcode or borough? Justify and clean them up if yes.
- There are two zipcode columns:
- Data exploration.
- Create binary variable
passing
indicating passing or not for the inspection result. Doespassing
depend on whether the inspection is initial or compliance? State your hypothesis and summarize your test result. - Are the passing pattern different across different boroughs for initial inspections? How about compliance inspections? State your hypothesis and summarize your test results.
- If we suspect that the passing rate may depends on the time of a day of the inspection, we may compare the passting rates for inspections done in the mornings and inspections one in the afternoons. Visualize the comparison by borough and inspection type.
- Perform a formal hypothesis test to confirm the observations from your visualization.
- Create binary variable
- Data analytics.
- Aggregate the inspections by zip code to create a dataset with five columns. The first three columns are
zipcode
;n_initial
, the count of the initial inspections in that zipcode; andn_initpass
, the number of initial inspections with a passing result in that zipcode. The other two variables aren_compliance
andn_comppass
, the counterpart for compliance inspections. - Add a variable to your dataset,
n_sighting
, which represent the number of rodent sightings from the 311 service request data in the same 2022-2023 period. - Merge your dataset with the simple zipcode table in package
uszipcode
by zipcode to obtain demographic and socioeconomic variables at the zipcode level. - Build a binomial regression for the passing rate of initial inspections at the zipcode level. Assess the goodness-of-fit of your model. Summarize your results to a New Yorker who is not data science savvy.
- Aggregate the inspections by zip code to create a dataset with five columns. The first three columns are
- Now you know the data quite well. Come up with a research question of interest that can be answered by the data, which could be analytics or visualizations. Perform the needed analyses and answer your question.
- Data cleaning.