14 Exercises
Pick up Git basics and set up an account at GitHub if you don’t have one. Please practice the tips on Git in the notes. Make sure you have at least 10 commits in the repo, each with informative message. Keep checking the status of your repo with
git status
. My grader will grade the repo.- Clone the
ids-s23
repo to your own computer. - Add your name and wishes to the Wishlist; commit with an informative message.
- Remove the
Last, First
entry from the list; commit. - Create a new file called
add.qmd
containing a few lines of texts; commit. - Remove
add.qmd
(pretending that this is by accident; commit. - Recover the accidently removed file
add.qmd
; add a long line (a paragraph without a hard break); add a short line (under 80 characters); commit. - Change one word in the long line and one word in the short line; use
git diff
to see the difference from the last commit; commit. - Put the repo into the GitHub Classroom homework repo with
git remote add
andgit push
.
- Clone the
Get ready for contributing to the classnotes.
- Create a fork of the
ids-s23
repo into your own GitHub account. - Clone it to your local computer.
- Make a new branch to experiment with your changes.
- Checkout your branch and add your wishes to the wish list; push to your GitHub account.
- Make a pull request to my
ids-s23
repo from your fork at GitHub. Make sure you have clear messages to document the changes.
- Create a fork of the
Write a function to demonstrate the Monty Hall problem through simulation. The function takes two arguments
ndoors
andntrials
, representing the number of doors in the experiment and the number of trails in a simulation, respectively. The function should return the proportion of wins for both the switch and no-switch strategy. Apply your function with 3 doors and 5 doors, both with 1000 trials. Include sufficient text around the code to explain your them.Write a function to do a Monte Carlo approximation of \(\pi\). The function takes a Monte Carlo sample size
n
as input, and returns a point estimate of \(\pi\) and a 95% confidence interval. Apply your function with sample size 1000, 2000, 4000, and 8000. Repeat the experiment 1000 times for each sample size and check the empirical probability that the confidence intervals cover the true value of \(\pi\). Comment on the results.Find the first 10-digit prime number occurring in consecutive digits of \(e\). This was a Google recruiting ad
The NYC motor vehicle collisions data with documentation is available from NYC Open Data. The raw data needs some cleaning. (JY: Add variable name cleaning next year.)
- Use the filter from the website to download the crash data of January 2023; save it under a directory
data
with an informative name (e.g.,nyc_crashes_202301.csv
). - Get basic summaries of each variable: missing percentage; descriptive statistics for continuous variables; frequency tables for discrete variables.
- Are the
LATITUDE
andLONGITIDE
values all look legitimate? If not (e.g., zeroes), code them as missing values. - If
OFF STREET NAME
is not missing, are there any missingLATITUDE
andLONGITUDE
? If so, geocode the addresses. - (Optional) Are the missing patterns of
ON STREET NAME
andLATITUDE
the same? Summarize the missing patterns by a cross table. IfON STREET NAME
andCROSS STREET NAME
are available, use geocoding by intersection to fill theLATITUDE
andLONGITUDE
. - Are
ZIP CODE
andBOROUGH
always missing together? IfLATITUDE
andLONGITUDE
are available, use reverse geocoding to fill theZIP CODE
andBOROUGH
. - Print the whole frequency table of
CONTRIBUTING FACTOR VEHICLE 1
. Convert lower cases to uppercases and check the frequencies again. - Provided an opportunity to meet the data provider, what suggestions do you have to make the data better based on your data exploration experience?
- Use the filter from the website to download the crash data of January 2023; save it under a directory
Except the first problem, use the cleaned data set with missing geocode imputed (
data/nyc_crashes_202301_cleaned.csv
).- Construct a contigency table for missing in geocode (latitude and longitude) by borough. Is the missing pattern the same across borough? Formulate a hypothesis and test it.
- Construct a
hour
variable with integer values from 0 to 23. Plot the histogram of the number of crashes byhour
. Plot it by borough. - Overlay the locations of the crashes on a map of NYC. The map could be a static map or Google map.
- Create a new variable
injury
which is one if the number of persons injured is 1 or more; and zero otherwise. Construct a cross table forinjury
versus borough. Test the null hypothesis that the two variables are not associated. - Merge the crash data with the zip code database.
- Fit a logistic model with
injury
as the outcome variable and covariates that are available in the data or can be engineered from the data. For example, zip code level covariates can be obtained by merging with the zip code database.
Using the cleaned NYC crash data, perform classification of
injury
with support vector machine and compare the results with the benchmark from regularized logistic regression. Use the last week’s data as testing data.- Explain the parameters you used in your fitting for each method.
- Explain the confusion matrix retult from each fit.
- Compare the performance of the two approaches in terms of accuracy, precision, recall, F1-score, and AUC.
(Mid-term team project) The NYC Open Data of 311 Service Requests contains all requests from 2010 to present. We consider a subset of it with request time between 00:00:00 01/15/2023 and 24:00:00 01/21/2023. The subset is available in CSV format as
data/nyc311_011523-012123_by022023.csv
. Read the data dictionary to understand the meaning of the variables,- Clean the data: fill missing fields as much as possible; check for obvious data entry errors (e.g., can
Closed Date
be earlier thanCreated Date
?); summarize your suggestions to the data curator in several bullet points. - Remove requests that are not made to NYPD and create a new variable
duration
, which represents the time period from theCreated Date
toClosed Date
. Note thatduration
may be censored for some requests. Visualize the distribution of uncensoredduration
by weekdays/weekend and by borough, and test whether the distributions are the same across weekdays/weekends of their creation and across boroughs. - Define a binary variable
over3h
which is 1 ifduration
is greater than 3 hours. Note that it can be obtained even for censoredduration
. Build a model to predictover3h
. If your model has tuning parameters, justify their choices. Apply this model to the 311 requests of NYPD in the week of 01/22/2023. Assess the performance of your model. - Now you know the data quite well. Come up with a research question of interest that can be answered by the data, which could be analytics or visualizations. Perform the needed analyses and answer your question.
- Clean the data: fill missing fields as much as possible; check for obvious data entry errors (e.g., can