11 Exercises

Setting up the computing environment Using the right computing tools and environment is a prerequisite of data science projects. Set up your computer for this course with the following steps. For each step, document what you did, the obstacles you encountered, and how you overcame them. If you used AI, document your prompts. Note that the steps you take may depend on your computer’s operating system.Think of this as a user manual for students who are new to this. Use the command line interface.
1. Install R.
2. Install Positron or RStudio.
3. Install Quarto.
4. Set up SSH authentication between your computer and your GitHub account.
5. Render your homework into an HTML file.
6. Print the HTML file into a pdf file and put it into the release of this homework assignment.
Getting familiar with command line interface The command line interface (CLI) is widely used among computing professionals, even though graphical user interfaces (GUI) are more common for everyday users. Be clear and concise in your explanations. Provide short examples if you think they will help illustrate your points.
1. Research and explain why many professionals prefer using the CLI over a GUI. Consider aspects such as efficiency, automation, reproducibility, or remote access.
2. Identify your top five favorite commands in the Unix/Linux shell. For each command, explain what it does. If there is an option/flag you found particularly useful, describe it.
3. Identify your top five favorite Git commands. For each command, explain what it does. If there is an option/flag you found particularly useful, describe it.

Do women promote different policies than men? Chattopadhyay & Duflo (2004) studied whether female policymakers make different choices than male policymakers by exploiting a unique natural experiment in India. India’s 1993 constitutional amendment required that one-third of village council head positions (pradhans) be randomly reserved for women, creating a setting where the assignment of female leaders was exogenous. In the data here, villages were randomly assigned to have a female council head. The dataset is in the source of the class notes “data/india.csv”. As shown in Table 11.1, the dataset contains four variables.

Table 11.1: Variables in india.csv

variable	description
village	village identifier (“Gram Panchayat number _ village number”)
female	whether village was assigned a female politician: 1 = yes, 0 = no
water	number of new (or repaired) drinking water facilities in the village
	since random assignment
irrigation	number of new (or repaired) irrigation facilities in the village
	since random assignment

Using the correct R code, set your working directory.
Load the tidyverse package.
Load the data as a data.frame and assign the name india to it.
Utilizing either head() or glimpse() view the first few rows of the dataset. Substantively describe what these functions do.
What does each observation in this dataset represent?
Substantively interpret the first observation in the dataset.
For each variable in the dataset, identify the type of variable (character vs. numeric binary vs. numeric non-binary).
How many observations are in the dataset? In other words, how many villages were part of this experiment? Additionally, provide a substantive answer.

Land of opportunity in the US Chetty et al. (2014) show that children’s chances of rising out of poverty in the U.S. vary sharply across commuting zones, with higher mobility in areas with less segregation, less inequality, stronger schools, more two-parent families, and greater social capital. The data and its variable dictionary are available as data/Chetty_2014.csv and data/Chetty_2014_dict.csv. Here we look into a subset of the data, exploring the relationship between economic mobility and CZ characteristics: household income per capita (hhi_percap). The mobility measure that you will use in this analysis captures the probability that a child born to a parent in quintile 1 moves to income quintile 5 as an adult (prob_q1q5).
1. Read and filter data to get data only for the 100 largest commuting zones (CZ). You will work with filtered data, chetty_top100, throughout this lab.
2. Make a scatterplot with household income per capita (hhi_percap) on the x-axis, and mobility (abs_mobility) on the y-axis. (A) Describe the graph: what is the approximate range of the x-axis? (B) What is the approximate range of the y-axis? (C) Do you think there is a relationship between these two variables?
3. Use color to represent the geographic region (region) to your scatterplot. (A) What patterns does this reveal? (B) Describe the distribution of the data, by region.
4. Represent geographic region (region) on your scatterplot using shape instead of color. Compare the use of color vs shape to represent the region: what are the benefits and drawbacks of each?
5. Going back to the graph you just made, which uses color to represent the geographic region, add another aesthetic to represent the size of the population (pop_2000, population from the 2000 Census). Describe any relationships between size and region.
6. Split your plot into facets to display scatterplots of your data by region. (A) Compare this split plot to the combined plot earlier. Are there aspects of the relationship between hhi_percap and mobility that are easier to detect in the faceted plot than in the combined plot? (B) Which regions appear to have a relatively stronger relationship between hhi_percap and mobility?
7. Add information on the census division (division) to your graph using the color aesthetic. (A) What does this reveal about divisional differences in the West?
8. Create a plot of the relationship between hhi_percap and abs_mobility with two layers: (1) A scatterplot colored by region, and (2) a smooth fit chart with no standard error also colored by region. (A) What patterns does this illustrate in the data?
9. Create a bar graph that displays the count of CZs by region and fill each bar using information on census division. What do you learn from this graph? (A) Make new bar graphs with position dodge. (B) Make new bar graphs with position fill.
1. What is the relative advantage of each of the three bar graphs?
Connecticut schools Data on schools from the Common Core of Data (CCD) are collected by the National Center for Education Statistics. Information about the CCD can be found here. We are working with a subset of the 2013-14 dataset, data/ct_schools in the classnotes repo. A variable codebook for the CCD, which will explain what each column represents in the data file is located in the flat file in the record layout column and 2013-14 row.
1. Load the appropriate packages and load the data.
2. Data manipulation.
  1. Make a new variable sch_type that has the value Charter, Magnet or TPS, to specify if the school is a charter, magnet or traditional public school.
  2. How many schools of each of these three types are in our area?
  3. Make a table that show the number of schools of each of the three school types that are missing data for free lunch.
3. Descriptives.
  1. How many schools are elementary schools? Middle schools? High schools? Other? Missing?
  2. How many schools are eligible for Title I status? Summarize the count of schools in each category
4. Racial composition.
  1. Create new variables that compute the percentage of students who are Black, White, Hispanic, Asian or another race (call this variable “other”), and the percentage of students receiving free OR reduced price lunch.
  2. Visualize these data to see the variability of each of these variables (Hint: Use geom_histogram).
5. Visualize the variation in the percent of students receiving free lunch (frelch) for magnet schools, TPS, and charter schools (Hint: Use geom_boxplot).
6. Visualize the variation in the percent of students of each race and ethnicity in the data file (Black, White, Hispanic, Asian and Other) for each of the type of schools (magnet, TPS and charter).
7. With more R skills what types of questions could you answer using this dataset?
8. What are some questions you have about this dataset? What information would you add to the dataset, if you could?
Early care and education, NC, 2007–2014 The data on early care and education were carefully collected by Scott Latham and colleagues at the Stanford Center for Educational Policy Analysis, in support of a recent publication, available here. Additional background on North Carolina’s Quality Rating and Improvement System (QRIS) can be found here. We have a copy in the classnotes repo: data/NC_ECE_2007-2014.csv. You can download it into your homework repo, but please do not add it to your repo; it’s over 11MB.

Instead of a formal online codebook, Professor Latham provided the following description:

“We have a panel of all licensed child care providers in North Carolina from 2007 to 2014. This includes both center-based providers and family child care homes. For all providers, we have county identifiers and zip codes. We also know facility types (e.g., independent, Head Start, local public school), enrollment, capacity, and some zip-code-level demographics (e.g., percentage below poverty, percentage Black, percentage Hispanic). For most providers, we have information on quality as measured by North Carolina’s Quality Rating and Improvement System (QRIS). However, many of these indicators are not readily interpretable because they are tied to the QRIS rubric (e.g., a 1–7 measure of teacher/staff education and credentials). The one quality measure that is relatively straightforward to interpret is the ERS rating—a widely used measure of observed classroom quality. These are elective, so we only have them for a subset of providers.”
1. Load the needed packages and the data. Take a glimpse of the data.
2. Make a new data frame called nc14 that contains all NC facilities in the year 2014, with the variables fname, zip, ftype, p_pov, med_income, QRIS_ERS, and a new variable p_cap = enroll/capacity. You will use this data frame in the remaining questions.
3. Make a new data frame that shows the number of facilities evaluated in each zip code in 2014, and the value of p_pov for each zip code.
  1. Sort by decreasing order in the number of facilities.
  2. Make a plot from the data frame you made to demonstrate variation in percent poverty across zip codes in NC.
  3. Explain why your plot in is different from a plot that demonstrates variation in p_pov in the original data frame.
4. Add a new variable to nc14 that lumps together various facility types into five groups, based on the value of ftype, as follows: Independent; Franchise; Religious sponsored; Federal, Head Start or Local public school; All others.
  1. Make a plot to show the variation in p_cap across these five groups.
5. Make a boxplot to visualize covariation between QRIS_ERS and med_income, binning med_income to treat it as a categorical variable, and representing the number of observations in each category by the width of the corresponding box.
6. Visualize the relationship between med_income and QRIS_ERS using a scatter plot.
  1. Visualize the same relationship using geom_bin2d.
  2. Discuss the pros and cons of each of these visualizations.
Quality of life, NC This project was created by the UNCC Urban Institute in collaboration with the city of Charlotte, Mecklenburg County, and nearby townships. This dataset describes how various aspects of Charlotte residents’ “quality of life” varies according to neighborhood. You can explore and learn more about this project here. The data is data/qol_PS3.csv. We explore how adolescent birth rate and income are related at the neighborhood level in Mecklenburg county.
1. Load the appropriate packages and load the data.
2. Visualize the covariation between the rate of adolescent births and median household income. What do you observe about the relationship between these variables?
3. Create a boxplot of the rate of adolescent birth by household income quartile. Use the features of the boxplot (i.e, the bar, box, whiskers and dots) to interpret your findings and discuss the pros and cons of this visualization, including breaking income up by quartile.
4. Use the “cut_width” strategy for binning the household income variable. Discuss the pros and cons of this visualization.
5. Create a visualization that compares the distributions of the rate of adolescent birth for low-income neighborhoods (with a median HHI of 24257), and high-income neighborhoods (with a median HHI of 48514).
6. Visualize the combined distribution of adolescent births, household income, and access to adequate prenatal care. Note that you may want to bin one or more of these variables to visualize these relationships, and you will need to find a way to represent a third variable.
7. Describe why you decided to assign variables to the asthetics you chose. How did these choices help you tell a
  compelling story with this visualization?
Quality of life, NC (continued) Conduct your own exploratory data analysis (EDA). Your EDA will be graded on your choice of an appropriate research question, the clarity of your visualizations, and how well your visualizations allow you to recognize patterns in your data and provide insight into the question you have posed.
1. Pose and answer at minimum four iterative, related EDA questions, using a minimum of 3 different variables (as in the last exercise). Write as though you were guiding future students through an exploration of the data, along with an answer key. Be sure to clearly state your EDA question, explain why you chose the visualizations that you did, and describe your findings.
2. What were the easiest and hardest parts of this assignment?
3. With more R skills what types of questions could you answer using this dataset?
4. What are some questions you have about this dataset? What information would you add to the dataset, if you could?
Role of colleges in economic mobility This exercise is based on a data set from Chetty et al., the same author team as the economic mobility data we have worked with, but a different paper and different data that examines the role of colleges in economic mobility; see data/mrc.csv. If you wish, you can read about this study here.
1. Load the appropriate packages and the data and take a glimpse of the data.
2. Get to know the data.
  1. What does each row in this data set represent?
  2. Make a table of the number of colleges in each state, sorted by number.
  3. Explain why there are more than 50 rows in the table you just produced even though there are only 50 states, and write code to show what are in these extra rows.
  4. For the four CZs in New York with the most colleges, output a table of the CZ name, number of colleges in the CZ, and the average value of par_median in the CZ, sorted by the number of colleges in the CZ.
3. Compare parent and child median income.
  1. Make a new variable called pk_ratio that is the ratio of parent median income to child median income, and another new variable to indicate if this ratio is low (<=2), medium (between 2 and 3), or high (>= 3).
  2. Graphically display the number of colleges with low, medium and high parent to child median income by CZ in New York
4. Visualize income mobility.
  1. Graphically display the number of colleges in each state.
  2. Graphically display the distribution of mr_kq5_pq1.
5. Visualize the variability in mr_kq5_pq1 by CZ name in New York and describe your visualization.
6. Visualize the relationship between pk_ratio, mr_kq5_pq1, and trend_parq1 using geom_boxplot Describe your visualization.
NYC crashes in the week of Labor Day, 2025 The data come from the NYC Open Data portal: Motor Vehicle Collisions–Crashes. This dataset contains information reported to the NYPD for motor vehicle crashes, available as data/nyc_crashes_lbdwk_2025.csv. For this exercise, a subset covering Labor Day week 2025 was downloaded. It includes 29 variables, which are explained at the portal. The raw data contain missing values in several fields.
1. Importing the data.
  1. Read the data into R and take a glimpse.
  2. The variable names are all upper-cases, with spaces replaced with periods when imported. Rename the variables to use lower cases, with underscores in places of spaces (periods).
  3. Are all the observations really in our target time frame? If not, filter out those are not and use the filtered data for the rest of the exercise.
  4. Do you need the location variable in the data? What suggestion would you give to the data curator?
2. Missing values.
  1. Are there borough values that should be coded as NA? If so, recode them.
  2. Are there unreasonable geocode (lattitude/longitude) that should be coded as NA? If so, recode them.
  3. How many records have geocode missing?
  4. How many zip code missing and borough missing occur simultaneously?
  5. Create a logical variable fillable_zip, which is TRUE when the geocode is not missing but zip code is missing. Compare the rate of fillable zip across the seven days.
3. Data exploration.
  1. Create a variable hour to store the hour in which a collision occurs.
  2. Plot the number of crashes by hour by borough, Summarize the patterns in the figure (e.g., are there more crashes during rush hours?).
  3. How many crashes occurred at exactly midnight? How about whole hours? Is this a metter of bad luck when the clock strikes midnight?
  4. Create a logical variable business_day, which is TRUE if the day is a business day and FALSE otherwise.
  5. Plot the number of crashes by business day by borough. Summarize your observations.
4. Severity analysis.
  1. Create a logical variable severe, which is TRUE if a crash involved at least one person injured or one person killed.
  2. Create a count variable n_vehicles to store the number of vehicles involved for the crashes.
  3. Create a contingency table of severe and n_vehicles. Summarize your observations.
  4. Create a contingency table of severe and hour. Summarize your observations.
  5. Identify the top 10 severe crashes. Are some of them geographically clustered?
5. Contributing factors.
  1. Create a frequency table for the contributing factors for vehicle 1.
  2. Are there contributing factors differ only in cases? If so, convert upper cases to lower cases.
  3. What are the top five contributing factors?
  4. Should any values in the frequency table be considered NA?
  5. What are the most frequent vehicle categories in vehicle 1 and vehicle 2? Comment on the results.
Covid Cases Map Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) COVID-19 Data Repository provides daily global reports of confirmed, death, and recovery counts for each country and subnational region, along with geographic coordinates and derived indicators such as incident rate and case fatality ratio. The dataset is publicly available on GitHub, where detailed field definitions and documentation can be found in the accompanying README file.
1. Using the download.file() function in R, write code to download the daily report for April 14, 2021, save it as a local CSV file named COVID19_04-14-2021.csv, and load it into R for inspection. Display the first few rows of the data frame.Download the data.
2. The shapes of counties in North Carolina are stored in data/nc_counties.RData in the class notes repo. Use it to make a map of the cumulative number of confirmed cases, per capita, in each county in North Carolina.
3. Download the daily report for April 13, 2021, read it into R, and filter to the North Carolina. Create a variable that represents the number of confirmed cases that occurred from April 13 and April 14. Lastly, compute the confirmed per capita cases and visualize it in a map.
4. The shapes of states in the US are stored in data/us_geom.RData in the classnotes repo. Make a map of the cumulative number of confirmed cases per capita, in each state in the United States
Shiny App for NYC Crashes Bbuild a Shiny app that mimics a simplified version of NYC’s Vision Zero dashboard using a subset of NYC crash data from the week covering Labor Day 2025 (data/nyc_crashes_lbdwk_2025.csv). The app will grow in stages, starting from simple filtering to adding interactivity and summary visualizations.
1. Build a minimal Shiny app that filters the crash data by date and hour and displays the filtered points on a map. Use sidebar inputs for date and hour range. Use leaflet to plot crash points for the selected range.
2. Enhance the visualization to show crash severity. Map darker color for more severe crashes. Make circle size proportional to the number of persons injured or killed.
3. Add contextual information through popups on the map. Include location, borough, and injury or fatality counts in each popup.
4. Add a summary bar chart below the inputs, showing crash counts by 3-hour intervals, and arrange the layout so the map appears on the right and the bar chart below the inputs.
5. Extend your app with additional interactive features (optional).
6. Add filters for borough or severity level.
7. Include reactive text summaries (for example, total crashes).