11 Exercises
Setting up the computing environment Using the right computing tools and environment is a prerequisite of data science projects. Set up your computer for this course with the following steps. For each step, document what you did, the obstacles you encountered, and how you overcame them. If you used AI, document your prompts. Note that the steps you take may depend on your computer’s operating system.Think of this as a user manual for students who are new to this. Use the command line interface.
- Install R.
- Install Positron or RStudio.
- Install Quarto.
- Set up SSH authentication between your computer and your GitHub account.
- Render your homework into an HTML file.
- Print the HTML file into a pdf file and put it into the release of this homework assignment.
Getting familiar with command line interface The command line interface (CLI) is widely used among computing professionals, even though graphical user interfaces (GUI) are more common for everyday users. Be clear and concise in your explanations. Provide short examples if you think they will help illustrate your points.
- Research and explain why many professionals prefer using the CLI over a GUI. Consider aspects such as efficiency, automation, reproducibility, or remote access.
- Identify your top five favorite commands in the Unix/Linux shell. For each command, explain what it does. If there is an option/flag you found particularly useful, describe it.
- Identify your top five favorite Git commands. For each command, explain what it does. If there is an option/flag you found particularly useful, describe it.
Do women promote different policies than men? Chattopadhyay & Duflo (2004) studied whether female policymakers make different choices than male policymakers by exploiting a unique natural experiment in India. India’s 1993 constitutional amendment required that one-third of village council head positions (pradhans) be randomly reserved for women, creating a setting where the assignment of female leaders was exogenous. In the data here, villages were randomly assigned to have a female council head. The dataset is in the source of the class notes “data/india.csv”. As shown in Table 11.1, the dataset contains four variables.
Table 11.1: Variables in india.csvvariable description village village identifier (“Gram Panchayat number _ village number”) female whether village was assigned a female politician: 1 = yes, 0 = no water number of new (or repaired) drinking water facilities in the village since random assignment irrigation number of new (or repaired) irrigation facilities in the village since random assignment - Using the correct R code, set your working directory.
- Load the
tidyversepackage. - Load the data as a
data.frameand assign the nameindiato it. - Utilizing either
head()orglimpse()view the first few rows of the dataset. Substantively describe what these functions do. - What does each observation in this dataset represent?
- Substantively interpret the first observation in the dataset.
- For each variable in the dataset, identify the type of variable (character vs. numeric binary vs. numeric non-binary).
- How many observations are in the dataset? In other words, how many villages were part of this experiment? Additionally, provide a substantive answer.
Land of opportunity in the US Chetty et al. (2014) show that children’s chances of rising out of poverty in the U.S. vary sharply across commuting zones, with higher mobility in areas with less segregation, less inequality, stronger schools, more two-parent families, and greater social capital. The data and its variable dictionary are available as
data/Chetty_2014.csvanddata/Chetty_2014_dict.csv. Here we look into a subset of the data, exploring the relationship between economic mobility and CZ characteristics: household income per capita (hhi_percap). The mobility measure that you will use in this analysis captures the probability that a child born to a parent in quintile 1 moves to income quintile 5 as an adult (prob_q1q5).- Read and filter data to get data only for the 100 largest commuting zones (CZ). You will work with filtered data,
chetty_top100, throughout this lab. - Make a scatterplot with household income per capita (
hhi_percap) on the x-axis, and mobility (abs_mobility)on the y-axis. (A) Describe the graph: what is the approximate range of the x-axis? (B) What is the approximate range of the y-axis? (C) Do you think there is a relationship between these two variables? - Use color to represent the geographic region (region) to your scatterplot. (A) What patterns does this reveal? (B) Describe the distribution of the data, by region.
- Represent geographic region (region) on your scatterplot using shape instead of color. Compare the use of color vs shape to represent the region: what are the benefits and drawbacks of each?
- Going back to the graph you just made, which uses color to represent the geographic region, add another aesthetic to represent the size of the population (
pop_2000, population from the 2000 Census). Describe any relationships between size and region. - Split your plot into facets to display scatterplots of your data by region. (A) Compare this split plot to the combined plot earlier. Are there aspects of the relationship between
hhi_percapand mobility that are easier to detect in the faceted plot than in the combined plot? (B) Which regions appear to have a relatively stronger relationship betweenhhi_percapand mobility? - Add information on the census division (
division) to your graph using the color aesthetic. (A) What does this reveal about divisional differences in the West? - Create a plot of the relationship between
hhi_percapandabs_mobilitywith two layers: (1) A scatterplot colored by region, and (2) a smooth fit chart with no standard error also colored by region. (A) What patterns does this illustrate in the data? - Create a bar graph that displays the count of CZs by region and fill each bar using information on census division. What do you learn from this graph? (A) Make new bar graphs with position
dodge. (B) Make new bar graphs with positionfill.
- What is the relative advantage of each of the three bar graphs?
- Read and filter data to get data only for the 100 largest commuting zones (CZ). You will work with filtered data,
Connecticut schools Data on schools from the Common Core of Data (CCD) are collected by the National Center for Education Statistics. Information about the CCD can be found here. We are working with a subset of the 2013-14 dataset,
data/ct_schoolsin the classnotes repo. A variable codebook for the CCD, which will explain what each column represents in the data file is located in theflat filein the record layout column and 2013-14 row.- Load the appropriate packages and load the data.
- Data manipulation.
- Make a new variable
sch_typethat has the valueCharter,MagnetorTPS, to specify if the school is a charter, magnet or traditional public school. - How many schools of each of these three types are in our area?
- Make a table that show the number of schools of each of the three school types that are missing data for free lunch.
- Make a new variable
- Descriptives.
- How many schools are elementary schools? Middle schools? High schools? Other? Missing?
- How many schools are eligible for Title I status? Summarize the count of schools in each category
- Racial composition.
- Create new variables that compute the percentage of students who are Black, White, Hispanic, Asian or another race (call this variable “other”), and the percentage of students receiving free OR reduced price lunch.
- Visualize these data to see the variability of each of these variables (Hint: Use
geom_histogram).
- Visualize the variation in the percent of students receiving free lunch (frelch) for magnet schools, TPS, and charter schools (Hint: Use
geom_boxplot). - Visualize the variation in the percent of students of each race and ethnicity in the data file (Black, White, Hispanic, Asian and Other) for each of the type of schools (magnet, TPS and charter).
- With more R skills what types of questions could you answer using this dataset?
- What are some questions you have about this dataset? What information would you add to the dataset, if you could?
Early care and education, NC, 2007–2014 The data on early care and education were carefully collected by Scott Latham and colleagues at the Stanford Center for Educational Policy Analysis, in support of a recent publication, available here. Additional background on North Carolina’s Quality Rating and Improvement System (QRIS) can be found here. We have a copy in the classnotes repo:
data/NC_ECE_2007-2014.csv. You can download it into your homework repo, but please do not add it to your repo; it’s over 11MB.Instead of a formal online codebook, Professor Latham provided the following description:
“We have a panel of all licensed child care providers in North Carolina from 2007 to 2014. This includes both center-based providers and family child care homes. For all providers, we have county identifiers and zip codes. We also know facility types (e.g., independent, Head Start, local public school), enrollment, capacity, and some zip-code-level demographics (e.g., percentage below poverty, percentage Black, percentage Hispanic). For most providers, we have information on quality as measured by North Carolina’s Quality Rating and Improvement System (QRIS). However, many of these indicators are not readily interpretable because they are tied to the QRIS rubric (e.g., a 1–7 measure of teacher/staff education and credentials). The one quality measure that is relatively straightforward to interpret is the ERS rating—a widely used measure of observed classroom quality. These are elective, so we only have them for a subset of providers.”
- Load the needed packages and the data. Take a glimpse of the data.
- Make a new data frame called
nc14that contains all NC facilities in the year 2014, with the variablesfname,zip,ftype,p_pov,med_income,QRIS_ERS, and a new variablep_cap = enroll/capacity. You will use this data frame in the remaining questions. - Make a new data frame that shows the number of facilities evaluated in each zip code in 2014, and the value of
p_povfor each zip code.- Sort by decreasing order in the number of facilities.
- Make a plot from the data frame you made to demonstrate variation in percent poverty across zip codes in NC.
- Explain why your plot in is different from a plot that demonstrates variation in
p_povin the original data frame.
- Add a new variable to
nc14that lumps together various facility types into five groups, based on the value offtype, as follows: Independent; Franchise; Religious sponsored; Federal, Head Start or Local public school; All others.- Make a plot to show the variation in
p_capacross these five groups.
- Make a plot to show the variation in
- Make a boxplot to visualize covariation between
QRIS_ERSandmed_income, binningmed_incometo treat it as a categorical variable, and representing the number of observations in each category by the width of the corresponding box. - Visualize the relationship between
med_incomeandQRIS_ERSusing a scatter plot.- Visualize the same relationship using
geom_bin2d. - Discuss the pros and cons of each of these visualizations.
- Visualize the same relationship using
Quality of life, NC This project was created by the UNCC Urban Institute in collaboration with the city of Charlotte, Mecklenburg County, and nearby townships. This dataset describes how various aspects of Charlotte residents’ “quality of life” varies according to neighborhood. You can explore and learn more about this project here. The data is
data/qol_PS3.csv. We explore how adolescent birth rate and income are related at the neighborhood level in Mecklenburg county.- Load the appropriate packages and load the data.
- Visualize the covariation between the rate of adolescent births and median household income. What do you observe about the relationship between these variables?
- Create a boxplot of the rate of adolescent birth by household income quartile. Use the features of the boxplot (i.e, the bar, box, whiskers and dots) to interpret your findings and discuss the pros and cons of this visualization, including breaking income up by quartile.
- Use the “cut_width” strategy for binning the household income variable. Discuss the pros and cons of this visualization.
- Create a visualization that compares the distributions of the rate of adolescent birth for low-income neighborhoods (with a median HHI of 24257), and high-income neighborhoods (with a median HHI of 48514).
- Visualize the combined distribution of adolescent births, household income, and access to adequate prenatal care. Note that you may want to bin one or more of these variables to visualize these relationships, and you will need to find a way to represent a third variable.
- Describe why you decided to assign variables to the asthetics you chose. How did these choices help you tell a
compelling story with this visualization?
Quality of life, NC (continued) Conduct your own exploratory data analysis (EDA). Your EDA will be graded on your choice of an appropriate research question, the clarity of your visualizations, and how well your visualizations allow you to recognize patterns in your data and provide insight into the question you have posed.
- Pose and answer at minimum four iterative, related EDA questions, using a minimum of 3 different variables (as in the last exercise). Write as though you were guiding future students through an exploration of the data, along with an answer key. Be sure to clearly state your EDA question, explain why you chose the visualizations that you did, and describe your findings.
- What were the easiest and hardest parts of this assignment?
- With more R skills what types of questions could you answer using this dataset?
- What are some questions you have about this dataset? What information would you add to the dataset, if you could?
Role of colleges in economic mobility This exercise is based on a data set from Chetty et al., the same author team as the economic mobility data we have worked with, but a different paper and different data that examines the role of colleges in economic mobility; see
data/mrc.csv. If you wish, you can read about this study here.- Load the appropriate packages and the data and take a glimpse of the data.
- Get to know the data.
- What does each row in this data set represent?
- Make a table of the number of colleges in each state, sorted by number.
- Explain why there are more than 50 rows in the table you just produced even though there are only 50 states, and write code to show what are in these extra rows.
- For the four CZs in New York with the most colleges, output a table of the CZ name, number of colleges in the CZ, and the average value of
par_medianin the CZ, sorted by the number of colleges in the CZ.
- Compare parent and child median income.
- Make a new variable called
pk_ratiothat is the ratio of parent median income to child median income, and another new variable to indicate if this ratio is low (<=2), medium (between 2 and 3), or high (>= 3). - Graphically display the number of colleges with low, medium and high parent to child median income by CZ in New York
- Make a new variable called
- Visualize income mobility.
- Graphically display the number of colleges in each state.
- Graphically display the distribution of
mr_kq5_pq1.
- Visualize the variability in
mr_kq5_pq1by CZ name in New York and describe your visualization. - Visualize the relationship between
pk_ratio,mr_kq5_pq1, andtrend_parq1usinggeom_boxplotDescribe your visualization.
NYC crashes in the week of Labor Day, 2025 The data come from the NYC Open Data portal: Motor Vehicle Collisions–Crashes. This dataset contains information reported to the NYPD for motor vehicle crashes, available as
data/nyc_crashes_lbdwk_2025.csv. For this exercise, a subset covering Labor Day week 2025 was downloaded. It includes 29 variables, which are explained at the portal. The raw data contain missing values in several fields.- Importing the data.
- Read the data into R and take a glimpse.
- The variable names are all upper-cases, with spaces replaced with periods when imported. Rename the variables to use lower cases, with underscores in places of spaces (periods).
- Are all the observations really in our target time frame? If not, filter out those are not and use the filtered data for the rest of the exercise.
- Do you need the
locationvariable in the data? What suggestion would you give to the data curator?
- Missing values.
- Are there borough values that should be coded as
NA? If so, recode them. - Are there unreasonable geocode (lattitude/longitude) that should be coded as
NA? If so, recode them. - How many records have geocode missing?
- How many zip code missing and borough missing occur simultaneously?
- Create a logical variable
fillable_zip, which isTRUEwhen the geocode is not missing but zip code is missing. Compare the rate of fillable zip across the seven days.
- Are there borough values that should be coded as
- Data exploration.
- Create a variable
hourto store the hour in which a collision occurs. - Plot the number of crashes by hour by borough, Summarize the patterns in the figure (e.g., are there more crashes during rush hours?).
- How many crashes occurred at exactly midnight? How about whole hours? Is this a metter of bad luck when the clock strikes midnight?
- Create a logical variable
business_day, which isTRUEif the day is a business day andFALSEotherwise. - Plot the number of crashes by business day by borough. Summarize your observations.
- Create a variable
- Severity analysis.
- Create a logical variable
severe, which isTRUEif a crash involved at least one person injured or one person killed. - Create a count variable
n_vehiclesto store the number of vehicles involved for the crashes. - Create a contingency table of
severeandn_vehicles. Summarize your observations. - Create a contingency table of
severeandhour. Summarize your observations. - Identify the top 10 severe crashes. Are some of them geographically clustered?
- Create a logical variable
- Contributing factors.
- Create a frequency table for the contributing factors for vehicle 1.
- Are there contributing factors differ only in cases? If so, convert upper cases to lower cases.
- What are the top five contributing factors?
- Should any values in the frequency table be considered
NA? - What are the most frequent vehicle categories in vehicle 1 and vehicle 2? Comment on the results.
- Importing the data.
Covid Cases Map Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) COVID-19 Data Repository provides daily global reports of confirmed, death, and recovery counts for each country and subnational region, along with geographic coordinates and derived indicators such as incident rate and case fatality ratio. The dataset is publicly available on GitHub, where detailed field definitions and documentation can be found in the accompanying README file.
- Using the
download.file()function in R, write code to download the daily report for April 14, 2021, save it as a local CSV file namedCOVID19_04-14-2021.csv, and load it into R for inspection. Display the first few rows of the data frame.Download the data. - The shapes of counties in North Carolina are stored in
data/nc_counties.RDatain the class notes repo. Use it to make a map of the cumulative number of confirmed cases, per capita, in each county in North Carolina. - Download the daily report for April 13, 2021, read it into R, and filter to the North Carolina. Create a variable that represents the number of confirmed cases that occurred from April 13 and April 14. Lastly, compute the confirmed per capita cases and visualize it in a map.
- The shapes of states in the US are stored in
data/us_geom.RDatain the classnotes repo. Make a map of the cumulative number of confirmed cases per capita, in each state in the United States
- Using the