6 Exercises
Setting up the computing environment Using the right computing tools and environment is a prerequisite of data science projects. Set up your computer for this course with the following steps. For each step, document what you did, the obstacles you encountered, and how you overcame them. If you used AI, document your prompts. Note that the steps you take may depend on your computer’s operating system.Think of this as a user manual for students who are new to this. Use the command line interface.
- Install R.
- Install Positron or RStudio.
- Install Quarto.
- Set up SSH authentication between your computer and your GitHub account.
- Render your homework into an HTML file.
- Print the HTML file into a pdf file and put it into the release of this homework assignment.
Getting familiar with command line interface The command line interface (CLI) is widely used among computing professionals, even though graphical user interfaces (GUI) are more common for everyday users. Be clear and concise in your explanations. Provide short examples if you think they will help illustrate your points.
- Research and explain why many professionals prefer using the CLI over a GUI. Consider aspects such as efficiency, automation, reproducibility, or remote access.
- Identify your top five favorite commands in the Unix/Linux shell. For each command, explain what it does. If there is an option/flag you found particularly useful, describe it.
- Identify your top five favorite Git commands. For each command, explain what it does. If there is an option/flag you found particularly useful, describe it.
Do women promote different policies than men? Chattopadhyay & Duflo (2004) studied whether female policymakers make different choices than male policymakers by exploiting a unique natural experiment in India. India’s 1993 constitutional amendment required that one-third of village council head positions (pradhans) be randomly reserved for women, creating a setting where the assignment of female leaders was exogenous. In the data here, villages were randomly assigned to have a female council head. The dataset is in the source of the class notes “data/india.csv”. As shown in Table 6.1, the dataset contains four variables.
Table 6.1: Variables in india.csv
variable description village village identifier (“Gram Panchayat number _ village number”) female whether village was assigned a female politician: 1 = yes, 0 = no water number of new (or repaired) drinking water facilities in the village since random assignment irrigation number of new (or repaired) irrigation facilities in the village since random assignment - Using the correct R code, set your working directory.
- Load the
tidyverse
package. - Load the data as a
data.frame
and assign the nameindia
to it. - Utilizing either
head()
orglimpse()
view the first few rows of the dataset. Substantively describe what these functions do. - What does each observation in this dataset represent?
- Substantively interpret the first observation in the dataset.
- For each variable in the dataset, identify the type of variable (character vs. numeric binary vs. numeric non-binary).
- How many observations are in the dataset? In other words, how many villages were part of this experiment? Additionally, provide a substantive answer.
Land of Opportunity in the US Chetty et al. (2014) show that children’s chances of rising out of poverty in the U.S. vary sharply across commuting zones, with higher mobility in areas with less segregation, less inequality, stronger schools, more two-parent families, and greater social capital. The data and its variable dictionary are available as
data/Chetty_2014.csv
anddata/Chetty_2014_dict.csv
. Here we look into a subset of the data, exploring the relationship between economic mobility and CZ characteristics: household income per capita (hhi_percap
). The mobility measure that you will use in this analysis captures the probability that a child born to a parent in quintile 1 moves to income quintile 5 as an adult (prob_q1q5
).- Read and filter data to get data only for the 100 largest commuting zones (CZ). You will work with filtered data,
chetty_top100
, throughout this lab. - Make a scatterplot with household income per capita (
hhi_percap
) on the x-axis, and mobility (abs_mobility)
on the y-axis. (A) Describe the graph: what is the approximate range of the x-axis? (B) What is the approximate range of the y-axis? (C) Do you think there is a relationship between these two variables? - Use color to represent the geographic region (region) to your scatterplot. (A) What patterns does this reveal? (B) Describe the distribution of the data, by region.
- Represent geographic region (region) on your scatterplot using shape instead of color. Compare the use of color vs shape to represent the region: what are the benefits and drawbacks of each?
- Going back to the graph you just made, which uses color to represent the geographic region, add another aesthetic to represent the size of the population (
pop_2000
, population from the 2000 Census). Describe any relationships between size and region. - Split your plot into facets to display scatterplots of your data by region. (A) Compare this split plot to the combined plot earlier. Are there aspects of the relationship between
hhi_percap
and mobility that are easier to detect in the faceted plot than in the combined plot? (B) Which regions appear to have a relatively stronger relationship betweenhhi_percap
and mobility? - Add information on the census division (
division
) to your graph using the color aesthetic. (A) What does this reveal about divisional differences in the West? - Create a plot of the relationship between
hhi_percap
andabs_mobility
with two layers: (1) A scatterplot colored by region, and (2) a smooth fit chart with no standard error also colored by region. (A) What patterns does this illustrate in the data? - Create a bar graph that displays the count of CZs by region and fill each bar using information on census division. What do you learn from this graph? (A) Make new bar graphs with position
dodge
. (B) Make new bar graphs with positionfill
.
- What is the relative advantage of each of the three bar graphs?
- Read and filter data to get data only for the 100 largest commuting zones (CZ). You will work with filtered data,
Connecticut Schools Data on schools from the Common Core of Data (CCD) are collected by the National Center for Education Statistics. Information about the CCD can be found here. We are working with a subset of the 2013-14 dataset,
data/ct_schools
in the classnotes repo. A variable codebook for the CCD, which will explain what each column represents in the data file is located in theflat file
in the record layout column and 2013-14 row.- Load the appropriate packages and load the data.
- Data manipulation.
- Make a new variable
sch_type
that has the valueCharter
,Magnet
orTPS
, to specify if the school is a charter, magnet or traditional public school. - How many schools of each of these three types are in our area?
- Make a table that show the number of schools of each of the three school types that are missing data for free lunch.
- Make a new variable
- Descriptives.
- How many schools are elementary schools? Middle schools? High schools? Other? Missing?
- How many schools are eligible for Title I status? Summarize the count of schools in each category
- Racial composition.
- Create new variables that compute the percentage of students who are Black, White, Hispanic, Asian or another race (call this variable “other”), and the percentage of students receiving free OR reduced price lunch.
- Visualize these data to see the variability of each of these variables (Hint: Use
geom_histogram
).
- Visualize the variation in the percent of students receiving free lunch (frelch) for magnet schools, TPS, and charter schools (Hint: Use
geom_boxplot
). - Visualize the variation in the percent of students of each race and ethnicity in the data file (Black, White, Hispanic, Asian and Other) for each of the type of schools (magnet, TPS and charter).
- With more R skills what types of questions could you answer using this dataset?
- What are some questions you have about this dataset? What information would you add to the dataset, if you could?