Raw data are rarely ready for direct analysis. We often need to reshape, filter, or summarize before we can create meaningful plots or fit statistical models. The tidyverse provides a consistent grammar for these operations, with dplyr as its central package.
In this chapter, we will learn the most important data manipulation verbs. Each verb is a function that takes a data frame (or tibble) as the first argument, applies some manipulation, and returns a new data frame.
Backward Compatibility in the Tidyverse
The tidyverse strives to minimize disruption, but backward compatibility is not guaranteed. Breaking changes sometimes occur—especially in major releases—to improve consistency or fix design issues. Functions are usually deprecated with warnings before removal, giving time to update code. For long-term stability, pin package versions with tools like renv and always review release notes when upgrading.
6.2 Core dplyr Verbs
The six most commonly used verbs are:
filter() — select rows based on conditions
arrange() — reorder rows
select() — choose columns
mutate() — add or modify columns
group_by() — define groups for analysis
summarise() — collapse groups into summaries
All verbs follow the same pattern: the first argument is a data frame, and subsequent arguments describe manipulations using column names.
We will illustrate these verbs using the nycflights13::flights dataset.
The filter() function selects rows that satisfy logical conditions. Logical operators like & (and), | (or), and ! (not) are often used, along with comparisons such as ==, !=, <, and >=.
== equals
!= not equal
< less than, <= less than or equal
> greater than, >= greater than or equal
& logical AND (both conditions must be true)
| logical OR (at least one condition must be true)
! logical NOT (negates a condition)
Examples with the flights data:
Code
# Flights on January 1flights |>filter(month ==1& day ==1)
Code
# Flights in January or Februaryflights |>filter(month ==1| month ==2)
Code
# Flights not in Januaryflights |>filter(!(month ==1))
Code
# Flights with arrival delay over 2 hours and from JFKflights |>filter(arr_delay >120& origin =="JFK")
Code
# Flights that were either very early (dep_delay < -15) or very late (dep_delay > 120)flights |>filter(dep_delay <-15| dep_delay >120)
Code
# Flights in summer months AND either from JFK or LGAflights |>filter((month %in%c(6,7,8)) & (origin =="JFK"| origin =="LGA"))
Code
# Flights in December with departure delay over 2 hours OR (in January with arrival delay over 2 hours)flights |>filter((month ==12& dep_delay >120) | (month ==1& arr_delay >120))
Code
# Flights from EWR where either (dep_delay > 60 AND arr_delay > 60) OR (dep_delay < -30 AND arr_delay < -30)flights |>filter(origin =="EWR"& ((dep_delay >60& arr_delay >60) |(dep_delay <-30& arr_delay <-30)))
Sometimes, the keeping the group structured after summarising can be handy, which can be achieved with .groups = "keep". In this case, groups persist unless they are explicitly removed with ungroup(). With the default summarise(), a single grouping variable is dropped, producing an ungrouped result; so ungroup() after that is redundant.
6.3 Reshaping Data
Real-world data often need reshaping. Tidy data prefers one observation per row, one variable per column.
Other join types include inner_join(), right_join(), and full_join().
6.5 Assigning Results of Manipulations
When working with dplyr, it is often useful to save the results of a manipulation into a new object. This allows you to reuse the processed data without repeating all of the steps.
Code
# Filter flights on January 1st and arrange by departure delayflights_jan1 <- flights |>filter(month ==1, day ==1) |>arrange(dep_delay)# Print the first few rowshead(flights_jan1)
You can now work with flights_jan1 in later code chunks without re-running the entire manipulation. This practice is especially helpful for long workflows where the same processed data will be used multiple times.
Create a data frame with CZ’s with absolute mobility of at least 40
Create a data frame with CZ’s in any state other than CT, MA, or NY, with absolute mobility at least 40
Create a data frame with CZ’s that are in CT, MA, NY and have absolute mobility less than 40.
Create a data frame with CZ’s that are in CT, MA, NY, sorting the CZ’s in decreasing order of absolute mobility, and keeping just the CZ name, state, and absolute mobility variables in the resulting data frames
Create a new data set with only the following variables: cz_name, state, pop_2000, abs_mobility, hhi_percap, and any variable that starts with frac.
Make new variables for each of these quantities:
The number of people in each CZ who consider themselves to be religious.
The log base 2 of the per capita household income (hint: log2() ).
The proportion of people who are not married.
6.7 Summary and Best Practices
Begin with a clear idea of what manipulation you need.
Chain verbs together with the pipe operator for readability.
Use group_by() and summarise() to move from raw detail to aggregated insights.
Reshape and join data as needed to bring it into tidy form.
These manipulations prepare your data for visualization and modeling, ensuring clarity and reproducibility in analysis.
Chetty, R., Hendren, N., Kline, P., & Saez, E. (2014). Where is the land of opportunity? The geography of intergenerational mobility in the United States. Quarterly Journal of Economics, 129(4), 1553–1623. https://doi.org/10.1093/qje/qju022