6  Data Manipulation

6.1 Introduction

Raw data are rarely ready for direct analysis. We often need to reshape, filter, or summarize before we can create meaningful plots or fit statistical models. The tidyverse provides a consistent grammar for these operations, with dplyr as its central package.

In this chapter, we will learn the most important data manipulation verbs. Each verb is a function that takes a data frame (or tibble) as the first argument, applies some manipulation, and returns a new data frame.

Backward Compatibility in the Tidyverse

The tidyverse strives to minimize disruption, but backward compatibility is not guaranteed. Breaking changes sometimes occur—especially in major releases—to improve consistency or fix design issues. Functions are usually deprecated with warnings before removal, giving time to update code. For long-term stability, pin package versions with tools like renv and always review release notes when upgrading.

6.2 Core dplyr Verbs

The six most commonly used verbs are:

  • filter() — select rows based on conditions
  • arrange() — reorder rows
  • select() — choose columns
  • mutate() — add or modify columns
  • group_by() — define groups for analysis
  • summarise() — collapse groups into summaries

All verbs follow the same pattern: the first argument is a data frame, and subsequent arguments describe manipulations using column names.

We will illustrate these verbs using the nycflights13::flights dataset.

Code
library(nycflights13)
flights <- nycflights13::flights
str(flights)
tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
 $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
 $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
 $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
 $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
 $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
 $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
 $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
 $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
 $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
 $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
 $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
 $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
 $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
 $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
 $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
 $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

6.2.1 Filtering Rows with filter()

The filter() function selects rows that satisfy logical conditions. Logical operators like & (and), | (or), and ! (not) are often used, along with comparisons such as ==, !=, <, and >=.

  • == equals
  • != not equal
  • < less than, <= less than or equal
  • > greater than, >= greater than or equal
  • & logical AND (both conditions must be true)
  • | logical OR (at least one condition must be true)
  • ! logical NOT (negates a condition)

Examples with the flights data:

Code
# Flights on January 1
flights |>
filter(month == 1 & day == 1)
Code
# Flights in January or February
flights |>
filter(month == 1 | month == 2)
Code
# Flights not in January
flights |>
filter(!(month == 1))
Code
# Flights with arrival delay over 2 hours and from JFK
flights |>
filter(arr_delay > 120 & origin == "JFK")
Code
# Flights that were either very early (dep_delay < -15) or very late (dep_delay > 120)
flights |>
filter(dep_delay < -15 | dep_delay > 120)
Code
# Flights in summer months AND either from JFK or LGA
flights |>
filter((month %in% c(6,7,8)) & (origin == "JFK" | origin == "LGA"))
Code
# Flights in December with departure delay over 2 hours OR (in January with arrival delay over 2 hours)
flights |>
filter((month == 12 & dep_delay > 120) | (month == 1 & arr_delay > 120))
Code
# Flights from EWR where either (dep_delay > 60 AND arr_delay > 60) OR (dep_delay < -30 AND arr_delay < -30)
flights |>
filter(origin == "EWR" & ((dep_delay > 60 & arr_delay > 60) |
(dep_delay < -30 & arr_delay < -30)))

6.2.1.1 Handling Missing Values

Missing values (NA) require care. For example:

Code
stocks <- data.frame(
  year   = c(2015, 2015, 2016, 2016),
  qtr    = c(1, 2, 2, NA),
  return = c(1.1, NA, 0.9, 2.0)
)

stocks |> filter(is.na(qtr) | is.na(return))
Code
stocks |> filter(!is.na(qtr) & !is.na(return))

6.2.2 Reordering Rows with arrange()

arrange() sorts rows by column values. Missing values are sorted last.

Code
flights |> arrange(year, month, day) |> str()
tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
 $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
 $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
 $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
 $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
 $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
 $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
 $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
 $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
 $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
 $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
 $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
 $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
 $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
 $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
 $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
 $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

Descending order is done with desc():

Code
flights |> arrange(desc(dep_delay)) |> str()
tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
 $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month         : int [1:336776] 1 6 1 9 7 4 3 6 7 12 ...
 $ day           : int [1:336776] 9 15 10 20 22 10 17 27 22 5 ...
 $ dep_time      : int [1:336776] 641 1432 1121 1139 845 1100 2321 959 2257 756 ...
 $ sched_dep_time: int [1:336776] 900 1935 1635 1845 1600 1900 810 1900 759 1700 ...
 $ dep_delay     : num [1:336776] 1301 1137 1126 1014 1005 ...
 $ arr_time      : int [1:336776] 1242 1607 1239 1457 1044 1342 135 1236 121 1058 ...
 $ sched_arr_time: int [1:336776] 1530 2120 1810 2210 1815 2211 1020 2226 1026 2020 ...
 $ arr_delay     : num [1:336776] 1272 1127 1109 1007 989 ...
 $ carrier       : chr [1:336776] "HA" "MQ" "MQ" "AA" ...
 $ flight        : int [1:336776] 51 3535 3695 177 3075 2391 2119 2007 2047 172 ...
 $ tailnum       : chr [1:336776] "N384HA" "N504MQ" "N517MQ" "N338AA" ...
 $ origin        : chr [1:336776] "JFK" "JFK" "EWR" "JFK" ...
 $ dest          : chr [1:336776] "HNL" "CMH" "ORD" "SFO" ...
 $ air_time      : num [1:336776] 640 74 111 354 96 139 167 313 109 149 ...
 $ distance      : num [1:336776] 4983 483 719 2586 589 ...
 $ hour          : num [1:336776] 9 19 16 18 16 19 8 19 7 17 ...
 $ minute        : num [1:336776] 0 35 35 45 0 0 10 0 59 0 ...
 $ time_hour     : POSIXct[1:336776], format: "2013-01-09 09:00:00" "2013-06-15 19:00:00" ...

6.2.3 Selecting Columns with select()

select() picks out specific columns. Useful helpers include starts_with() and ends_with().

Code
flights |> select(year, month, day) |> str()
tibble [336,776 × 3] (S3: tbl_df/tbl/data.frame)
 $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month: int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ day  : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
Code
flights |> select(starts_with("dep")) |> str()
tibble [336,776 × 2] (S3: tbl_df/tbl/data.frame)
 $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
 $ dep_delay: num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...

Columns can also be renamed inline:

Code
flights |> select(tail = tailnum, everything()) |> str()
tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
 $ tail          : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
 $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
 $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
 $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
 $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
 $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
 $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
 $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
 $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
 $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
 $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
 $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
 $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
 $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
 $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
 $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

6.2.4 Creating New Columns with mutate()

mutate() adds new variables or modifies existing ones.

Code
flights |> mutate(speed = distance / (air_time/60)) |> str()
tibble [336,776 × 20] (S3: tbl_df/tbl/data.frame)
 $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
 $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
 $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
 $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
 $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
 $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
 $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
 $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
 $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
 $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
 $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
 $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
 $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
 $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
 $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
 $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
 $ speed         : num [1:336776] 370 374 408 517 394 ...

It is common to create multiple new variables at once:

Code
flights |> mutate(
  gain = dep_delay - arr_delay,
  hours = air_time / 60
) |> str()
tibble [336,776 × 21] (S3: tbl_df/tbl/data.frame)
 $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
 $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
 $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
 $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
 $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
 $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
 $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
 $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
 $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
 $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
 $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
 $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
 $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
 $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
 $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
 $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
 $ gain          : num [1:336776] -9 -16 -31 17 19 -16 -24 11 5 -10 ...
 $ hours         : num [1:336776] 3.78 3.78 2.67 3.05 1.93 ...

6.2.5 Grouping and Summarizing

The group_by() function defines groups of rows, and summarise() reduces each group to summary statistics.

Code
flights |>
  group_by(month) |>
  summarise(delay = mean(dep_delay, na.rm = TRUE))

Multiple grouping variables are possible:

Code
flights |>
  group_by(carrier, month) |>
  summarise(
    delay = mean(dep_delay, na.rm = TRUE),
    n = n()
  ) |> str()
gropd_df [185 × 4] (S3: grouped_df/tbl_df/tbl/data.frame)
 $ carrier: chr [1:185] "9E" "9E" "9E" "9E" ...
 $ month  : int [1:185] 1 2 3 4 5 6 7 8 9 10 ...
 $ delay  : num [1:185] 16.9 16.5 13.4 13.6 22.7 ...
 $ n      : int [1:185] 1573 1459 1627 1511 1462 1437 1494 1456 1540 1673 ...
 - attr(*, "groups")= tibble [16 × 2] (S3: tbl_df/tbl/data.frame)
  ..$ carrier: chr [1:16] "9E" "AA" "AS" "B6" ...
  ..$ .rows  : list<int> [1:16] 
  .. ..$ : int [1:12] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..$ : int [1:12] 13 14 15 16 17 18 19 20 21 22 ...
  .. ..$ : int [1:12] 25 26 27 28 29 30 31 32 33 34 ...
  .. ..$ : int [1:12] 37 38 39 40 41 42 43 44 45 46 ...
  .. ..$ : int [1:12] 49 50 51 52 53 54 55 56 57 58 ...
  .. ..$ : int [1:12] 61 62 63 64 65 66 67 68 69 70 ...
  .. ..$ : int [1:12] 73 74 75 76 77 78 79 80 81 82 ...
  .. ..$ : int [1:12] 85 86 87 88 89 90 91 92 93 94 ...
  .. ..$ : int [1:12] 97 98 99 100 101 102 103 104 105 106 ...
  .. ..$ : int [1:12] 109 110 111 112 113 114 115 116 117 118 ...
  .. ..$ : int [1:5] 121 122 123 124 125
  .. ..$ : int [1:12] 126 127 128 129 130 131 132 133 134 135 ...
  .. ..$ : int [1:12] 138 139 140 141 142 143 144 145 146 147 ...
  .. ..$ : int [1:12] 150 151 152 153 154 155 156 157 158 159 ...
  .. ..$ : int [1:12] 162 163 164 165 166 167 168 169 170 171 ...
  .. ..$ : int [1:12] 174 175 176 177 178 179 180 181 182 183 ...
  .. ..@ ptype: int(0) 
  ..- attr(*, ".drop")= logi TRUE

Sometimes, the keeping the group structured after summarising can be handy, which can be achieved with .groups = "keep". In this case, groups persist unless they are explicitly removed with ungroup(). With the default summarise(), a single grouping variable is dropped, producing an ungrouped result; so ungroup() after that is redundant.

6.3 Reshaping Data

Real-world data often need reshaping. Tidy data prefers one observation per row, one variable per column.

6.3.1 Wide to Long: pivot_longer()

Code
table4a <- tibble(
  country = c("A", "B"),
  `1999` = c(745, 377),
  `2000` = c(377, 345)
)

table4a |> pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")

6.3.2 Long to Wide: pivot_wider()

Code
table2 <- tibble(
  country = c("A", "A", "B", "B"),
  year = c(1999, 2000, 1999, 2000),
  type = c("cases", "cases", "cases", "cases"),
  count = c(745, 377, 377, 345)
)

table2 |> pivot_wider(names_from = type, values_from = count)

6.4 Combining Data from Multiple Tables

Joins combine data from two tables based on a common key column. The type of join determines which rows are kept:

  • Inner Join: Keeps only rows with matching keys in both tables.
  • Left Join: Keeps all rows from the left table, adding matches from the right table where available (missing values filled with NA).
  • Right Join: Keeps all rows from the right table, adding matches from the left table where available.
  • Full Join: Keeps all rows from both tables, filling unmatched values with NA.

Consider the planes data.

Code
planes <- nycflights13::planes
str(planes)
tibble [3,322 × 9] (S3: tbl_df/tbl/data.frame)
 $ tailnum     : chr [1:3322] "N10156" "N102UW" "N103US" "N104UW" ...
 $ year        : int [1:3322] 2004 1998 1999 1999 2002 1999 1999 1999 1999 1999 ...
 $ type        : chr [1:3322] "Fixed wing multi engine" "Fixed wing multi engine" "Fixed wing multi engine" "Fixed wing multi engine" ...
 $ manufacturer: chr [1:3322] "EMBRAER" "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE" ...
 $ model       : chr [1:3322] "EMB-145XR" "A320-214" "A320-214" "A320-214" ...
 $ engines     : int [1:3322] 2 2 2 2 2 2 2 2 2 2 ...
 $ seats       : int [1:3322] 55 182 182 182 55 182 182 182 182 182 ...
 $ speed       : int [1:3322] NA NA NA NA NA NA NA NA NA NA ...
 $ engine      : chr [1:3322] "Turbo-fan" "Turbo-fan" "Turbo-fan" "Turbo-fan" ...

Join the flights data and the planes data:

Code
flights2 <- flights |> select(year, month, day, carrier, tailnum)
planes2 <- planes |> select(tailnum, type, manufacturer)

flights2 |> left_join(planes2, by = "tailnum") |> str()
tibble [336,776 × 7] (S3: tbl_df/tbl/data.frame)
 $ year        : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month       : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ day         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ carrier     : chr [1:336776] "UA" "UA" "AA" "B6" ...
 $ tailnum     : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
 $ type        : chr [1:336776] "Fixed wing multi engine" "Fixed wing multi engine" "Fixed wing multi engine" "Fixed wing multi engine" ...
 $ manufacturer: chr [1:336776] "BOEING" "BOEING" "BOEING" "AIRBUS" ...
Code
# Try inner, right, and full join too

Other join types include inner_join(), right_join(), and full_join().

6.5 Assigning Results of Manipulations

When working with dplyr, it is often useful to save the results of a manipulation into a new object. This allows you to reuse the processed data without repeating all of the steps.

Code
# Filter flights on January 1st and arrange by departure delay
flights_jan1 <- flights |>
filter(month == 1, day == 1) |>
arrange(dep_delay)

# Print the first few rows
head(flights_jan1)

You can now work with flights_jan1 in later code chunks without re-running the entire manipulation. This practice is especially helpful for long workflows where the same processed data will be used multiple times.

6.6 In-Class Example

Consider the data of Chetty et al. (2014).

  1. Create a data frame with CZs in CT, MA, and NY.
  2. Create a data frame with CZ’s with absolute mobility of at least 40
  3. Create a data frame with CZ’s in any state other than CT, MA, or NY, with absolute mobility at least 40
  4. Create a data frame with CZ’s that are in CT, MA, NY and have absolute mobility less than 40.
  5. Create a data frame with CZ’s that are in CT, MA, NY, sorting the CZ’s in decreasing order of absolute mobility, and keeping just the CZ name, state, and absolute mobility variables in the resulting data frames
  6. Create a new data set with only the following variables: cz_name, state, pop_2000, abs_mobility, hhi_percap, and any variable that starts with frac.
  7. Make new variables for each of these quantities:
    1. The number of people in each CZ who consider themselves to be religious.
    2. The log base 2 of the per capita household income (hint: log2() ).
    3. The proportion of people who are not married.

6.7 Summary and Best Practices

  • Begin with a clear idea of what manipulation you need.
  • Chain verbs together with the pipe operator for readability.
  • Use group_by() and summarise() to move from raw detail to aggregated insights.
  • Reshape and join data as needed to bring it into tidy form.

These manipulations prepare your data for visualization and modeling, ensuring clarity and reproducibility in analysis.