Code
install.packages("tidyverse")
install.packages("gapminder")
R is an open-source language, widely used in data science. One of its greatest strengths is the ecosystem of packages developed by the community. These packages make it easier to perform tasks such as importing data, cleaning it, and creating visualizations.
Two sets of packages will be central for us. Package tidyverse provides a coherent framework for data wrangling and visualization. The gapminder package offers a dataset on life expectancy, GDP per capita, and population across countries and years, which will serve as a running example in our practice.
If packages are not already installed, we can add them to our system with install.packages()
. Installation is needed only once, but packages must be loaded every time we start a new R session.
install.packages("tidyverse")
install.packages("gapminder")
Once installed, packages are made available in a session by loading them with library()
.
library(tidyverse)
library(gapminder)
After loading a dataset, it is good practice to examine its structure. Functions such as str()
and summary()
provide a quick overview of variable types, sample values, and ranges.
str(gapminder)
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
$ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num [1:1704] 779 821 853 836 740 ...
summary(gapminder)
country continent year lifeExp
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
Algeria : 12 Asia :396 Median :1980 Median :60.71
Angola : 12 Europe :360 Mean :1980 Mean :59.47
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
Australia : 12 Max. :2007 Max. :82.60
(Other) :1632
pop gdpPercap
Min. :6.001e+04 Min. : 241.2
1st Qu.:2.794e+06 1st Qu.: 1202.1
Median :7.024e+06 Median : 3531.8
Mean :2.960e+07 Mean : 7215.3
3rd Qu.:1.959e+07 3rd Qu.: 9325.5
Max. :1.319e+09 Max. :113523.1
Once data are loaded, the next step is to explore and understand the dataset. A data frame (or tibble, in tidyverse) is the standard format for rectangular data. R provides many built-in functions to examine, summarize, and manipulate data frames.
dim(df)
– number of rows and columnsnrow(df)
, ncol(df)
– number of rows or columns separatelystr(df)
– internal structure (types, first few values)glimpse(df)
(from dplyr) – a cleaner version of str
names(df)
or colnames(df)
– list column namesrownames(df)
– list row names (rarely used in tidy data)head(df)
– first six rowstail(df)
– last six rowssummary(df)
– variable-by-variable summariessapply(df, class)
– variable typessapply(df, function)
– apply any function to each column (e.g., mean, min, max)df$var
– access a column by namedf[ , "var"]
– same as above, but more generaldf[1:5, ]
– first five rowsdf[ , 1:3]
– first three columnssubset(df, condition)
– filter rows by conditionfilter(df, condition)
, select(df, cols)
unique(df$var)
– unique values in a columntable(df$var)
– frequency countsis.na(df)
– identify missing valuesTogether, these functions give a toolkit for becoming familiar with any new dataset.
A crucial step in working with data is recognizing the types of variables. Variable types determine how we visualize, summarize, and analyze data.
Variables are broadly divided into numerical and categorical. Numerical variables can be continuous, such as income or life expectancy, or discrete, such as the number of siblings or a graduation year. Categorical variables can be nominal, with no inherent order (for example, country or gender), or ordinal, with an order that matters (such as education levels or rankings). R also provides support for logical variables, representing true/false values, and date variables, with built-in functions for handling time information.
# Example: variable types in gapminder
glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
Understanding variable types is not just theoretical. The type guides decisions about visualization, statistical summaries, and models. For instance, the mean is meaningful for a numerical variable but not for a nominal one.
Real-world datasets are rarely clean. Messiness can arise from missing values, inconsistent formats, poorly named variables, or categories coded in multiple ways. Dates might appear in different styles, proper nouns might be inconsistently capitalized, and numeric values might be stored as text.
Cleaning data involves identifying and fixing these problems. R provides many tools for this work. Missing values can be detected with is.na()
and handled using functions such as na.omit()
. Variable names can be adjusted with rename()
. The mutate()
function can change types or create new variables, and joins such as left_join()
allow information from multiple tables to be combined.
# Example: identify missing values in gapminder
sum(is.na(gapminder))
[1] 0
The tidyverse philosophy emphasizes keeping data in a “tidy” format, where each variable is a column, each observation is a row, and each type of observation forms its own table. Working toward tidy data makes later analysis and visualization much easier.
To see these ideas in practice, consider analyzing life expectancy in African countries. We might start by filtering the data to include only Africa, checking for missing values, and confirming variable types. Once the data are tidy, we can compute summaries and produce visualizations that reveal patterns over time.
<- gapminder %>% filter(continent == "Africa")
africa head(africa)
This example illustrates the general workflow: install and load packages, import data, understand variable types, clean messy data, and prepare the dataset for analysis.