4  First Impression with Data

4.1 R Packages and Data

R is an open-source language, widely used in data science. One of its greatest strengths is the ecosystem of packages developed by the community. These packages make it easier to perform tasks such as importing data, cleaning it, and creating visualizations.

Two sets of packages will be central for us. Package tidyverse provides a coherent framework for data wrangling and visualization. The gapminder package offers a dataset on life expectancy, GDP per capita, and population across countries and years, which will serve as a running example in our practice.

If packages are not already installed, we can add them to our system with install.packages(). Installation is needed only once, but packages must be loaded every time we start a new R session.

Code
install.packages("tidyverse")
install.packages("gapminder")

Once installed, packages are made available in a session by loading them with library().

Code
library(tidyverse)
library(gapminder)

After loading a dataset, it is good practice to examine its structure. Functions such as str() and summary() provide a quick overview of variable types, sample values, and ranges.

Code
str(gapminder)
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
Code
summary(gapminder)
        country        continent        year         lifeExp     
 Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
 Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
 Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
 Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
 Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
 Australia  :  12                  Max.   :2007   Max.   :82.60  
 (Other)    :1632                                                
      pop              gdpPercap       
 Min.   :6.001e+04   Min.   :   241.2  
 1st Qu.:2.794e+06   1st Qu.:  1202.1  
 Median :7.024e+06   Median :  3531.8  
 Mean   :2.960e+07   Mean   :  7215.3  
 3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
 Max.   :1.319e+09   Max.   :113523.1  
                                       

4.2 Exploring Data Frames in R

Once data are loaded, the next step is to explore and understand the dataset. A data frame (or tibble, in tidyverse) is the standard format for rectangular data. R provides many built-in functions to examine, summarize, and manipulate data frames.

4.2.1 Structure and dimensions

  • dim(df) – number of rows and columns
  • nrow(df), ncol(df) – number of rows or columns separately
  • str(df) – internal structure (types, first few values)
  • glimpse(df) (from dplyr) – a cleaner version of str

4.2.2 Column names and metadata

  • names(df) or colnames(df) – list column names
  • rownames(df) – list row names (rarely used in tidy data)

4.2.3 First and last rows

  • head(df) – first six rows
  • tail(df) – last six rows

4.2.4 Summaries

  • summary(df) – variable-by-variable summaries
  • sapply(df, class) – variable types
  • sapply(df, function) – apply any function to each column (e.g., mean, min, max)

4.2.5 Accessing columns and rows

  • df$var – access a column by name
  • df[ , "var"] – same as above, but more general
  • df[1:5, ] – first five rows
  • df[ , 1:3] – first three columns

4.2.6 Subsetting and filtering

  • subset(df, condition) – filter rows by condition
  • With tidyverse: filter(df, condition), select(df, cols)

4.2.7 Checking contents

  • unique(df$var) – unique values in a column
  • table(df$var) – frequency counts
  • is.na(df) – identify missing values

Together, these functions give a toolkit for becoming familiar with any new dataset.

4.3 Variable Types

A crucial step in working with data is recognizing the types of variables. Variable types determine how we visualize, summarize, and analyze data.

Variables are broadly divided into numerical and categorical. Numerical variables can be continuous, such as income or life expectancy, or discrete, such as the number of siblings or a graduation year. Categorical variables can be nominal, with no inherent order (for example, country or gender), or ordinal, with an order that matters (such as education levels or rankings). R also provides support for logical variables, representing true/false values, and date variables, with built-in functions for handling time information.

Variable types
Code
# Example: variable types in gapminder
glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

Understanding variable types is not just theoretical. The type guides decisions about visualization, statistical summaries, and models. For instance, the mean is meaningful for a numerical variable but not for a nominal one.

4.4 Messy Data

Real-world datasets are rarely clean. Messiness can arise from missing values, inconsistent formats, poorly named variables, or categories coded in multiple ways. Dates might appear in different styles, proper nouns might be inconsistently capitalized, and numeric values might be stored as text.

Cleaning data involves identifying and fixing these problems. R provides many tools for this work. Missing values can be detected with is.na() and handled using functions such as na.omit(). Variable names can be adjusted with rename(). The mutate() function can change types or create new variables, and joins such as left_join() allow information from multiple tables to be combined.

Code
# Example: identify missing values in gapminder
sum(is.na(gapminder))
[1] 0

The tidyverse philosophy emphasizes keeping data in a “tidy” format, where each variable is a column, each observation is a row, and each type of observation forms its own table. Working toward tidy data makes later analysis and visualization much easier.

4.5 Putting It Together

To see these ideas in practice, consider analyzing life expectancy in African countries. We might start by filtering the data to include only Africa, checking for missing values, and confirming variable types. Once the data are tidy, we can compute summaries and produce visualizations that reveal patterns over time.

Code
africa <- gapminder %>% filter(continent == "Africa")
head(africa)

This example illustrates the general workflow: install and load packages, import data, understand variable types, clean messy data, and prepare the dataset for analysis.