7  Data Exploration

7.1 Introduction

Data exploration is the disciplined process that connects raw records to credible analysis. Its goals are to identify data quality issues, understand what variables mean operationally, and generate falsifiable claims that later analysis can scrutinize. The work is iterative: initial checks surface inconsistencies or gaps; targeted cleaning follows; then renewed examination tests whether earlier conclusions still hold. Reproducibility is non-negotiable, so every step should live in code with a brief log explaining what changed and why. Critically, this phase is not confirmatory inference. Instead, it frames questions clearly, proposes measurable definitions, and records assumptions that later sections will test formally. Scope of this chapter.

We will develop practical habits for high-quality exploration. First, we present cleaning principles: consistency of formats and units, completeness and missing-data mechanisms, accuracy and duplicates, and integrity across related fields, together with clear documentation. Next, we practice numerically driven summaries: distributional statistics, grouped descriptives, cross-tabulations, and simple association checks that reveal promising relationships. Finally, we show how to state hypotheses correctly—with the null representing no effect or independence—and run appropriate tests in Python (ANOVA or Kruskal–Wallis for group means, Welch’s t-test or Mann–Whitney for two groups, OLS slope tests with robust errors, and Pearson or Spearman correlations), pairing p-values with effect sizes and intervals.

7.2 Getting to Know Data

Any analysis begins by becoming familiar with the dataset. This involves learning what the observations represent, what types of variables are recorded, and whether the structure meets the expectations of tidy data. Before turning to a specific example, we highlight general principles.

  • Units of observation. Each row of a dataset should correspond to a single unit, such as an individual, transaction, or property sale. Misalignment of units often leads to errors in later analysis.

  • Variables and their types. Columns record attributes of units. These may be continuous measurements, counts, ordered categories, or nominal labels. Recognizing the correct type is essential because it dictates which summary statistics and hypothesis tests are appropriate.

  • Tidy data principles. In a tidy format, each row is one observation and each column is one variable. When data are stored otherwise, reshaping is necessary before analysis can proceed smoothly.

We now illustrate these ideas using a reduced version of the Ames Housing dataset (De Cock, 2009), which is available directly from OpenML. With a filtered subset of ~1460 observations, it drops older sales, keeps only certain years, and removes some variables to make modeling cleaner for beginners.

import openml

# Load Ames Housing (OpenML ID 42165)
dataset = openml.datasets.get_dataset(42165)
df, *_ = dataset.get_data()

# Basic dimensions
df.shape
(1460, 81)

The dataset contains nearly three thousand house sales and eighty variables.

It is useful to view the first few rows to confirm the structure.

# First few rows
df.head()
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave None Reg Lvl AllPub ... 0 None None None 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave None Reg Lvl AllPub ... 0 None None None 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave None IR1 Lvl AllPub ... 0 None None None 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave None IR1 Lvl AllPub ... 0 None None None 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave None IR1 Lvl AllPub ... 0 None None None 0 12 2008 WD Normal 250000

5 rows × 81 columns

Each row corresponds to one property sale, while columns record attributes such as lot size, neighborhood, and sale price.

Understanding whether variables are numeric, categorical, or temporal guides later exploration and cleaning.

# Dtypes in pandas
df.dtypes.value_counts()
object     43
int64      22
uint8      13
float64     3
Name: count, dtype: int64
# Example: show a few variables with types
df.dtypes.head(10)
Id               int64
MSSubClass       uint8
MSZoning        object
LotFrontage    float64
LotArea          int64
Street          object
Alley           object
LotShape        object
LandContour     object
Utilities       object
dtype: object

Most features are numeric or categorical. Some, such as YearBuilt, are integers but represent calendar years.

The outcome of interest is the sale price.

# The default target from OpenML
dataset.default_target_attribute
'SalePrice'

Numeric summaries highlight scale and possible outliers.

# Summary statistics for numeric columns
df.describe().T.head(10)
count mean std min 25% 50% 75% max
Id 1460.0 730.500000 421.610009 1.0 365.75 730.5 1095.25 1460.0
MSSubClass 1460.0 56.897260 42.300571 20.0 20.00 50.0 70.00 190.0
LotFrontage 1201.0 70.049958 24.284752 21.0 59.00 69.0 80.00 313.0
LotArea 1460.0 10516.828082 9981.264932 1300.0 7553.50 9478.5 11601.50 215245.0
OverallQual 1460.0 6.099315 1.382997 1.0 5.00 6.0 7.00 10.0
OverallCond 1460.0 5.575342 1.112799 1.0 5.00 5.0 6.00 9.0
YearBuilt 1460.0 1971.267808 30.202904 1872.0 1954.00 1973.0 2000.00 2010.0
YearRemodAdd 1460.0 1984.865753 20.645407 1950.0 1967.00 1994.0 2004.00 2010.0
MasVnrArea 1452.0 103.685262 181.066207 0.0 0.00 0.0 166.00 1600.0
BsmtFinSF1 1460.0 443.639726 456.098091 0.0 0.00 383.5 712.25 5644.0

Categorical summaries reveal balance among levels.

# Frequency counts for categorical columns
df['Neighborhood'].value_counts().head()
Neighborhood
NAmes      225
CollgCr    150
OldTown    113
Edwards    100
Somerst     86
Name: count, dtype: int64
# Another example
df['GarageType'].value_counts(dropna=False)
GarageType
Attchd     870
Detchd     387
BuiltIn     88
None        81
Basment     19
CarPort      9
2Types       6
Name: count, dtype: int64

Simple checks can detect implausible combinations.

# Houses should not be sold before built
(df['YrSold'] < df['YearBuilt']).sum()
np.int64(0)
# Garage year built should not precede house year built
(df['GarageYrBlt'] < df['YearBuilt']).sum()
np.int64(9)

These checks confirm that while the dataset is well curated, certain quirks require careful interpretation.

7.3 Data Cleaning Principles

Exploration begins with data cleaning. The purpose is not to modify values casually but to identify issues, understand their sources, and decide on a transparent response. The following principles provide structure.

Consistency. Variables should follow the same format and unit across all records. Dates should have a common representation, categorical labels should not differ by spelling, and measurements should use the same scale.

Completeness. Missing values are unavoidable. It is important to determine whether they arise from data entry errors, survey nonresponse, or structural absence. For example, a missing value in FireplaceQu often indicates that a house has no fireplace rather than missing information.

Accuracy. Values should be plausible. Obvious errors include negative square footage or sale years in the future. Duplicate records also fall under this category.

Integrity. Relationships between variables should be logically consistent. A house cannot be sold before it was built. If related totals exist, the sum of parts should match the total.

Transparency. All cleaning decisions should be recorded. Reproducibility requires that another analyst can understand what was changed and why.

7.3.1 Illustration with Ames Housing

We apply these principles to the Ames dataset. The first step is to inspect missing values.

# Count missing values in each column
df.isna().sum().sort_values(ascending=False).head(15)
PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
FireplaceQu      690
LotFrontage      259
GarageFinish      81
GarageQual        81
GarageYrBlt       81
GarageType        81
GarageCond        81
BsmtExposure      38
BsmtFinType2      38
BsmtCond          37
BsmtFinType1      37
dtype: int64

Several variables, such as PoolQC, MiscFeature, and Alley, contain many missing entries. Documentation shows that these are structural, indicating the absence of the feature.

# Example: check PoolQC against PoolArea
(df['PoolQC'].isna() & (df['PoolArea'] > 0)).sum()
np.int64(0)

The result is zero, confirming that missing PoolQC means no pool.

Consistency can be checked by reviewing categorical labels.

# Distinct values in Exterior1st
df['Exterior1st'].unique()
array(['VinylSd', 'MetalSd', 'Wd Sdng', 'HdBoard', 'BrkFace', 'WdShing',
       'CemntBd', 'Plywood', 'AsbShng', 'Stucco', 'BrkComm', 'AsphShn',
       'Stone', 'ImStucc', 'CBlock'], dtype=object)

If spelling variants are detected, they should be harmonized.

Accuracy checks involve searching for implausible values.

# Negative or zero living area
(df['GrLivArea'] <= 0).sum()
np.int64(0)

Integrity checks verify logical relationships.

# Houses sold before they were built
(df['YrSold'] < df['YearBuilt']).sum()
np.int64(0)
# Garage built before house built
(df['GarageYrBlt'] < df['YearBuilt']).sum()
np.int64(9)

These checks help identify issues to document and, if appropriate, correct in a reproducible way.

7.4 Common Exploration Practices

After establishing data cleaning principles, the next step is to compute summaries that describe the distributions of variables and their relationships. This section avoids graphics, relying instead on tables and statistics.

7.4.1 Univariate summaries

Simple statistics reveal scale, central tendency, and variability.

# Sale price distribution
df['SalePrice'].describe()
count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

The sale price is right-skewed, with a mean higher than the median.

# Lot area distribution
df['LotArea'].describe()
count      1460.000000
mean      10516.828082
std        9981.264932
min        1300.000000
25%        7553.500000
50%        9478.500000
75%       11601.500000
max      215245.000000
Name: LotArea, dtype: float64

Lot size shows extreme variation, indicating possible outliers.

7.4.2 Grouped summaries

Comparisons across categories highlight differences in central tendency.

# Mean sale price by neighborhood
df.groupby('Neighborhood')['SalePrice'].mean().sort_values().head()
Neighborhood
MeadowV     98576.470588
IDOTRR     100123.783784
BrDale     104493.750000
BrkSide    124834.051724
Edwards    128219.700000
Name: SalePrice, dtype: float64

Neighborhoods differ substantially in average sale price.

# Median sale price by overall quality
df.groupby('OverallQual')['SalePrice'].median()
OverallQual
1      50150.0
2      60000.0
3      86250.0
4     108000.0
5     133000.0
6     160000.0
7     200141.0
8     269750.0
9     345000.0
10    432390.0
Name: SalePrice, dtype: float64

Higher quality ratings correspond to higher prices.

7.4.3 Cross-tabulations

Cross-tabulations summarize associations between categorical variables.

import pandas as pd
# Neighborhood by garage type
pd.crosstab(df['Neighborhood'], df['GarageType']).head()
GarageType 2Types Attchd Basment BuiltIn CarPort Detchd
Neighborhood
Blmngtn 0 17 0 0 0 0
Blueste 0 2 0 0 0 0
BrDale 0 2 0 0 0 13
BrkSide 0 3 0 1 0 44
ClearCr 0 24 0 1 0 2

Some garage types are common only in specific neighborhoods.

7.4.4 Correlations

Correlation coefficients capture linear associations between numeric variables.

# Correlation between living area and sale price
df[['GrLivArea','SalePrice']].corr()
GrLivArea SalePrice
GrLivArea 1.000000 0.708624
SalePrice 0.708624 1.000000
# Correlation between lot area and sale price
df[['LotArea','SalePrice']].corr()
LotArea SalePrice
LotArea 1.000000 0.263843
SalePrice 0.263843 1.000000

Living area is strongly correlated with price, while lot area shows a weaker association.

7.4.5 Conditional summaries

Examining distributions within subgroups can surface interaction patterns.

# Average sale price by house style
df.groupby('HouseStyle')['SalePrice'].mean().sort_values()
HouseStyle
1.5Unf    110150.000000
SFoyer    135074.486486
1.5Fin    143116.740260
2.5Unf    157354.545455
SLvl      166703.384615
1Story    175985.477961
2Story    210051.764045
2.5Fin    220000.000000
Name: SalePrice, dtype: float64

House style is another factor associated with variation in price.

These practices provide a numerical portrait of the data, guiding later steps where hypotheses will be stated explicitly and tested with appropriate statistical methods.

7.5 Forming and Testing Hypotheses

Exploration becomes more rigorous when we state hypotheses formally and run appropriate tests. The null hypothesis always represents no effect, no difference, or no association. The alternative expresses the presence of an effect. The examples below use the Ames Housing data.

7.5.1 Neighborhood and sale price

Hypothesis: - H0: The mean sale prices are equal across neighborhoods. - H1: At least one neighborhood has a different mean.

import statsmodels.formula.api as smf
from statsmodels.stats.anova import anova_lm

sub = df[['SalePrice','Neighborhood']].dropna()
model = smf.ols('SalePrice ~ C(Neighborhood)', data=sub).fit()
anova_lm(model, typ=2)
sum_sq df F PR(>F)
C(Neighborhood) 5.023606e+12 24.0 71.784865 1.558600e-225
Residual 4.184305e+12 1435.0 NaN NaN

ANOVA tests equality of group means. If significant, post-hoc comparisons can identify which neighborhoods differ.

7.5.2 Year built and sale price

Hypothesis: - H0: The slope for YearBuilt is zero; no linear relationship. - H1: The slope is not zero.

model = smf.ols('SalePrice ~ YearBuilt', data=df).fit(cov_type='HC3')
model.summary().tables[1]
coef std err z P>|z| [0.025 0.975]
Intercept -2.53e+06 1.36e+05 -18.667 0.000 -2.8e+06 -2.26e+06
YearBuilt 1375.3735 68.973 19.941 0.000 1240.189 1510.558

The regression slope test checks whether newer houses tend to sell for more.

7.5.3 Lot area and sale price

Hypothesis: - H0: The population correlation is zero. - H1: The correlation is not zero.

from scipy import stats
sub = df[['LotArea','SalePrice']].dropna()
stats.pearsonr(sub['LotArea'], sub['SalePrice'])
PearsonRResult(statistic=np.float64(0.2638433538714058), pvalue=np.float64(1.123139154918551e-24))

A Pearson correlation tests linear association. A Spearman rank correlation can be used when distributions are skewed.

7.5.4 Fireplaces and sale price

Hypothesis: - H0: The mean sale price is the same for houses with and without a fireplace. - H1: The mean sale prices differ.

sub = df[['SalePrice','Fireplaces']].dropna()
sub['has_fp'] = (sub['Fireplaces'] > 0).astype(int)

g1 = sub.loc[sub['has_fp']==1, 'SalePrice']
g0 = sub.loc[sub['has_fp']==0, 'SalePrice']

stats.ttest_ind(g1, g0, equal_var=False)
TtestResult(statistic=np.float64(21.105376324953664), pvalue=np.float64(4.666259945494159e-84), df=np.float64(1171.6295727321062))

Welch’s t-test compares means when variances differ.

7.5.5 Garage type and neighborhood

Hypothesis: - H0: Garage type and neighborhood are independent. - H1: Garage type and neighborhood are associated.

import pandas as pd
ct = pd.crosstab(df['GarageType'], df['Neighborhood'])
stats.chi2_contingency(ct)
Chi2ContingencyResult(statistic=np.float64(794.6871326886048), pvalue=np.float64(5.179037846292559e-100), dof=120, expected_freq=array([[7.39666425e-02, 8.70195794e-03, 6.52646846e-02, 2.08846991e-01,
        1.17476432e-01, 6.43944888e-01, 2.21899927e-01, 3.39376360e-01,
        3.43727339e-01, 1.26178390e-01, 5.22117476e-02, 1.91443075e-01,
        9.52864394e-01, 3.91588107e-02, 3.17621465e-01, 1.78390138e-01,
        3.35025381e-01, 4.39448876e-01, 8.70195794e-02, 3.08919507e-01,
        2.52356780e-01, 3.74184191e-01, 1.08774474e-01, 1.65337201e-01,
        4.78607687e-02],
       [1.07251632e+01, 1.26178390e+00, 9.46337926e+00, 3.02828136e+01,
        1.70340827e+01, 9.33720087e+01, 3.21754895e+01, 4.92095722e+01,
        4.98404641e+01, 1.82958666e+01, 7.57070341e+00, 2.77592458e+01,
        1.38165337e+02, 5.67802756e+00, 4.60551124e+01, 2.58665700e+01,
        4.85786802e+01, 6.37200870e+01, 1.26178390e+01, 4.47933285e+01,
        3.65917331e+01, 5.42567078e+01, 1.57722988e+01, 2.39738941e+01,
        6.93981146e+00],
       [2.34227701e-01, 2.75562001e-02, 2.06671501e-01, 6.61348803e-01,
        3.72008702e-01, 2.03915881e+00, 7.02683104e-01, 1.07469181e+00,
        1.08846991e+00, 3.99564902e-01, 1.65337201e-01, 6.06236403e-01,
        3.01740392e+00, 1.24002901e-01, 1.00580131e+00, 5.64902103e-01,
        1.06091371e+00, 1.39158811e+00, 2.75562001e-01, 9.78245105e-01,
        7.99129804e-01, 1.18491661e+00, 3.44452502e-01, 5.23567803e-01,
        1.51559101e-01],
       [1.08484409e+00, 1.27628716e-01, 9.57215373e-01, 3.06308920e+00,
        1.72298767e+00, 9.44452502e+00, 3.25453227e+00, 4.97751994e+00,
        5.04133430e+00, 1.85061639e+00, 7.65772299e-01, 2.80783176e+00,
        1.39753445e+01, 5.74329224e-01, 4.65844815e+00, 2.61638869e+00,
        4.91370558e+00, 6.44525018e+00, 1.27628716e+00, 4.53081943e+00,
        3.70123278e+00, 5.48803481e+00, 1.59535896e+00, 2.42494561e+00,
        7.01957941e-01],
       [1.10949964e-01, 1.30529369e-02, 9.78970268e-02, 3.13270486e-01,
        1.76214648e-01, 9.65917331e-01, 3.32849891e-01, 5.09064540e-01,
        5.15591008e-01, 1.89267585e-01, 7.83176215e-02, 2.87164612e-01,
        1.42929659e+00, 5.87382161e-02, 4.76432197e-01, 2.67585207e-01,
        5.02538071e-01, 6.59173314e-01, 1.30529369e-01, 4.63379260e-01,
        3.78535170e-01, 5.61276287e-01, 1.63161711e-01, 2.48005801e-01,
        7.17911530e-02],
       [4.77084844e+00, 5.61276287e-01, 4.20957215e+00, 1.34706309e+01,
        7.57722988e+00, 4.15344453e+01, 1.43125453e+01, 2.18897752e+01,
        2.21704133e+01, 8.13850616e+00, 3.36765772e+00, 1.23480783e+01,
        6.14597534e+01, 2.52574329e+00, 2.04865845e+01, 1.15061639e+01,
        2.16091371e+01, 2.83444525e+01, 5.61276287e+00, 1.99253082e+01,
        1.62770123e+01, 2.41348803e+01, 7.01595359e+00, 1.06642495e+01,
        3.08701958e+00]]))

A chi-square test checks for association between two categorical variables.


7.5.6 Quick reference table for common tests

Situation Null hypothesis Test Python tool
k-group mean comparison All group means equal One-way ANOVA anova_lm
k-group, nonparametric All group distributions equal Kruskal–Wallis stats.kruskal
Two means, unequal variance Means equal Welch’s t-test stats.ttest_ind
Two groups, nonparametric Distributions equal Mann–Whitney U stats.mannwhitneyu
Linear relationship Slope = 0 OLS slope test ols + robust SE
Continuous association Correlation = 0 Pearson correlation stats.pearsonr
Monotone association Correlation = 0 Spearman correlation stats.spearmanr
Categorical association Independence Chi-square test stats.chi2_contingency

These examples illustrate how hypotheses guide exploration. Each test produces a statistic, a p-value, and often an effect size. Results are provisional and informal, but they shape which relationships merit deeper investigation.

7.6 Iterative Nature of Exploration

Exploration is rarely linear. Cleaning, summarizing, and testing feed back into each other. Each new discovery can prompt a return to earlier steps.

7.6.1 Cycle of exploration

  • Inspect variables and detect anomalies.
  • Clean data based on what anomalies reveal.
  • Summarize distributions and relationships.
  • Formulate and test new hypotheses.
  • Revisit cleaning when results suggest overlooked issues.

7.6.2 Example: Garage year built

Initial inspection may suggest that many values of GarageYrBlt are missing. Documentation indicates that missing means no garage.

# Count missing garage years
df['GarageYrBlt'].isna().sum()
np.int64(81)

When checking integrity, we may notice that some garage years precede the house year built.

# Garage built before house built
(df['GarageYrBlt'] < df['YearBuilt']).sum()
np.int64(9)

This prompts a decision: treat as data entry error, keep with caution, or exclude in certain analyses.

7.6.3 Example: Living area and sale price

A strong correlation between GrLivArea and SalePrice may surface.

# Correlation
sub = df[['GrLivArea','SalePrice']].dropna()
sub.corr()
GrLivArea SalePrice
GrLivArea 1.000000 0.708624
SalePrice 0.708624 1.000000

If a few extremely large houses are driving the correlation, it may be necessary to investigate further.

# Identify extreme values
sub[sub['GrLivArea'] > 4000][['GrLivArea','SalePrice']]
GrLivArea SalePrice
523 4676 184750
691 4316 755000
1182 4476 745000
1298 5642 160000

These observations may be genuine luxury properties, or they may distort summary statistics. The decision is context-dependent and should be documented.

Exploration is not a one-pass process. Findings in one step often require revisiting previous steps. Clear documentation ensures that these iterations are transparent and reproducible.

7.7 Good Practices in Data Exploration

Clear habits in exploration make later analysis more reliable and easier to share. The following practices help ensure quality and reproducibility.

7.7.1 Reproducibility

  • Keep all work in notebooks or scripts, never only in spreadsheets.
  • Ensure that another analyst can rerun the code and obtain identical results.
# Example: set a random seed for reproducibility
import numpy as np
np.random.seed(20250923)

7.7.2 Documentation

  • Record cleaning decisions explicitly. For example, note that NA in PoolQC means no pool.
  • Keep a running log of questions, anomalies, and decisions.
# Example: create an indicator for presence of a pool
df['HasPool'] = df['PoolArea'] > 0

7.7.3 Balanced curiosity and rigor

  • Exploration can generate many possible stories. Avoid over-interpreting patterns before formal testing.
  • Distinguish clearly between exploratory checks and confirmatory analysis.

7.7.4 Effect sizes and intervals

  • Report not only p-values but also effect sizes and confidence intervals.
  • This practice keeps focus on the magnitude of relationships.
# Example: compute Cohen's d for fireplace vs no fireplace
sub = df[['SalePrice','Fireplaces']].dropna()
sub['has_fp'] = (sub['Fireplaces'] > 0).astype(int)

mean_diff = sub.groupby('has_fp')['SalePrice'].mean().diff().iloc[-1]
pooled_sd = sub.groupby('has_fp')['SalePrice'].std().mean()
cohens_d = mean_diff / pooled_sd
cohens_d
np.float64(1.144008229281349)

7.7.5 Transparency

  • Share both code and notes with collaborators.
  • Version control with Git helps track changes and decisions.

Good practices keep exploration structured and reproducible. They also create a record of reasoning that improves collaboration and supports later analysis.