import openml
# Load Ames Housing (OpenML ID 42165)
= openml.datasets.get_dataset(42165)
dataset *_ = dataset.get_data()
df,
# Basic dimensions
df.shape
(1460, 81)
Data exploration is the disciplined process that connects raw records to credible analysis. Its goals are to identify data quality issues, understand what variables mean operationally, and generate falsifiable claims that later analysis can scrutinize. The work is iterative: initial checks surface inconsistencies or gaps; targeted cleaning follows; then renewed examination tests whether earlier conclusions still hold. Reproducibility is non-negotiable, so every step should live in code with a brief log explaining what changed and why. Critically, this phase is not confirmatory inference. Instead, it frames questions clearly, proposes measurable definitions, and records assumptions that later sections will test formally. Scope of this chapter.
We will develop practical habits for high-quality exploration. First, we present cleaning principles: consistency of formats and units, completeness and missing-data mechanisms, accuracy and duplicates, and integrity across related fields, together with clear documentation. Next, we practice numerically driven summaries: distributional statistics, grouped descriptives, cross-tabulations, and simple association checks that reveal promising relationships. Finally, we show how to state hypotheses correctly—with the null representing no effect or independence—and run appropriate tests in Python (ANOVA or Kruskal–Wallis for group means, Welch’s t-test or Mann–Whitney for two groups, OLS slope tests with robust errors, and Pearson or Spearman correlations), pairing p-values with effect sizes and intervals.
Any analysis begins by becoming familiar with the dataset. This involves learning what the observations represent, what types of variables are recorded, and whether the structure meets the expectations of tidy data. Before turning to a specific example, we highlight general principles.
Units of observation. Each row of a dataset should correspond to a single unit, such as an individual, transaction, or property sale. Misalignment of units often leads to errors in later analysis.
Variables and their types. Columns record attributes of units. These may be continuous measurements, counts, ordered categories, or nominal labels. Recognizing the correct type is essential because it dictates which summary statistics and hypothesis tests are appropriate.
Tidy data principles. In a tidy format, each row is one observation and each column is one variable. When data are stored otherwise, reshaping is necessary before analysis can proceed smoothly.
We now illustrate these ideas using a reduced version of the Ames Housing dataset (De Cock, 2009), which is available directly from OpenML
. With a filtered subset of ~1460 observations, it drops older sales, keeps only certain years, and removes some variables to make modeling cleaner for beginners.
import openml
# Load Ames Housing (OpenML ID 42165)
= openml.datasets.get_dataset(42165)
dataset *_ = dataset.get_data()
df,
# Basic dimensions
df.shape
(1460, 81)
The dataset contains nearly three thousand house sales and eighty variables.
It is useful to view the first few rows to confirm the structure.
# First few rows
df.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | None | Reg | Lvl | AllPub | ... | 0 | None | None | None | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | None | Reg | Lvl | AllPub | ... | 0 | None | None | None | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | None | IR1 | Lvl | AllPub | ... | 0 | None | None | None | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | None | IR1 | Lvl | AllPub | ... | 0 | None | None | None | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | None | IR1 | Lvl | AllPub | ... | 0 | None | None | None | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
Each row corresponds to one property sale, while columns record attributes such as lot size, neighborhood, and sale price.
Understanding whether variables are numeric, categorical, or temporal guides later exploration and cleaning.
# Dtypes in pandas
df.dtypes.value_counts()
object 43
int64 22
uint8 13
float64 3
Name: count, dtype: int64
# Example: show a few variables with types
10) df.dtypes.head(
Id int64
MSSubClass uint8
MSZoning object
LotFrontage float64
LotArea int64
Street object
Alley object
LotShape object
LandContour object
Utilities object
dtype: object
Most features are numeric or categorical. Some, such as YearBuilt
, are integers but represent calendar years.
The outcome of interest is the sale price.
# The default target from OpenML
dataset.default_target_attribute
'SalePrice'
Numeric summaries highlight scale and possible outliers.
# Summary statistics for numeric columns
10) df.describe().T.head(
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Id | 1460.0 | 730.500000 | 421.610009 | 1.0 | 365.75 | 730.5 | 1095.25 | 1460.0 |
MSSubClass | 1460.0 | 56.897260 | 42.300571 | 20.0 | 20.00 | 50.0 | 70.00 | 190.0 |
LotFrontage | 1201.0 | 70.049958 | 24.284752 | 21.0 | 59.00 | 69.0 | 80.00 | 313.0 |
LotArea | 1460.0 | 10516.828082 | 9981.264932 | 1300.0 | 7553.50 | 9478.5 | 11601.50 | 215245.0 |
OverallQual | 1460.0 | 6.099315 | 1.382997 | 1.0 | 5.00 | 6.0 | 7.00 | 10.0 |
OverallCond | 1460.0 | 5.575342 | 1.112799 | 1.0 | 5.00 | 5.0 | 6.00 | 9.0 |
YearBuilt | 1460.0 | 1971.267808 | 30.202904 | 1872.0 | 1954.00 | 1973.0 | 2000.00 | 2010.0 |
YearRemodAdd | 1460.0 | 1984.865753 | 20.645407 | 1950.0 | 1967.00 | 1994.0 | 2004.00 | 2010.0 |
MasVnrArea | 1452.0 | 103.685262 | 181.066207 | 0.0 | 0.00 | 0.0 | 166.00 | 1600.0 |
BsmtFinSF1 | 1460.0 | 443.639726 | 456.098091 | 0.0 | 0.00 | 383.5 | 712.25 | 5644.0 |
Categorical summaries reveal balance among levels.
# Frequency counts for categorical columns
'Neighborhood'].value_counts().head() df[
Neighborhood
NAmes 225
CollgCr 150
OldTown 113
Edwards 100
Somerst 86
Name: count, dtype: int64
# Another example
'GarageType'].value_counts(dropna=False) df[
GarageType
Attchd 870
Detchd 387
BuiltIn 88
None 81
Basment 19
CarPort 9
2Types 6
Name: count, dtype: int64
Simple checks can detect implausible combinations.
# Houses should not be sold before built
'YrSold'] < df['YearBuilt']).sum() (df[
np.int64(0)
# Garage year built should not precede house year built
'GarageYrBlt'] < df['YearBuilt']).sum() (df[
np.int64(9)
These checks confirm that while the dataset is well curated, certain quirks require careful interpretation.
Exploration begins with data cleaning. The purpose is not to modify values casually but to identify issues, understand their sources, and decide on a transparent response. The following principles provide structure.
Consistency. Variables should follow the same format and unit across all records. Dates should have a common representation, categorical labels should not differ by spelling, and measurements should use the same scale.
Completeness. Missing values are unavoidable. It is important to determine whether they arise from data entry errors, survey nonresponse, or structural absence. For example, a missing value in FireplaceQu
often indicates that a house has no fireplace rather than missing information.
Accuracy. Values should be plausible. Obvious errors include negative square footage or sale years in the future. Duplicate records also fall under this category.
Integrity. Relationships between variables should be logically consistent. A house cannot be sold before it was built. If related totals exist, the sum of parts should match the total.
Transparency. All cleaning decisions should be recorded. Reproducibility requires that another analyst can understand what was changed and why.
We apply these principles to the Ames dataset. The first step is to inspect missing values.
# Count missing values in each column
sum().sort_values(ascending=False).head(15) df.isna().
PoolQC 1453
MiscFeature 1406
Alley 1369
Fence 1179
FireplaceQu 690
LotFrontage 259
GarageFinish 81
GarageQual 81
GarageYrBlt 81
GarageType 81
GarageCond 81
BsmtExposure 38
BsmtFinType2 38
BsmtCond 37
BsmtFinType1 37
dtype: int64
Several variables, such as PoolQC
, MiscFeature
, and Alley
, contain many missing entries. Documentation shows that these are structural, indicating the absence of the feature.
# Example: check PoolQC against PoolArea
'PoolQC'].isna() & (df['PoolArea'] > 0)).sum() (df[
np.int64(0)
The result is zero, confirming that missing PoolQC
means no pool.
Consistency can be checked by reviewing categorical labels.
# Distinct values in Exterior1st
'Exterior1st'].unique() df[
array(['VinylSd', 'MetalSd', 'Wd Sdng', 'HdBoard', 'BrkFace', 'WdShing',
'CemntBd', 'Plywood', 'AsbShng', 'Stucco', 'BrkComm', 'AsphShn',
'Stone', 'ImStucc', 'CBlock'], dtype=object)
If spelling variants are detected, they should be harmonized.
Accuracy checks involve searching for implausible values.
# Negative or zero living area
'GrLivArea'] <= 0).sum() (df[
np.int64(0)
Integrity checks verify logical relationships.
# Houses sold before they were built
'YrSold'] < df['YearBuilt']).sum() (df[
np.int64(0)
# Garage built before house built
'GarageYrBlt'] < df['YearBuilt']).sum() (df[
np.int64(9)
These checks help identify issues to document and, if appropriate, correct in a reproducible way.
After establishing data cleaning principles, the next step is to compute summaries that describe the distributions of variables and their relationships. This section avoids graphics, relying instead on tables and statistics.
Simple statistics reveal scale, central tendency, and variability.
# Sale price distribution
'SalePrice'].describe() df[
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64
The sale price is right-skewed, with a mean higher than the median.
# Lot area distribution
'LotArea'].describe() df[
count 1460.000000
mean 10516.828082
std 9981.264932
min 1300.000000
25% 7553.500000
50% 9478.500000
75% 11601.500000
max 215245.000000
Name: LotArea, dtype: float64
Lot size shows extreme variation, indicating possible outliers.
Comparisons across categories highlight differences in central tendency.
# Mean sale price by neighborhood
'Neighborhood')['SalePrice'].mean().sort_values().head() df.groupby(
Neighborhood
MeadowV 98576.470588
IDOTRR 100123.783784
BrDale 104493.750000
BrkSide 124834.051724
Edwards 128219.700000
Name: SalePrice, dtype: float64
Neighborhoods differ substantially in average sale price.
# Median sale price by overall quality
'OverallQual')['SalePrice'].median() df.groupby(
OverallQual
1 50150.0
2 60000.0
3 86250.0
4 108000.0
5 133000.0
6 160000.0
7 200141.0
8 269750.0
9 345000.0
10 432390.0
Name: SalePrice, dtype: float64
Higher quality ratings correspond to higher prices.
Cross-tabulations summarize associations between categorical variables.
import pandas as pd
# Neighborhood by garage type
'Neighborhood'], df['GarageType']).head() pd.crosstab(df[
GarageType | 2Types | Attchd | Basment | BuiltIn | CarPort | Detchd |
---|---|---|---|---|---|---|
Neighborhood | ||||||
Blmngtn | 0 | 17 | 0 | 0 | 0 | 0 |
Blueste | 0 | 2 | 0 | 0 | 0 | 0 |
BrDale | 0 | 2 | 0 | 0 | 0 | 13 |
BrkSide | 0 | 3 | 0 | 1 | 0 | 44 |
ClearCr | 0 | 24 | 0 | 1 | 0 | 2 |
Some garage types are common only in specific neighborhoods.
Correlation coefficients capture linear associations between numeric variables.
# Correlation between living area and sale price
'GrLivArea','SalePrice']].corr() df[[
GrLivArea | SalePrice | |
---|---|---|
GrLivArea | 1.000000 | 0.708624 |
SalePrice | 0.708624 | 1.000000 |
# Correlation between lot area and sale price
'LotArea','SalePrice']].corr() df[[
LotArea | SalePrice | |
---|---|---|
LotArea | 1.000000 | 0.263843 |
SalePrice | 0.263843 | 1.000000 |
Living area is strongly correlated with price, while lot area shows a weaker association.
Examining distributions within subgroups can surface interaction patterns.
# Average sale price by house style
'HouseStyle')['SalePrice'].mean().sort_values() df.groupby(
HouseStyle
1.5Unf 110150.000000
SFoyer 135074.486486
1.5Fin 143116.740260
2.5Unf 157354.545455
SLvl 166703.384615
1Story 175985.477961
2Story 210051.764045
2.5Fin 220000.000000
Name: SalePrice, dtype: float64
House style is another factor associated with variation in price.
These practices provide a numerical portrait of the data, guiding later steps where hypotheses will be stated explicitly and tested with appropriate statistical methods.
Exploration becomes more rigorous when we state hypotheses formally and run appropriate tests. The null hypothesis always represents no effect, no difference, or no association. The alternative expresses the presence of an effect. The examples below use the Ames Housing data.
Hypothesis: - H0: The mean sale prices are equal across neighborhoods. - H1: At least one neighborhood has a different mean.
import statsmodels.formula.api as smf
from statsmodels.stats.anova import anova_lm
= df[['SalePrice','Neighborhood']].dropna()
sub = smf.ols('SalePrice ~ C(Neighborhood)', data=sub).fit()
model =2) anova_lm(model, typ
sum_sq | df | F | PR(>F) | |
---|---|---|---|---|
C(Neighborhood) | 5.023606e+12 | 24.0 | 71.784865 | 1.558600e-225 |
Residual | 4.184305e+12 | 1435.0 | NaN | NaN |
ANOVA tests equality of group means. If significant, post-hoc comparisons can identify which neighborhoods differ.
Hypothesis: - H0: The slope for YearBuilt is zero; no linear relationship. - H1: The slope is not zero.
= smf.ols('SalePrice ~ YearBuilt', data=df).fit(cov_type='HC3')
model 1] model.summary().tables[
coef | std err | z | P>|z| | [0.025 | 0.975] | |
Intercept | -2.53e+06 | 1.36e+05 | -18.667 | 0.000 | -2.8e+06 | -2.26e+06 |
YearBuilt | 1375.3735 | 68.973 | 19.941 | 0.000 | 1240.189 | 1510.558 |
The regression slope test checks whether newer houses tend to sell for more.
Hypothesis: - H0: The population correlation is zero. - H1: The correlation is not zero.
from scipy import stats
= df[['LotArea','SalePrice']].dropna()
sub 'LotArea'], sub['SalePrice']) stats.pearsonr(sub[
PearsonRResult(statistic=np.float64(0.2638433538714058), pvalue=np.float64(1.123139154918551e-24))
A Pearson correlation tests linear association. A Spearman rank correlation can be used when distributions are skewed.
Hypothesis: - H0: The mean sale price is the same for houses with and without a fireplace. - H1: The mean sale prices differ.
= df[['SalePrice','Fireplaces']].dropna()
sub 'has_fp'] = (sub['Fireplaces'] > 0).astype(int)
sub[
= sub.loc[sub['has_fp']==1, 'SalePrice']
g1 = sub.loc[sub['has_fp']==0, 'SalePrice']
g0
=False) stats.ttest_ind(g1, g0, equal_var
TtestResult(statistic=np.float64(21.105376324953664), pvalue=np.float64(4.666259945494159e-84), df=np.float64(1171.6295727321062))
Welch’s t-test compares means when variances differ.
Hypothesis: - H0: Garage type and neighborhood are independent. - H1: Garage type and neighborhood are associated.
import pandas as pd
= pd.crosstab(df['GarageType'], df['Neighborhood'])
ct stats.chi2_contingency(ct)
Chi2ContingencyResult(statistic=np.float64(794.6871326886048), pvalue=np.float64(5.179037846292559e-100), dof=120, expected_freq=array([[7.39666425e-02, 8.70195794e-03, 6.52646846e-02, 2.08846991e-01,
1.17476432e-01, 6.43944888e-01, 2.21899927e-01, 3.39376360e-01,
3.43727339e-01, 1.26178390e-01, 5.22117476e-02, 1.91443075e-01,
9.52864394e-01, 3.91588107e-02, 3.17621465e-01, 1.78390138e-01,
3.35025381e-01, 4.39448876e-01, 8.70195794e-02, 3.08919507e-01,
2.52356780e-01, 3.74184191e-01, 1.08774474e-01, 1.65337201e-01,
4.78607687e-02],
[1.07251632e+01, 1.26178390e+00, 9.46337926e+00, 3.02828136e+01,
1.70340827e+01, 9.33720087e+01, 3.21754895e+01, 4.92095722e+01,
4.98404641e+01, 1.82958666e+01, 7.57070341e+00, 2.77592458e+01,
1.38165337e+02, 5.67802756e+00, 4.60551124e+01, 2.58665700e+01,
4.85786802e+01, 6.37200870e+01, 1.26178390e+01, 4.47933285e+01,
3.65917331e+01, 5.42567078e+01, 1.57722988e+01, 2.39738941e+01,
6.93981146e+00],
[2.34227701e-01, 2.75562001e-02, 2.06671501e-01, 6.61348803e-01,
3.72008702e-01, 2.03915881e+00, 7.02683104e-01, 1.07469181e+00,
1.08846991e+00, 3.99564902e-01, 1.65337201e-01, 6.06236403e-01,
3.01740392e+00, 1.24002901e-01, 1.00580131e+00, 5.64902103e-01,
1.06091371e+00, 1.39158811e+00, 2.75562001e-01, 9.78245105e-01,
7.99129804e-01, 1.18491661e+00, 3.44452502e-01, 5.23567803e-01,
1.51559101e-01],
[1.08484409e+00, 1.27628716e-01, 9.57215373e-01, 3.06308920e+00,
1.72298767e+00, 9.44452502e+00, 3.25453227e+00, 4.97751994e+00,
5.04133430e+00, 1.85061639e+00, 7.65772299e-01, 2.80783176e+00,
1.39753445e+01, 5.74329224e-01, 4.65844815e+00, 2.61638869e+00,
4.91370558e+00, 6.44525018e+00, 1.27628716e+00, 4.53081943e+00,
3.70123278e+00, 5.48803481e+00, 1.59535896e+00, 2.42494561e+00,
7.01957941e-01],
[1.10949964e-01, 1.30529369e-02, 9.78970268e-02, 3.13270486e-01,
1.76214648e-01, 9.65917331e-01, 3.32849891e-01, 5.09064540e-01,
5.15591008e-01, 1.89267585e-01, 7.83176215e-02, 2.87164612e-01,
1.42929659e+00, 5.87382161e-02, 4.76432197e-01, 2.67585207e-01,
5.02538071e-01, 6.59173314e-01, 1.30529369e-01, 4.63379260e-01,
3.78535170e-01, 5.61276287e-01, 1.63161711e-01, 2.48005801e-01,
7.17911530e-02],
[4.77084844e+00, 5.61276287e-01, 4.20957215e+00, 1.34706309e+01,
7.57722988e+00, 4.15344453e+01, 1.43125453e+01, 2.18897752e+01,
2.21704133e+01, 8.13850616e+00, 3.36765772e+00, 1.23480783e+01,
6.14597534e+01, 2.52574329e+00, 2.04865845e+01, 1.15061639e+01,
2.16091371e+01, 2.83444525e+01, 5.61276287e+00, 1.99253082e+01,
1.62770123e+01, 2.41348803e+01, 7.01595359e+00, 1.06642495e+01,
3.08701958e+00]]))
A chi-square test checks for association between two categorical variables.
Situation | Null hypothesis | Test | Python tool |
---|---|---|---|
k-group mean comparison | All group means equal | One-way ANOVA | anova_lm |
k-group, nonparametric | All group distributions equal | Kruskal–Wallis | stats.kruskal |
Two means, unequal variance | Means equal | Welch’s t-test | stats.ttest_ind |
Two groups, nonparametric | Distributions equal | Mann–Whitney U | stats.mannwhitneyu |
Linear relationship | Slope = 0 | OLS slope test | ols + robust SE |
Continuous association | Correlation = 0 | Pearson correlation | stats.pearsonr |
Monotone association | Correlation = 0 | Spearman correlation | stats.spearmanr |
Categorical association | Independence | Chi-square test | stats.chi2_contingency |
These examples illustrate how hypotheses guide exploration. Each test produces a statistic, a p-value, and often an effect size. Results are provisional and informal, but they shape which relationships merit deeper investigation.
Exploration is rarely linear. Cleaning, summarizing, and testing feed back into each other. Each new discovery can prompt a return to earlier steps.
Initial inspection may suggest that many values of GarageYrBlt
are missing. Documentation indicates that missing means no garage.
# Count missing garage years
'GarageYrBlt'].isna().sum() df[
np.int64(81)
When checking integrity, we may notice that some garage years precede the house year built.
# Garage built before house built
'GarageYrBlt'] < df['YearBuilt']).sum() (df[
np.int64(9)
This prompts a decision: treat as data entry error, keep with caution, or exclude in certain analyses.
A strong correlation between GrLivArea
and SalePrice
may surface.
# Correlation
= df[['GrLivArea','SalePrice']].dropna()
sub sub.corr()
GrLivArea | SalePrice | |
---|---|---|
GrLivArea | 1.000000 | 0.708624 |
SalePrice | 0.708624 | 1.000000 |
If a few extremely large houses are driving the correlation, it may be necessary to investigate further.
# Identify extreme values
'GrLivArea'] > 4000][['GrLivArea','SalePrice']] sub[sub[
GrLivArea | SalePrice | |
---|---|---|
523 | 4676 | 184750 |
691 | 4316 | 755000 |
1182 | 4476 | 745000 |
1298 | 5642 | 160000 |
These observations may be genuine luxury properties, or they may distort summary statistics. The decision is context-dependent and should be documented.
Exploration is not a one-pass process. Findings in one step often require revisiting previous steps. Clear documentation ensures that these iterations are transparent and reproducible.
Clear habits in exploration make later analysis more reliable and easier to share. The following practices help ensure quality and reproducibility.
# Example: set a random seed for reproducibility
import numpy as np
20250923) np.random.seed(
PoolQC
means no pool.# Example: create an indicator for presence of a pool
'HasPool'] = df['PoolArea'] > 0 df[
# Example: compute Cohen's d for fireplace vs no fireplace
= df[['SalePrice','Fireplaces']].dropna()
sub 'has_fp'] = (sub['Fireplaces'] > 0).astype(int)
sub[
= sub.groupby('has_fp')['SalePrice'].mean().diff().iloc[-1]
mean_diff = sub.groupby('has_fp')['SalePrice'].std().mean()
pooled_sd = mean_diff / pooled_sd
cohens_d cohens_d
np.float64(1.144008229281349)
Good practices keep exploration structured and reproducible. They also create a record of reasoning that improves collaboration and supports later analysis.