5  Data Visualization

In this chapter we embrace visualization as a cornerstone of modern data-analysis workflows: turning raw numbers into meaningful visuals that support insight, decision-making and communication. We begin by exploring static and information-rich graphics through plotnine, building on the grammar‐of-graphics approach. Next we extend into spatial data-visualisation using GeoPandas, equipping you to map, project and interpret geospatial patterns. Later sections will introduce further tools and techniques (for example interactive maps or dashboards), but throughout we emphasise the same core questions: Which visual form fits the data and question? How do our design and implementation choices influence what the viewer sees — and what they don’t see? With the tools and principles in hand, you’ll be prepared to insert clear, effective visualisation into your data-science project workflow.

5.1 Spatial Data with Geopandas

This section is by Alejandro Haerter, a junior majoring in Statistical Data Science and Economics.

5.1.1 Spatial Data and Python

Spatial data is any information which describes the geographic location and shape of features. We might represent these as:

  • Points (address, cities)
  • Lines (Roads, rivers)
  • Polygons (Property parcels, city boundaries, ZIP codes)

Spatial data is everywhere; we see it on maps, it has legal implications; and it often delineates demographic information. In short, a data scientist is certain to encounter spatial data in their career, and should know the tools to work with it.

Traditional Geographic Information Systems (GIS) tools (e.g., ArcGIS, QGIS) are proprietary, require steep learning curves, and do not implement well with the data science workflow. Luckily, with GeoPandas, we can use Python for spatial data analysis, preserving the data science workflow.

5.1.2 Introduction to GeoPandas

GeoPandas is an open source package which adds support for geographic data to pandas objects, first released in 2014. GeoPandas’ two main data structures are the geopandas.GeoSeries, an extension of the pandas.Series, and the geopandas.GeoDataFrame, an extension of the pandas.DataFrame.

GeoPandas is capable of geometric operations, transformations, and plotting, all relevant tools for for operating with spatial data.

GeoPandas requires Folium and Matplotlib as dependencies, both of which for plotting.

5.1.2.1 GeoSeries

geopandas.GeoSeries a subclass of pandas.Series which can only store geometries. Potential geometries include Points, Lines, Polygons, etc. Not all geometries need be the same type, but must be some form of Shapely geometric object.

The GeoSeries.crs attribute stores the coordinate reference system (CRS) information of the GeoSeries. A CRS relates how map coordinates relate to real locations on Earth by specifying a datum (a model of the Earth’s shape), a coordinate system (e.g., Latitude/Longitude, UTM) and units of measurement (e.g., degrees, meters).

5.1.2.2 GeoDataFrame

A GeoDataFrame is the core data stucture of GeoPandas. It can store one or more geometry columns and perform spatial operations. Essentially, it is a pandas.DataFrame combined with one or more GeoSeries.

A mock GeoDataFrame might look like:

    city       population                   geometry
0  NYC        8800000      POINT (-74.0060 40.7128)
1  Boston      675000      POINT (-71.0589 42.3601)
2  Chicago    2700000      POINT (-87.6298 41.8781)

Importantly, while we can have multiple GeoSeries in a GeoDataFrame, only one GeoSeries at a time is the active geometry column. All geometric operations act on this column; it’s accessed by GeoDataFrame.geometry attribute.

5.1.3 Basic Operations

A file containg both data and geometry can be read by geopandas.read_file(). For this example, I use a dataset which contains geometric information for each of NYC’s five boroughs.

import geopandas as gpd
from geodatasets import get_path

path_to_data = get_path("nybb")  # map of NYC boroughs
gdf = gpd.read_file(path_to_data)

gdf
BoroCode BoroName Shape_Leng Shape_Area geometry
0 5 Staten Island 330470.010332 1.623820e+09 MULTIPOLYGON (((970217.022 145643.332, 970227....
1 4 Queens 896344.047763 3.045213e+09 MULTIPOLYGON (((1029606.077 156073.814, 102957...
2 3 Brooklyn 741080.523166 1.937479e+09 MULTIPOLYGON (((1021176.479 151374.797, 102100...
3 1 Manhattan 359299.096471 6.364715e+08 MULTIPOLYGON (((981219.056 188655.316, 980940....
4 2 Bronx 464392.991824 1.186925e+09 MULTIPOLYGON (((1012821.806 229228.265, 101278...

5.1.3.1 Inspecting a GeoDataFrame

GeoPandas syntax is just like pandas. Methods like .head(), .info(), and .shape, .rename, .drop, etc., apply, all work the same. For example, I rename the column geometry to poly, so that I don’t confuse it with the .geometry attribute. (This is an example of good naming practice to avoid conflicts with built-in attributes or methods.)

gdf = gdf.rename(columns={"geometry": "poly"})
gdf.head(1)
BoroCode BoroName Shape_Leng Shape_Area poly
0 5 Staten Island 330470.010332 1.623820e+09 MULTIPOLYGON (((970217.022 145643.332, 970227....

GeoPandas also has its own functions, methods, and attributes which are specific to it. Recall .geometry, which gives us the active geometry column. This GeoDataFrame is still pointing to a column called "geometry", even though its renamed and doesn’t exist. This can be fixed with .set_geometry.

gdf = gdf.set_geometry('poly')
print(gdf.geometry)
0    MULTIPOLYGON (((970217.022 145643.332, 970227....
1    MULTIPOLYGON (((1029606.077 156073.814, 102957...
2    MULTIPOLYGON (((1021176.479 151374.797, 102100...
3    MULTIPOLYGON (((981219.056 188655.316, 980940....
4    MULTIPOLYGON (((1012821.806 229228.265, 101278...
Name: poly, dtype: geometry

Recall how to access CRS:

print(gdf.crs)
EPSG:2263

EPSG:2263 is a CRS specific for New York City. It uses feet for distance operations.

5.1.3.2 Area

If I wanted to find the area enclosed by the polygons, I’d use the .area attribute, which gives the area enclosed by each polygon.

gdf = gdf.set_index("BoroName") # for legibility
gdf["area"] = gdf.area
gdf["area"]
BoroName
Staten Island    1.623822e+09
Queens           3.045214e+09
Brooklyn         1.937478e+09
Manhattan        6.364712e+08
Bronx            1.186926e+09
Name: area, dtype: float64

Because of EPSG:2263, area is given in square footage. For example, Manhattan is 6.364712e+08 = 636,471,200 ft2 = 22.9mi2.

5.1.3.3 Boundaries and Centroids

Right now, the active geometry column contains polygons. We can access the perimeters and the centroids of these polygons:

gdf["boundary"] = gdf.boundary
gdf["centroid_ft"] = gdf.centroid

gdf[['boundary','centroid_ft']].head()
boundary centroid_ft
BoroName
Staten Island MULTILINESTRING ((970217.022 145643.332, 97022... POINT (941639.45 150931.991)
Queens MULTILINESTRING ((1029606.077 156073.814, 1029... POINT (1034578.078 197116.604)
Brooklyn MULTILINESTRING ((1021176.479 151374.797, 1021... POINT (998769.115 174169.761)
Manhattan MULTILINESTRING ((981219.056 188655.316, 98094... POINT (993336.965 222451.437)
Bronx MULTILINESTRING ((1012821.806 229228.265, 1012... POINT (1021174.79 249937.98)

gdf now has boundary and centroid columns as additional geometry columns, but the active column wont change unless specified.

5.1.3.4 Distance Operation

If I wanted to find the distance between the center of Brooklyn and the center of the Bronx, that’s taking the .distance() between two centroids.

# active geometry set to centroid info
gdf = gdf.set_geometry('centroid_ft')

#finds distance between indeces given
gdf.geometry['Bronx'].distance(gdf.geometry['Brooklyn'])
79011.6278663779

Recall the current EPSG:2263, which is a projection in feet. ~79,000ft \(\approx\) ~15mi.

We have all the pandas functionality available here too, for example, .mean():

from shapely.geometry import Point

cx = gdf["centroid_ft"].x.mean()
cy = gdf["centroid_ft"].y.mean()
geo_center_ft = Point(cx, cy) # Shapely

print(geo_center_ft)
POINT (997899.6796377342 198921.55457843072)

Gives us a Point position of the centroid of centroids, i.e., the geographic center of NYC. Although, that doesn’t tell us very much…

5.1.3.5 Changing CRS

Rule of thumb: Operations which rely on distance and area should use a Projected CRS (m, ft, km, etc). Geographic CRS (degrees) is better for position information, like Lat/Lon of a location. We use .to_crs().

# new GeoSeries is just a different projection of existing GeoSeries
gdf['centroid_ll'] = gdf['centroid_ft'].to_crs(4326)
gdf['centroid_ll']
BoroName
Staten Island     POINT (-74.1534 40.58085)
Queens           POINT (-73.81847 40.70757)
Brooklyn         POINT (-73.94768 40.64472)
Manhattan        POINT (-73.96719 40.77725)
Bronx            POINT (-73.86653 40.85262)
Name: centroid_ll, dtype: geometry

centroid_ll is centroid reprojected EPSG:4326, now giving coordinates. E.g., the center of the Bronx is at coordinates 40.85°N, 73.87°W.

clon = gdf["centroid_ll"].x.mean()
clat = gdf["centroid_ll"].y.mean()
geo_center_ll = Point(clon, clat) # Shapely

print(geo_center_ll)
POINT (-73.95065406867823 40.7126019492116)

5.1.3.6 File Writing

When I want to save my GeoDataFrame to the computer, we use GeoDataFrame.to_file. GeoPandas will infer by the file format by the file extension.

# This won't run because geoJSON doesn't support multiple GeoSeries!
gdf.to_file("nyc_boroughs.geojson")

I recommend using feather.

# Use Feather!
import pyarrow
gdf.to_feather("nyc_boroughs.feather")

Just as pandas can handle file types .csv, .feather, .html, etc., GeoPandas data can also be stored in multipe file types, like .shp, .geojson, and also .feather! These file types are are used by a variety of different GIS software, but they all contain information which can read as a GeoDataFrame.

5.1.3.7 Other Useful Methods

  • length() Returns length of geometries (useful for LineStrings like roads).

  • intersects(other)
    True if two geometries overlap or touch.

  • contains(other) True if one geometry fully contains another.

  • buffer(distance)
    Creates a new geometry expanded outward (or inward if negative) by the given distance.

  • equals(other) Checks geometric equality

  • is_valid Boolean check: are geometries valid (no self-intersections, etc.)?

5.1.4 Plotting

GeoPandas is capable of plotting/mapping spatial data on both static and interactive figures, using Matplotlib and Folium respectively.

5.1.4.1 Static Maps

Plotting operations are done on the active geometry column with Matplotlib syntax, .plot().

This code plots the polygons and colors them by their total area:

# active geometry column = poly
gdf = gdf.set_geometry("poly")
gdf.plot('area', legend=True)

We can map multiple GeoSeries by using one plot as an axis for another. First, we need to verify that centroid_ll and poly are in the same CRS by setting both to EPSG:4326, which gives Latitude and Longitude. I can use the .crs attribute from one GeoDataFrame as input for .to_crs to ensure a match. This code corrects the CRS and plots both centroids and polygons:

# polygons in lat/lon
ax = gdf.set_geometry("poly").to_crs(gdf["centroid_ll"].crs).plot()
gdf.set_geometry("centroid_ll").plot(ax=ax, color="red")

5.1.5 Interactive Maps

Geopandas uses a Folium/Leaflet backend to make interactive maps very easy, using .explore().

This code gives an interactive map of the same data:

gdf = gdf.set_geometry("poly")
gdf.explore("area", legend=False, zoom_to_layer=True)
Make this Notebook Trusted to load map: File -> Trust Notebook

5.1.6 Worked Out Example: NYC Collision Data

This section demonstrates the start-to-finish workflow of using GeoPandas on a real-world dataset. We will be using the cleaned NYC collision data, courtesy of Wilson Tang. Additionally, for our spatial data, I’ll be using the .shp file from the NYC Modified Zip Code Tabulation Areas (MODZCTA) dataset. ZIP codes can be reassigned and their boundaries changed;the goal of this dataset is to preserve the ZIP code shapes for geospatial analysis.

Goal: overlay crash locations on a map of NYC.

5.1.6.1 Load and Inspect Data

We begin by installing our required dependencies, and loading our datasets.

import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import pyarrow

crash_df = pd.read_feather('data/nyc_crashes_cleaned.feather')

zip_gdf = gpd.read_feather("data/nyc_modzcta_gp.feather")

zip_gdf.head()
modzcta label zcta pop_est geometry
0 10001 10001, 10118 10001, 10119, 10199 23072.0 POLYGON ((-73.98774 40.74407, -73.98819 40.743...
1 10002 10002 10002 74993.0 POLYGON ((-73.9975 40.71407, -73.99709 40.7146...
2 10003 10003 10003 54682.0 POLYGON ((-73.98864 40.72293, -73.98876 40.722...
3 10026 10026 10026 39363.0 MULTIPOLYGON (((-73.96201 40.80551, -73.96007 ...
4 10004 10004 10004 3028.0 MULTIPOLYGON (((-74.00827 40.70772, -74.00937 ...

crash_df is our pandas.DataFrame which contains collision data, and zip_gdf is our GeoPandas.GeoDataFrame which contains ZIP code data.

crash_df = crash_df.dropna(subset=["longitude", "latitude"])
zip_gdf = zip_gdf.drop(columns=["label", "zcta"])

5.1.6.2 Data Overlay

crash_df doesn’t yet have an active geometry column to use for plotting, but does have Latitude and Longitude information. Function gpd.points_from_xy() can take these inputs (which use EPSG:4326) to produce a GeoSeries of geometric shapely.Point objects. The new crash_gdf GeoDataFrame combines these two.

# Create gdf so we can visualize
crash_gdf = gpd.GeoDataFrame(
    crash_df,
    geometry=gpd.points_from_xy(crash_df["longitude"], 
                                crash_df["latitude"]),
    crs="EPSG:4326"
)

# Double-check: ensure geodataframes same CRS
zip_gdf = zip_gdf.to_crs(crash_gdf.crs)

Using Matplotlib syntax, I use add zip_gdf to the .plot(), specifying how I want them to appear. I do the same for crash_gdf.

# Overlay crashes on Borough Polygons
fig, ax = plt.subplots(figsize=(7, 7))
zip_gdf.plot(ax=ax, facecolor="none", edgecolor="black", linewidth=0.8)
crash_gdf.plot(ax=ax, markersize=1, color="red", alpha=0.5)
ax.set_title("NYC Crashes, Labor Day Week 2025")
plt.show()

This overlay uses two GeoSeries, each from a different GeoDataFrame. In this case, it was necessary to keep the two seperate, as they have fundamentally different structures.

5.1.6.3 Spatial Joins

There are advantages of using just one GeoDataFrame. We use spatial join function gpd.sjoin(), which parallels pd.merge. This function combines two dataframes by matching keys and a join condition type.

# Crashes put within zips
# predicate="within" requires all of a geometry's points to be within
# the interior of the spatially joined geometry (and none on the exterior)
joined = gpd.sjoin(crash_gdf, zip_gdf, predicate="within", how="left")

# count number of crashes per ZIP; creates new Series
# "modzcta" same as zip code. it says to group by zip code.
counts = joined.groupby("modzcta").size().rename("n_crashes")

# Attach crash counts back to the polygon GeoDataFrame
zip_counts = zip_gdf.merge(counts, on="modzcta", how="left").fillna({"n_crashes": 0})
zip_counts = zip_counts.set_geometry("geometry").to_crs(4326)

zip_counts[["modzcta", "n_crashes"]].head()
modzcta n_crashes
0 10001 8.0
1 10002 17.0
2 10003 9.0
3 10026 2.0
4 10004 0.0

The new GeoDataFrame zip_counts gives us crash count by ZIP code, which allows us new plotting opportunities.

5.1.6.4 Plotting Joined Data

Choropleth maps provide an easy way to visualize how a variable varies across a geographic area or show the level of variability within a region.

# Plot polygons colored by crash counts
fig, ax = plt.subplots(figsize=(7, 7))
zip_counts.plot(ax=ax, column="n_crashes", legend=True)
ax.set_title("Crashes per NYC ZIP")
plt.show()

Interactive choropleth maps are especially helpful on websites. tooltip specifies which two variables appear when I hover the mouse over a given polygon.

# render and auto-fit to layer
zip_counts.explore(column="n_crashes",
                   legend=True,
                   tooltip=["modzcta","n_crashes"],
                   zoom_to_layer=True)
Make this Notebook Trusted to load map: File -> Trust Notebook

5.1.7 Further Readings

  • GeoPandas: See more examples of GeoPandas uses.
  • Pandas: Not technically a dependency, but complete understanding of Pandas syntax is necessary to be successful with GeoPandas. See Pandas section in the classnotes.
  • Shapely: GeoPandas leverages Shapely for geometric object types and operations. You won’t interface much with Shapely directly, but is helpful to have a basic understanding of.
  • Matplotlib: Static plotting operations use Matplotlib syntax.
  • Folium: Interactive mapping uses a Folium backend. Folium is a very powerful tool for spatial visualization, which warrants its own topic presentation.
  • EPSG: Familiarize yourself with the most common CRS, which are given by unique EPSG codes. The EPSG database currently contrains over 5000 unique entries.