In this chapter we embrace visualization as a cornerstone of modern data-analysis workflows: turning raw numbers into meaningful visuals that support insight, decision-making and communication. We begin by exploring static and information-rich graphics through plotnine, building on the grammar‐of-graphics approach. Next we extend into spatial data-visualisation using GeoPandas, equipping you to map, project and interpret geospatial patterns. Later sections will introduce further tools and techniques (for example interactive maps or dashboards), but throughout we emphasise the same core questions: Which visual form fits the data and question? How do our design and implementation choices influence what the viewer sees — and what they don’t see? With the tools and principles in hand, you’ll be prepared to insert clear, effective visualisation into your data-science project workflow.
5.1 Spatial Data with Geopandas
This section is by Alejandro Haerter, a junior majoring in Statistical Data Science and Economics.
5.1.1 Spatial Data and Python
Spatial data is any information which describes the geographic location and shape of features. We might represent these as:
Points (address, cities)
Lines (Roads, rivers)
Polygons (Property parcels, city boundaries, ZIP codes)
Spatial data is everywhere; we see it on maps, it has legal implications; and it often delineates demographic information. In short, a data scientist is certain to encounter spatial data in their career, and should know the tools to work with it.
Traditional Geographic Information Systems (GIS) tools (e.g., ArcGIS, QGIS) are proprietary, require steep learning curves, and do not implement well with the data science workflow. Luckily, with GeoPandas, we can use Python for spatial data analysis, preserving the data science workflow.
5.1.2 Introduction to GeoPandas
GeoPandas is an open source package which adds support for geographic data to pandas objects, first released in 2014. GeoPandas’ two main data structures are the geopandas.GeoSeries, an extension of the pandas.Series, and the geopandas.GeoDataFrame, an extension of the pandas.DataFrame.
GeoPandas is capable of geometric operations, transformations, and plotting, all relevant tools for for operating with spatial data.
GeoPandas requires Folium and Matplotlib as dependencies, both of which for plotting.
5.1.2.1 GeoSeries
geopandas.GeoSeries a subclass of pandas.Series which can only store geometries. Potential geometries include Points, Lines, Polygons, etc. Not all geometries need be the same type, but must be some form of Shapely geometric object.
The GeoSeries.crs attribute stores the coordinate reference system (CRS) information of the GeoSeries. A CRS relates how map coordinates relate to real locations on Earth by specifying a datum (a model of the Earth’s shape), a coordinate system (e.g., Latitude/Longitude, UTM) and units of measurement (e.g., degrees, meters).
5.1.2.2 GeoDataFrame
A GeoDataFrame is the core data stucture of GeoPandas. It can store one or more geometry columns and perform spatial operations. Essentially, it is a pandas.DataFrame combined with one or more GeoSeries.
A mock GeoDataFrame might look like:
city population geometry
0 NYC 8800000 POINT (-74.0060 40.7128)
1 Boston 675000 POINT (-71.0589 42.3601)
2 Chicago 2700000 POINT (-87.6298 41.8781)
Importantly, while we can have multiple GeoSeries in a GeoDataFrame, only one GeoSeries at a time is the active geometry column. All geometric operations act on this column; it’s accessed by GeoDataFrame.geometry attribute.
5.1.3 Basic Operations
A file containg both data and geometry can be read by geopandas.read_file(). For this example, I use a dataset which contains geometric information for each of NYC’s five boroughs.
import geopandas as gpdfrom geodatasets import get_pathpath_to_data = get_path("nybb") # map of NYC boroughsgdf = gpd.read_file(path_to_data)gdf
BoroCode
BoroName
Shape_Leng
Shape_Area
geometry
0
5
Staten Island
330470.010332
1.623820e+09
MULTIPOLYGON (((970217.022 145643.332, 970227....
1
4
Queens
896344.047763
3.045213e+09
MULTIPOLYGON (((1029606.077 156073.814, 102957...
2
3
Brooklyn
741080.523166
1.937479e+09
MULTIPOLYGON (((1021176.479 151374.797, 102100...
3
1
Manhattan
359299.096471
6.364715e+08
MULTIPOLYGON (((981219.056 188655.316, 980940....
4
2
Bronx
464392.991824
1.186925e+09
MULTIPOLYGON (((1012821.806 229228.265, 101278...
5.1.3.1 Inspecting a GeoDataFrame
GeoPandas syntax is just like pandas. Methods like .head(), .info(), and .shape, .rename, .drop, etc., apply, all work the same. For example, I rename the column geometry to poly, so that I don’t confuse it with the .geometry attribute. (This is an example of good naming practice to avoid conflicts with built-in attributes or methods.)
GeoPandas also has its own functions, methods, and attributes which are specific to it. Recall .geometry, which gives us the active geometry column. This GeoDataFrame is still pointing to a column called "geometry", even though its renamed and doesn’t exist. This can be fixed with .set_geometry.
gdf now has boundary and centroid columns as additional geometry columns, but the active column wont change unless specified.
5.1.3.4 Distance Operation
If I wanted to find the distance between the center of Brooklyn and the center of the Bronx, that’s taking the .distance() between two centroids.
# active geometry set to centroid infogdf = gdf.set_geometry('centroid_ft')#finds distance between indeces givengdf.geometry['Bronx'].distance(gdf.geometry['Brooklyn'])
79011.6278663779
Recall the current EPSG:2263, which is a projection in feet. ~79,000ft \(\approx\) ~15mi.
We have all the pandas functionality available here too, for example, .mean():
Gives us a Point position of the centroid of centroids, i.e., the geographic center of NYC. Although, that doesn’t tell us very much…
5.1.3.5 Changing CRS
Rule of thumb: Operations which rely on distance and area should use a Projected CRS (m, ft, km, etc). Geographic CRS (degrees) is better for position information, like Lat/Lon of a location. We use .to_crs().
# new GeoSeries is just a different projection of existing GeoSeriesgdf['centroid_ll'] = gdf['centroid_ft'].to_crs(4326)gdf['centroid_ll']
BoroName
Staten Island POINT (-74.1534 40.58085)
Queens POINT (-73.81847 40.70757)
Brooklyn POINT (-73.94768 40.64472)
Manhattan POINT (-73.96719 40.77725)
Bronx POINT (-73.86653 40.85262)
Name: centroid_ll, dtype: geometry
centroid_ll is centroid reprojected EPSG:4326, now giving coordinates. E.g., the center of the Bronx is at coordinates 40.85°N, 73.87°W.
When I want to save my GeoDataFrame to the computer, we use GeoDataFrame.to_file. GeoPandas will infer by the file format by the file extension.
# This won't run because geoJSON doesn't support multiple GeoSeries!gdf.to_file("nyc_boroughs.geojson")
I recommend using feather.
# Use Feather!import pyarrowgdf.to_feather("nyc_boroughs.feather")
Just as pandas can handle file types .csv, .feather, .html, etc., GeoPandas data can also be stored in multipe file types, like .shp, .geojson, and also .feather! These file types are are used by a variety of different GIS software, but they all contain information which can read as a GeoDataFrame.
5.1.3.7 Other Useful Methods
length() Returns length of geometries (useful for LineStrings like roads).
intersects(other)
True if two geometries overlap or touch.
contains(other) True if one geometry fully contains another.
buffer(distance)
Creates a new geometry expanded outward (or inward if negative) by the given distance.
equals(other) Checks geometric equality
is_valid Boolean check: are geometries valid (no self-intersections, etc.)?
5.1.4 Plotting
GeoPandas is capable of plotting/mapping spatial data on both static and interactive figures, using Matplotlib and Folium respectively.
5.1.4.1 Static Maps
Plotting operations are done on the active geometry column with Matplotlib syntax, .plot().
This code plots the polygons and colors them by their total area:
# active geometry column = polygdf = gdf.set_geometry("poly")gdf.plot('area', legend=True)
We can map multiple GeoSeries by using one plot as an axis for another. First, we need to verify that centroid_ll and poly are in the same CRS by setting both to EPSG:4326, which gives Latitude and Longitude. I can use the .crs attribute from one GeoDataFrame as input for .to_crs to ensure a match. This code corrects the CRS and plots both centroids and polygons:
# polygons in lat/lonax = gdf.set_geometry("poly").to_crs(gdf["centroid_ll"].crs).plot()gdf.set_geometry("centroid_ll").plot(ax=ax, color="red")
5.1.5 Interactive Maps
Geopandas uses a Folium/Leaflet backend to make interactive maps very easy, using .explore().
This code gives an interactive map of the same data:
Make this Notebook Trusted to load map: File -> Trust Notebook
5.1.6 Worked Out Example: NYC Collision Data
This section demonstrates the start-to-finish workflow of using GeoPandas on a real-world dataset. We will be using the cleaned NYC collision data, courtesy of Wilson Tang. Additionally, for our spatial data, I’ll be using the .shp file from the NYC Modified Zip Code Tabulation Areas (MODZCTA) dataset. ZIP codes can be reassigned and their boundaries changed;the goal of this dataset is to preserve the ZIP code shapes for geospatial analysis.
Goal: overlay crash locations on a map of NYC.
5.1.6.1 Load and Inspect Data
We begin by installing our required dependencies, and loading our datasets.
import numpy as npimport pandas as pdimport geopandas as gpdimport matplotlib.pyplot as pltimport pyarrowcrash_df = pd.read_feather('data/nyc_crashes_cleaned.feather')zip_gdf = gpd.read_feather("data/nyc_modzcta_gp.feather")zip_gdf.head()
modzcta
label
zcta
pop_est
geometry
0
10001
10001, 10118
10001, 10119, 10199
23072.0
POLYGON ((-73.98774 40.74407, -73.98819 40.743...
1
10002
10002
10002
74993.0
POLYGON ((-73.9975 40.71407, -73.99709 40.7146...
2
10003
10003
10003
54682.0
POLYGON ((-73.98864 40.72293, -73.98876 40.722...
3
10026
10026
10026
39363.0
MULTIPOLYGON (((-73.96201 40.80551, -73.96007 ...
4
10004
10004
10004
3028.0
MULTIPOLYGON (((-74.00827 40.70772, -74.00937 ...
crash_df is our pandas.DataFrame which contains collision data, and zip_gdf is our GeoPandas.GeoDataFrame which contains ZIP code data.
crash_df doesn’t yet have an active geometry column to use for plotting, but does have Latitude and Longitude information. Function gpd.points_from_xy() can take these inputs (which use EPSG:4326) to produce a GeoSeries of geometric shapely.Point objects. The new crash_gdfGeoDataFrame combines these two.
# Create gdf so we can visualizecrash_gdf = gpd.GeoDataFrame( crash_df, geometry=gpd.points_from_xy(crash_df["longitude"], crash_df["latitude"]), crs="EPSG:4326")# Double-check: ensure geodataframes same CRSzip_gdf = zip_gdf.to_crs(crash_gdf.crs)
Using Matplotlib syntax, I use add zip_gdf to the .plot(), specifying how I want them to appear. I do the same for crash_gdf.
This overlay uses two GeoSeries, each from a different GeoDataFrame. In this case, it was necessary to keep the two seperate, as they have fundamentally different structures.
5.1.6.3 Spatial Joins
There are advantages of using just one GeoDataFrame. We use spatial join function gpd.sjoin(), which parallels pd.merge. This function combines two dataframes by matching keys and a join condition type.
# Crashes put within zips# predicate="within" requires all of a geometry's points to be within# the interior of the spatially joined geometry (and none on the exterior)joined = gpd.sjoin(crash_gdf, zip_gdf, predicate="within", how="left")# count number of crashes per ZIP; creates new Series# "modzcta" same as zip code. it says to group by zip code.counts = joined.groupby("modzcta").size().rename("n_crashes")# Attach crash counts back to the polygon GeoDataFramezip_counts = zip_gdf.merge(counts, on="modzcta", how="left").fillna({"n_crashes": 0})zip_counts = zip_counts.set_geometry("geometry").to_crs(4326)zip_counts[["modzcta", "n_crashes"]].head()
modzcta
n_crashes
0
10001
8.0
1
10002
17.0
2
10003
9.0
3
10026
2.0
4
10004
0.0
The new GeoDataFramezip_counts gives us crash count by ZIP code, which allows us new plotting opportunities.
5.1.6.4 Plotting Joined Data
Choropleth maps provide an easy way to visualize how a variable varies across a geographic area or show the level of variability within a region.
# Plot polygons colored by crash countsfig, ax = plt.subplots(figsize=(7, 7))zip_counts.plot(ax=ax, column="n_crashes", legend=True)ax.set_title("Crashes per NYC ZIP")plt.show()
Interactive choropleth maps are especially helpful on websites. tooltip specifies which two variables appear when I hover the mouse over a given polygon.
# render and auto-fit to layerzip_counts.explore(column="n_crashes", legend=True, tooltip=["modzcta","n_crashes"], zoom_to_layer=True)
Make this Notebook Trusted to load map: File -> Trust Notebook
Pandas: Not technically a dependency, but complete understanding of Pandas syntax is necessary to be successful with GeoPandas. See Pandas section in the classnotes.
Shapely: GeoPandas leverages Shapely for geometric object types and operations. You won’t interface much with Shapely directly, but is helpful to have a basic understanding of.
Matplotlib: Static plotting operations use Matplotlib syntax.
Folium: Interactive mapping uses a Folium backend. Folium is a very powerful tool for spatial visualization, which warrants its own topic presentation.
EPSG: Familiarize yourself with the most common CRS, which are given by unique EPSG codes. The EPSG database currently contrains over 5000 unique entries.