6  Geospatial Data

6.1 Handling Spatial Data with GeoPandas (by Kaitlyn Bedard)

GeoPandas is a python library created as an extension of pandas to offer support for geographic data. Like pandas, GeoPandas has a series type and a dataframe type: GeoSeries and GeoDataFrame. It allows users to do work that would otherwise need a GIS database. Note that since GeoPandas is an extension of Pandas, it inherits all its attributes and methods. Please review the pandas presentations for information on these tools, if needed.

6.1.1 Installation

You can install GeoPandas using the below commands in terminal. The documentation recommends the first method.

conda install -c conda-forge geopandas

conda install geopandas

pip install geopandas

6.1.2 Basic Concepts

The GeoPandas GeoDataFrame is essentially a pandas dataframe that supports typical data, however, it also supports geometries. Though the dataframe can have multiple geometry columns, there is one “active” column on which all operations are applied to.

The types of geometries are:

  • Points: coordinates
  • Lines: set of two coordinates
  • Polygons: list of coordinate tuples, first and last must be the same (closed shape)

These geometries are often represented by shapely.geometry objects. Note, we can also have multi-points, multi-lines, and multi-polygons. Below are examples of creating these geometries using shapely. Each GeoSeries has a specified CRS (Coordinate Reference System) that stores information about the data.

from shapely.geometry import LineString, Point, Polygon
import geopandas as gpd

# point example
point = Point(0.5, 0.5)
gdf1 = gpd.GeoDataFrame(geometry=[point])

# line example
line = LineString([(0, 0), (1, 1)])
gdf2 = gpd.GeoDataFrame(geometry=[line])

# polygon example
polygon = Polygon([(0, 0), (0, 1), (1, 1), (1, 0), (0, 0)])
gdf3 = gpd.GeoDataFrame(geometry=[polygon])

The following are some examples of basic attributes of a GeoSeries:

  • length: returns the length of a line
gdf2.length
0    1.414214
dtype: float64
  • area: returns the area of the shape
gdf3.area
0    1.0
dtype: float64
  • bounds: gives the bounds of each row in a geometry column

  • total_bounds: gives the total bounds of a geometry series

  • geom_type: gives the geometry type

gdf3.geom_type
0    Polygon
dtype: object
  • is_valid: returns True for valid geometries and False otherwise

Below are some examples of basic methods that can be applied to a GeoSeries:

  • distance(): returns the (minimum) distance of each row of a geometry to a specified paramater
    • parameter other: can be a single geometry, or an entire geometry series
    • parameter align: True if you want to align GeoSeries by index, false otherwise
gdf2.distance(Point((1,0)))
0    0.707107
dtype: float64
  • centroid: returns a new GeoSeries with the centers of each row in the geometry
gdf3.centroid
0    POINT (0.50000 0.50000)
dtype: geometry

Below are examples of some relationship tests that can be applied to a GeoSeries:

  • contains(): returns true if shape contains a specified other
    • parameter other: can be a single geometry, or an entire geometry series
    • parameter align: True if you want to align GeoSeries by index, false otherwise
gdf3.contains(gdf1)
0    True
dtype: bool
  • intersects(): returns true if shape intersects a specified other
    • parameter other: can be a single geometry, or an entire geometry series
    • parameter align: True if you want to align GeoSeries by index, false otherwise
gdf2.intersects(gdf3)
0    True
dtype: bool

6.1.3 Reading Files

If you have a file that contains data and geometry information, you can read it directly with geopandas using the geopandas.read_file() command. Examples of these files are GeoPackage, GeoJSON, Shapefile. However, we can convert other types of files to a GeoDataFrame. For example, we can transform the NYC crash data. The below code creates a point geometry. The points are the coordinates of the crashes.

# Reading csv file 
import pandas as pd 
import numpy as np
# Shapely for converting latitude/longtitude to geometry
from shapely.geometry import Point 
# To create GeodataFrame
import geopandas as gpd 

jan23 = pd.read_csv('data/nyc_crashes_202301_cleaned.csv')

# creating geometry using shapely (removing empty points)
geometry = [Point(xy) for xy in zip(jan23["LONGITUDE"], \
            jan23["LATITUDE"]) if not Point(xy).is_empty]

# creating geometry column to be used by geopandas
geometry2 = gpd.points_from_xy(jan23["LONGITUDE"], jan23["LATITUDE"])

# coordinate reference system (epsg:4326 implies geographic coordinates)
crs = {'init': 'epsg:4326'}

# create Geographic data frame (removing rows with missing coordinates)
jan23_gdf = gpd.GeoDataFrame(jan23.loc[~pd.isna(\
            jan23["LATITUDE"]) & ~pd.isna(\
            jan23["LONGITUDE"])],crs=crs, geometry=geometry)

jan23_gdf.head()
/usr/local/lib/python3.11/site-packages/pyproj/crs/crs.py:141: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  in_crs_string = _prepare_from_proj_string(in_crs_string)
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME ... Unnamed: 31 Unnamed: 32 Unnamed: 33 Unnamed: 34 Unnamed: 35 Unnamed: 36 Unnamed: 37 Unnamed: 38 Unnamed: 39 geometry
0 1/1/23 14:38 BROOKLYN 11211.0 40.719094 -73.946108 (40.7190938,-73.9461082) BROOKLYN QUEENS EXPRESSWAY RAMP NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN POINT (-73.94611 40.71909)
1 1/1/23 8:04 QUEENS 11430.0 40.659508 -73.773687 (40.6595077,-73.7736867) NASSAU EXPRESSWAY NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN POINT (-73.77369 40.65951)
2 1/1/23 18:05 MANHATTAN 10011.0 40.742454 -74.008686 (40.7424543,-74.008686) 10 AVENUE 11 AVENUE NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN POINT (-74.00869 40.74245)
3 1/1/23 23:45 QUEENS 11103.0 40.769737 -73.912440 (40.769737, -73.91244) ASTORIA BOULEVARD 37 STREET NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN POINT (-73.91244 40.76974)
4 1/1/23 4:50 BRONX 10462.0 40.830555 -73.850720 (40.830555, -73.85072) CASTLE HILL AVENUE EAST 177 STREET NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN POINT (-73.85072 40.83055)

5 rows × 41 columns

6.1.4 Plotting

We can easily plot our data now that has been transformed to a geometric data frame.

# Basic Plot
jan23_gdf.plot()
# Color the plot by borough
jan23_gdf.plot(column = 'BOROUGH',legend=True)

# Color the plot by number persons injuried
jan23_gdf.plot(column = 'NUMBER OF PERSONS INJURED',legend=True, \
               cmap= "OrRd")

# Plotting missing information 
jan23_gdf.plot(column='BOROUGH', missing_kwds={'color': 'lightgrey'})
<AxesSubplot: >

6.1.5 Interactive Maps

We can also easily create an interactive plot, using the .explore() method.

# interactive map of just the latitude and longitude points
jan23_gdf.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook