8.2. Plotnine

plotnine is an implementation of a grammar of graphics in Python, it is based on ggplot2. The grammar allows users to compose plots by explicitly mapping data to the visual objects that make up the plot. Plotting with a grammar is powerful, it makes custom (and otherwise complex) plots easy to think about and then create, while the simple plots remain simple.

8.2.1. Install Plotnine

To install plotnine with pip, use the command:

pip install plotnine

To install plotnine with conda, use the command:

conda install -c conda-forge plotnine

8.2.2. Import Plotnine

First, we need to import pandas, numpy, and plotnine with the following code. We also need to read the actual data.

import pandas as pd
import numpy as np
from plotnine import *

url = 'https://raw.githubusercontent.com/statds/ids-s22/main/notes/data/nyc_mv_collisions_202201.csv'
nyc = pd.read_csv(url)
nyc.info()
/home/runner/work/ids-s22/ids-s22/env/lib/python3.9/site-packages/geopandas/_compat.py:111: UserWarning: The Shapely GEOS version (3.9.1-CAPI-1.14.2) is incompatible with the GEOS version PyGEOS was compiled with (3.10.1-CAPI-1.16.0). Conversions between both will be slow.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7659 entries, 0 to 7658
Data columns (total 29 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   CRASH DATE                     7659 non-null   object 
 1   CRASH TIME                     7659 non-null   object 
 2   BOROUGH                        5025 non-null   object 
 3   ZIP CODE                       5025 non-null   float64
 4   LATITUDE                       7097 non-null   float64
 5   LONGITUDE                      7097 non-null   float64
 6   LOCATION                       7097 non-null   object 
 7   ON STREET NAME                 5625 non-null   object 
 8   CROSS STREET NAME              3620 non-null   object 
 9   OFF STREET NAME                2034 non-null   object 
 10  NUMBER OF PERSONS INJURED      7659 non-null   int64  
 11  NUMBER OF PERSONS KILLED       7659 non-null   int64  
 12  NUMBER OF PEDESTRIANS INJURED  7659 non-null   int64  
 13  NUMBER OF PEDESTRIANS KILLED   7659 non-null   int64  
 14  NUMBER OF CYCLIST INJURED      7659 non-null   int64  
 15  NUMBER OF CYCLIST KILLED       7659 non-null   int64  
 16  NUMBER OF MOTORIST INJURED     7659 non-null   int64  
 17  NUMBER OF MOTORIST KILLED      7659 non-null   int64  
 18  CONTRIBUTING FACTOR VEHICLE 1  7615 non-null   object 
 19  CONTRIBUTING FACTOR VEHICLE 2  5624 non-null   object 
 20  CONTRIBUTING FACTOR VEHICLE 3  824 non-null    object 
 21  CONTRIBUTING FACTOR VEHICLE 4  225 non-null    object 
 22  CONTRIBUTING FACTOR VEHICLE 5  80 non-null     object 
 23  COLLISION_ID                   7659 non-null   int64  
 24  VEHICLE TYPE CODE 1            7539 non-null   object 
 25  VEHICLE TYPE CODE 2            4748 non-null   object 
 26  VEHICLE TYPE CODE 3            752 non-null    object 
 27  VEHICLE TYPE CODE 4            207 non-null    object 
 28  VEHICLE TYPE CODE 5            78 non-null     object 
dtypes: float64(3), int64(9), object(17)
memory usage: 1.7+ MB

It is easy to add elements of a plot with plotnine. We can write each necessary function one by one with a ‘+’ sign.

8.2.3. Histograms

With plotnine, we can create histogram plots. We can change how the count is visualized. By default it is the raw count, but it can be set to ncount (raw count normalized to 1), density, proportion (width*density), and percent_format. We must set either binwidth or the number of bins.

nyc['hour'] = [x.split(':')[0] for x in nyc['CRASH TIME']]
nyc['hour'] = [int(x) for x in nyc['hour']]
(
    ggplot(nyc, aes(x = 'hour', y = after_stat('count')))
    + geom_histogram(binwidth = 1, bins = 24)
    + ggtitle("Hourly Crash Count")
)
../_images/plotnine_3_0.png
<ggplot: (8774598084961)>

We can easily add multiple plots. Here we add a plot of the same data with wider bins. We can make the histograms transparent with alpha. We can also set the color of the plot with fill.

(
    ggplot(nyc, aes(x = 'hour', y = after_stat('count')))
    + geom_histogram(binwidth = 2, alpha = 0.5, fill = 'green')
    + geom_histogram(binwidth = 1, alpha = 0.5)
    + ggtitle("Hourly and Bihourly Crash Count")
)
../_images/plotnine_5_0.png
<ggplot: (8774523029896)>

We can visualize the plots with respect to other variables. For the NYC example, we can fill each bin with respect to NYC boroughs.

(
    ggplot(nyc, aes(x = 'hour', y = after_stat('count'), fill = 'BOROUGH'))
    + geom_histogram(binwidth = 4)
    + ggtitle("Hourly Crash Count for each Borough")
)
../_images/plotnine_7_0.png
<ggplot: (8774522969543)>

Alternatively, we can split one plot into multiple plots using facet_wrap.

(
    ggplot(nyc, aes(x = 'hour', y = after_stat('count')))
    + geom_histogram(binwidth = 4)
    + facet_wrap("BOROUGH")
    + ggtitle("Hourly Crash Count for each Borough")
)
/home/runner/work/ids-s22/ids-s22/env/lib/python3.9/site-packages/plotnine/utils.py:371: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
../_images/plotnine_9_1.png
<ggplot: (8774522923069)>

8.2.4. Boxplot

We can make boxplots with plotnine very easily. In this example, we adjust the angle of the x labels so that they do not overlap.

theme_update(axis_text_x = element_text(angle = 45))
(
    ggplot(nyc, aes(x = 'BOROUGH', y = 'NUMBER OF PERSONS INJURED'))
    + geom_boxplot()
    + ggtitle("Boxplot of # of Persons Injured for each Borough")
)
../_images/plotnine_11_0.png
<ggplot: (8774522639348)>

8.2.5. Violin Plot

Violin plots are similar to boxplots, but they also display the density for the numeric data.

theme_set(theme_xkcd)
theme_update(figure_size = (10, 4))
(
  ggplot(nyc, aes(x = 'BOROUGH', y = 'NUMBER OF PERSONS INJURED'))
  + geom_violin(nyc)
  + geom_point()
  + ggtitle("Violin Plot of # of Persons Injured for each Borough")
)
findfont: Font family ['xkcd', 'Humor Sans', 'Comic Sans MS'] not found. Falling back to DejaVu Sans.
findfont: Font family ['xkcd', 'Humor Sans', 'Comic Sans MS'] not found. Falling back to DejaVu Sans.
findfont: Font family ['xkcd', 'Humor Sans', 'Comic Sans MS'] not found. Falling back to DejaVu Sans.
../_images/plotnine_13_3.png
<ggplot: (8774522591019)>

8.2.6. Time Series

We can also make time series with plotnine. In our example we need to group the data by day and borough, and then reset the index to use both as a column variable.

theme_set(theme_classic)
nyc['day'] = [x.split('/')[1] for x in nyc['CRASH DATE']]
nyc['day'] = [int(x) for x in nyc['day']]
daily_counts = nyc.groupby(['day', 'BOROUGH'])['BOROUGH'].count()
daily_counts = daily_counts.reset_index(name = 'counts')

(
    ggplot(daily_counts, aes(x = 'day', y = 'counts', color = "BOROUGH"))
    + geom_line()
    + ggtitle("Daily Crash Count for Each Borough")
)
../_images/plotnine_15_0.png
<ggplot: (8774522682353)>

8.2.7. Scatter Plot

With geom_point, we can make a simple scatter plot of our data. We can set LONGITUDE as X, LATITUDE as y, and we can color each point by BOROUGH. We can also set alpha so that we can visualize the density of the points.

  theme_update(figure_size = (8, 8))  
  nyc[nyc.LONGITUDE == 0] = np.nan
  nyc_plot = ggplot(nyc, aes(x = 'LONGITUDE', y = 'LATITUDE', color = 'BOROUGH'))
  nyc_plot + geom_point(alpha = 0.1) + ggtitle("Coordinates of Car Crashes")
/home/runner/work/ids-s22/ids-s22/env/lib/python3.9/site-packages/plotnine/layer.py:401: PlotnineWarning: geom_point : Removed 582 rows containing missing values.
../_images/plotnine_17_1.png
<ggplot: (8774522394230)>