Plotnine
Contents
8.2. Plotnine¶
plotnine is an implementation of a grammar of graphics in Python, it is based on ggplot2. The grammar allows users to compose plots by explicitly mapping data to the visual objects that make up the plot. Plotting with a grammar is powerful, it makes custom (and otherwise complex) plots easy to think about and then create, while the simple plots remain simple.
8.2.1. Install Plotnine¶
To install plotnine with pip, use the command:
pip install plotnine
To install plotnine with conda, use the command:
conda install -c conda-forge plotnine
8.2.2. Import Plotnine¶
First, we need to import pandas, numpy, and plotnine with the following code. We also need to read the actual data.
import pandas as pd
import numpy as np
from plotnine import *
url = 'https://raw.githubusercontent.com/statds/ids-s22/main/notes/data/nyc_mv_collisions_202201.csv'
nyc = pd.read_csv(url)
nyc.info()
/home/runner/work/ids-s22/ids-s22/env/lib/python3.9/site-packages/geopandas/_compat.py:111: UserWarning: The Shapely GEOS version (3.9.1-CAPI-1.14.2) is incompatible with the GEOS version PyGEOS was compiled with (3.10.1-CAPI-1.16.0). Conversions between both will be slow.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7659 entries, 0 to 7658
Data columns (total 29 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRASH DATE 7659 non-null object
1 CRASH TIME 7659 non-null object
2 BOROUGH 5025 non-null object
3 ZIP CODE 5025 non-null float64
4 LATITUDE 7097 non-null float64
5 LONGITUDE 7097 non-null float64
6 LOCATION 7097 non-null object
7 ON STREET NAME 5625 non-null object
8 CROSS STREET NAME 3620 non-null object
9 OFF STREET NAME 2034 non-null object
10 NUMBER OF PERSONS INJURED 7659 non-null int64
11 NUMBER OF PERSONS KILLED 7659 non-null int64
12 NUMBER OF PEDESTRIANS INJURED 7659 non-null int64
13 NUMBER OF PEDESTRIANS KILLED 7659 non-null int64
14 NUMBER OF CYCLIST INJURED 7659 non-null int64
15 NUMBER OF CYCLIST KILLED 7659 non-null int64
16 NUMBER OF MOTORIST INJURED 7659 non-null int64
17 NUMBER OF MOTORIST KILLED 7659 non-null int64
18 CONTRIBUTING FACTOR VEHICLE 1 7615 non-null object
19 CONTRIBUTING FACTOR VEHICLE 2 5624 non-null object
20 CONTRIBUTING FACTOR VEHICLE 3 824 non-null object
21 CONTRIBUTING FACTOR VEHICLE 4 225 non-null object
22 CONTRIBUTING FACTOR VEHICLE 5 80 non-null object
23 COLLISION_ID 7659 non-null int64
24 VEHICLE TYPE CODE 1 7539 non-null object
25 VEHICLE TYPE CODE 2 4748 non-null object
26 VEHICLE TYPE CODE 3 752 non-null object
27 VEHICLE TYPE CODE 4 207 non-null object
28 VEHICLE TYPE CODE 5 78 non-null object
dtypes: float64(3), int64(9), object(17)
memory usage: 1.7+ MB
It is easy to add elements of a plot with plotnine. We can write each necessary function one by one with a ‘+’ sign.
8.2.3. Histograms¶
With plotnine, we can create histogram plots. We can change how the count is visualized. By default it is the raw count, but it can be set to ncount (raw count normalized to 1), density, proportion (width*density), and percent_format. We must set either binwidth or the number of bins.
nyc['hour'] = [x.split(':')[0] for x in nyc['CRASH TIME']]
nyc['hour'] = [int(x) for x in nyc['hour']]
(
ggplot(nyc, aes(x = 'hour', y = after_stat('count')))
+ geom_histogram(binwidth = 1, bins = 24)
+ ggtitle("Hourly Crash Count")
)
<ggplot: (8774598084961)>
We can easily add multiple plots. Here we add a plot of the same data with wider bins. We can make the histograms transparent with alpha. We can also set the color of the plot with fill.
(
ggplot(nyc, aes(x = 'hour', y = after_stat('count')))
+ geom_histogram(binwidth = 2, alpha = 0.5, fill = 'green')
+ geom_histogram(binwidth = 1, alpha = 0.5)
+ ggtitle("Hourly and Bihourly Crash Count")
)
<ggplot: (8774523029896)>
We can visualize the plots with respect to other variables. For the NYC example, we can fill each bin with respect to NYC boroughs.
(
ggplot(nyc, aes(x = 'hour', y = after_stat('count'), fill = 'BOROUGH'))
+ geom_histogram(binwidth = 4)
+ ggtitle("Hourly Crash Count for each Borough")
)
<ggplot: (8774522969543)>
Alternatively, we can split one plot into multiple plots using facet_wrap.
(
ggplot(nyc, aes(x = 'hour', y = after_stat('count')))
+ geom_histogram(binwidth = 4)
+ facet_wrap("BOROUGH")
+ ggtitle("Hourly Crash Count for each Borough")
)
/home/runner/work/ids-s22/ids-s22/env/lib/python3.9/site-packages/plotnine/utils.py:371: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
<ggplot: (8774522923069)>
8.2.4. Boxplot¶
We can make boxplots with plotnine very easily. In this example, we adjust the angle of the x labels so that they do not overlap.
theme_update(axis_text_x = element_text(angle = 45))
(
ggplot(nyc, aes(x = 'BOROUGH', y = 'NUMBER OF PERSONS INJURED'))
+ geom_boxplot()
+ ggtitle("Boxplot of # of Persons Injured for each Borough")
)
<ggplot: (8774522639348)>
8.2.5. Violin Plot¶
Violin plots are similar to boxplots, but they also display the density for the numeric data.
theme_set(theme_xkcd)
theme_update(figure_size = (10, 4))
(
ggplot(nyc, aes(x = 'BOROUGH', y = 'NUMBER OF PERSONS INJURED'))
+ geom_violin(nyc)
+ geom_point()
+ ggtitle("Violin Plot of # of Persons Injured for each Borough")
)
findfont: Font family ['xkcd', 'Humor Sans', 'Comic Sans MS'] not found. Falling back to DejaVu Sans.
findfont: Font family ['xkcd', 'Humor Sans', 'Comic Sans MS'] not found. Falling back to DejaVu Sans.
findfont: Font family ['xkcd', 'Humor Sans', 'Comic Sans MS'] not found. Falling back to DejaVu Sans.
<ggplot: (8774522591019)>
8.2.6. Time Series¶
We can also make time series with plotnine. In our example we need to group the data by day and borough, and then reset the index to use both as a column variable.
theme_set(theme_classic)
nyc['day'] = [x.split('/')[1] for x in nyc['CRASH DATE']]
nyc['day'] = [int(x) for x in nyc['day']]
daily_counts = nyc.groupby(['day', 'BOROUGH'])['BOROUGH'].count()
daily_counts = daily_counts.reset_index(name = 'counts')
(
ggplot(daily_counts, aes(x = 'day', y = 'counts', color = "BOROUGH"))
+ geom_line()
+ ggtitle("Daily Crash Count for Each Borough")
)
<ggplot: (8774522682353)>
8.2.7. Scatter Plot¶
With geom_point, we can make a simple scatter plot of our data. We can set LONGITUDE as X, LATITUDE as y, and we can color each point by BOROUGH. We can also set alpha so that we can visualize the density of the points.
theme_update(figure_size = (8, 8))
nyc[nyc.LONGITUDE == 0] = np.nan
nyc_plot = ggplot(nyc, aes(x = 'LONGITUDE', y = 'LATITUDE', color = 'BOROUGH'))
nyc_plot + geom_point(alpha = 0.1) + ggtitle("Coordinates of Car Crashes")
/home/runner/work/ids-s22/ids-s22/env/lib/python3.9/site-packages/plotnine/layer.py:401: PlotnineWarning: geom_point : Removed 582 rows containing missing values.
<ggplot: (8774522394230)>