from plotnine import *
7 Visualization
7.1 Data Visualization with Plotnine
This section was written by Julia Mazzola
7.1.1 Introduction
Hi! My name is Julia, and I am a Senior double majoring in Statistical Data Science and Economics. I’m excited to show you the power of data visualization with Plotnine
, a Python library inspired by R’s ggplot2
. Visualization is a crucial tool to effectively communicate your findings to your audience and Plotnine
is a useful library to use.
7.1.2 What is Plotnine
?
Plotnine
uses grammer of graphics to create layered, customizable visualizations. Grammar of graphics is a framework that provides a systematic approach to creating visual representations of data by breaking down the plot into its fundamental components. To understand this better, think about how sentences have grammer, we can layer our graphics to create complex and detailed visulizations.
Components of the layered grammar of graphics:
- Layer: used to create the objects on a plot
- Data: defines the source of the information to be visualized
- Mapping: defines how the variables are represented in the plot
- Statistical transformation (stat): transforms the data, generally by summarizing the information
- Geometric object (geom): determines the type of plot type (e.g., points, lines, bars)
- Position adjustment (position): adjusts the display of overlapping points to improve clarity
- Scale: controls how values are mapped to aesthetic attributes (e.g., color, size)
- Coordinate system (coord): maps the position of objects onto the plane of the plot, and controls how the axes and grid lines are drawn
- Faceting (facet): used to split the data up into subsets of the entire dataset
You can make a wide array of different graphics with Plotnine
. Some common examples are:
- Scatterplot
geom_point()
- Bar Chart
geom_bar()
- Histogram
geom_histogram()
- Line Chart
geom_line()
7.1.3 Installing Plotnine
To use Plotnine
you must install it into your venv first. The instructions are as follows:
Type this command into either conda, your terminal, gitbash, or whatever you use for package install for your venv.
For pip:
pip install plotnine
For conda:
conda install -c conda-forge plotnine
You can import Plotnine
without a prefix:
Or with with a prefix to access each component such as:
import plotnine as p9
This way is generally recommended for larger projects or when collaborating with others for better code maintainability. But for simplicity in this section I will use the first method.
For the examples we will be using NYC open data to visualize motor vehicle crashes from the week of June 30, 2024.
import pandas as pd
= pd.read_feather('data/nyccrashes_cleaned.feather').dropna(subset=['borough']) nyc_crash
7.1.4 Scatterplot
Firstly, we will be creating a scatterplot. This can be done with geom_point()
. Our scatterplot will be displaying Crash Locations based on the longitude and latitude of the crash sites.
Creating a Basic Scatterplot
import warnings
"ignore", category=UserWarning)
warnings.filterwarnings(
='longitude', y='latitude')) +
(ggplot(nyc_crash, aes(x# Specifies graph type
+
geom_point() # Creates labels for graphic
='Crash Locations',
labs(title='Longitude',
x='Latitude') +
y# Because we are plotting maps we want 1:1 ratio
# coord_fixed(): changes the ratio of the x and y axis
= 1)) coord_fixed(ratio
Customizing a Scatterplot
You can customize your plot further by changing the color, edge color, transparency, size, or shape of your points. This is done in geom_point().
='longitude', y='latitude')) +
(ggplot(nyc_crash, aes(x# Changes what our points look like
# color= changes the outline color
# fill= changes the fill color
# alpha= changes transparency
# size= changes size
# shape= chanegs shape (s = square)
= 'black', fill = 'purple',
geom_point(color = 0.5, size = 2, shape = 's') +
alpha ='Crash Locations',
labs(title='Longitude',
x='Latitude') +
y= 1)) coord_fixed(ratio
This scatterplot provides a lot of information, yet there are ways we can customize our plot to be more informative for our audience. We can create a scatterplot that differentiates by contributing factor.
Changing Shape by Variables
Changing shape of points by contributing_factor_vehicle_1
:
# List of top 5 reasons for the contributing facor
# Abbreviating names for clairity
= {"Driver Inattention/Distraction": "Distraction",
factor1 "Failure to Yield Right-of-Way": "Failure to Yield",
"Following Too Closely": "Tailgating",
"Unsafe Speed": "Unsafe Speed",
"Passing or Lane Usage Improper": "Improper Lane Use"}
# Filter the data to only include valid contributing factors
= nyc_crash.loc[nyc_crash['contributing_factor_vehicle_1'].isin(factor1)].copy()
confact
# Change to shortened names for better visability
'contributing_factor_vehicle_1'] = confact[
confact.loc[:, 'contributing_factor_vehicle_1'
].replace(factor1)
# Changes shape of point according to 'contributing_factor_vehicle_1'
='longitude', y='latitude',
(ggplot(confact, aes(x='contributing_factor_vehicle_1')) +
shape = 0.7) +
geom_point(alpha ='Crash Locations by Top 5 Contributing Factors',
labs(title='Longitude',
x='Latitude',
y= 'Contributing Factor',
shape = 'Contributing Factor') +
color= 1) +
coord_fixed(ratio = (7,5))) theme(figure_size
Changing Color by Variables
To add color coordination to your plot in Plotnine
, specify the variable you want to use for coloring by including color='variable'
within the aes()
function. This enables you to visually distinguish different categories in your dataset, enhancing the clarity and interpretability of your plot.
Changing color of point according to borough:
# color= changhes color according to 'borough'
='longitude', y='latitude', color = 'borough')) +
(ggplot(nyc_crash, aes(x+
geom_point() ='Crash Locations',
labs(title='Longitude',
x='Latitude',
y# Changes key title to 'Borough'
= 'Borough') +
color= 1) +
coord_fixed(ratio = (7,5))) theme(figure_size
As you can see, each borough is represented by its own color, allowing the audience to easily identify which borough the crash occurred in.
Changing color of points by contributing_factor_vehicle_1
:
# color= changes color according to 'contributing_factor_vehicle_1'
='longitude', y='latitude',
(ggplot(confact, aes(x='contributing_factor_vehicle_1')) +
color +
geom_point() ='Crash Locations by Top 5 Contributing Factors',
labs(title='Longitude',
x='Latitude',
y= 'Contributing Factor') +
color= 1) +
coord_fixed(ratio # Changes plot size to be larger
= (7,5))) theme(figure_size
This graph uses color to distinguish what contributing factor caused the crash.
Adding Linear Regression Line to Plot
If you want to fit a linear regression line, use geom_smooth()
. Adding this to your plot can be really helpful to visualize trends of your data easier. To add a linear regression line to your scatterplot, you would include the following line of code:
='lm', se=False, color='red') geom_smooth(method
<plotnine.geoms.geom_smooth.geom_smooth at 0x119b29c70>
7.1.5 Bar Chart
Another common use for displaying data is a bar chart. You can create one with geom_bar()
. We will start with a simple chart of crashes by borough.
Creating a Basic Bar Chart
='borough')) + # Use 'borough' for the x-axis
(ggplot(nyc_crash, aes(x='purple') +
geom_bar(fill='Number of Crashes by Borough',
labs(title='Borough',
x='Crash Count')) y
Customizing your Bar Chart
You can change up your bar chart a couple of different ways. You can handpick colors you want, designate it to variables, flip orientation, etc:
# Designate your preffered colors (pastel color codes)
= ['#B3FFBA', '#E1C6FF', '#FFB3BA', '#BAE1FF', '#FFD5BA']
colors
# Adding fill= changes the color of bar according to variable
='borough', fill = 'borough')) +
(ggplot(nyc_crash, aes(x# Assigns your preffered colors
= colors) +
geom_bar(fill # Flips orientation of the chart
+
coord_flip() ='Number of Crashes by Borough',
labs(title='Borough',
x='Crash Count')) y
Multivariable Bar Chart
You can also split up a bar chart to make it visually easier to understand.
# Using 'confact' dataset again for better visualization
='contributing_factor_vehicle_1', fill='borough')) +
(ggplot(confact, aes(x+
geom_bar() ='Top 5 Contributing Factors by Borough',
labs(title='Top 5 Contributing Factor Vehicle 1',
x='Number of Crashes',
y# Changes key name to "Borough"
='Borough') +
fill # size= creates smaller text
# angle= rotates x-axis text for readability
# figure_size= creates a larger image
=element_text(size=9, angle=65), figure_size= (7,7))) theme(axis_text_x
7.1.6 Histogram
Another useful way to display data is a histogram. You can create one with geom_hisogram()
. Using a histogram is very useful when displaying continuous data.
Basic Histogram
='number_of_persons_injured')) +
(ggplot(nyc_crash, aes(x# bins= sets the amount of bars in your histogram
=10, alpha=0.8, fill='green') +
geom_histogram(bins='Distribution of Persons Injured',
labs(title='Number of Persons Injured',
x='Count of Crashes')) y
With a histogram it is very easy to understand trends for a dataset and you can see that our NYC crash data is positively skewed.
Multivariable Histogram
Similar to bar charts, you can make Histograms that display more than one variable.
='number_of_persons_injured', fill = 'borough')) +
(ggplot(confact, aes(x# binwidth= changes width of your bars
# color= changes outline color for better visability
=1, color = 'black') +
geom_histogram(binwidth='Distribution of Persons Injured',
labs(title='Number of Persons Injured',
x='Count of Crashes',
y= 'Borough')) fill
Overlapping Histogram
Histograms can also be useful when comparing multiple categories. Here we are comparing Manhattan and Brooklyn’s number of persons injured with an overlapping histogram.
# Creating plot if crash is in 'MANHATTAN' or 'BROOKLYN'
'borough'].isin(['MANHATTAN', 'BROOKLYN'])],
(ggplot(nyc_crash[nyc_crash[='number_of_persons_injured', fill='borough')) +
aes(x=10) +
geom_histogram(bins='Persons Injured: Manhattan vs Brooklyn',
labs(title='Number of Persons Injured',
x='Count',
y='Borough')) fill
7.1.7 Line Chart
Line charts are great for time-series data and can be created with geom_line()
. This type of chart is particularly useful for identifying patterns, fluctuations, and trends, making it easier to understand how a variable changes over a specified period. We will create one analyzing Number of Crashes by Hour.
Basic Line Chart
# Finding crashes per hour
'crash_datetime'] = pd.to_datetime(nyc_crash['crash_datetime'])
nyc_crash[
# Extract hour
'crash_hour'] = nyc_crash['crash_datetime'].dt.hour
nyc_crash[
# Count crashes per hour
= (nyc_crash.groupby(['crash_hour'])
crash_counts ='crash_count')) .size().reset_index(name
# Plot crashes by hour
='crash_hour', y='crash_count')) +
(ggplot(crash_counts, aes(x# Creates the line chart
+
geom_line() # Adds points for better visibility
+
geom_point() ='Number of Crashes by Hour',
labs(title='Hour',
x='Crashes') +
y# Formats the x-axis to display ticks by every 2 hours
=range(0, 24, 2))) scale_x_continuous(breaks
This example is excellent for understanding the grammar of graphics. As you can see, we use geom_line()
to create the line chart, while also adding geom_point()
, which is typically used for scatterplots, to make the figure clearer by layering additional details.”
Multivariable Line Chart
Similarly to the other figures you can create a line chart with multiple variables. Now we will create a chart with number of crashes by borough.
# Setting crash counts to also include borough
= nyc_crash.groupby(['crash_hour',
crash_counts 'borough']).size().reset_index(name='crash_count')
# Plots crashes by hour with different lines for each borough
='crash_hour', y='crash_count',
(ggplot(crash_counts, aes(x='borough')) +
color# size= changes the thinkness of the lines
=0.5) +
geom_line(size='Number of Crashes by Hour and Borough',
labs(title='Hour of the Day',
x='Number of Crashes',
y= 'Borough') +
color =range(0, 24, 2))) scale_x_continuous(breaks
7.1.8 Faceting Your Plots
To organize your data in a way that enhances interpretability, you can utilize facet_grid()
or facet_wrap()
. This approach allows for the creation of separate plots based on categorical variables, making it easier to identify trends and patterns. You can facet any type of plots, scatterplots, bar charts, histograms, line charts, etc. using one or two variables.
Scatterplots per Facet
Scatterplot of Crash Locations by Contributing Factor with facet_wrap()
:
='longitude', y='latitude')) +
(ggplot(confact, aes(x=0.5) +
geom_point(alpha# Creates separate plots for each contributing factor
'contributing_factor_vehicle_1') +
facet_wrap(='Crash Locations by Top 5 Contributing Factor',
labs(title='Longitude',
x='Latitude') +
y= 1)) coord_fixed(ratio
Scatterplot of Two Variables, Crash Locations Contributing Factor and Borough with facet_grid()
:
='longitude', y='latitude')) +
(ggplot(confact, aes(x= 0.5) +
geom_point(alpha # Creates a grid of subplots based on the values of two variables
# ~'contributing_factor_vehicle_1' by 'borough'
'contributing_factor_vehicle_1 ~ borough') +
facet_grid(='Crash Locations by Top 5 Contributing Factor',
labs(title='Longitude',
x='Latitude') +
y# Changes angle of text and size of the graphic
=element_text(angle=90),
theme(axis_text_x# sprip_text=element_text changes text size of the facet titles
=element_text(size=5.5)) +
strip_text= 1)) coord_fixed(ratio
Bar Chart per Facet
Bar chart of Contributing Factors by Borough with facet_wrap
:
='contributing_factor_vehicle_1', fill='borough')) +
(ggplot(confact, aes(x+
geom_bar() ='Top 5 Contributing Factors by Borough',
labs(title='Top 5 Contributing Factor Vehicle 1',
x='Number of Crashes',
y= 'Borough') +
fill '~ borough') +
facet_wrap(=element_text(size=9, angle=65), figure_size= (7,7))) theme(axis_text_x
Histograms per Facet
Histogram of Crashes per Hour by Borough with facet_wrap
:
='crash_hour', y='crash_count', fill = 'borough')) +
(ggplot(crash_counts, aes(x='identity') +
geom_bar(stat='Crash Hour', y='Number of Crashes', title = "Crashes by Hour") +
labs(x'~ borough')) facet_wrap(
Line Chart per Facet
You can use plot each variable by on separate panels with facet_wrap()
.
='crash_hour', y='crash_count')) +
(ggplot(crash_counts, aes(x+
geom_line() # Breaks the figure up by borough
"borough") +
facet_wrap(='Number of Crashes by Hour and Borough',
labs(title='Hour of the Day',
x='Number of Crashes')) y
7.1.9 Conclusion
Plotnine
is a very powerful tool to make impactful and detailed graphics. The flexibility of its grammar of graphics approach means there are endless ways to modify, enhance, and be creative with your plots. You can layer geoms, adjust aesthetics, and apply scales, facets, and themes.
Creating Specific Plots
- Scatterplot
geom_point()
- Boxplot
geom_box()
- Histogram
geom_histogram()
- Line Chart
geom_line()
- Bar Chart
geom_bar()
- Density Plot
geom_denisty()
Formatting and Customizing Your Figure:
fill
: to change the color of the datacolor
: to change the color of the bordersalpha
: to change the transparencybins
: to change the number of binsfigure_size
: to change size of graphicgeom_smooth
: to add a smoothed linefacet
: plot each group on a separate panelfacet_wrap()
: creates a series of plots arranged in a grid, wrapping into new rows or columns as neededfacet_grid()
: allows you to create a grid layout based on two categorical variables, organizing plots in a matrix format
theme
: change overall theme
There are many other features and customizations you can do with Plotnine. For more information on how to leverage the full potential of this package for your data visualization needs check out Plotnine’s Graph Gallery.
Happy plotting!
Sources
Python Graph Gallery. (2024). Plotnine: ggplot in python. Python Graph Gallery. https://python-graph-gallery.com/plotnine/
Sarker, D. (2018). A comprehensive guide to the grammar of graphics for effective visualization of multi-dimensional data. Towards Data Science. https://towardsdatascience.com/a-comprehensive-guide-to-the-grammar-of-graphics-for-effective-visualization-of-multi-dimensional-1f92b4ed4149
Wilkinson, L. (2012). The grammar of graphics (pp. 375-414). Springer Berlin Heidelberg.