Data manipulation is crucial for transforming raw data into a more analyzable format, essential for uncovering patterns and ensuring accurate analysis. This chapter introduces the core techniques for data manipulation in Python, utilizing the Pandas library, a cornerstone for data handling within Python’s data science toolkit.
Python’s ecosystem is rich with libraries that facilitate not just data manipulation but comprehensive data analysis. Pandas, in particular, provides extensive functionality for data manipulation tasks including reading, cleaning, transforming, and summarizing data. Using real-world datasets, we will explore how to leverage Python for practical data manipulation tasks.
By the end of this chapter, you will learn to:
Import/export data from/to diverse sources.
Clean and preprocess data efficiently.
Transform and aggregate data to derive insights.
Merge and concatenate datasets from various origins.
Analyze real-world datasets using these techniques.
6.2 Example: NYC Crash Data
Consider a subset of the NYC Crash Data, which contains all NYC motor vehicle collisions data with documentation from NYC Open Data. We downloaded the crash data for the week of August 31, 2025, on September 11, 2025, in CSC format.
import numpy as npimport pandas as pd# Load the datasetfile_path ='data/nyc_crashes_lbdwk_2025.csv'df = pd.read_csv(file_path, dtype={'LATITUDE': np.float32,'LONGITUDE': np.float32,'ZIP CODE': str})# Replace column names: convert to lowercase and replace spaces with underscoresdf.columns = df.columns.str.lower().str.replace(' ', '_')# Check for missing valuesdf.isnull().sum()
# Replace invalid coordinates (latitude=0, longitude=0 or NaN) with NaNdf.loc[(df['latitude'] ==0) & (df['longitude'] ==0), ['latitude', 'longitude']] = pd.NAdf['latitude'] = df['latitude'].replace(0, pd.NA)df['longitude'] = df['longitude'].replace(0, pd.NA)# Drop the redundant `latitute` and `longitude` columnsdf = df.drop(columns=['location'])# Converting 'crash_date' and 'crash_time' columns into a single datetime columndf['crash_datetime'] = pd.to_datetime(df['crash_date'] +' '+ df['crash_time'], format='%m/%d/%Y %H:%M', errors='coerce')# Drop the original 'crash_date' and 'crash_time' columnsdf = df.drop(columns=['crash_date', 'crash_time'])
Let’s get some basic frequency tables of borough and zip_code, whose values could be used to check their validity against the legitmate values.
# Frequency table for 'borough' without filling missing valuesborough_freq = df['borough'].value_counts(dropna=False).reset_index()borough_freq.columns = ['borough', 'count']# Frequency table for 'zip_code' without filling missing valueszip_code_freq = df['zip_code'].value_counts(dropna=False).reset_index()zip_code_freq.columns = ['zip_code', 'count']zip_code_freq
zip_code
count
0
NaN
284
1
11207
33
2
11203
29
3
11212
23
4
11233
21
...
...
...
159
10028
1
160
11426
1
161
10307
1
162
11365
1
163
11694
1
164 rows × 2 columns
A comprehensive list of ZIP codes by borough can be obtained, for example, from the New York City Department of Health’s UHF Codes. We can use this list to check the validity of the zip codes in the data.
As it turns out, the collection of valid NYC zip codes differ from different sources. From United States Zip Codes, 10065 appears to be a valid NYC zip code. Under this circumstance, it might be safer to not remove any zip code from the data.
To be safe, let’s concatenate valid and invalid zips.
# Convert invalid ZIP codes to a set of stringsinvalid_zips_set =set(invalid_zip_freq['zip_code'].dropna().astype(str))# Convert all_valid_zips to a set of strings (if not already)valid_zips_set =set(map(str, all_valid_zips))# Merge both setsmerged_zips = invalid_zips_set | valid_zips_set # Union of both sets
Are missing in zip code and borough always co-occur?
# Check if missing values in 'zip_code' and 'borough' always co-occur# Count rows where both are missingmissing_cooccur = df[['zip_code', 'borough']].isnull().all(axis=1).sum()# Count total missing in 'zip_code' and 'borough', respectivelytotal_missing_zip_code = df['zip_code'].isnull().sum()total_missing_borough = df['borough'].isnull().sum()# If missing in both columns always co-occur, the number of missing# co-occurrences should be equal to the total missing in either columnnp.array([missing_cooccur, total_missing_zip_code, total_missing_borough])
array([284, 284, 284])
Are there cases where zip_code and borough are missing but the geo codes are not missing? If so, fill in zip_code and borough using the geo codes by reverse geocoding.
First make sure geopy is installed.
pip install geopy
Now we use module Nominatim in package geopy to reverse geocode.
from geopy.geocoders import Nominatimimport time# Initialize the geocoder; the `user_agent` is your identifier # when using the service. Be mindful not to crash the server# by unlimited number of queries, especially invalid code.geolocator = Nominatim(user_agent="jyGeopyTry")
We write a function to do the reverse geocoding given lattitude and longitude.
# Function to fill missing zip_codedef get_zip_code(latitude, longitude):try: location = geolocator.reverse((latitude, longitude), timeout=10)if location: address = location.raw['address'] zip_code = address.get('postcode', None)return zip_codeelse:returnNoneexceptExceptionas e:print(f"Error: {e} for coordinates {latitude}, {longitude}")returnNonefinally: time.sleep(1) # Delay to avoid overwhelming the service
Let’s try it out:
# Example usagelatitude =40.730610longitude =-73.935242get_zip_code(latitude, longitude)
'11101'
The function get_zip_code can then be applied to rows where zip code is missing but geocodes are not to fill the missing zip code.
Once zip code is known, figuring out burough is simple because valid zip codes from each borough are known.
6.3 Accessing Census Data
The U.S. Census Bureau provides extensive demographic, economic, and social data through multiple surveys, including the decennial Census, the American Community Survey (ACS), and the Economic Census. These datasets offer valuable insights into population trends, economic conditions, and community characteristics at multiple geographic levels.
There are multiple ways to access Census data. For example:
Census API: The Census API allows programmatic access to various datasets. It supports queries for different geographic levels and time periods.
data.census.gov: The official web interface for searching and downloading Census data.
IPUMS USA: Provides harmonized microdata for longitudinal research. Available at IPUMS USA.
NHGIS: Offers historical Census data with geographic information. Visit NHGIS.
In addition, Python tools simplify API access and data retrieval.
6.3.1 Python Tools for Accessing Census Data
Several Python libraries facilitate Census data retrieval:
census: A high-level interface to the Census API, supporting ACS and decennial Census queries. See census on PyPI.
censusdis: Provides richer functionality: automatic discovery of variables, geographies, and datasets. Helpful if you don’t want to manually look up variable codes. See censusdis on PyPI.
us: Often used alongside census libraries to handle U.S. state and territory information (e.g., FIPS codes). See us on PyPI.
6.3.2 Zip-Code Level for NYC Crash Data
Now that we have NYC crash data, we might want to analyze patterns at the zip-code level to understand whether certain demographic or economic factors correlate with traffic incidents. While the crash dataset provides details about individual accidents, such as location, time, and severity, it does not contain contextual information about the neighborhoods where these crashes occur.
To perform meaningful zip-code-level analysis, we need additional data sources that provide relevant demographic, economic, and geographic variables. For example, understanding whether high-income areas experience fewer accidents, or whether population density influences crash frequency, requires integrating Census data. Key variables such as population size, median household income, employment rate, and population density can provide valuable context for interpreting crash trends across different zip codes.
Since the Census Bureau provides detailed estimates for these variables at the zip-code level, we can use the Census API or other tools to retrieve relevant data and merge it with the NYC crash dataset. To access the Census API, you need an API key, which is free and easy to obtain. Visit the Census API Request page and submit your email address to receive a key. Once you have the key, you must include it in your API requests to access Census data. The following demonstration assumes that you have registered, obtained your API key, and saved it in a file called censusAPIkey.txt.
# Import modulesimport matplotlib.pyplot as pltimport pandas as pdimport geopandas as gpdfrom census import Censusfrom us import statesimport osimport ioapi_key =open("censusAPIkey.txt").read().strip()c = Census(api_key)
Suppose that we want to get some basic info from ACS data of the year of 2024 for all the NYC zip codes. The variable names can be found in the ACS variable documentation.
ACS_YEAR =2024ACS_DATASET ="acs/acs5"# Important ACS variables (including land area for density calculation)ACS_VARIABLES = {"B01003_001E": "Total Population","B19013_001E": "Median Household Income","B02001_002E": "White Population","B02001_003E": "Black Population","B02001_005E": "Asian Population","B15003_022E": "Bachelor’s Degree Holders","B15003_025E": "Graduate Degree Holders","B23025_002E": "Labor Force","B23025_005E": "Unemployed","B25077_001E": "Median Home Value"}# Convert set to list of stringsmerged_zips =list(map(str, merged_zips))
Let’s set up the query to request the ACS data, and process the returned data.
We could save the ACS data df_acs in feather format (see next Section).
df_acs.to_feather("data/acs2023.feather")
The population density could be an important factor for crash likelihood. To obtain the population densities, we need the areas of the zip codes. The shape files can be obtained from NYC Open Data.
import requestsimport zipfileimport geopandas as gpd# Define the NYC MODZCTA shapefile URL and extraction directoryshapefile_url ="https://data.cityofnewyork.us/api/geospatial/pri4-ifjk?method=export&format=Shapefile"extract_dir ="tmp/MODZCTA_Shapefile"# Create the directory if it doesn't existos.makedirs(extract_dir, exist_ok=True)# Step 1: Download and extract the shapefileprint("Downloading MODZCTA shapefile...")response = requests.get(shapefile_url)with zipfile.ZipFile(io.BytesIO(response.content), "r") as z: z.extractall(extract_dir)print(f"Shapefile extracted to: {extract_dir}")
Downloading MODZCTA shapefile...
Shapefile extracted to: tmp/MODZCTA_Shapefile
Now we process the shape file to calculate the areas of the polygons.
# Step 2: Automatically detect the correct .shp fileshapefile_path =Noneforfilein os.listdir(extract_dir):iffile.endswith(".shp"): shapefile_path = os.path.join(extract_dir, file)break# Use the first .shp file foundifnot shapefile_path:raiseFileNotFoundError("No .shp file found in extracted directory.")print(f"Using shapefile: {shapefile_path}")# Step 3: Load the shapefile into GeoPandasgdf = gpd.read_file(shapefile_path)# Step 4: Convert to CRS with meters for accurate area calculationgdf = gdf.to_crs(epsg=3857)# Step 5: Compute land area in square milesgdf['land_area_sq_miles'] = gdf['geometry'].area /2_589_988.11# 1 square mile = 2,589,988.11 square metersprint(gdf[['modzcta', 'land_area_sq_miles']].head())
# Merge ACS data (`df_acs`) directly with MODZCTA land area (`gdf`)gdf = gdf.merge(df_acs, left_on='modzcta', right_on='ZIP Code', how='left')# Calculate Population Density (people per square mile)gdf['popdensity_per_sq_mile'] = ( gdf['Total Population'] / gdf['land_area_sq_miles'] )# Display first few rowsprint(gdf[['modzcta', 'Total Population', 'land_area_sq_miles','popdensity_per_sq_mile']].head())
import matplotlib.pyplot as pltimport geopandas as gpd# Set up figure and axisfig, ax = plt.subplots(figsize=(10, 12))# Plot the choropleth mapgdf.plot(column='popdensity_per_sq_mile', cmap='viridis', # Use a visually appealing color map linewidth=0.8, edgecolor='black', legend=True, legend_kwds={'label': "Population Density (per sq mile)",'orientation': "horizontal"}, ax=ax)# Add a titleax.set_title("Population Density by ZCTA in NYC", fontsize=14)# Remove axesax.set_xticks([])ax.set_yticks([])ax.set_frame_on(False)# Show the plotplt.show()
6.4 Cross-platform Data Format Arrow
The CSV format (and related formats like TSV - tab-separated values) for data tables is ubiquitous, convenient, and can be read or written by many different data analysis environments, including spreadsheets. An advantage of the textual representation of the data in a CSV file is that the entire data table, or portions of it, can be previewed in a text editor. However, the textual representation can be ambiguous and inconsistent. The format of a particular column: Boolean, integer, floating-point, text, factor, etc. must be inferred from text representation, often at the expense of reading the entire file before these inferences can be made. Experienced data scientists are aware that a substantial part of an analysis or report generation is often the “data cleaning” involved in preparing the data for analysis. This can be an open-ended task — it required numerous trial-and-error iterations to create the list of different missing data representations we use for the sample CSV file and even now we are not sure we have them all.
To read and export data efficiently, leveraging the Apache Arrow library can significantly improve performance and storage efficiency, especially with large datasets. The IPC (Inter-Process Communication) file format in the context of Apache Arrow is a key component for efficiently sharing data between different processes, potentially written in different programming languages. Arrow’s IPC mechanism is designed around two main file formats:
Stream Format: For sending an arbitrary length sequence of Arrow record batches (tables). The stream format is useful for real-time data exchange where the size of the data is not known upfront and can grow indefinitely.
File (or Feather) Format: Optimized for storage and memory-mapped access, allowing for fast random access to different sections of the data. This format is ideal for scenarios where the entire dataset is available upfront and can be stored in a file system for repeated reads and writes.
Apache Arrow provides a columnar memory format for flat and hierarchical data, optimized for efficient data analytics. It can be used in Python through the pyarrow package. Here’s how you can use Arrow to read, manipulate, and export data, including a demonstration of storage savings.
First, ensure you have pyarrow installed on your computer (and preferrably, in your current virtual environment):
pip install pyarrow
Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames, optimized for speed and efficiency, particularly for IPC and data sharing between Python and R or Julia.
The following code processes the raw data in CSV format and write out in Arrow format.
# File pathscsv_file ='data/nyc_crashes_lbdwk_2025.csv'feather_file ='tmp/nyc_crashes_lbdwk_2025.feather'import pandas as pd# Move 'crash_datetime' to the first columndf = df[['crash_datetime'] + df.drop(columns=['crash_datetime']).columns.tolist()]df['zip_code'] = df['zip_code'].astype(str).str.rstrip('.0')df = df.sort_values(by='crash_datetime')df.to_feather(feather_file)
Let’s compare the file sizes of the feather format and the CSV format.
import os# Get file sizes in bytescsv_size = os.path.getsize(csv_file)feather_size = os.path.getsize(feather_file)# Convert bytes to a more readable format (e.g., MB)csv_size_mb = csv_size / (1024*1024)feather_size_mb = feather_size / (1024*1024)# Print the file sizesprint(f"CSV file size: {csv_size_mb:.2f} MB")print(f"Feather file size: {feather_size_mb:.2f} MB")
Structured Query Language (SQL) is one of the standard languages that is used to work with large databases. It uses tables to store and display data, creating an organized and comprehensible interface that makes it far easier to track and view your data.
Some advantages of using SQL include its ability to handle large amounts of data while simultaneously simplifying the process of creating, updating, and retrieving any data you may want.
The biggest advantage of using SQL for the purposes of this class is that it can very easily connect with Python and R. This makes it so that we can have all of the benefits of working with SQL while still working in the Python environment we already have set up.
6.5.2 Setting up Databases
In this section, we will be working with two databases, one that’s built into a Python package (nycflights13) and one that we used for our midterm project (311 Service Requests)
6.5.2.1 nycflights13
This database uses the pandas package and includes flight data for all flights that left New York City airports in 2013. The database includes several tables including ones that detail each flight, airline, airport, and much more info regarding the flights. The data in this database is contained across several tables. Data stored like this would typically be irritating to deal with however it is proven simple when working with SQL.
To set up this database, we need to import the proper packages:
import sqlite3import pandas as pdfrom nycflights13 import flights, airlines, airports
Here we imported “sqlite3” which is the package to import when working with SQL. We also imported “pandas” which also includes the nycflights13 database and from there imported the three tables we will be working with.
Now, we want to establish the connection between Python and our SQL database:
nycconn = sqlite3.connect("nycflights13.db")
This snippet both creates the nycflights13.db database and establishes nycconn as our connection to this database.
Next step is to add the three tables we imported from nycflights13 to the database:
What this does is converts the three pandas dataframes into tables in the SQL database. The if_exists argument handles what would happen if there is already a table in the database with the same name. The index argument is determining whether or not the first column of the dataframe should be handled as the index of the table in the dataframe.
6.5.2.2 serviceRequests Database
First, we need to import our serviceRequests data from the csv in the data folder:
Here we imported the dataframe we got from the csv file as the only table in the serviceRequests.db database.
6.5.3 Query Basics
Now we can move onto working with the data in our SQL databases. The most common use for SQL is writing a “query” which is a statement sent to the SQL database that returns a selection of rows and columns from the tables in a database.
We will be looking at the “flights” table in our nycflights13 database for this section.
6.5.3.1 SELECT and FROM
The following is the most basic possible query one can perform:
This snippet represents the basic form of writing SQL queries in Python. We create a variable ‘query’ that contains the statement we intend to pass into SQL. The last line then uses the nycconn connection we created earlier to pass our query into SQL and it returns the head() of the result we get back.
This query is the most basic query, as it returns the entire flights table. The SELECT line of the query is where you put the names of which columns you want from your table. You specify which table you want to work with on the FROM line. We put “*” in our SELECT line which returns all columns. All SQL queries must end with “;”, otherwise you’ll get an error.
Now let’s try to simplify what we’re seeing by only looking at the origins and destinations of the flights:
Here we replaced the “*” in our SELECT statement with “origin, dest” This told SQL to only return those two columns from the database.
6.5.3.2 ORDER BY
These columns are quite messy to look at so let’s try sorting them by origin:
query ="""SELECT origin, destFROM flightsORDER BY origin;"""pd.read_sql_query(query, nycconn).head()
origin
dest
0
EWR
IAH
1
EWR
ORD
2
EWR
FLL
3
EWR
SFO
4
EWR
LAS
Here we added a new “ORDER BY” line. This line tells SQL what columns to sort the list by.
You can sort by multiple columns just by listing them with commas in between:
query ="""SELECT origin, destFROM flightsORDER BY origin, dest;"""pd.read_sql_query(query, nycconn).head()
origin
dest
0
EWR
ALB
1
EWR
ALB
2
EWR
ALB
3
EWR
ALB
4
EWR
ALB
6.5.3.3 SELECT DISTINCT
Now that the list is properly sorted, we can see that there are multiple flights from each origin to destination combination. If we want to only see the unique columns, we can do this:
query ="""SELECT DISTINCT origin, destFROM flightsORDER BY origin, dest;"""pd.read_sql_query(query, nycconn).head()
origin
dest
0
EWR
ALB
1
EWR
ANC
2
EWR
ATL
3
EWR
AUS
4
EWR
AVL
Here we replaced our “SELECT” statement with a “SELECT DISTINCT” statement. This tells SQL to only return the unique columns from the query.
6.5.4 Conditionals
Usually you wouldn’t want to just return all rows or all unique rows from a table You instead will have conditions that determine which rows are relevant to your query
6.5.4.1 WHERE
The way you can add conditionals to your query is by adding a “WHERE” line. Let’s take the same list from the last section and filter it so that only the flights that departed from LGA are returned:
You add conditionals in the “WHERE” line by using the following comparators:
‘=’
‘<’
‘>’
‘<=’
‘>=’
‘!=’
Be careful of the type of the data in a particular column!
6.5.4.2 AND, OR, and NOT
You can add multiple conditionals by using AND, OR, and parentheses
query ="""SELECT origin, destFROM flightsWHERE origin = 'LGA' AND dest = 'ATL'ORDER BY origin, dest;"""pd.read_sql_query(query, nycconn).head()
origin
dest
0
LGA
ATL
1
LGA
ATL
2
LGA
ATL
3
LGA
ATL
4
LGA
ATL
This uses AND to return all flights that departed from LGA and arrived in ATL
We can also use OR to return all flights that either departed from LGA or arrived in ATL:
query ="""SELECT DISTINCT origin, destFROM flightsWHERE origin = 'LGA' OR dest = 'ATL'ORDER BY origin, dest;"""pd.read_sql_query(query, nycconn).head()
origin
dest
0
EWR
ATL
1
JFK
ATL
2
LGA
ATL
3
LGA
AVL
4
LGA
BGR
When you want to get more complicated with your conditionals, parenthesis can be used to ensure SQL is correctly mixing the AND and OR statements.
Use NOT to return the opposite of a statement:
query ="""SELECT DISTINCT origin, destFROM flightsWHERE NOT (origin = 'LGA' OR dest = 'ATL')ORDER BY origin, dest;"""pd.read_sql_query(query, nycconn).head()
origin
dest
0
EWR
ALB
1
EWR
ANC
2
EWR
AUS
3
EWR
AVL
4
EWR
BDL
6.5.4.3 COUNT
An easy way in SQL to see the total number of rows that fit the conditions you’ve specified:
query ="""SELECT COUNT(DISTINCT dest)FROM flightsWHERE NOT (origin = 'LGA' OR dest = 'ATL');"""pd.read_sql_query(query, nycconn).head()
COUNT(DISTINCT dest)
0
97
This returns the number of distinct destinations that fit the specified criteria
6.5.4.4 LIMIT
If you don’t want to use .head() to only display the first few rows, this can be done in SQL using a LIMIT statement:
This query added the “carrier” column that displays the airline that held the flight. We specified that we want all flights from the “UA” airline and limited the result to 5 rows.
An interesting thing you can do with conditionals in SQL is to filter by values in a column that you are not displaying:
Here we still filter by flights from the “UA” airline but we don’t display the column as that would be very redundant.
6.5.5 Joins
The last query that we did was useful, however it isn’t realistic to expect users to memorize the two-digit codes for all airlines. Thankfully, there is the airlines table in our nycflights13.db database. Let’s take a look at it:
This table is much simpler than the “flights” table as it only has two columns.
6.5.5.1 INNER JOIN
Logically, we wouldn’t want our outputted table to display the two-digit code that represents each airline but instead we’d want to see the name of the airline. Thankfully, SQL has a way to join the data from the two tables together:
query ="""SELECT DISTINCT a.name AS airline_name, f.origin, f.destFROM flights AS fINNER JOIN airlines AS aON f.carrier = a.carrier;"""pd.read_sql_query(query, nycconn).head()
airline_name
origin
dest
0
United Air Lines Inc.
EWR
IAH
1
United Air Lines Inc.
LGA
IAH
2
American Airlines Inc.
JFK
MIA
3
JetBlue Airways
JFK
BQN
4
Delta Air Lines Inc.
LGA
ATL
This is a much more complicated query that returns a table of the distinct airline, origin, and destination in the flights database. We introduced three new statements in this query:
AS is similar to how “as” is used when importing packages in Python. It gives us an opportunity to use a shorthand instead of needing to type out the full table names every time we mention a column. It can also be used in our SELECT line to name columns in our resultant table
INNER JOIN connects a new table to our query (in this case, the new table is airlines).
ON lets SQL know what column in each table it should use to connect the tables. Here we told SQL that in every row, when it reaches the carrier column in flights, it should use that as its reference for what row in airlines to use for values in this row. For example, in the first row, SQL saw that the carrier name in flights was “UA”. SQL then looked in the carrier column in airlines and found the row in which “UA” was the value in that table’s carrier column. So when SQL was calculating the value for airline_name in the first row, it knew which column to search in airline to find “United Air Lines Inc.”
JOIN statements are key when using SQL to display data. This is what allows SQL databases to be in such nice and concise structures.
Now let’s also add in the values from the “airports” table so that we get the full names of the airports instead of their three-digit codes:
query ="""SELECT DISTINCT a.name AS airline_name, orig.name AS origin_name, dest.name AS dest_nameFROM flights AS fINNER JOIN airlines AS a ON f.carrier = a.carrierINNER JOIN airports AS orig ON f.origin = orig.faaINNER JOIN airports AS dest ON f.dest = dest.faa;"""pd.read_sql_query(query, nycconn).head()
airline_name
origin_name
dest_name
0
United Air Lines Inc.
Newark Liberty Intl
George Bush Intercontinental
1
United Air Lines Inc.
La Guardia
George Bush Intercontinental
2
American Airlines Inc.
John F Kennedy Intl
Miami Intl
3
Delta Air Lines Inc.
La Guardia
Hartsfield Jackson Atlanta Intl
4
United Air Lines Inc.
Newark Liberty Intl
Chicago Ohare Intl
Here we use three different INNER JOIN statements to connect the three tables properly. The reason we use two JOIN statements to connect the same “airports” table is to have SQL be able to look separately for the origin and destination airport names. This query also utilizes indenting to make the query far more readable.
6.5.5.2 The difference between the three JOINs
SQL has three different versions of JOIN statements: + INNER JOIN + LEFT JOIN + RIGHT JOIN
This allows the user to determine what rows get included as a result of joining two tables. Let’s take a look at the differences:
query ="""SELECT DISTINCT f.flight, a.name AS airline_name, orig.name AS origin_name, dest.name AS dest_nameFROM flights AS fINNER JOIN airlines AS a ON f.carrier = a.carrierINNER JOIN airports AS orig ON f.origin = orig.faaINNER JOIN airports AS dest ON f.dest = dest.faa;"""pd.read_sql_query(query, nycconn).head()
flight
airline_name
origin_name
dest_name
0
1545
United Air Lines Inc.
Newark Liberty Intl
George Bush Intercontinental
1
1714
United Air Lines Inc.
La Guardia
George Bush Intercontinental
2
1141
American Airlines Inc.
John F Kennedy Intl
Miami Intl
3
461
Delta Air Lines Inc.
La Guardia
Hartsfield Jackson Atlanta Intl
4
1696
United Air Lines Inc.
Newark Liberty Intl
Chicago Ohare Intl
INNER JOIN is the most common. This is because the result will only include rows in which the values in question are included in both tables. For example, if there was a row in flights that included an airline code that was not present in airlines, then SQL will not include that row in the result.
query ="""SELECT DISTINCT f.flight, a.name AS airline_name, orig.name AS origin_name, dest.name AS dest_nameFROM flights AS fLEFT JOIN airlines AS a ON f.carrier = a.carrierLEFT JOIN airports AS orig ON f.origin = orig.faaLEFT JOIN airports AS dest ON f.dest = dest.faa;"""pd.read_sql_query(query, nycconn).head()
flight
airline_name
origin_name
dest_name
0
1545
United Air Lines Inc.
Newark Liberty Intl
George Bush Intercontinental
1
1714
United Air Lines Inc.
La Guardia
George Bush Intercontinental
2
1141
American Airlines Inc.
John F Kennedy Intl
Miami Intl
3
725
JetBlue Airways
John F Kennedy Intl
None
4
461
Delta Air Lines Inc.
La Guardia
Hartsfield Jackson Atlanta Intl
LEFT JOIN will return all rows that are in flights regardless of if SQL was able to find matching rows on the other tables. This is perfectly represented in the fourth row of the output. On flights, the destination for flight 725 is “BQN” which is not an airport on the airports table. If you look back to the result of the INNER JOIN, this row was removed from the result however it is present here because we used flights as our only reference for rows and therefore the row is included but the destination name is left as “None”
query ="""SELECT DISTINCT f.flight, f.origin, dest.name AS dest_nameFROM flights AS fRIGHT JOIN airports AS dest ON f.dest = dest.faa;"""pd.read_sql_query(query, nycconn)
flight
origin
dest_name
0
1545.0
EWR
George Bush Intercontinental
1
1714.0
LGA
George Bush Intercontinental
2
1141.0
JFK
Miami Intl
3
461.0
LGA
Hartsfield Jackson Atlanta Intl
4
1696.0
EWR
Chicago Ohare Intl
...
...
...
...
13275
NaN
None
Boston Back Bay Station
13276
NaN
None
Black Rock
13277
NaN
None
New Haven Rail Station
13278
NaN
None
Wilmington Amtrak Station
13279
NaN
None
Washington Union Station
13280 rows × 3 columns
RIGHT JOIN is basically the opposite of LEFT JOIN. Instead of using flights as its basis for what rows to include, RIGHT JOIN uses the joined tables instead. As you can see, the fourth row is being skipped again because the airports table is being used as reference instead. Also, the table ends with rows for all destinations from the airports table. SQL includes them because it is looking for all rows that include destinations from the airports table. This includes all rows from the airports table itself.
6.5.6 Creating Tables
Let’s take a look at our serviceRequests.db’s requests table:
This table is incredibly complicated and redundant through repeated information such as each row having both “Agency” and “Agency Name”. This redundancy clutters the table, making it difficult to read. It would be much better if the database wasn’t just one table and instead was formatted like the nycflights13.db
Take note that we have begun using our other connection (srconn instead of nycconn) because we are working with our other database.
Thankfully, SQL makes it very simple to create new tables from the results of a query. Looking back at the redundancy with agencies, we can create the following query to see all the agency codes and their corresponding names:
query ="""SELECT DISTINCT Agency, "Agency Name"FROM requestsORDER BY Agency;"""pd.read_sql_query(query, srconn).head()
Agency
Agency Name
0
DCWP
Department of Consumer and Worker Protection
1
DEP
Department of Environmental Protection
2
DHS
Department of Homeless Services
3
DOB
Department of Buildings
4
DOE
Department of Education
Something to note here is how one can handle a column name that includes a space or a comma. If you put the name of the column in quotes, then SQL will handle everything inside the quotes as the name of the column.
Now that we have created this query, we can make the result into its own table in the database:
query ="""CREATE TABLE IF NOT EXISTS agencies ASSELECT DISTINCT Agency, "Agency Name"FROM requestsORDER BY Agency;"""srconn.execute(query)srconn.commit()
There are a few key parts to this query:
Firstly, we have the first line which tells SQL to create a table named agencies and the AS works here to tell SQL to make the table be the result of the query that follows the first line.
Second, there is the statement “IF NOT EXISTS”. This is incredibly useful to include in the query as it ensures that you do not override any tables that you have previously made.
Lastly, we use different commands outside of creating the query variable. Instead of pd.read_sql_query(query, srconn), we have our connection to SQL execute the query we created, then commit the changes to the database.
Now that we have added the table to our database, we can query it:
This process can be repeated to clean up messy tables and messy databases.
6.5.7 Statement Order
Statements in a SQL query must go in a certain order, otherwise the query will return an error. The order is as follows:
SELECT
FROM
JOIN
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT/OFFSET
6.5.8 Conclusion
SQL makes working with large datasets much easier by organizing databases and simplifying the process of displaying data. Using either the queries shown here as well as much more complicated queries, one can turn complex tables into databases that don’t unnecessarily repeat data, and that consist of easily read tables.