5 Data Manipulation

5.1 Introduction

Data manipulation is crucial for transforming raw data into a more analyzable format, essential for uncovering patterns and ensuring accurate analysis. This chapter introduces the core techniques for data manipulation in Python, utilizing the Pandas library, a cornerstone for data handling within Python’s data science toolkit.

Python’s ecosystem is rich with libraries that facilitate not just data manipulation but comprehensive data analysis. Pandas, in particular, provides extensive functionality for data manipulation tasks including reading, cleaning, transforming, and summarizing data. Using real-world datasets, we will explore how to leverage Python for practical data manipulation tasks.

By the end of this chapter, you will learn to:

Import/export data from/to diverse sources.
Clean and preprocess data efficiently.
Transform and aggregate data to derive insights.
Merge and concatenate datasets from various origins.
Analyze real-world datasets using these techniques.

5.2 Example: NYC Crash Data

Consider a subset of the NYC Crash Data, which contains all NYC motor vehicle collisions data with documentation from NYC Open Data. We downloaded the crash data for the week of August 31, 2025, on September 11, 2025, in CSC format.

import numpy as np
import pandas as pd

# Load the dataset
file_path = 'data/nyc_crashes_lbdwk_2025.csv'
df = pd.read_csv(file_path,
                 dtype={'LATITUDE': np.float32,
                        'LONGITUDE': np.float32,
                        'ZIP CODE': str})

# Replace column names: convert to lowercase and replace spaces with underscores
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Check for missing values
df.isnull().sum()

crash_date                          0
crash_time                          0
borough                           284
zip_code                          284
latitude                           12
longitude                          12
location                           12
on_street_name                    456
cross_street_name                 587
off_street_name                  1031
number_of_persons_injured           0
number_of_persons_killed            0
number_of_pedestrians_injured       0
number_of_pedestrians_killed        0
number_of_cyclist_injured           0
number_of_cyclist_killed            0
number_of_motorist_injured          0
number_of_motorist_killed           0
contributing_factor_vehicle_1       9
contributing_factor_vehicle_2     355
contributing_factor_vehicle_3    1358
contributing_factor_vehicle_4    1447
contributing_factor_vehicle_5    1474
collision_id                        0
vehicle_type_code_1                17
vehicle_type_code_2               475
vehicle_type_code_3              1363
vehicle_type_code_4              1452
vehicle_type_code_5              1474
dtype: int64

Take a peek at the first five rows:

df.head()

	crash_date	crash_time	borough	zip_code	latitude	longitude	location	on_street_name	cross_street_name	off_street_name	...	contributing_factor_vehicle_2	contributing_factor_vehicle_3	contributing_factor_vehicle_4	contributing_factor_vehicle_5	collision_id	vehicle_type_code_1	vehicle_type_code_2	vehicle_type_code_3	vehicle_type_code_4	vehicle_type_code_5
0	08/31/2025	12:49	QUEENS	11101	40.753113	-73.933701	(40.753113, -73.9337)	30 ST	39 AVE	NaN	...	NaN	NaN	NaN	NaN	4838875	Station Wagon/Sport Utility Vehicle	NaN	NaN	NaN	NaN
1	08/31/2025	15:30	MANHATTAN	10022	40.760601	-73.964317	(40.7606, -73.96432)	E 59 ST	2 AVE	NaN	...	NaN	NaN	NaN	NaN	4839110	Station Wagon/Sport Utility Vehicle	NaN	NaN	NaN	NaN
2	08/31/2025	19:00	NaN	NaN	40.734234	-73.722748	(40.734234, -73.72275)	CROSS ISLAND PARKWAY	HILLSIDE AVENUE	NaN	...	Unspecified	Unspecified	NaN	NaN	4838966	Sedan	Sedan	NaN	NaN	NaN
3	08/31/2025	1:19	BROOKLYN	11220	40.648075	-74.007034	(40.648075, -74.007034)	NaN	NaN	4415 5 AVE	...	Unspecified	NaN	NaN	NaN	4838563	Sedan	E-Bike	NaN	NaN	NaN
4	08/31/2025	2:41	MANHATTAN	10036	40.756561	-73.986107	(40.75656, -73.98611)	W 43 ST	BROADWAY	NaN	...	Unspecified	NaN	NaN	NaN	4838922	Station Wagon/Sport Utility Vehicle	Bike	NaN	NaN	NaN

5 rows × 29 columns

A quick summary of the data types of the columns:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1487 entries, 0 to 1486
Data columns (total 29 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   crash_date                     1487 non-null   object 
 1   crash_time                     1487 non-null   object 
 2   borough                        1203 non-null   object 
 3   zip_code                       1203 non-null   object 
 4   latitude                       1475 non-null   float32
 5   longitude                      1475 non-null   float32
 6   location                       1475 non-null   object 
 7   on_street_name                 1031 non-null   object 
 8   cross_street_name              900 non-null    object 
 9   off_street_name                456 non-null    object 
 10  number_of_persons_injured      1487 non-null   int64  
 11  number_of_persons_killed       1487 non-null   int64  
 12  number_of_pedestrians_injured  1487 non-null   int64  
 13  number_of_pedestrians_killed   1487 non-null   int64  
 14  number_of_cyclist_injured      1487 non-null   int64  
 15  number_of_cyclist_killed       1487 non-null   int64  
 16  number_of_motorist_injured     1487 non-null   int64  
 17  number_of_motorist_killed      1487 non-null   int64  
 18  contributing_factor_vehicle_1  1478 non-null   object 
 19  contributing_factor_vehicle_2  1132 non-null   object 
 20  contributing_factor_vehicle_3  129 non-null    object 
 21  contributing_factor_vehicle_4  40 non-null     object 
 22  contributing_factor_vehicle_5  13 non-null     object 
 23  collision_id                   1487 non-null   int64  
 24  vehicle_type_code_1            1470 non-null   object 
 25  vehicle_type_code_2            1012 non-null   object 
 26  vehicle_type_code_3            124 non-null    object 
 27  vehicle_type_code_4            35 non-null     object 
 28  vehicle_type_code_5            13 non-null     object 
dtypes: float32(2), int64(9), object(18)
memory usage: 325.4+ KB

Now we can do some cleaning after a quick browse.

# Replace invalid coordinates (latitude=0, longitude=0 or NaN) with NaN
df.loc[(df['latitude'] == 0) & (df['longitude'] == 0), 
       ['latitude', 'longitude']] = pd.NA
df['latitude'] = df['latitude'].replace(0, pd.NA)
df['longitude'] = df['longitude'].replace(0, pd.NA)

# Drop the redundant `latitute` and `longitude` columns
df = df.drop(columns=['location'])

# Converting 'crash_date' and 'crash_time' columns into a single datetime column
df['crash_datetime'] = pd.to_datetime(df['crash_date'] + ' ' 
                       + df['crash_time'], format='%m/%d/%Y %H:%M', errors='coerce')

# Drop the original 'crash_date' and 'crash_time' columns
df = df.drop(columns=['crash_date', 'crash_time'])

Let’s get some basic frequency tables of borough and zip_code, whose values could be used to check their validity against the legitmate values.

# Frequency table for 'borough' without filling missing values
borough_freq = df['borough'].value_counts(dropna=False).reset_index()
borough_freq.columns = ['borough', 'count']

# Frequency table for 'zip_code' without filling missing values
zip_code_freq = df['zip_code'].value_counts(dropna=False).reset_index()
zip_code_freq.columns = ['zip_code', 'count']
zip_code_freq

	zip_code	count
0	NaN	284
1	11207	33
2	11203	29
3	11212	23
4	11233	21
...	...	...
159	11379	1
160	10007	1
161	10308	1
162	11362	1
163	11694	1

164 rows × 2 columns

A comprehensive list of ZIP codes by borough can be obtained, for example, from the New York City Department of Health’s UHF Codes. We can use this list to check the validity of the zip codes in the data.

# List of valid NYC ZIP codes compiled from UHF codes
# Define all_valid_zips based on the earlier extracted ZIP codes
all_valid_zips = {
    10463, 10471, 10466, 10469, 10470, 10475, 10458, 10467, 10468,
    10461, 10462, 10464, 10465, 10472, 10473, 10453, 10457, 10460,
    10451, 10452, 10456, 10454, 10455, 10459, 10474, 11211, 11222,
    11201, 11205, 11215, 11217, 11231, 11213, 11212, 11216, 11233,
    11238, 11207, 11208, 11220, 11232, 11204, 11218, 11219, 11230,
    11203, 11210, 11225, 11226, 11234, 11236, 11239, 11209, 11214,
    11228, 11223, 11224, 11229, 11235, 11206, 11221, 11237, 10031,
    10032, 10033, 10034, 10040, 10026, 10027, 10030, 10037, 10039,
    10029, 10035, 10023, 10024, 10025, 10021, 10028, 10044, 10128,
    10001, 10011, 10018, 10019, 10020, 10036, 10010, 10016, 10017,
    10022, 10012, 10013, 10014, 10002, 10003, 10009, 10004, 10005,
    10006, 10007, 10038, 10280, 11101, 11102, 11103, 11104, 11105,
    11106, 11368, 11369, 11370, 11372, 11373, 11377, 11378, 11354,
    11355, 11356, 11357, 11358, 11359, 11360, 11361, 11362, 11363,
    11364, 11374, 11375, 11379, 11385, 11365, 11366, 11367, 11414,
    11415, 11416, 11417, 11418, 11419, 11420, 11421, 11412, 11423,
    11432, 11433, 11434, 11435, 11436, 11004, 11005, 11411, 11413,
    11422, 11426, 11427, 11428, 11429, 11691, 11692, 11693, 11694,
    11695, 11697, 10302, 10303, 10310, 10301, 10304, 10305, 10314,
    10306, 10307, 10308, 10309, 10312
}

    
# Convert set to list of strings
all_valid_zips = list(map(str, all_valid_zips))

# Identify invalid ZIP codes (including NaN)
invalid_zips = df[
    df['zip_code'].isna() | ~df['zip_code'].isin(all_valid_zips)
    ]['zip_code']

# Calculate frequency of invalid ZIP codes
invalid_zip_freq = invalid_zips.value_counts(dropna=False).reset_index()
invalid_zip_freq.columns = ['zip_code', 'frequency']

invalid_zip_freq

	zip_code	frequency
0	NaN	284
1	10000	4
2	10065	3
3	10075	2
4	11430	1

As it turns out, the collection of valid NYC zip codes differ from different sources. From United States Zip Codes, 10065 appears to be a valid NYC zip code. Under this circumstance, it might be safer to not remove any zip code from the data.

To be safe, let’s concatenate valid and invalid zips.

# Convert invalid ZIP codes to a set of strings
invalid_zips_set = set(invalid_zip_freq['zip_code'].dropna().astype(str))

# Convert all_valid_zips to a set of strings (if not already)
valid_zips_set = set(map(str, all_valid_zips))

# Merge both sets
merged_zips = invalid_zips_set | valid_zips_set  # Union of both sets

Are missing in zip code and borough always co-occur?

# Check if missing values in 'zip_code' and 'borough' always co-occur
# Count rows where both are missing
missing_cooccur = df[['zip_code', 'borough']].isnull().all(axis=1).sum()
# Count total missing in 'zip_code' and 'borough', respectively
total_missing_zip_code = df['zip_code'].isnull().sum()
total_missing_borough = df['borough'].isnull().sum()

# If missing in both columns always co-occur, the number of missing
# co-occurrences should be equal to the total missing in either column
np.array([missing_cooccur, total_missing_zip_code, total_missing_borough])

array([284, 284, 284])

Are there cases where zip_code and borough are missing but the geo codes are not missing? If so, fill in zip_code and borough using the geo codes by reverse geocoding.

First make sure geopy is installed.

pip install geopy

Now we use module Nominatim in package geopy to reverse geocode.

from geopy.geocoders import Nominatim
import time

# Initialize the geocoder; the `user_agent` is your identifier 
# when using the service. Be mindful not to crash the server
# by unlimited number of queries, especially invalid code.
geolocator = Nominatim(user_agent="jyGeopyTry")

We write a function to do the reverse geocoding given lattitude and longitude.

# Function to fill missing zip_code
def get_zip_code(latitude, longitude):
    try:
        location = geolocator.reverse((latitude, longitude), timeout=10)
        if location:
            address = location.raw['address']
            zip_code = address.get('postcode', None)
            return zip_code
        else:
            return None
    except Exception as e:
        print(f"Error: {e} for coordinates {latitude}, {longitude}")
        return None
    finally:
        time.sleep(1)  # Delay to avoid overwhelming the service

Let’s try it out:

# Example usage
latitude = 40.730610
longitude = -73.935242
get_zip_code(latitude, longitude)

'11101'

The function get_zip_code can then be applied to rows where zip code is missing but geocodes are not to fill the missing zip code.

Once zip code is known, figuring out burough is simple because valid zip codes from each borough are known.

5.3 Accessing Census Data

The U.S. Census Bureau provides extensive demographic, economic, and social data through multiple surveys, including the decennial Census, the American Community Survey (ACS), and the Economic Census. These datasets offer valuable insights into population trends, economic conditions, and community characteristics at multiple geographic levels.

There are multiple ways to access Census data. For example:

Census API: The Census API allows programmatic access to various datasets. It supports queries for different geographic levels and time periods.
data.census.gov: The official web interface for searching and downloading Census data.
IPUMS USA: Provides harmonized microdata for longitudinal research. Available at IPUMS USA.
NHGIS: Offers historical Census data with geographic information. Visit NHGIS.

In addition, Python tools simplify API access and data retrieval.

5.3.1 Python Tools for Accessing Census Data

Several Python libraries facilitate Census data retrieval:

census: A high-level interface to the Census API, supporting ACS and decennial Census queries. See census on PyPI.
censusdis: Provides richer functionality: automatic discovery of variables, geographies, and datasets. Helpful if you don’t want to manually look up variable codes. See censusdis on PyPI.
us: Often used alongside census libraries to handle U.S. state and territory information (e.g., FIPS codes). See us on PyPI.

5.3.2 Zip-Code Level for NYC Crash Data

Now that we have NYC crash data, we might want to analyze patterns at the zip-code level to understand whether certain demographic or economic factors correlate with traffic incidents. While the crash dataset provides details about individual accidents, such as location, time, and severity, it does not contain contextual information about the neighborhoods where these crashes occur.

To perform meaningful zip-code-level analysis, we need additional data sources that provide relevant demographic, economic, and geographic variables. For example, understanding whether high-income areas experience fewer accidents, or whether population density influences crash frequency, requires integrating Census data. Key variables such as population size, median household income, employment rate, and population density can provide valuable context for interpreting crash trends across different zip codes.

Since the Census Bureau provides detailed estimates for these variables at the zip-code level, we can use the Census API or other tools to retrieve relevant data and merge it with the NYC crash dataset. To access the Census API, you need an API key, which is free and easy to obtain. Visit the Census API Request page and submit your email address to receive a key. Once you have the key, you must include it in your API requests to access Census data. The following demonstration assumes that you have registered, obtained your API key, and saved it in a file called censusAPIkey.txt.

# Import modules
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
from census import Census
from us import states
import os
import io

api_key = open("censusAPIkey.txt").read().strip()
c = Census(api_key)

Suppose that we want to get some basic info from ACS data of the year of 2024 for all the NYC zip codes. The variable names can be found in the ACS variable documentation.

ACS_YEAR = 2024
ACS_DATASET = "acs/acs5"

# Important ACS variables (including land area for density calculation)
ACS_VARIABLES = {
    "B01003_001E": "Total Population",
    "B19013_001E": "Median Household Income",
    "B02001_002E": "White Population",
    "B02001_003E": "Black Population",
    "B02001_005E": "Asian Population",
    "B15003_022E": "Bachelor’s Degree Holders",
    "B15003_025E": "Graduate Degree Holders",
    "B23025_002E": "Labor Force",
    "B23025_005E": "Unemployed",
    "B25077_001E": "Median Home Value"
}

# Convert set to list of strings
merged_zips = list(map(str, merged_zips))

Let’s set up the query to request the ACS data, and process the returned data.

acs_data = c.acs5.get(
    list(ACS_VARIABLES.keys()), 
    {'for': f'zip code tabulation area:{",".join(merged_zips)}'}
    )

# Convert to DataFrame
df_acs = pd.DataFrame(acs_data)

# Rename columns
df_acs.rename(columns=ACS_VARIABLES, inplace=True)
df_acs.rename(columns={"zip code tabulation area": "ZIP Code"}, inplace=True)

We could save the ACS data df_acs in feather format (see next Section).

df_acs.to_feather("data/acs2023.feather")

The population density could be an important factor for crash likelihood. To obtain the population densities, we need the areas of the zip codes. The shape files can be obtained from NYC Open Data.

import requests
import zipfile
import geopandas as gpd

# Define the NYC MODZCTA shapefile URL and extraction directory
shapefile_url = "https://data.cityofnewyork.us/api/geospatial/pri4-ifjk?method=export&format=Shapefile"
extract_dir = "MODZCTA_Shapefile"

# Create the directory if it doesn't exist
os.makedirs(extract_dir, exist_ok=True)

# Step 1: Download and extract the shapefile
print("Downloading MODZCTA shapefile...")
response = requests.get(shapefile_url)
with zipfile.ZipFile(io.BytesIO(response.content), "r") as z:
    z.extractall(extract_dir)

print(f"Shapefile extracted to: {extract_dir}")

Downloading MODZCTA shapefile...
Shapefile extracted to: MODZCTA_Shapefile

Now we process the shape file to calculate the areas of the polygons.

# Step 2: Automatically detect the correct .shp file
shapefile_path = None
for file in os.listdir(extract_dir):
    if file.endswith(".shp"):
        shapefile_path = os.path.join(extract_dir, file)
        break  # Use the first .shp file found

if not shapefile_path:
    raise FileNotFoundError("No .shp file found in extracted directory.")

print(f"Using shapefile: {shapefile_path}")

# Step 3: Load the shapefile into GeoPandas
gdf = gpd.read_file(shapefile_path)

# Step 4: Convert to CRS with meters for accurate area calculation
gdf = gdf.to_crs(epsg=3857)

# Step 5: Compute land area in square miles
gdf['land_area_sq_miles'] = gdf['geometry'].area / 2_589_988.11
# 1 square mile = 2,589,988.11 square meters

print(gdf[['modzcta', 'land_area_sq_miles']].head())

Using shapefile: MODZCTA_Shapefile/geo_export_81477944-a458-47f2-951a-38422a2e648a.shp
  modzcta  land_area_sq_miles
0   10001            1.153516
1   10002            1.534509
2   10003            1.008318
3   10026            0.581848
4   10004            0.256876

Let’s export this data frame for future usage in feather format (see next Section).

gdf[['modzcta', 'land_area_sq_miles']].to_feather('data/nyc_zip_areas.feather')

Now we are ready to merge the two data frames.

# Merge ACS data (`df_acs`) directly with MODZCTA land area (`gdf`)
gdf = gdf.merge(df_acs, left_on='modzcta', right_on='ZIP Code', how='left')

# Calculate Population Density (people per square mile)
gdf['popdensity_per_sq_mile'] = (
    gdf['Total Population'] / gdf['land_area_sq_miles']
    )

# Display first few rows
print(gdf[['modzcta', 'Total Population', 'land_area_sq_miles',
    'popdensity_per_sq_mile']].head())

  modzcta  Total Population  land_area_sq_miles  popdensity_per_sq_mile
0   10001           29079.0            1.153516            25209.019713
1   10002           75517.0            1.534509            49212.471465
2   10003           53825.0            1.008318            53380.992071
3   10026           37113.0            0.581848            63784.749994
4   10004            3875.0            0.256876            15085.082190

Some visualization of population density.

import matplotlib.pyplot as plt
import geopandas as gpd

# Set up figure and axis
fig, ax = plt.subplots(figsize=(10, 12))

# Plot the choropleth map
gdf.plot(column='popdensity_per_sq_mile', 
         cmap='viridis',  # Use a visually appealing color map
         linewidth=0.8, 
         edgecolor='black',
         legend=True,
         legend_kwds={'label': "Population Density (per sq mile)",
             'orientation': "horizontal"},
         ax=ax)

# Add a title
ax.set_title("Population Density by ZCTA in NYC", fontsize=14)

# Remove axes
ax.set_xticks([])
ax.set_yticks([])
ax.set_frame_on(False)

# Show the plot
plt.show()

5.4 Cross-platform Data Format `Arrow`

The CSV format (and related formats like TSV - tab-separated values) for data tables is ubiquitous, convenient, and can be read or written by many different data analysis environments, including spreadsheets. An advantage of the textual representation of the data in a CSV file is that the entire data table, or portions of it, can be previewed in a text editor. However, the textual representation can be ambiguous and inconsistent. The format of a particular column: Boolean, integer, floating-point, text, factor, etc. must be inferred from text representation, often at the expense of reading the entire file before these inferences can be made. Experienced data scientists are aware that a substantial part of an analysis or report generation is often the “data cleaning” involved in preparing the data for analysis. This can be an open-ended task — it required numerous trial-and-error iterations to create the list of different missing data representations we use for the sample CSV file and even now we are not sure we have them all.

To read and export data efficiently, leveraging the Apache Arrow library can significantly improve performance and storage efficiency, especially with large datasets. The IPC (Inter-Process Communication) file format in the context of Apache Arrow is a key component for efficiently sharing data between different processes, potentially written in different programming languages. Arrow’s IPC mechanism is designed around two main file formats:

Stream Format: For sending an arbitrary length sequence of Arrow record batches (tables). The stream format is useful for real-time data exchange where the size of the data is not known upfront and can grow indefinitely.
File (or Feather) Format: Optimized for storage and memory-mapped access, allowing for fast random access to different sections of the data. This format is ideal for scenarios where the entire dataset is available upfront and can be stored in a file system for repeated reads and writes.

Apache Arrow provides a columnar memory format for flat and hierarchical data, optimized for efficient data analytics. It can be used in Python through the pyarrow package. Here’s how you can use Arrow to read, manipulate, and export data, including a demonstration of storage savings.

First, ensure you have pyarrow installed on your computer (and preferrably, in your current virtual environment):

pip install pyarrow

Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames, optimized for speed and efficiency, particularly for IPC and data sharing between Python and R or Julia.

The following code processes the cleaned data in CSV format from Mohammad Mundiwala and write out in Arrow format.

#| eval: false

import pandas as pd

# Read CSV, ensuring 'zip_code' is string and 'crash_datetime' is parsed as datetime
df = pd.read_csv('data/nyc_crashes_cleaned_mm.csv',
                 dtype={'zip_code': str},
                 parse_dates=['crash_datetime'])

# Drop the 'date' and 'time' columns
df = df.drop(columns=['crash_date', 'crash_time'])

# Move 'crash_datetime' to the first column
df = df[['crash_datetime'] + df.drop(columns=['crash_datetime']).columns.tolist()]

df['zip_code'] = df['zip_code'].astype(str).str.rstrip('.0')

df = df.sort_values(by='crash_datetime')

df.to_feather('nyccrashes_cleaned.feather')

Let’s compare the file sizes of the feather format and the CSV format.

import os

# File paths
csv_file = 'data/nyccrashes_2024w0630_by20250212.csv'
feather_file = 'data/nyccrashes_cleaned.feather'

# Get file sizes in bytes
csv_size = os.path.getsize(csv_file)
feather_size = os.path.getsize(feather_file)

# Convert bytes to a more readable format (e.g., MB)
csv_size_mb = csv_size / (1024 * 1024)
feather_size_mb = feather_size / (1024 * 1024)

# Print the file sizes
print(f"CSV file size: {csv_size_mb:.2f} MB")
print(f"Feather file size: {feather_size_mb:.2f} MB")

Read the feather file back in:

#| eval: false
dff = pd.read_feather("data/nyccrashes_cleaned.feather")
dff.shape