6  Data Manipulation

6.1 Introduction

Data manipulation is crucial for transforming raw data into a more analyzable format, essential for uncovering patterns and ensuring accurate analysis. This chapter introduces the core techniques for data manipulation in Python, utilizing the Pandas library, a cornerstone for data handling within Python’s data science toolkit.

Python’s ecosystem is rich with libraries that facilitate not just data manipulation but comprehensive data analysis. Pandas, in particular, provides extensive functionality for data manipulation tasks including reading, cleaning, transforming, and summarizing data. Using real-world datasets, we will explore how to leverage Python for practical data manipulation tasks.

By the end of this chapter, you will learn to:

  • Import/export data from/to diverse sources.
  • Clean and preprocess data efficiently.
  • Transform and aggregate data to derive insights.
  • Merge and concatenate datasets from various origins.
  • Analyze real-world datasets using these techniques.

6.2 Import/Export Data

This section was written by Deyu Xu, a MS student in Statistics at the time.

6.2.1 Summary

I would like to divide all of the content into five sections. The fisrt one is exporting data to a .csv file. The second one is importing common formats of data. The third one is importing data from other softwares. The forth one is viewing basic information of the data we have imported. The last one is finding null values.

6.2.2 Package Pandas

6.2.2.1 Import data based on Package Pandas

We need to use the Package, Pandas provided by Python to import data. The first step is to install the Package, Pandas. Python allows us to install different versions of Pandas. We are able to use the following code to install the common cersion.

## install the common version of Pandas
pip install pandas

The code for installing the latest version is listed.

## install the latest version of Pandas
pip install --upgrade pandas

Different versions mean there are differnces in the code to achieve the same goal. We will see the specific example in the part of importing .xlsx files.

6.2.3 Export the data to a .csv file:

6.2.3.1 Import the cleaned crashes data at first

Fisrtly, we need to import the file named “nyccrashes_cleaned.feather” data source. ‌.feather file is a binary file format for storing and sharing data. It is especially suitable for large-scale data analysis and data science workflows. ‌ It uses Apache Arrow’s columnar storage format, which can store data in binary form. The advantage of using this format of file is elvaluating the standard of reading and writing. We need to choose the function read_feather from Pandas to import the crashes data.

## Choose the Package Pandas
import pandas as pd
import os
## Import the cleaned crashes data
## Choose the fcuntion, read_feather from Pandas
## Add the relative address of the data file to let your cooperator deal with the code smoothly
df_feather = pd.read_feather("data/nyccrashes_cleaned.feather")
## Show the top 5 rows of data
## Determine whether we import the data successfully
print(df_feather.head(5)) # Use the fucntion "head()"
        crash_datetime    borough  zip_code  latitude  longitude  \
0  2024-06-30 17:30:00       None       NaN       NaN        NaN   
1  2024-06-30 00:32:00       None       NaN       NaN        NaN   
2  2024-06-30 07:05:00   BROOKLYN   11235.0  40.58106  -73.96744   
3  2024-06-30 20:47:00  MANHATTAN   10021.0  40.76363  -73.95330   
4  2024-06-30 10:14:00   BROOKLYN   11222.0  40.73046  -73.95149   

                location     on_street_name      cross_street_name  \
0             (0.0, 0.0)               None                   None   
1                   None  BELT PARKWAY RAMP                   None   
2  (40.58106, -73.96744)               None                   None   
3   (40.76363, -73.9533)          FDR DRIVE                   None   
4  (40.73046, -73.95149)  GREENPOINT AVENUE  MC GUINNESS BOULEVARD   

           off_street_name  number_of_persons_injured  ...  \
0              GOLD STREET                          0  ...   
1                     None                          0  ...   
2  2797      OCEAN PARKWAY                          0  ...   
3                     None                          0  ...   
4                     None                          0  ...   

   contributing_factor_vehicle_2  contributing_factor_vehicle_3  \
0                    Unspecified                           None   
1                    Unspecified                           None   
2                           None                           None   
3                           None                           None   
4                    Unspecified                           None   

   contributing_factor_vehicle_4  contributing_factor_vehicle_5  collision_id  \
0                           None                           None       4736746   
1                           None                           None       4736768   
2                           None                           None       4737060   
3                           None                           None       4737510   
4                           None                           None       4736759   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   
2  Station Wagon/Sport Utility Vehicle                                 None   
3                                Sedan                                 None   
4                                  Bus                            Box Truck   

  vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5  
0                None                None                None  
1                None                None                None  
2                None                None                None  
3                None                None                None  
4                None                None                None  

[5 rows x 28 columns]
  • We have imported the cleaned crashes data successfully.
  • We utilize the fucntion head(5) to show the top 5 rows of the data.

6.2.3.2 Export the crashes data to a .csv file

It is easy to export the data. The fucntion that helps us to complete this goal is to_csv from Pandas.

## Choose the Package Pandas
## Choose the function "to_csv" from Pandas
## Use the argument, "df_feather" storing the data
## Export the data to the default working directory
df_feather.to_csv("foreign/nyccrashes_cleaned.csv") # Add the name of the CSV file
  • We can check whether the corresponding .csv file is generated in the default working directory.

We have exported the data to a .csv file in the default working directory.

We will use this .csv file later.

6.2.4 Import files in common formats: .csv/.xlsx/.txt

6.2.4.1 .csv files

We are familiar with .csv files as utilizing them to print some charts by R in the past courses. Now let us import this generated .csv file. We are supposed to choose the function read_csv from Pandas. The following code shows how to import it.

## Choose the Package Pandas
import pandas as pd
## Choose the function "read_csv"
## Add the relative address of the generated CSV file
df_csv = pd.read_csv("foreign/nyccrashes_cleaned.csv")
## Check the data we have imported
## Use the above function "head()" I have introduced
print(df_csv.head(2))
   Unnamed: 0       crash_datetime borough  zip_code  latitude  longitude  \
0           0  2024-06-30 17:30:00     NaN       NaN       NaN        NaN   
1           1  2024-06-30 00:32:00     NaN       NaN       NaN        NaN   

     location     on_street_name cross_street_name off_street_name  ...  \
0  (0.0, 0.0)                NaN               NaN     GOLD STREET  ...   
1         NaN  BELT PARKWAY RAMP               NaN             NaN  ...   

   contributing_factor_vehicle_2  contributing_factor_vehicle_3  \
0                    Unspecified                            NaN   
1                    Unspecified                            NaN   

   contributing_factor_vehicle_4  contributing_factor_vehicle_5  collision_id  \
0                            NaN                            NaN       4736746   
1                            NaN                            NaN       4736768   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   

   vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5  
0                  NaN                 NaN                 NaN  
1                  NaN                 NaN                 NaN  

[2 rows x 29 columns]

6.2.4.2 .xlsx files

We want to import .xlsx files but there are not suitable .xlsx files. We can transfer the CSV file to a .xlsx file by the fucntion to_excel from Pandas. Let’s see how to achieve this goal according to the following code.

## Choose the Package Pandas
import pandas as pd
## Use the function "to_excel"
## Export the data to the default working directory
df_csv.to_excel("foreign/nyccrashes_cleaned.xlsx") # Add the name of the Excel file
  • Check whether the corresponding .xlsx file is generated in the working directory

Now we have generated the .xlsx file covering the same data. And then we can learn how to import .xlsx files. The function wen use is read_excel no matter what Pandas version is.

  • The latest version of Pandas corresponds to the following code.
import pandas as pd
## Choose the function "read_excel"
## Add the command "engine" to read the file smoothly
df_excel = pd.read_excel("foreign/nyccrashes_cleaned.xlsx", engine = "openpyxl")
## Print top 2 rows of the data
print(df_excel.head(2))
   Unnamed: 0.1  Unnamed: 0       crash_datetime borough  zip_code  latitude  \
0             0           0  2024-06-30 17:30:00     NaN       NaN       NaN   
1             1           1  2024-06-30 00:32:00     NaN       NaN       NaN   

   longitude    location     on_street_name cross_street_name  ...  \
0        NaN  (0.0, 0.0)                NaN               NaN  ...   
1        NaN         NaN  BELT PARKWAY RAMP               NaN  ...   

  contributing_factor_vehicle_2  contributing_factor_vehicle_3  \
0                   Unspecified                            NaN   
1                   Unspecified                            NaN   

   contributing_factor_vehicle_4  contributing_factor_vehicle_5  collision_id  \
0                            NaN                            NaN       4736746   
1                            NaN                            NaN       4736768   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   

   vehicle_type_code_3  vehicle_type_code_4 vehicle_type_code_5  
0                  NaN                  NaN                 NaN  
1                  NaN                  NaN                 NaN  

[2 rows x 30 columns]

The code of the common Pandas version is below. What we need to adjust is to add correct encoding.

df_excel = pd.read_excel("foreign/nyccrashes_cleaned.xlsx", engine = "openpyxl", 
encoding = "utf-8")

6.2.4.3 .txt files

The last common kind of file is .txt files. We are able to generate the .txt file in the similar way as generatng the .xlsx file. We choose the function to_csv from Pandas. It is necessary to add the command sep="\t". At the same time, we are supposed to add index=False to avoid the index of Dataframe. The specific code is following.

import pandas as pd
df_csv = pd.read_csv("foreign/nyccrashes_cleaned.csv")
## Choose the function "to_csv"
## Add the command "sep='\t'"
## Add the command "index=False"
## Export the data to the default working directory
df_csv.to_csv("foreign/nyccrashes_cleaned.txt", sep = "\t", index = False)

Now we get the corresponding .txt file successfully. The next step is to determine the correct encoding of this .txt file. This is because the computer will not read the file successfully without correct encoding. I have listed the code helping us to obtain the correct encoding. We use with statement to deal with data. And then we use the function detect from Package Chardet. The intention of detect is to detect character encoding of text files.

import chardet
## Use "with" statement 
with open("foreign/nyccrashes_cleaned.txt", "rb") as f:
    # Execute the command "open"
    # And then assign the result to variable "f"
    raw_data = f.read() # Read the content from "f"
    result = chardet.detect(raw_data) 
    encoding = result["encoding"]
    print(str(encoding))
ascii
  • Warning: It is possible to generate the encoding which is not “utf-8”.

Now we own the .txt file and its correct encoding. The last step is to use the function read_table to import the .txt file. We need to insert the correct encoding too. The corresponding code is following.

import pandas as pd
## Choose the function "read_table"
## Add the encoding behind the relative address
df_txt = pd.read_table("foreign/nyccrashes_cleaned.txt", encoding = "utf-8")
## The defualt of function "head()" is top five rows 
print(df_txt.head())
   Unnamed: 0       crash_datetime    borough  zip_code  latitude  longitude  \
0           0  2024-06-30 17:30:00        NaN       NaN       NaN        NaN   
1           1  2024-06-30 00:32:00        NaN       NaN       NaN        NaN   
2           2  2024-06-30 07:05:00   BROOKLYN   11235.0  40.58106  -73.96744   
3           3  2024-06-30 20:47:00  MANHATTAN   10021.0  40.76363  -73.95330   
4           4  2024-06-30 10:14:00   BROOKLYN   11222.0  40.73046  -73.95149   

                location     on_street_name      cross_street_name  \
0             (0.0, 0.0)                NaN                    NaN   
1                    NaN  BELT PARKWAY RAMP                    NaN   
2  (40.58106, -73.96744)                NaN                    NaN   
3   (40.76363, -73.9533)          FDR DRIVE                    NaN   
4  (40.73046, -73.95149)  GREENPOINT AVENUE  MC GUINNESS BOULEVARD   

           off_street_name  ...  contributing_factor_vehicle_2  \
0              GOLD STREET  ...                    Unspecified   
1                      NaN  ...                    Unspecified   
2  2797      OCEAN PARKWAY  ...                            NaN   
3                      NaN  ...                            NaN   
4                      NaN  ...                    Unspecified   

   contributing_factor_vehicle_3  contributing_factor_vehicle_4  \
0                            NaN                            NaN   
1                            NaN                            NaN   
2                            NaN                            NaN   
3                            NaN                            NaN   
4                            NaN                            NaN   

   contributing_factor_vehicle_5  collision_id  \
0                            NaN       4736746   
1                            NaN       4736768   
2                            NaN       4737060   
3                            NaN       4737510   
4                            NaN       4736759   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   
2  Station Wagon/Sport Utility Vehicle                                  NaN   
3                                Sedan                                  NaN   
4                                  Bus                            Box Truck   

   vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5  
0                  NaN                 NaN                 NaN  
1                  NaN                 NaN                 NaN  
2                  NaN                 NaN                 NaN  
3                  NaN                 NaN                 NaN  
4                  NaN                 NaN                 NaN  

[5 rows x 29 columns]

6.2.5 Import the data from other software

In this section, all of the specific files generated by the code I provided are stored in the folder which name is “foreign” and can be accessed according to the relative path.

6.2.5.1 SAS files

6.2.5.1.1 Transfer the .csv file to a .xpt file

The reason why we choose .xpt file is to ensure data types remain consistent during conversion. Firstly, we need to process the data to ensure there is no space in the name of columns. If we don’t do that, we will not achieve the goal. The code of dealing with data is following.

import pandas as pd
df = pd.read_csv("foreign/nyccrashes_cleaned.csv")
df_csv_without_unnamed = df.loc[:, ~df.columns.str.contains("^Unnamed")]
df_csv_without_unnamed.columns=df_csv_without_unnamed.columns.str.replace(" ", "_")
df_csv_without_unnamed.to_csv("foreign/Without_Unamed.csv", index=False)

We use the Package pyreadstat and the function write_xport from pyreadstat to transfer the .csv file. The corresponding code is following.

import pandas as pd
import pyreadstat
df_without_unnamed = pd.read_csv("foreign/Without_Unamed.csv")
sas_file = "foreign/SAS.xpt"
## Export the data to the default working directory
pyreadstat.write_xport(df_without_unnamed, "SAS.xpt")
6.2.5.1.2 Import the generated .xpt file

We use the package pyreadstat too. We choose the function read_xport to import the data. Here is the code.

import pyreadstat
## Define the Dataframe and metadata
df_1, meta = pyreadstat.read_xport("foreign/SAS.xpt")
## Show the Dataframe
print(df_1.head(2))
## Show the metadata
print(meta)
        crash_datetime borough  zip_code  latitude  longitude    location  \
0  2024-06-30 17:30:00               NaN       NaN        NaN  (0.0, 0.0)   
1  2024-06-30 00:32:00               NaN       NaN        NaN               

      on_street_name cross_street_name off_street_name  \
0                                          GOLD STREET   
1  BELT PARKWAY RAMP                                     

   number_of_persons_injured  ...  contributing_factor_vehicle_2  \
0                        0.0  ...                    Unspecified   
1                        0.0  ...                    Unspecified   

   contributing_factor_vehicle_3  contributing_factor_vehicle_4  \
0                                                                 
1                                                                 

   contributing_factor_vehicle_5  collision_id  \
0                                    4736746.0   
1                                    4736768.0   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   

  vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5  
0                                                              
1                                                              

[2 rows x 28 columns]
<pyreadstat._readstat_parser.metadata_container object at 0x115625400>

6.2.5.2 rdata files(the suffix of this file is .RData)

6.2.5.2.1 Transfer the .csv file to a .Rdata file

We need to install the package rpy2

pip install rpy2

And then we choose the function pandas2ri The following code helps us achieve the goal.

import pandas as pd
## Use the Package rpy2
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
## Activate conversion between Pandas and R
pandas2ri.activate()
df_csv = pd.read_csv("foreign/nyccrashes_cleaned.csv")
## Transfer the Pandas DataFrame to R DataFrame
df_r = pandas2ri.py2rpy(df_csv)
## Save as .Rdata file
## Export the data to the default working directory
ro.globalenv["R"] = df_r
ro.r("save(R, file = 'foreign/nyccrashes_cleaned.Rdata')")
print("The CSV file has been transfered to a .Rdata file successfully.")
  • The error means : 1.Type conversion failure of columns. Some columns were not converted to the correct data typr in r as epected, and were instead coerced to strings. 2.The data is still saved, but there are potential data type issues. 3.These errores will not influence importing the .Rdata file.

  • If you want to perform necessary type cpnversions, the following code is suitable.

df_csv["boroughs"] = df["boroughs"].astype(str)
6.2.5.2.2 Import the generated .Rdata file

We also use the Package rpy2. We need the function pandas2ri too. The code is following

import pandas as pd
import rpy2.robjects as ro
## Load the .Rdata file
r_file_path = "foreign/nyccrashes_cleaned.Rdata"
ro.r["load"](r_file_path)
## View loaded variables
loaded_objects = ro.r("ls()")
## Show the loaded vatiables
print("Loaded R objects:", loaded_objects)
## We have set the name of dataframe as "R" above
r_dataframe = ro.r["R"]
from rpy2.robjects import pandas2ri
## Transfer R Dataframe to Pandas Dataframe 
## Aim to deal with the data conveniently
pandas2ri.activate()
df_2 = pandas2ri.rpy2py(r_dataframe)
print(df_2)
Loading custom .RprofileLoaded R objects: [1] "R"

      Unnamed: 0       crash_datetime    borough  zip_code   latitude  \
0              0  2024-06-30 17:30:00       None       NaN        NaN   
1              1  2024-06-30 00:32:00       None       NaN        NaN   
2              2  2024-06-30 07:05:00   BROOKLYN   11235.0  40.581060   
3              3  2024-06-30 20:47:00  MANHATTAN   10021.0  40.763630   
4              4  2024-06-30 10:14:00   BROOKLYN   11222.0  40.730460   
...          ...                  ...        ...       ...        ...   
1870        1870  2024-07-07 21:25:00      BRONX   10457.0  40.852520   
1871        1871  2024-07-07 10:31:00      BRONX   10460.0  40.843945   
1872        1872  2024-07-07 20:15:00     QUEENS   11436.0  40.677982   
1873        1873  2024-07-07 14:45:00      BRONX   10452.0  40.843822   
1874        1874  2024-07-07 14:12:00      BRONX   10468.0  40.861084   

      longitude                 location           on_street_name  \
0           NaN               (0.0, 0.0)                     None   
1           NaN                     None        BELT PARKWAY RAMP   
2    -73.967440    (40.58106, -73.96744)                     None   
3    -73.953300     (40.76363, -73.9533)                FDR DRIVE   
4    -73.951490    (40.73046, -73.95149)        GREENPOINT AVENUE   
...         ...                      ...                      ...   
1870 -73.900020    (40.85252, -73.90002)          EAST 180 STREET   
1871 -73.885800    (40.843945, -73.8858)                     None   
1872 -73.791214  (40.677982, -73.791214)        SUTPHIN BOULEVARD   
1873 -73.927500    (40.843822, -73.9275)  MAJOR DEEGAN EXPRESSWAY   
1874 -73.911490   (40.861084, -73.91149)                     None   

          cross_street_name            off_street_name  ...  \
0                      None                GOLD STREET  ...   
1                      None                       None  ...   
2                      None    2797      OCEAN PARKWAY  ...   
3                      None                       None  ...   
4     MC GUINNESS BOULEVARD                       None  ...   
...                     ...                        ...  ...   
1870                   None                       None  ...   
1871                   None  855       EAST 178 STREET  ...   
1872             120 AVENUE                       None  ...   
1873                   None                       None  ...   
1874                   None    2258      HAMPDEN PLACE  ...   

      contributing_factor_vehicle_2  contributing_factor_vehicle_3  \
0                       Unspecified                           None   
1                       Unspecified                           None   
2                              None                           None   
3                              None                           None   
4                       Unspecified                           None   
...                             ...                            ...   
1870                    Unspecified                           None   
1871                    Unspecified                           None   
1872                    Unspecified                           None   
1873                    Unspecified                           None   
1874                           None                           None   

      contributing_factor_vehicle_4  contributing_factor_vehicle_5  \
0                              None                           None   
1                              None                           None   
2                              None                           None   
3                              None                           None   
4                              None                           None   
...                             ...                            ...   
1870                           None                           None   
1871                           None                           None   
1872                           None                           None   
1873                           None                           None   
1874                           None                           None   

      collision_id                  vehicle_type_code_1  \
0          4736746                                Sedan   
1          4736768  Station Wagon/Sport Utility Vehicle   
2          4737060  Station Wagon/Sport Utility Vehicle   
3          4737510                                Sedan   
4          4736759                                  Bus   
...            ...                                  ...   
1870       4744144                        Pick-up Truck   
1871       4744576  Station Wagon/Sport Utility Vehicle   
1872       4745391                                Sedan   
1873       4746540                                Sedan   
1874       4746320                                Sedan   

                      vehicle_type_code_2  vehicle_type_code_3  \
0                                   Sedan                 None   
1     Station Wagon/Sport Utility Vehicle                 None   
2                                    None                 None   
3                                    None                 None   
4                               Box Truck                 None   
...                                   ...                  ...   
1870                                Sedan                 None   
1871                                 None                 None   
1872                                Sedan                 None   
1873                                Sedan                 None   
1874                                 None                 None   

     vehicle_type_code_4 vehicle_type_code_5  
0                   None                None  
1                   None                None  
2                   None                None  
3                   None                None  
4                   None                None  
...                  ...                 ...  
1870                None                None  
1871                None                None  
1872                None                None  
1873                None                None  
1874                None                None  

[1875 rows x 29 columns]

6.2.5.3 stata data(the suffix of this file is .dta)

6.2.5.3.1 Transfer the .csv file to a .dta file

We can only use Pandas. We choose the fucntion to_stata to save the .dta file.

import pandas as pd
df_csv = pd.read_csv("foreign/nyccrashes_cleaned.csv")
## Export the data to the default working directory
df_csv.to_stata("foreign/stata.dta")
6.2.5.3.2 Import the .dta file

We use the function read_stata from Pandas. And here is the specific code.

import pandas as pd
df_3 = pd.read_stata("foreign/stata.dta")
print(df_3.head(2))
   index  Unnamed__0       crash_datetime borough  zip_code  latitude  \
0      0           0  2024-06-30 17:30:00               NaN       NaN   
1      1           1  2024-06-30 00:32:00               NaN       NaN   

   longitude    location     on_street_name cross_street_name  ...  \
0        NaN  (0.0, 0.0)                                       ...   
1        NaN              BELT PARKWAY RAMP                    ...   

  contributing_factor_vehicle_2  contributing_factor_vehicle_3  \
0                   Unspecified                                  
1                   Unspecified                                  

   contributing_factor_vehicle_4  contributing_factor_vehicle_5  collision_id  \
0                                                                     4736746   
1                                                                     4736768   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   

   vehicle_type_code_3  vehicle_type_code_4 vehicle_type_code_5  
0                                                                
1                                                                

[2 rows x 30 columns]

6.2.5.4 spss data(the suffix of this file is .sav)

6.2.5.4.1 Transfer the .csv file to a .sav file

We need to use the Package pyreadstat We choose the function write_sav form pyreadstat We are supposed to use the CSV file which is without space. We can uese the following code.

import pandas as pd
import pyreadstat
df_csv = pd.read_csv("foreign/Without_Unamed.csv")
output_file = os.path.join("foreign", "SPSS.sav")
## Export the data to the default working directory
pyreadstat.write_sav(df_csv, output_file)
6.2.5.4.2 Import the generated .sav file

We also use the Package pyreadstat. We utilize the function read_sav from pyreadstat. The following code helps us import the .sav file.

import pandas as pd
import pyreadstat
df_4, meta = pyreadstat.read_sav("foreign/SPSS.sav")
print(df_4.head(2))
print(meta)
        crash_datetime borough  zip_code  latitude  longitude    location  \
0  2024-06-30 17:30:00               NaN       NaN        NaN  (0.0, 0.0)   
1  2024-06-30 00:32:00               NaN       NaN        NaN               

      on_street_name cross_street_name off_street_name  \
0                                          GOLD STREET   
1  BELT PARKWAY RAMP                                     

   number_of_persons_injured  ...  contributing_factor_vehicle_2  \
0                        0.0  ...                    Unspecified   
1                        0.0  ...                    Unspecified   

   contributing_factor_vehicle_3  contributing_factor_vehicle_4  \
0                                                                 
1                                                                 

   contributing_factor_vehicle_5  collision_id  \
0                                    4736746.0   
1                                    4736768.0   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   

  vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5  
0                                                              
1                                                              

[2 rows x 28 columns]
<pyreadstat._readstat_parser.metadata_container object at 0x118b65af0>

6.2.5.5 Matlab files(the suffix of this file is .mat)

6.2.5.5.1 Transfer the .csv file to a .mat file

We need to install the package scipy.io

pip install scipy

And then we choose the function savemat from scipy. The specific code is following.

import pandas as pd
from scipy.io import savemat
df_csv = pd.read_csv("foreign/nyccrashes_cleaned.csv")
## Convert DataFrame to dicotionary form
## MATLAB.mat require dictionary format
data_dict = {"data": df_csv.to_dict("list")}
## Save the dictionary as a .mat file
output_file = os.path.join("foreign", "MATLAB.mat")
## Export the data to the default working directory
savemat(output_file, data_dict)
6.2.5.5.2 Import the generated .mat file

We use the Package scipy.io too. We choose the function loadmat from spicy.io And the corresponding code is following.

import pandas as pd
from scipy.io import loadmat
df_csv = pd.read_csv("foreign/nyccrashes_cleaned.csv")
df_5 = loadmat("foreign/MATLAB.mat")
## Show the data keys
print(df_5.keys())
## Show the contents of the "data" keys
print(df_5["data"])
dict_keys(['__header__', '__version__', '__globals__', 'data'])
[[(array([[   0,    1,    2, ..., 1872, 1873, 1874]]), array(['2024-06-30 17:30:00', '2024-06-30 00:32:00',
         '2024-06-30 07:05:00', ..., '2024-07-07 20:15:00',
         '2024-07-07 14:45:00', '2024-07-07 14:12:00'], dtype='<U19'), array(['nan                             ',
         'nan                             ',
         'BROOKLYN                        ', ...,
         'QUEENS                          ',
         'BRONX                           ',
         'BRONX                           '], dtype='<U32'), array([[   nan,    nan, 11235., ..., 11436., 10452., 10468.]]), array([[      nan,       nan, 40.58106 , ..., 40.677982, 40.843822,
          40.861084]]), array([[       nan,        nan, -73.96744 , ..., -73.791214, -73.9275  ,
          -73.91149 ]]), array(['(0.0, 0.0)                      ',
         'nan                             ',
         '(40.58106, -73.96744)           ', ...,
         '(40.677982, -73.791214)         ',
         '(40.843822, -73.9275)           ',
         '(40.861084, -73.91149)          '], dtype='<U32'), array(['nan                             ',
         'BELT PARKWAY RAMP               ',
         'nan                             ', ...,
         'SUTPHIN BOULEVARD               ',
         'MAJOR DEEGAN EXPRESSWAY         ',
         'nan                             '], dtype='<U32'), array(['nan                             ',
         'nan                             ',
         'nan                             ', ...,
         '120 AVENUE                      ',
         'nan                             ',
         'nan                             '], dtype='<U32'), array(['GOLD STREET                        ',
         'nan                                ',
         '2797      OCEAN PARKWAY            ', ...,
         'nan                                ',
         'nan                                ',
         '2258      HAMPDEN PLACE            '], dtype='<U35'), array([[0, 0, 0, ..., 1, 0, 0]]), array([[0, 0, 0, ..., 0, 0, 0]]), array([[0, 0, 0, ..., 0, 0, 0]]), array([[0, 0, 0, ..., 0, 0, 0]]), array([[0, 0, 0, ..., 0, 0, 0]]), array([[0, 0, 0, ..., 0, 0, 0]]), array([[0, 0, 0, ..., 1, 0, 0]]), array([[0, 0, 0, ..., 0, 0, 0]]), array(['Passing Too Closely                                  ',
         'Unspecified                                          ',
         'Unspecified                                          ', ...,
         'Passing or Lane Usage Improper                       ',
         'Driver Inexperience                                  ',
         'Unspecified                                          '],
        dtype='<U53'), array(['Unspecified                                          ',
         'Unspecified                                          ',
         'nan                                                  ', ...,
         'Unspecified                                          ',
         'Unspecified                                          ',
         'nan                                                  '],
        dtype='<U53'), array(['nan                             ',
         'nan                             ',
         'nan                             ', ...,
         'nan                             ',
         'nan                             ',
         'nan                             '], dtype='<U32'), array(['nan                             ',
         'nan                             ',
         'nan                             ', ...,
         'nan                             ',
         'nan                             ',
         'nan                             '], dtype='<U32'), array(['nan                             ',
         'nan                             ',
         'nan                             ', ...,
         'nan                             ',
         'nan                             ',
         'nan                             '], dtype='<U32'), array([[4736746, 4736768, 4737060, ..., 4745391, 4746540, 4746320]]), array(['Sedan                              ',
         'Station Wagon/Sport Utility Vehicle',
         'Station Wagon/Sport Utility Vehicle', ...,
         'Sedan                              ',
         'Sedan                              ',
         'Sedan                              '], dtype='<U35'), array(['Sedan                              ',
         'Station Wagon/Sport Utility Vehicle',
         'nan                                ', ...,
         'Sedan                              ',
         'Sedan                              ',
         'nan                                '], dtype='<U35'), array(['nan                                ',
         'nan                                ',
         'nan                                ', ...,
         'nan                                ',
         'nan                                ',
         'nan                                '], dtype='<U35'), array(['nan                                ',
         'nan                                ',
         'nan                                ', ...,
         'nan                                ',
         'nan                                ',
         'nan                                '], dtype='<U35'), array(['nan                                ',
         'nan                                ',
         'nan                                ', ...,
         'nan                                ',
         'nan                                ',
         'nan                                '], dtype='<U35'))                                                                                                                                                                                                                                                                                                                                                ]]

6.2.5.6 HDF5 files(the suffix of this file is .h5)

6.2.5.6.1 Transfer the .csv file to a .h5 file

We can only use Pandas. At the same time, the function to_hdf helps us achive the goal. The code is following.

import pandas as pd
import tables
df_csv = pd.read_csv("foreign/nyccrashes_cleaned.csv")
output_file = os.path.join( "foreign", "HDF5.h5")
## Export the data to the default working directory
df_csv.to_hdf(output_file, key = "data", mode = "w")
6.2.5.6.2 Import the generated .h5 file

We only use Pandas too. We need the function read_h5. The code of importing .h5 file is following.

import pandas as pd
df_6 = pd.read_hdf("foreign/HDF5.h5", key = "data")
print(df_6.head(2))
   Unnamed: 0       crash_datetime borough  zip_code  latitude  longitude  \
0           0  2024-06-30 17:30:00     NaN       NaN       NaN        NaN   
1           1  2024-06-30 00:32:00     NaN       NaN       NaN        NaN   

     location     on_street_name cross_street_name off_street_name  ...  \
0  (0.0, 0.0)                NaN               NaN     GOLD STREET  ...   
1         NaN  BELT PARKWAY RAMP               NaN             NaN  ...   

   contributing_factor_vehicle_2  contributing_factor_vehicle_3  \
0                    Unspecified                            NaN   
1                    Unspecified                            NaN   

   contributing_factor_vehicle_4  contributing_factor_vehicle_5  collision_id  \
0                            NaN                            NaN       4736746   
1                            NaN                            NaN       4736768   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   

   vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5  
0                  NaN                 NaN                 NaN  
1                  NaN                 NaN                 NaN  

[2 rows x 29 columns]

6.2.5.7 Import multiple files and merge them into a new file

I have introduced the method of importing single file of data. Python also allows us to import multiple files simultaneously. We choose the Package glob and Package Pandas

## Install Package Glob
pip install glob

The effect of Package glob is to find files and directories that match the specified pattern. We use the function glob from Package glob. The intention of function glob is to find all file paths that match a specitic pattern and return a list of file paths. The following fucntion is the corresponding code.

## Use the package globe and the package pandas
import glob 
import pandas as pd
## Merge multiple arrays
## * means match any number of characters( including the null characters)
all_files = glob.glob("foreign/*.csv") 
## Create a list to store the data
all_data = []
## Use "for" statement to import all of the csv files
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    all_data.append(df)
## Combine multiple pandas objects into one along a fixed axis using some merging methods
data_merge = pd.concat(all_data, axis=0, ignore_index=True)
## Check the result
print(data_merge.head(2))
   Unnamed: 0       crash_datetime borough  zip_code  latitude  longitude  \
0         0.0  2024-06-30 17:30:00     NaN       NaN       NaN        NaN   
1         1.0  2024-06-30 00:32:00     NaN       NaN       NaN        NaN   

     location     on_street_name cross_street_name off_street_name  ...  \
0  (0.0, 0.0)                NaN               NaN     GOLD STREET  ...   
1         NaN  BELT PARKWAY RAMP               NaN             NaN  ...   

   contributing_factor_vehicle_2  contributing_factor_vehicle_3  \
0                    Unspecified                            NaN   
1                    Unspecified                            NaN   

   contributing_factor_vehicle_4  contributing_factor_vehicle_5  collision_id  \
0                            NaN                            NaN       4736746   
1                            NaN                            NaN       4736768   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   

   vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5  
0                  NaN                 NaN                 NaN  
1                  NaN                 NaN                 NaN  

[2 rows x 29 columns]

6.2.6 View data information

It is natural for us to be interested in the fundamental information of the data we have imported. As a result, I have listed some useful functions to get the basic knowledge of the data.

The following code helps us know how much the data is. We choose the basic function shape from pandas.

## How much is the crashes data is
df_csv.shape
(1875, 29)
  • There are 1875 data and 29 columns in the file.

The following code helps us check the type of each variable in data. The fucntion is dtypes from Pandas

## Show all types of the crashes' variables
df_csv.dtypes
Unnamed: 0                         int64
crash_datetime                    object
borough                           object
zip_code                         float64
latitude                         float64
longitude                        float64
location                          object
on_street_name                    object
cross_street_name                 object
off_street_name                   object
number_of_persons_injured          int64
number_of_persons_killed           int64
number_of_pedestrians_injured      int64
number_of_pedestrians_killed       int64
number_of_cyclist_injured          int64
number_of_cyclist_killed           int64
number_of_motorist_injured         int64
number_of_motorist_killed          int64
contributing_factor_vehicle_1     object
contributing_factor_vehicle_2     object
contributing_factor_vehicle_3     object
contributing_factor_vehicle_4     object
contributing_factor_vehicle_5     object
collision_id                       int64
vehicle_type_code_1               object
vehicle_type_code_2               object
vehicle_type_code_3               object
vehicle_type_code_4               object
vehicle_type_code_5               object
dtype: object
  • Our computer has listed 29 variables and their corresponding types.

The following code is suitable for viewing overall data information. We use the basic function info from Pandas

df_csv.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1875 entries, 0 to 1874
Data columns (total 29 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Unnamed: 0                     1875 non-null   int64  
 1   crash_datetime                 1875 non-null   object 
 2   borough                        1749 non-null   object 
 3   zip_code                       1749 non-null   float64
 4   latitude                       1722 non-null   float64
 5   longitude                      1722 non-null   float64
 6   location                       1729 non-null   object 
 7   on_street_name                 1329 non-null   object 
 8   cross_street_name              943 non-null    object 
 9   off_street_name                546 non-null    object 
 10  number_of_persons_injured      1875 non-null   int64  
 11  number_of_persons_killed       1875 non-null   int64  
 12  number_of_pedestrians_injured  1875 non-null   int64  
 13  number_of_pedestrians_killed   1875 non-null   int64  
 14  number_of_cyclist_injured      1875 non-null   int64  
 15  number_of_cyclist_killed       1875 non-null   int64  
 16  number_of_motorist_injured     1875 non-null   int64  
 17  number_of_motorist_killed      1875 non-null   int64  
 18  contributing_factor_vehicle_1  1864 non-null   object 
 19  contributing_factor_vehicle_2  1425 non-null   object 
 20  contributing_factor_vehicle_3  174 non-null    object 
 21  contributing_factor_vehicle_4  52 non-null     object 
 22  contributing_factor_vehicle_5  14 non-null     object 
 23  collision_id                   1875 non-null   int64  
 24  vehicle_type_code_1            1842 non-null   object 
 25  vehicle_type_code_2            1230 non-null   object 
 26  vehicle_type_code_3            162 non-null    object 
 27  vehicle_type_code_4            48 non-null     object 
 28  vehicle_type_code_5            14 non-null     object 
dtypes: float64(3), int64(10), object(16)
memory usage: 424.9+ KB
  • The basic information of crashes data has been listed.

The function describe from Pandas helps generate descriptive statistics.

## Show the basic descriptive statistics of crashes data
df_csv.describe()
Unnamed: 0 zip_code latitude longitude number_of_persons_injured number_of_persons_killed number_of_pedestrians_injured number_of_pedestrians_killed number_of_cyclist_injured number_of_cyclist_killed number_of_motorist_injured number_of_motorist_killed collision_id
count 1875.000000 1749.000000 1722.000000 1722.000000 1875.000000 1875.000000 1875.000000 1875.000000 1875.000000 1875.0 1875.000000 1875.000000 1.875000e+03
mean 937.000000 10892.563179 40.719287 -73.919898 0.617067 0.004267 0.093333 0.002667 0.065067 0.0 0.433067 0.001600 4.738587e+06
std 541.410196 525.579066 0.081315 0.085191 0.915610 0.103219 0.338452 0.095207 0.246709 0.0 0.891185 0.039979 1.659947e+03
min 0.000000 10000.000000 40.513510 -74.237366 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 4.736561e+06
25% 468.500000 10455.000000 40.662752 -73.968543 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 4.737666e+06
50% 937.000000 11208.000000 40.712778 -73.922933 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 4.738257e+06
75% 1405.500000 11239.000000 40.767641 -73.869405 1.000000 0.000000 0.000000 0.000000 0.000000 0.0 1.000000 0.000000 4.738883e+06
max 1874.000000 11694.000000 40.907246 -73.702190 11.000000 4.000000 7.000000 4.000000 1.000000 0.0 11.000000 1.000000 4.752444e+06

If we want to summarize the names of all columns, it is a good choice to use the function columns from Pandas

## Get all columns' names of crashes data
df_csv.columns
Index(['Unnamed: 0', 'crash_datetime', 'borough', 'zip_code', 'latitude',
       'longitude', 'location', 'on_street_name', 'cross_street_name',
       'off_street_name', 'number_of_persons_injured',
       'number_of_persons_killed', 'number_of_pedestrians_injured',
       'number_of_pedestrians_killed', 'number_of_cyclist_injured',
       'number_of_cyclist_killed', 'number_of_motorist_injured',
       'number_of_motorist_killed', 'contributing_factor_vehicle_1',
       'contributing_factor_vehicle_2', 'contributing_factor_vehicle_3',
       'contributing_factor_vehicle_4', 'contributing_factor_vehicle_5',
       'collision_id', 'vehicle_type_code_1', 'vehicle_type_code_2',
       'vehicle_type_code_3', 'vehicle_type_code_4', 'vehicle_type_code_5'],
      dtype='object')

The function tail from Pandas allow us to view the last rows.

## Show the last 2 rows of crashes data
df_csv.tail(n = 2)
Unnamed: 0 crash_datetime borough zip_code latitude longitude location on_street_name cross_street_name off_street_name ... contributing_factor_vehicle_2 contributing_factor_vehicle_3 contributing_factor_vehicle_4 contributing_factor_vehicle_5 collision_id vehicle_type_code_1 vehicle_type_code_2 vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5
1873 1873 2024-07-07 14:45:00 BRONX 10452.0 40.843822 -73.92750 (40.843822, -73.9275) MAJOR DEEGAN EXPRESSWAY NaN NaN ... Unspecified NaN NaN NaN 4746540 Sedan Sedan NaN NaN NaN
1874 1874 2024-07-07 14:12:00 BRONX 10468.0 40.861084 -73.91149 (40.861084, -73.91149) NaN NaN 2258 HAMPDEN PLACE ... NaN NaN NaN NaN 4746320 Sedan NaN NaN NaN NaN

2 rows × 29 columns

The function unique from Pandas dedicates in unique values ​​of one column.

## Show the unique values in the column named "crash_datetime"
df_csv["crash_datetime"].unique()
array(['2024-06-30 17:30:00', '2024-06-30 00:32:00',
       '2024-06-30 07:05:00', ..., '2024-07-07 09:13:00',
       '2024-07-07 10:31:00', '2024-07-07 14:12:00'], dtype=object)

We can get the values of one column(without deduplication). The choose of function is values from Pandas instead of unique

## Show all values of the column named "crash_datetime"
df_csv["crash_datetime"].values
array(['2024-06-30 17:30:00', '2024-06-30 00:32:00',
       '2024-06-30 07:05:00', ..., '2024-07-07 20:15:00',
       '2024-07-07 14:45:00', '2024-07-07 14:12:00'], dtype=object)

6.2.7 Find Null Values

It is necessary for us to find null values before we clean and preprocess the data. The following content covers how to find null values.

6.2.7.1 Determine whether there are missing values.

We need to use the function isnull firstly. The aim is to detect missing values in data. And then we add the command any to determine whether there are missing values.

  • Determine whether there are missing values in columns
## Determine whether there are missing values in columns of crashes data
df_csv.isnull().any(axis = 0) # "axis=0" means columns
Unnamed: 0                       False
crash_datetime                   False
borough                           True
zip_code                          True
latitude                          True
longitude                         True
location                          True
on_street_name                    True
cross_street_name                 True
off_street_name                   True
number_of_persons_injured        False
number_of_persons_killed         False
number_of_pedestrians_injured    False
number_of_pedestrians_killed     False
number_of_cyclist_injured        False
number_of_cyclist_killed         False
number_of_motorist_injured       False
number_of_motorist_killed        False
contributing_factor_vehicle_1     True
contributing_factor_vehicle_2     True
contributing_factor_vehicle_3     True
contributing_factor_vehicle_4     True
contributing_factor_vehicle_5     True
collision_id                     False
vehicle_type_code_1               True
vehicle_type_code_2               True
vehicle_type_code_3               True
vehicle_type_code_4               True
vehicle_type_code_5               True
dtype: bool
  • Determine whether there are missing values ​​in rows.
## Determine whether there are missing values in rows of crashes data
df_csv.isnull().any(axis = 1) # "axis=1" means rows
0       True
1       True
2       True
3       True
4       True
        ... 
1870    True
1871    True
1872    True
1873    True
1874    True
Length: 1875, dtype: bool

6.2.7.2 Locate the missing values of rows/columns

We utilize the function loc from Pandas. The function of loc selects or modifies data in a DataFrame or Series. The slection and modification are based on labels.

  • Locate the missing values of rows
## Locate the missing values in crashes data rows
df_csv.loc[df_csv.isnull().any(axis = 1)]
Unnamed: 0 crash_datetime borough zip_code latitude longitude location on_street_name cross_street_name off_street_name ... contributing_factor_vehicle_2 contributing_factor_vehicle_3 contributing_factor_vehicle_4 contributing_factor_vehicle_5 collision_id vehicle_type_code_1 vehicle_type_code_2 vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5
0 0 2024-06-30 17:30:00 NaN NaN NaN NaN (0.0, 0.0) NaN NaN GOLD STREET ... Unspecified NaN NaN NaN 4736746 Sedan Sedan NaN NaN NaN
1 1 2024-06-30 00:32:00 NaN NaN NaN NaN NaN BELT PARKWAY RAMP NaN NaN ... Unspecified NaN NaN NaN 4736768 Station Wagon/Sport Utility Vehicle Station Wagon/Sport Utility Vehicle NaN NaN NaN
2 2 2024-06-30 07:05:00 BROOKLYN 11235.0 40.581060 -73.967440 (40.58106, -73.96744) NaN NaN 2797 OCEAN PARKWAY ... NaN NaN NaN NaN 4737060 Station Wagon/Sport Utility Vehicle NaN NaN NaN NaN
3 3 2024-06-30 20:47:00 MANHATTAN 10021.0 40.763630 -73.953300 (40.76363, -73.9533) FDR DRIVE NaN NaN ... NaN NaN NaN NaN 4737510 Sedan NaN NaN NaN NaN
4 4 2024-06-30 10:14:00 BROOKLYN 11222.0 40.730460 -73.951490 (40.73046, -73.95149) GREENPOINT AVENUE MC GUINNESS BOULEVARD NaN ... Unspecified NaN NaN NaN 4736759 Bus Box Truck NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1870 1870 2024-07-07 21:25:00 BRONX 10457.0 40.852520 -73.900020 (40.85252, -73.90002) EAST 180 STREET NaN NaN ... Unspecified NaN NaN NaN 4744144 Pick-up Truck Sedan NaN NaN NaN
1871 1871 2024-07-07 10:31:00 BRONX 10460.0 40.843945 -73.885800 (40.843945, -73.8858) NaN NaN 855 EAST 178 STREET ... Unspecified NaN NaN NaN 4744576 Station Wagon/Sport Utility Vehicle NaN NaN NaN NaN
1872 1872 2024-07-07 20:15:00 QUEENS 11436.0 40.677982 -73.791214 (40.677982, -73.791214) SUTPHIN BOULEVARD 120 AVENUE NaN ... Unspecified NaN NaN NaN 4745391 Sedan Sedan NaN NaN NaN
1873 1873 2024-07-07 14:45:00 BRONX 10452.0 40.843822 -73.927500 (40.843822, -73.9275) MAJOR DEEGAN EXPRESSWAY NaN NaN ... Unspecified NaN NaN NaN 4746540 Sedan Sedan NaN NaN NaN
1874 1874 2024-07-07 14:12:00 BRONX 10468.0 40.861084 -73.911490 (40.861084, -73.91149) NaN NaN 2258 HAMPDEN PLACE ... NaN NaN NaN NaN 4746320 Sedan NaN NaN NaN NaN

1875 rows × 29 columns

6.2.7.3 Determine the number of missing values.

We also use the function isnull. But this time we add the command sum rather than any.

## Calculate the number of missing values in crashes data columns
df_csv.isnull().sum(axis = 0)
Unnamed: 0                          0
crash_datetime                      0
borough                           126
zip_code                          126
latitude                          153
longitude                         153
location                          146
on_street_name                    546
cross_street_name                 932
off_street_name                  1329
number_of_persons_injured           0
number_of_persons_killed            0
number_of_pedestrians_injured       0
number_of_pedestrians_killed        0
number_of_cyclist_injured           0
number_of_cyclist_killed            0
number_of_motorist_injured          0
number_of_motorist_killed           0
contributing_factor_vehicle_1      11
contributing_factor_vehicle_2     450
contributing_factor_vehicle_3    1701
contributing_factor_vehicle_4    1823
contributing_factor_vehicle_5    1861
collision_id                        0
vehicle_type_code_1                33
vehicle_type_code_2               645
vehicle_type_code_3              1713
vehicle_type_code_4              1827
vehicle_type_code_5              1861
dtype: int64

6.3 SQL

This section was written by Thea Johnson, a senior in Statistics at the time.

6.3.1 Table of Contents

  • What is a Database?
  • What is SQL
  • CRUD Model
  • Creating Tables with SQL
  • Inserting into SQL Tables
  • Updating/Deleting SQL Tables
  • Using SQL to work with NYC Open Data
  • Queries

6.3.2 What is a Database?

  • Collection of related information
  • Ex: Phone books, grocery list, student records
  • Relational and non-relational
  • Relational databases are structured into columns and rows
  • Each row represents an observation and each column represents an attribute
  • A key uniquely identifies each row
  • Trying to input repeat keys causes an error

6.3.3 Example Relational Database

import sqlite3
import pandas as pd

# creates a database file called phonebook.db and lets you connect
connection = sqlite3.connect("phonebook.db")

# created a cursor object lets you use the cursor function from SQlite3 module
cursor = connection.cursor()

cursor.execute("DROP TABLE IF EXISTS phonebook")

# Execute function allows you to send commands in the SQL language as strings
cursor.execute("""
CREATE TABLE IF NOT EXISTS phonebook (
    phonebook_id INTEGER PRIMARY KEY,
    name TEXT,
    phone_num TEXT UNIQUE,
    address TEXT
)
""")

cursor.execute("INSERT INTO phonebook VALUES (1, 'Greta Colic', '2035452367', '1 Hartsdale road')")
cursor.execute("INSERT INTO phonebook VALUES(2, 'Carlos Alavarez', '9145652761', '13 Porter street')")
cursor.execute("INSERT INTO phonebook VALUES(3, 'Marin Yanko', '5753917568', '110 Ocean avenue')")
cursor.execute("INSERT INTO phonebook VALUES(4, 'Mira Watson', '9146522761', '12 Hindmarsh avenue')")
cursor.execute("INSERT INTO phonebook VALUES(5, 'Harry Smith', '2036658279', '180 Wallace road')")

connection.commit()
 

# alternative way of displaying output
"""cursor.execute("SELECT * FROM phonebook")
rows = cursor.fetchall()
for row in rows:
    print(row)"""
 
output = pd.read_sql_query("SELECT * FROM phonebook", connection)
print()
print()
print(output)

connection.close()


   phonebook_id             name   phone_num              address
0             1      Greta Colic  2035452367     1 Hartsdale road
1             2  Carlos Alavarez  9145652761     13 Porter street
2             3      Marin Yanko  5753917568     110 Ocean avenue
3             4      Mira Watson  9146522761  12 Hindmarsh avenue
4             5      Harry Smith  2036658279     180 Wallace road

6.3.4 What is SQL?

  • Structured Query Language
  • Allows users to interact with databases to store and retrieve data
  • The essential operations follow the CRUD acronym

6.3.5 CRUD

  • Create, read, update and delete
  • Essential operations for SQL to manage a database
  • Create: adds a new record (row) to a database with unique attributes
  • Read: Returns records based on specified search criteria
  • Update: Allows you to change attribute(s) of the record
  • Delete: Allows you to remove records from the database

6.3.6 Implementing SQL Through Python to Create a Table

import sqlite3

# creates a database file called phonebook.db and lets you connect
connection = sqlite3.connect("phonebook.db")

# creates a cursor object using the cursor function from SQlite3 module
cursor = connection.cursor()

"""
CREATE TABLE tablename
(
    attribute1 datatype PRIMARY KEY,
    attribute2 datatype,
    attribute3 datatype
);
"""
# You MUST include a primary key to uniquely identify entries

# Execute function allows you to send commands in the SQL language as strings
cursor.execute("""
CREATE TABLE IF NOT EXISTS phonebook (
    phonebook_id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    phone_num TEXT UNIQUE,
    address TEXT
)
""")
connection.commit()
connection.close()

6.3.7 How to Insert into the SQL table?

import sqlite3
import pandas as pd

# Connects the the previously created phonebook.db
connection = sqlite3.connect("phonebook.db")

# creates a cursor object using the cursor function from SQlite3 module
cursor = connection.cursor()

"""
INSERT INTO database VALUES("value1", "value2", "value3");

"""

cursor.execute("INSERT INTO phonebook VALUES (1, 'Greta Colic', '2035452367', '1 Hartsdale road')")
cursor.execute("INSERT INTO phonebook VALUES(2, 'Carlos Alavarez', '9145652761', '13 Porter street')")
cursor.execute("INSERT INTO phonebook VALUES(3, 'Marin Yanko', '5753917568', '110 Ocean avenue')")
cursor.execute("INSERT INTO phonebook VALUES(4, 'Mira Watson', '9146522761', '12 Hindmarsh avenue')")
cursor.execute("INSERT INTO phonebook VALUES(5, 'Harry Smith', '2036658279', '180 Wallace road')")

# How to input data if there's a missing value?
"""
INSERT INTO database(attribute1, attribute2) VALUES(val1, val2);
"""
# only works if the missing value is not a primary key
cursor.execute("INSERT INTO phonebook(phonebook_id, name, phone_num) VALUES(6, 'Stacy Yang', '9178852765')")

connection.commit()

# Allows you to see the created table
output = pd.read_sql_query("SELECT * FROM phonebook", connection)
print(output)
connection.close()

6.3.8 How to Update/Delete Using SQL?

# Updating an attribute (WHERE statement is optional)
connection = sqlite3.connect("phonebook.db")

# created a cursor object lets you use the cursor function from SQlite3 module
cursor = connection.cursor()

# Updates Greta's number
cursor.execute("UPDATE phonebook SET phone_num = '2035151234' WHERE name = 'Greta Colic';")

# Deletes Harry Smith from the phonebook
cursor.execute("DELETE FROM phonebook WHERE name = 'Harry Smith';")

# Changes Carlos's last name
cursor.execute("UPDATE phonebook SET name = 'Carlos Ramos' WHERE name = 'Carlos Alavarez';")

# Updating multiple columns
update_multiple_query = """
UPDATE phonebook 
SET phone_num = '7777777777', address = '45 Main St' 
WHERE name = 'Marin Yanko';
"""
cursor.execute(update_multiple_query)

# deleting a table 
cursor.execute("DROP TABLE phonebook;")

connection.commit()
connection.close()

6.3.9

import sqlite3
import pandas as pd

# creates a database file called phonebook.db and lets you connect
connection = sqlite3.connect("phonebook.db")

# created a cursor object lets you use the cursor function from SQlite3 module
cursor = connection.cursor()

cursor.execute("DROP TABLE IF EXISTS phonebook;")
# Execute function allows you to send commands in the SQL language as strings
cursor.execute("""
CREATE TABLE IF NOT EXISTS phonebook (
    phonebook_id INT,
    name VARCHAR(30), 
    phone_num VARCHAR(15) PRIMARY KEY,
    address VARCHAR(30)
)
""")

cursor.execute("INSERT INTO phonebook VALUES (1, 'Greta Colic', '2035452367', '1 Hartsdale road')")
cursor.execute("INSERT INTO phonebook VALUES(2, 'Carlos Alavarez', '9145652761', '13 Porter street')")
cursor.execute("INSERT INTO phonebook VALUES(3, 'Marin Yanko', '5753917568', '110 Ocean avenue')")
cursor.execute("INSERT INTO phonebook VALUES(4, 'Mira Watson', '9146522761', '12 Hindmarsh avenue')")
cursor.execute("INSERT INTO phonebook VALUES(5, 'Harry Smith', '2036658279', '180 Wallace road')")
cursor.execute("INSERT INTO phonebook(phonebook_id, name, phone_num) VALUES(6, 'Stacy Yang', '9178852765')")

connection.commit()
 

# alternative way of displaying output
"""cursor.execute("SELECT * FROM phonebook")
rows = cursor.fetchall()
for row in rows:
    print(row)"""

output = pd.read_sql_query("SELECT * FROM phonebook", connection)
print(output)
print(" ")
print(" ")

cursor.execute("UPDATE phonebook SET phone_num = '2035151234' WHERE name = 'Greta Colic';")

# Deletes Harry Smith from the phonebook
cursor.execute("DELETE FROM phonebook WHERE name = 'Harry Smith';")

# Updating multiple columns
update_multiple_query = """
UPDATE phonebook 
SET phone_num = '7777777777', address = '45 Main St' 
WHERE name = 'Marin Yanko';
"""
cursor.execute(update_multiple_query)

cursor.execute("UPDATE phonebook SET name = 'Carlos Ramos' WHERE name = 'Carlos Alavarez';")

connection.commit()
output = pd.read_sql_query("SELECT * FROM phonebook", connection)
print(output)

connection.close()
   phonebook_id             name   phone_num              address
0             1      Greta Colic  2035452367     1 Hartsdale road
1             2  Carlos Alavarez  9145652761     13 Porter street
2             3      Marin Yanko  5753917568     110 Ocean avenue
3             4      Mira Watson  9146522761  12 Hindmarsh avenue
4             5      Harry Smith  2036658279     180 Wallace road
5             6       Stacy Yang  9178852765                 None
 
 
   phonebook_id          name   phone_num              address
0             1   Greta Colic  2035151234     1 Hartsdale road
1             2  Carlos Ramos  9145652761     13 Porter street
2             3   Marin Yanko  7777777777           45 Main St
3             4   Mira Watson  9146522761  12 Hindmarsh avenue
4             6    Stacy Yang  9178852765                 None

6.3.10 Using SQL to Work on a CSV file

import pandas as pd
import sqlite3

# Creates an SQL Database
conn = sqlite3.connect("nyc.db")

# Reads the CSV file using the path
data = pd.read_csv(r"data/nyccrashes_2024w0630_by20240916.csv")

cursor = conn.cursor()

# automatically converts data to an SQL table
data.to_sql('nyccrashes', conn, if_exists='replace', index=False)
import pandas as pd
import sqlite3

# Creates an SQL Database
conn = sqlite3.connect("nyc.db")

# Reads the CSV file using the path
data = pd.read_csv(r"data/nyccrashes_2024w0630_by20240916.csv")

cursor = conn.cursor()

# automatically converts data to an SQL table
data.to_sql('nyccrashes', conn, if_exists='replace', index=False)
1875

6.3.11 Queries

  • Commands used to pull needed data out of a database
# Query for everythin in the table
query1 = pd.read_sql_query("SELECT * FROM nyccrashes;", conn)
print(query1.head(3))

# Query to count total crashes
total_crashes = pd.read_sql_query("SELECT COUNT(*) FROM nyccrashes;", conn)
print(total_crashes)

# Query to only retrieve fixed attributes
specific_columns = pd.read_sql_query("SELECT \"ZIP CODE\", \"CRASH TIME\" FROM nyccrashes;", conn)
print(specific_columns.head(3))

# Groups Crashes by borough
crashes_by_borough = pd.read_sql_query("""
    SELECT BOROUGH, COUNT(*) AS Total_Crashes
    FROM nyccrashes
    GROUP BY BOROUGH;
""", conn)
print(crashes_by_borough)

# Query to show the fatal crashes
fatal_crashes = pd.read_sql_query("SELECT * FROM nyccrashes WHERE \"NUMBER OF PERSONS KILLED\" > 0;", conn)
print(fatal_crashes.head())

6.3.12 Queries Output

# Query for everythin in the table
# query1 = pd.read_sql_query("SELECT * FROM nyccrashes;", conn)
# print(query1.head(3))

# Query to count total crashes
print("Here's the output for the count query.")
total_crashes = pd.read_sql_query("SELECT COUNT(*) FROM nyccrashes;", conn)
print(total_crashes)
print()

# Groups Crashes by borough
print("Here's the output for the query to group crashes by borough.")
crashes_by_borough = pd.read_sql_query("""
    SELECT BOROUGH, COUNT(*) AS Total_Crashes
    FROM nyccrashes
    GROUP BY BOROUGH;
""", conn)
print(crashes_by_borough)
print()
# Query to only retrieve fixed attributes
print("Here's the output for the specific columns query.")
specific_columns = pd.read_sql_query("SELECT \"ZIP CODE\", \"CRASH TIME\" FROM nyccrashes;", conn)
print(specific_columns.head(5))

# Query to show the fatal crashes
# fatal_crashes = pd.read_sql_query("SELECT * FROM nyccrashes #WHERE \"NUMBER OF PERSONS KILLED\" > 0;", conn)
# print(fatal_crashes.head())
conn.commit()
Here's the output for the count query.
   COUNT(*)
0      1875

Here's the output for the query to group crashes by borough.
         BOROUGH  Total_Crashes
0           None            541
1          BRONX            213
2       BROOKLYN            462
3      MANHATTAN            228
4         QUEENS            381
5  STATEN ISLAND             50

Here's the output for the specific columns query.
   ZIP CODE CRASH TIME
0       NaN      17:30
1       NaN       0:32
2   11235.0       7:05
3       NaN      20:47
4   11222.0      10:14

6.3.13 Conclusion

  • SQL works with relational databases
  • SQL performs the CRUD functions: create, read, update, and delete to work with databases
  • A query is a request from a database for information
  • SQL can be used to manipulate data from various formats including CSV files

6.3.14 Further reading

6.4 NYC Crash Data

Consider a subset of the NYC Crash Data, which contains all NYC motor vehicle collisions data with documentation from NYC Open Data. We downloaded the crash data for the week of June 30, 2024, on September 16, 2024, in CSC format.

import pandas as pd

# Load the dataset
file_path = 'data/nyccrashes_2024w0630_by20240916.csv'
df = pd.read_csv(file_path)

# Replace column names: convert to lowercase and replace spaces with underscores
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Display the first few rows of the dataset to understand its structure
df.head()
crash_date crash_time borough zip_code latitude longitude location on_street_name cross_street_name off_street_name ... contributing_factor_vehicle_2 contributing_factor_vehicle_3 contributing_factor_vehicle_4 contributing_factor_vehicle_5 collision_id vehicle_type_code_1 vehicle_type_code_2 vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5
0 06/30/2024 17:30 NaN NaN 0.00000 0.00000 (0.0, 0.0) NaN NaN GOLD STREET ... Unspecified NaN NaN NaN 4736746 Sedan Sedan NaN NaN NaN
1 06/30/2024 0:32 NaN NaN NaN NaN NaN BELT PARKWAY RAMP NaN NaN ... Unspecified NaN NaN NaN 4736768 Station Wagon/Sport Utility Vehicle Station Wagon/Sport Utility Vehicle NaN NaN NaN
2 06/30/2024 7:05 BROOKLYN 11235.0 40.58106 -73.96744 (40.58106, -73.96744) NaN NaN 2797 OCEAN PARKWAY ... NaN NaN NaN NaN 4737060 Station Wagon/Sport Utility Vehicle NaN NaN NaN NaN
3 06/30/2024 20:47 NaN NaN 40.76363 -73.95330 (40.76363, -73.9533) FDR DRIVE NaN NaN ... NaN NaN NaN NaN 4737510 Sedan NaN NaN NaN NaN
4 06/30/2024 10:14 BROOKLYN 11222.0 40.73046 -73.95149 (40.73046, -73.95149) GREENPOINT AVENUE MC GUINNESS BOULEVARD NaN ... Unspecified NaN NaN NaN 4736759 Bus Box Truck NaN NaN NaN

5 rows × 29 columns

Now we can do some cleaning after a quick browse.

# Replace invalid coordinates (latitude=0, longitude=0 or NaN) with NaN
df.loc[(df['latitude'] == 0) & (df['longitude'] == 0), 
       ['latitude', 'longitude']] = pd.NA
df['latitude'] = df['latitude'].replace(0, pd.NA)
df['longitude'] = df['longitude'].replace(0, pd.NA)

# Longitude/latitude don't need double precision
df['latitude'] = df['latitude'].astype('float32', errors='ignore')
df['longitude'] = df['longitude'].astype('float32', errors='ignore')

# Drop the redundant 'location' column
df = df.drop(columns=['location'])

# Converting 'crash_date' and 'crash_time' columns into a single datetime column
df['crash_datetime'] = pd.to_datetime(df['crash_date'] + ' ' 
                       + df['crash_time'], format='%m/%d/%Y %H:%M', errors='coerce')

# Drop the original 'crash_date' and 'crash_time' columns
df = df.drop(columns=['crash_date', 'crash_time'])

Are missing in zip code and borough always co-occur?

# Check if missing values in 'zip_code' and 'borough' always co-occur
# Count rows where both are missing
missing_cooccur = df[['zip_code', 'borough']].isnull().all(axis=1).sum()
# Count total missing in 'zip_code' and 'borough', respectively
total_missing_zip_code = df['zip_code'].isnull().sum()
total_missing_borough = df['borough'].isnull().sum()

# If missing in both columns always co-occur, the number of missing
# co-occurrences should be equal to the total missing in either column
missing_cooccur, total_missing_zip_code, total_missing_borough
(np.int64(541), np.int64(541), np.int64(541))

Are there cases where zip_code and borough are missing but the geo codes are not missing? If so, fill in zip_code and borough using the geo codes by reverse geocoding.

First make sure geopy is installed.

pip install geopy

Now we use model Nominatim in package geopy to reverse geocode.

from geopy.geocoders import Nominatim
import time

# Initialize the geocoder; the `user_agent` is your identifier 
# when using the service. Be mindful not to crash the server
# by unlimited number of queries, especially invalid code.
geolocator = Nominatim(user_agent="jyGeopyTry")

We write a function to do the reverse geocoding given lattitude and longitude.

# Function to fill missing zip_code
def get_zip_code(latitude, longitude):
    try:
        location = geolocator.reverse((latitude, longitude), timeout=10)
        if location:
            address = location.raw['address']
            zip_code = address.get('postcode', None)
            return zip_code
        else:
            return None
    except Exception as e:
        print(f"Error: {e} for coordinates {latitude}, {longitude}")
        return None
    finally:
        time.sleep(1)  # Delay to avoid overwhelming the service

Let’s try it out:

# Example usage
latitude = 40.730610
longitude = -73.935242
zip_code = get_zip_code(latitude, longitude)

The function get_zip_code can then be applied to rows where zip code is missing but geocodes are not to fill the missing zip code.

Once zip code is known, figuring out burough is simple because valid zip codes from each borough are known.

6.5 Cross-platform Data Format Arrow

The CSV format (and related formats like TSV - tab-separated values) for data tables is ubiquitous, convenient, and can be read or written by many different data analysis environments, including spreadsheets. An advantage of the textual representation of the data in a CSV file is that the entire data table, or portions of it, can be previewed in a text editor. However, the textual representation can be ambiguous and inconsistent. The format of a particular column: Boolean, integer, floating-point, text, factor, etc. must be inferred from text representation, often at the expense of reading the entire file before these inferences can be made. Experienced data scientists are aware that a substantial part of an analysis or report generation is often the “data cleaning” involved in preparing the data for analysis. This can be an open-ended task — it required numerous trial-and-error iterations to create the list of different missing data representations we use for the sample CSV file and even now we are not sure we have them all.

To read and export data efficiently, leveraging the Apache Arrow library can significantly improve performance and storage efficiency, especially with large datasets. The IPC (Inter-Process Communication) file format in the context of Apache Arrow is a key component for efficiently sharing data between different processes, potentially written in different programming languages. Arrow’s IPC mechanism is designed around two main file formats:

  • Stream Format: For sending an arbitrary length sequence of Arrow record batches (tables). The stream format is useful for real-time data exchange where the size of the data is not known upfront and can grow indefinitely.
  • File (or Feather) Format: Optimized for storage and memory-mapped access, allowing for fast random access to different sections of the data. This format is ideal for scenarios where the entire dataset is available upfront and can be stored in a file system for repeated reads and writes.

Apache Arrow provides a columnar memory format for flat and hierarchical data, optimized for efficient data analytics. It can be used in Python through the pyarrow package. Here’s how you can use Arrow to read, manipulate, and export data, including a demonstration of storage savings.

First, ensure you have pyarrow installed on your computer (and preferrably, in your current virtual environment):

pip install pyarrow

Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames, optimized for speed and efficiency, particularly for IPC and data sharing between Python and R or Julia.

df.to_feather('data/nyccrashes_cleaned.feather')

# Compare the file sizes of the feather format and the CSV format
import os

# File paths
csv_file = 'data/nyccrashes_2024w0630_by20240916.csv'
feather_file = 'data/nyccrashes_cleaned.feather'

# Get file sizes in bytes
csv_size = os.path.getsize(csv_file)
feather_size = os.path.getsize(feather_file)

# Convert bytes to a more readable format (e.g., MB)
csv_size_mb = csv_size / (1024 * 1024)
feather_size_mb = feather_size / (1024 * 1024)

# Print the file sizes
print(f"CSV file size: {csv_size_mb:.2f} MB")
print(f"Feather file size: {feather_size_mb:.2f} MB")

Read the feather file back in:

dff = pd.read_feather("data/nyccrashes_cleaned.feather")
dff.shape

6.6 Using the Census Data

The US Census provides a lot of useful data that could be merged with the NYC crash data for further analytics.

First, ensure the DataFrame (df) is ready for merging with census data. Specifically, check that the zip_code column is clean and consistent.and consistent.

import pandas as pd
df = pd.read_feather("data/nyccrashes_cleaned.feather")

valid_zip_df = df.dropna(subset=['zip_code']).copy()
valid_zip_df['zip_code'] = valid_zip_df['zip_code'].astype(int).astype(str).str.zfill(5)
unique_zips = valid_zip_df['zip_code'].unique()

We can use the uszipcode package to get basic demographic data for each zip code. For more detailed or specific census data, using the CensusData package or direct API calls to the Census Bureau’s API.

The uszipcode package provides a range of information about ZIP codes in the United States. When you query a ZIP code using uszipcode, you can access various attributes related to demographic data, housing, geographic location, and more. Here are some of the key variables available at the ZIP code level:

Demographic Information

  • population: The total population.
  • population_density: The population per square kilometer.
  • housing_units: The total number of housing units.
  • occupied_housing_units: The number of occupied housing units.
  • median_home_value: The median value of homes.
  • median_household_income: The median household income.
  • age_distribution: A breakdown of the population by age.

Geographic Information

  • zipcode: The ZIP code.
  • zipcode_type: The type of ZIP code (e.g., Standard, PO Box).
  • major_city: The major city associated with the ZIP code.
  • post_office_city: The city name recognized by the U.S. Postal Service.
  • common_city_list: A list of common city names for the ZIP code.
  • county: The county in which the ZIP code is located.
  • state: The state in which the ZIP code is located.
  • lat: The latitude of the approximate center of the ZIP code.
  • lng: The longitude of the approximate center of the ZIP code.
  • timezone: The timezone of the ZIP code.

Economic and Housing Data

  • land_area_in_sqmi: The land area in square miles.
  • water_area_in_sqmi: The water area in square miles.
  • occupancy_rate: The rate of occupancy for housing units.
  • median_age: The median age of the population.

Install the uszipcode package into the current virtual environment, if it has not been installd yet.

pip install uszipcode


``
We will first clean the `zip_code` column to ensure it only
contains valid ZIP codes. Then, we will use a vectorized
approach to fetch the required data for each unique zip code
and merge this information back into the original `DataFrame`.
`{python}
# Remove rows where 'zip_code' is missing or not a valid ZIP code format
valid_zip_df = df.dropna(subset=['zip_code']).copy()
valid_zip_df['zip_code'] = valid_zip_df['zip_code'].astype(str).str.zfill(5)

Since uszipcode doesn’t valid_zip_dfrently support vectorized operations for multiple ZIP code queries, we’ll optimize the process by querying each unique ZIP code once, then merging the results with the original DataFrame. This approach minimizes redundant queries for ZIP codes that appear multiple

times.

from uszipcode import SearchEngine

# Initialize the SearchEngine
search = SearchEngine()

# Fetch median home value and median household income for each unique ZIP code
zip_data = []
for zip_code in unique_zips:
    result = search.by_zipcode(zip_code)
    if result:  # Check if the result is not None
        zip_data.append({
            "zip_code": zip_code,
            "median_home_value": result.median_home_value,
            "median_household_income": result.median_household_income
        })
    else:  # Handle the case where the result is None
        zip_data.append({
            "zip_code": zip_code,
            "median_home_value": None,
            "median_household_income": None
        })

# Convert to DataFrame
zip_info_df = pd.DataFrame(zip_data)

# Merge this info back into the original DataFrame based on 'zip_code'
merged_df = pd.merge(valid_zip_df, zip_info_df, how="left", on="zip_code")
merged_df.to_feather('data/nyccrashes_merged.feather')

merged_df.head()
crash_datetime borough zip_code latitude longitude location on_street_name cross_street_name off_street_name number_of_persons_injured ... contributing_factor_vehicle_4 contributing_factor_vehicle_5 collision_id vehicle_type_code_1 vehicle_type_code_2 vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5 median_home_value median_household_income
0 2024-06-30 07:05:00 BROOKLYN 11235 40.58106 -73.96744 (40.58106, -73.96744) None None 2797 OCEAN PARKWAY 0 ... None None 4737060 Station Wagon/Sport Utility Vehicle None None None None 531000.0 41639.0
1 2024-06-30 20:47:00 MANHATTAN 10021 40.76363 -73.95330 (40.76363, -73.9533) FDR DRIVE None None 0 ... None None 4737510 Sedan None None None None 1000001.0 107907.0
2 2024-06-30 10:14:00 BROOKLYN 11222 40.73046 -73.95149 (40.73046, -73.95149) GREENPOINT AVENUE MC GUINNESS BOULEVARD None 0 ... None None 4736759 Bus Box Truck None None None 726500.0 63739.0
3 2024-06-30 15:52:00 BRONX 10468 40.86685 -73.89597 (40.86685, -73.89597) None None 60 EAST KINGSBRIDGE ROAD 1 ... None None 4736781 Station Wagon/Sport Utility Vehicle Moped None None None 171200.0 33776.0
4 2024-06-30 16:30:00 BROOKLYN 11226 40.63969 -73.95321 (40.63969, -73.95321) NEWKIRK AVENUE EAST 25 STREET None 1 ... None None 4737299 Sedan Moped None None None 480500.0 40734.0

5 rows × 30 columns

6.7 Extracting and Analyzing Census Tract Data

This section was authored by Mohammad Shahriyar Parvez, a Ph.D. scholar in the Department of Geography, Sustainability, and Urban Studies at the University of Connecticut. For any questions or difficulties in executing the content of this section, please feel free to reach out at shahriyar@uconn.edu.

This section explains how to access data from the US Decennial Census and the American Community Survey (ACS). Basic operations will be applied to calculate and map poverty rates in Storrs, CT. Data will be retrieved from the US Census Bureau’s 2022 American Community Survey (ACS) (refer to this page for the dataset).

# Import modules
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
from census import Census
from us import states
import os

6.7.1 Accessing the cencus Data

6.7.1.1 Import census data

Importing census data into Python requires a Census API key. A key can be obtained from Census API Key. You have to provide your Institution name and email address. A unique 40 digit text string API will be sent to you. You have to keep track of this number and store it in a safe place.

# Set up the API key for accessing Census data
with open("CencusAPI.txt") as f:
    api_key = f.read().strip()
c = Census(api_key)

Once the Census API key is configured, access census data at the tract level for Connecticut using the 2022 ACS. Focus on these variables: total income-to-poverty ratio (C17002_001E), income less than 0.50 of the poverty level (C17002_002E), income between 0.50 and 0.99 of the poverty level (C17002_003E), and total population (B01003_001E). For details on why these variables are relevant to poverty analysis, refer to the Census Bureau’s guidelines on poverty measurement and the ACS variable documentation.

Use the census package’s convenience methods to pull data for different geographic levels. The FIPS code for Connecticu is 09, but if necessary, use the us library to identify FIPS codes for other states.

# Obtain Census variables from the ACS at the tract level for Storrs, CT
ct_census = c.acs5.state_county_tract(
    fields=('NAME', 'C17002_001E', 'C17002_002E', 'C17002_003E', 'B01003_001E'),
    state_fips=states.CT.fips,
    county_fips="*",
    tract="*",
    year=2022
)

With the data retrieved and assigned to a variable, load it into a DataFrame using the pandas library. This will allow for easy manipulation and analysis of the data.

# Create a dataframe from the census data
ct_df = pd.DataFrame(ct_census)

# Show the dataframe
print(ct_df.head(20))
print('Shape: ', ct_df.shape)
                                                 NAME  C17002_001E  \
0   Census Tract 4001.01; Capitol Planning Region;...       2916.0   
1   Census Tract 4001.02; Capitol Planning Region;...       4266.0   
2   Census Tract 4002; Capitol Planning Region; Co...       6213.0   
3   Census Tract 4003; Capitol Planning Region; Co...       6677.0   
4   Census Tract 4153; Capitol Planning Region; Co...       2615.0   
5   Census Tract 4154; Capitol Planning Region; Co...       6046.0   
6   Census Tract 4155; Capitol Planning Region; Co...       3245.0   
7   Census Tract 4156; Capitol Planning Region; Co...       4201.0   
8   Census Tract 4157; Capitol Planning Region; Co...       3284.0   
9   Census Tract 4158; Capitol Planning Region; Co...       2606.0   
10  Census Tract 4159; Capitol Planning Region; Co...       1427.0   
11  Census Tract 4160; Capitol Planning Region; Co...       4738.0   
12  Census Tract 4161; Capitol Planning Region; Co...       4995.0   
13  Census Tract 4162; Capitol Planning Region; Co...       2839.0   
14  Census Tract 4163; Capitol Planning Region; Co...       4533.0   
15  Census Tract 4164; Capitol Planning Region; Co...       3180.0   
16  Census Tract 4165; Capitol Planning Region; Co...       4364.0   
17  Census Tract 4166; Capitol Planning Region; Co...       3110.0   
18  Census Tract 4167; Capitol Planning Region; Co...       6364.0   
19  Census Tract 4168; Capitol Planning Region; Co...       3394.0   

    C17002_002E  C17002_003E  B01003_001E state county   tract  
0          41.0         22.0       2945.0    09    110  400101  
1          73.0        305.0       4266.0    09    110  400102  
2          55.0        136.0       6250.0    09    110  400200  
3         135.0        197.0       6688.0    09    110  400300  
4         465.0        360.0       2621.0    09    110  415300  
5         431.0        318.0       6046.0    09    110  415400  
6         408.0        353.0       3245.0    09    110  415500  
7         300.0        667.0       4303.0    09    110  415600  
8         134.0        169.0       3290.0    09    110  415700  
9         459.0        313.0       2775.0    09    110  415800  
10        131.0        405.0       1437.0    09    110  415900  
11        382.0        646.0       4738.0    09    110  416000  
12        509.0        992.0       5017.0    09    110  416100  
13        133.0        633.0       2911.0    09    110  416200  
14        373.0        796.0       4533.0    09    110  416300  
15         50.0        110.0       3185.0    09    110  416400  
16        350.0        180.0       4404.0    09    110  416500  
17        403.0        196.0       3171.0    09    110  416600  
18        424.0       1109.0       6481.0    09    110  416700  
19        401.0         13.0       3394.0    09    110  416800  
Shape:  (884, 8)

Displaying the DataFrame reveals 884 rows, indicating 884 census tracts, along with 8 columns. This structure provides a comprehensive view of the data across all selected tracts.

6.7.1.2 Import Shapefile

Read the Connecticut census tracts shapefile into Python and reproject it to the UTM Zone 18N projection. The shapefile is available for download from the Census Bureau’s website, specifically on the Cartographic Boundary Files page or the TIGER/Line Shapefiles page.

# Access shapefile of Connecticut census tracts
ct_tract = gpd.read_file("https://www2.census.gov/geo/tiger/TIGER2024/TRACT/"
                         "tl_2024_09_tract.zip")

# Reproject shapefile to UTM Zone 18N
ct_tract = ct_tract.to_crs(epsg=32618)

# Print GeoDataFrame of shapefile
print(ct_tract.head(2))
print('Shape: ', ct_tract.shape)
print("\nThe shapefile projection is: {}".format(ct_tract.crs))
  STATEFP COUNTYFP TRACTCE        GEOID               GEOIDFQ  NAME  \
0      09      110  535200  09110535200  1400000US09110535200  5352   
1      09      110  881200  09110881200  1400000US09110881200  8812   

            NAMELSAD  MTFCC FUNCSTAT     ALAND   AWATER     INTPTLAT  \
0  Census Tract 5352  G5020        S  43206583  1355239  +41.9294855   
1  Census Tract 8812  G5020        S   2273066    28584  +41.8089384   

       INTPTLON                                           geometry  
0  -072.4062911  POLYGON ((710429.8 4638466.647, 710437.611 463...  
1  -072.2516171  POLYGON ((727399.183 4633086.018, 727416.326 4...  
Shape:  (884, 14)

The shapefile projection is: EPSG:32618

Printing the shapefile confirms it also contains 884 rows, representing 884 census tracts. This matches the number of census records, which is a good start!

However, there’s a potential issue: while we have the census data and the shapefile of corresponding census tracts, they are currently stored in two separate variables (ct_df and ct_tract). This separation complicates mapping, as the datasets are not yet linked.

6.7.2 Performing Dataframe Operations

6.7.2.1 Create new column from old columns

To address this issue, join the two DataFrames using a shared field, or “key,” that is common to both datasets.

After reviewing the two datasets, the GEOID column in ct_tract and the combination of the state, county, and tract columns in ct_df can serve as a unique key for joining. Currently, these columns are not in a format that can be directly matched, so the state, county, and tract columns from ct_df will need to be merged into a single column to align with the GEOID column in ct_tract.

To create this new column—or access any existing column in a DataFrame—use indexing with [] and the column name as a string. (Alternatively, columns can be accessed by index position; see the pandas documentation on indexing and selecting data for more details.)

# Combine state, county, and tract columns together
ct_df["GEOID"] = ct_df["state"] + ct_df["county"] + ct_df["tract"]

# Print head of dataframe
ct_df.head(10)
NAME C17002_001E C17002_002E C17002_003E B01003_001E state county tract GEOID
0 Census Tract 4001.01; Capitol Planning Region;... 2916.0 41.0 22.0 2945.0 09 110 400101 09110400101
1 Census Tract 4001.02; Capitol Planning Region;... 4266.0 73.0 305.0 4266.0 09 110 400102 09110400102
2 Census Tract 4002; Capitol Planning Region; Co... 6213.0 55.0 136.0 6250.0 09 110 400200 09110400200
3 Census Tract 4003; Capitol Planning Region; Co... 6677.0 135.0 197.0 6688.0 09 110 400300 09110400300
4 Census Tract 4153; Capitol Planning Region; Co... 2615.0 465.0 360.0 2621.0 09 110 415300 09110415300
5 Census Tract 4154; Capitol Planning Region; Co... 6046.0 431.0 318.0 6046.0 09 110 415400 09110415400
6 Census Tract 4155; Capitol Planning Region; Co... 3245.0 408.0 353.0 3245.0 09 110 415500 09110415500
7 Census Tract 4156; Capitol Planning Region; Co... 4201.0 300.0 667.0 4303.0 09 110 415600 09110415600
8 Census Tract 4157; Capitol Planning Region; Co... 3284.0 134.0 169.0 3290.0 09 110 415700 09110415700
9 Census Tract 4158; Capitol Planning Region; Co... 2606.0 459.0 313.0 2775.0 09 110 415800 09110415800

By printing the first few rows of the DataFrame, the newly created GEOID column is visible, showing the combined values from the state, county, and tract columns. This new column now matches the format of the GEOID column in ct_tract, setting up the data for a successful join.

6.7.2.2 Remove dataframe columns that are no longer needed

To minimize clutter, delete the state, county, and tract columns from ct_df, as they’re no longer needed after creating the GEOID column. When modifying a DataFrame, reassign the modified DataFrame back to the original variable (or a new variable if preferred) to save the changes. Alternatively, use inplace=True within the drop function to apply changes directly without reassignment. For additional details, refer to the pandas documentation on drop.

# Remove columns
ct_df = ct_df.drop(columns = ["state", "county", "tract"])

# Show updated dataframe
ct_df.head(2)
NAME C17002_001E C17002_002E C17002_003E B01003_001E GEOID
0 Census Tract 4001.01; Capitol Planning Region;... 2916.0 41.0 22.0 2945.0 09110400101
1 Census Tract 4001.02; Capitol Planning Region;... 4266.0 73.0 305.0 4266.0 09110400102

6.7.2.3 Check column data types

Ensure that the key column (GEOID) in both DataFrames is of the same data type to allow for a successful join. Start by checking the data type of the GEOID column in each DataFrame. If the data types differ, adjust one or both columns as needed to make them consistent.

# Check column data types for census data
print(ct_df.dtypes)

# Check column data types for census shapefile
print(ct_tract.dtypes)
NAME            object
C17002_001E    float64
C17002_002E    float64
C17002_003E    float64
B01003_001E    float64
GEOID           object
dtype: object
STATEFP       object
COUNTYFP      object
TRACTCE       object
GEOID         object
GEOIDFQ       object
NAME          object
NAMELSAD      object
MTFCC         object
FUNCSTAT      object
ALAND          int64
AWATER         int64
INTPTLAT      object
INTPTLON      object
geometry    geometry
dtype: object

6.7.2.4 Merge dataframes

Now it’s time to merge the two DataFrames using the GEOID columns as the primary key. Use the merge method in GeoPandas, applying it to the va_tract shapefile dataset to combine the data based on the GEOID column.

# Join the attributes of the dataframes together
ct_merge = ct_tract.merge(ct_df, on = "GEOID")

# Show result
print(ct_merge.head(2))
print('Shape: ', ct_merge.shape)
  STATEFP COUNTYFP TRACTCE        GEOID               GEOIDFQ NAME_x  \
0      09      110  535200  09110535200  1400000US09110535200   5352   
1      09      110  881200  09110881200  1400000US09110881200   8812   

            NAMELSAD  MTFCC FUNCSTAT     ALAND   AWATER     INTPTLAT  \
0  Census Tract 5352  G5020        S  43206583  1355239  +41.9294855   
1  Census Tract 8812  G5020        S   2273066    28584  +41.8089384   

       INTPTLON                                           geometry  \
0  -072.4062911  POLYGON ((710429.8 4638466.647, 710437.611 463...   
1  -072.2516171  POLYGON ((727399.183 4633086.018, 727416.326 4...   

                                              NAME_y  C17002_001E  \
0  Census Tract 5352; Capitol Planning Region; Co...       6650.0   
1  Census Tract 8812; Capitol Planning Region; Co...        502.0   

   C17002_002E  C17002_003E  B01003_001E  
0        102.0        213.0       6664.0  
1        318.0         15.0      10590.0  
Shape:  (884, 19)

The merged DataFrame still has 884 rows, indicating that all (or nearly all) rows were matched correctly. The census data has now been appended to the shapefile data within the DataFrame.

Additional Notes on Joining DataFrames

  • The columns used as keys do not need to have the same name, as long as they contain matching data.
  • This join was a one-to-one relationship, where each attribute in one DataFrame matched exactly one attribute in the other. Joins with many-to-one, one-to-many, or many-to-many relationships are also possible, though they may require additional considerations. For more details, refer to the Esri ArcGIS documentation on joins and relates.

6.7.2.5 Subset dataframe

With the DataFrames merged, further clean up the data by removing unnecessary columns. Instead of using the drop method, select only the columns needed and create a new DataFrame containing just those selected columns. This approach streamlines the DataFrame and keeps only the relevant data for analysis.

# Create new dataframe from select columns
ct_poverty_tract = ct_merge[["STATEFP", "COUNTYFP", "TRACTCE", 
                            "GEOID", "geometry", "C17002_001E", 
                            "C17002_002E", "C17002_003E", 
                            "B01003_001E"]]

# Show dataframe
print(ct_poverty_tract.head(2))
print('Shape: ', ct_poverty_tract.shape)
  STATEFP COUNTYFP TRACTCE        GEOID  \
0      09      110  535200  09110535200   
1      09      110  881200  09110881200   

                                            geometry  C17002_001E  \
0  POLYGON ((710429.8 4638466.647, 710437.611 463...       6650.0   
1  POLYGON ((727399.183 4633086.018, 727416.326 4...        502.0   

   C17002_002E  C17002_003E  B01003_001E  
0        102.0        213.0       6664.0  
1        318.0         15.0      10590.0  
Shape:  (884, 9)

The number of columns has reduced from 19 to 9, streamlining the data for analysis.

6.7.2.6 Calculate Poverty Rates Using Column Math

To estimate the poverty rate, divide the sum of C17002_002E (ratio of income to poverty in the past 12 months, < 0.50) and C17002_003E (ratio of income to poverty in the past 12 months, 0.50 - 0.99) by B01003_001E (total population). This calculation provides the proportion of the population below the poverty line.

Note: C17002_001E (ratio of income to poverty in the past 12 months, total) should theoretically represent the entire population, but it does not exactly match B01003_001E (total population). This discrepancy is minor, so it will be disregarded for the purpose of this calculation.

# Get poverty rate and store ctlues in new column
ct_poverty_tract["Poverty_Rate"] = (
    (ct_poverty_tract["C17002_002E"] + ct_poverty_tract["C17002_003E"])
    / ct_poverty_tract["B01003_001E"] * 100
)

# Show dataframe
ct_poverty_tract.head(2)
/Users/junyan/work/teaching/ids-f24/ids-f24/.ids-f24-venv/lib/python3.12/site-packages/geopandas/geodataframe.py:1819: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
STATEFP COUNTYFP TRACTCE GEOID geometry C17002_001E C17002_002E C17002_003E B01003_001E Poverty_Rate
0 09 110 535200 09110535200 POLYGON ((710429.8 4638466.647, 710437.611 463... 6650.0 102.0 213.0 6664.0 4.726891
1 09 110 881200 09110881200 POLYGON ((727399.183 4633086.018, 727416.326 4... 502.0 318.0 15.0 10590.0 3.144476

6.7.2.7 Plotting the Results

With the spatial component linked to the census data, the results can now be visualized on a map. Plot the data to display poverty rates across counties, taking advantage of the merged spatial information for a clear geographic representation of poverty distribution.

# Create subplots
fig, ax = plt.subplots(1, 1, figsize = (12, 8))

# Plot data
ct_poverty_tract.plot(column = "Poverty_Rate",
                       ax = ax,
                       cmap = "plasma",
                       legend = True)

# Stylize plots
plt.style.use('bmh')

# Set title
ax.set_title('Poverty Rates (%) in Connecticut', fontdict = {
    'fontsize': '25', 'fontweight' : '3'})
Text(0.5, 1.0, 'Poverty Rates (%) in Connecticut')