6  Data Manipulation

6.1 Introduction

Data manipulation is crucial for transforming raw data into a more analyzable format, essential for uncovering patterns and ensuring accurate analysis. This chapter introduces the core techniques for data manipulation in Python, utilizing the Pandas library, a cornerstone for data handling within Python’s data science toolkit.

Python’s ecosystem is rich with libraries that facilitate not just data manipulation but comprehensive data analysis. Pandas, in particular, provides extensive functionality for data manipulation tasks including reading, cleaning, transforming, and summarizing data. Using real-world datasets, we will explore how to leverage Python for practical data manipulation tasks.

By the end of this chapter, you will learn to:

  • Import/export data from/to diverse sources.
  • Clean and preprocess data efficiently.
  • Transform and aggregate data to derive insights.
  • Merge and concatenate datasets from various origins.
  • Analyze real-world datasets using these techniques.

6.2 Import/Export Data

This section was written by Deyu Xu, a MS student in Statistics at the time.

6.2.1 Summary

I would like to divide all of the content into five sections. The fisrt one is exporting data to a .csv file. The second one is importing common formats of data. The third one is importing data from other softwares. The forth one is viewing basic information of the data we have imported. The last one is finding null values.

6.2.2 Package Pandas

6.2.2.1 Import data based on Package Pandas

We need to use the Package, Pandas provided by Python to import data. The first step is to install the Package, Pandas. Python allows us to install different versions of Pandas. We are able to use the following code to install the common cersion.

## install the common version of Pandas
pip install pandas

The code for installing the latest version is listed.

## install the latest version of Pandas
pip install --upgrade pandas

Different versions mean there are differnces in the code to achieve the same goal. We will see the specific example in the part of importing .xlsx files.

6.2.3 Export the data to a .csv file:

6.2.3.1 Import the cleaned crashes data at first

Fisrtly, we need to import the file named “nyccrashes_cleaned.feather” data source. ‌.feather file is a binary file format for storing and sharing data. It is especially suitable for large-scale data analysis and data science workflows. ‌ It uses Apache Arrow’s columnar storage format, which can store data in binary form. The advantage of using this format of file is elvaluating the standard of reading and writing. We need to choose the function read_feather from Pandas to import the crashes data.

## Choose the Package Pandas
import pandas as pd
## Import the cleaned crashes data
## Choose the fcuntion, read_feather from Pandas
## Add the relative address of the data file to let your cooperator deal with the code smoothly
df_feather = pd.read_feather("data/nyccrashes_cleaned.feather")
## Show the top 5 rows of data
## Determine whether we import the data successfully
print(df_feather.head(5)) # Use the fucntion "head()"
        crash_datetime    borough  zip_code  latitude  longitude  \
0  2024-06-30 17:30:00       None       NaN       NaN        NaN   
1  2024-06-30 00:32:00       None       NaN       NaN        NaN   
2  2024-06-30 07:05:00   BROOKLYN   11235.0  40.58106  -73.96744   
3  2024-06-30 20:47:00  MANHATTAN   10021.0  40.76363  -73.95330   
4  2024-06-30 10:14:00   BROOKLYN   11222.0  40.73046  -73.95149   

                location     on_street_name      cross_street_name  \
0             (0.0, 0.0)               None                   None   
1                   None  BELT PARKWAY RAMP                   None   
2  (40.58106, -73.96744)               None                   None   
3   (40.76363, -73.9533)          FDR DRIVE                   None   
4  (40.73046, -73.95149)  GREENPOINT AVENUE  MC GUINNESS BOULEVARD   

           off_street_name  number_of_persons_injured  ...  \
0              GOLD STREET                          0  ...   
1                     None                          0  ...   
2  2797      OCEAN PARKWAY                          0  ...   
3                     None                          0  ...   
4                     None                          0  ...   

   contributing_factor_vehicle_2  contributing_factor_vehicle_3  \
0                    Unspecified                           None   
1                    Unspecified                           None   
2                           None                           None   
3                           None                           None   
4                    Unspecified                           None   

   contributing_factor_vehicle_4  contributing_factor_vehicle_5  collision_id  \
0                           None                           None       4736746   
1                           None                           None       4736768   
2                           None                           None       4737060   
3                           None                           None       4737510   
4                           None                           None       4736759   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   
2  Station Wagon/Sport Utility Vehicle                                 None   
3                                Sedan                                 None   
4                                  Bus                            Box Truck   

  vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5  
0                None                None                None  
1                None                None                None  
2                None                None                None  
3                None                None                None  
4                None                None                None  

[5 rows x 28 columns]
  • We have imported the cleaned crashes data successfully.
  • We utilize the fucntion head(5) to show the top 5 rows of the data.

6.2.3.2 Export the crashes data to a .csv file

It is easy to export the data. The fucntion that helps us to complete this goal is to_csv from Pandas.

## Choose the Package Pandas
## Choose the function "to_csv" from Pandas
## Use the argument, "df_feather" storing the data
## Export the data to the default working directory
df_feather.to_csv("nyccrashes_cleaned.csv") # Add the name of the CSV file
  • We can check whether the corresponding .csv file is generated in the default working directory.

We have exported the data to a .csv file in the default working directory.

We will use this .csv file later.

6.2.4 Import files in common formats: .csv/.xlsx/.txt

6.2.4.1 .csv files

We are familiar with .csv files as utilizing them to print some charts by R in the past courses. Now let us import this generated .csv file. We are supposed to choose the function read_csv from Pandas. The following code shows how to import it.

## Choose the Package Pandas
import pandas as pd
## Choose the function "read_csv"
## Add the relative address of the generated CSV file
df_csv = pd.read_csv("nyccrashes_cleaned.csv")
## Check the data we have imported
## Use the above function "head()" I have introduced
print(df_csv.head(2))
   Unnamed: 0       crash_datetime borough  zip_code  latitude  longitude  \
0           0  2024-06-30 17:30:00     NaN       NaN       NaN        NaN   
1           1  2024-06-30 00:32:00     NaN       NaN       NaN        NaN   

     location     on_street_name cross_street_name off_street_name  ...  \
0  (0.0, 0.0)                NaN               NaN     GOLD STREET  ...   
1         NaN  BELT PARKWAY RAMP               NaN             NaN  ...   

   contributing_factor_vehicle_2  contributing_factor_vehicle_3  \
0                    Unspecified                            NaN   
1                    Unspecified                            NaN   

   contributing_factor_vehicle_4  contributing_factor_vehicle_5  collision_id  \
0                            NaN                            NaN       4736746   
1                            NaN                            NaN       4736768   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   

   vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5  
0                  NaN                 NaN                 NaN  
1                  NaN                 NaN                 NaN  

[2 rows x 29 columns]

6.2.4.2 .xlsx files

We want to import .xlsx files but there are not suitable .xlsx files. We can transfer the CSV file to a .xlsx file by the fucntion to_excel from Pandas. Let’s see how to achieve this goal according to the following code.

## Choose the Package Pandas
import pandas as pd
## Use the function "to_excel"
## Export the data to the default working directory
df_csv.to_excel("nyccrashes_cleaned.xlsx") # Add the name of the Excel file
  • Check whether the corresponding .xlsx file is generated in the working directory

Now we have generated the .xlsx file covering the same data. And then we can learn how to import .xlsx files. The function wen use is read_excel no matter what Pandas version is.

  • The latest version of Pandas corresponds to the following code.
import pandas as pd
## Choose the function "read_excel"
## Add the command "engine" to read the file smoothly
df_excel = pd.read_excel("nyccrashes_cleaned.xlsx", engine = "openpyxl")
## Print top 2 rows of the data
print(df_excel.head(2))
   Unnamed: 0.1  Unnamed: 0       crash_datetime borough  zip_code  latitude  \
0             0           0  2024-06-30 17:30:00     NaN       NaN       NaN   
1             1           1  2024-06-30 00:32:00     NaN       NaN       NaN   

   longitude    location     on_street_name cross_street_name  ...  \
0        NaN  (0.0, 0.0)                NaN               NaN  ...   
1        NaN         NaN  BELT PARKWAY RAMP               NaN  ...   

  contributing_factor_vehicle_2  contributing_factor_vehicle_3  \
0                   Unspecified                            NaN   
1                   Unspecified                            NaN   

   contributing_factor_vehicle_4  contributing_factor_vehicle_5  collision_id  \
0                            NaN                            NaN       4736746   
1                            NaN                            NaN       4736768   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   

   vehicle_type_code_3  vehicle_type_code_4 vehicle_type_code_5  
0                  NaN                  NaN                 NaN  
1                  NaN                  NaN                 NaN  

[2 rows x 30 columns]

The code of the common Pandas version is below. What we need to adjust is to add correct encoding.

df_excel = pd.read_excel("nyccrashes_cleaned.xlsx", engine = "openpyxl", 
encoding = "utf-8")

6.2.4.3 .txt files

The last common kind of file is .txt files. We are able to generate the .txt file in the similar way as generatng the .xlsx file. We choose the function to_csv from Pandas. It is necessary to add the command sep="\t". At the same time, we are supposed to add index=False to avoid the index of Dataframe. The specific code is following.

import pandas as pd
df_csv = pd.read_csv("nyccrashes_cleaned.csv")
## Choose the function "to_csv"
## Add the command "sep='\t'"
## Add the command "index=False"
## Export the data to the default working directory
df_csv.to_csv("nyccrashes_cleaned.txt", sep = "\t", index = False)

Now we get the corresponding .txt file successfully. The next step is to determine the correct encoding of this .txt file. This is because the computer will not read the file successfully without correct encoding. I have listed the code helping us to obtain the correct encoding. We use with statement to deal with data. And then we use the function detect from Package Chardet. The intention of detect is to detect character encoding of text files.

import chardet
## Use "with" statement 
with open("nyccrashes_cleaned.txt", "rb") as f:
    # Execute the command "open"
    # And then assign the result to variable "f"
    raw_data = f.read() # Read the content from "f"
    result = chardet.detect(raw_data) 
    encoding = result["encoding"]
    print(str(encoding))
ascii
  • Warning: It is possible to generate the encoding which is not “utf-8”.

Now we own the .txt file and its correct encoding. The last step is to use the function read_table to import the .txt file. We need to insert the correct encoding too. The corresponding code is following.

import pandas as pd
## Choose the function "read_table"
## Add the encoding behind the relative address
df_txt = pd.read_table("nyccrashes_cleaned.txt", encoding = "utf-8")
## The defualt of function "head()" is top five rows 
print(df_txt.head())
   Unnamed: 0       crash_datetime    borough  zip_code  latitude  longitude  \
0           0  2024-06-30 17:30:00        NaN       NaN       NaN        NaN   
1           1  2024-06-30 00:32:00        NaN       NaN       NaN        NaN   
2           2  2024-06-30 07:05:00   BROOKLYN   11235.0  40.58106  -73.96744   
3           3  2024-06-30 20:47:00  MANHATTAN   10021.0  40.76363  -73.95330   
4           4  2024-06-30 10:14:00   BROOKLYN   11222.0  40.73046  -73.95149   

                location     on_street_name      cross_street_name  \
0             (0.0, 0.0)                NaN                    NaN   
1                    NaN  BELT PARKWAY RAMP                    NaN   
2  (40.58106, -73.96744)                NaN                    NaN   
3   (40.76363, -73.9533)          FDR DRIVE                    NaN   
4  (40.73046, -73.95149)  GREENPOINT AVENUE  MC GUINNESS BOULEVARD   

           off_street_name  ...  contributing_factor_vehicle_2  \
0              GOLD STREET  ...                    Unspecified   
1                      NaN  ...                    Unspecified   
2  2797      OCEAN PARKWAY  ...                            NaN   
3                      NaN  ...                            NaN   
4                      NaN  ...                    Unspecified   

   contributing_factor_vehicle_3  contributing_factor_vehicle_4  \
0                            NaN                            NaN   
1                            NaN                            NaN   
2                            NaN                            NaN   
3                            NaN                            NaN   
4                            NaN                            NaN   

   contributing_factor_vehicle_5  collision_id  \
0                            NaN       4736746   
1                            NaN       4736768   
2                            NaN       4737060   
3                            NaN       4737510   
4                            NaN       4736759   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   
2  Station Wagon/Sport Utility Vehicle                                  NaN   
3                                Sedan                                  NaN   
4                                  Bus                            Box Truck   

   vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5  
0                  NaN                 NaN                 NaN  
1                  NaN                 NaN                 NaN  
2                  NaN                 NaN                 NaN  
3                  NaN                 NaN                 NaN  
4                  NaN                 NaN                 NaN  

[5 rows x 29 columns]

6.2.5 Import the data from other software

6.2.5.1 SAS files

6.2.5.1.1 Transfer the .csv file to a .xpt file

The reason why we choose .xpt file is to ensure data types remain consistent during conversion. Firstly, we need to process the data to ensure there is no space in the name of columns. If we don’t do that, we will not achieve the goal. The code of dealing with data is following.

import pandas as pd
df = pd.read_csv("nyccrashes_cleaned.csv")
df_csv_without_unnamed = df.loc[:, ~df.columns.str.contains("^Unnamed")]
df_csv_without_unnamed.columns=df_csv_without_unnamed.columns.str.replace(" ", "_")
df_csv_without_unnamed.to_csv("Without_Unamed.csv", index=False)

We use the Package pyreadstat and the function write_xport from pyreadstat to transfer the .csv file. The corresponding code is following.

import pandas as pd
import pyreadstat
df_without_unnamed = pd.read_csv("Without_Unamed.csv")
sas_file = "SAS.xpt"
## Export the data to the default working directory
pyreadstat.write_xport(df_without_unnamed, "SAS.xpt")
6.2.5.1.2 Import the generated .xpt file

We use the package pyreadstat too. We choose the function read_xport to import the data. Here is the code.

import pyreadstat
## Define the Dataframe and metadata
df_1, meta = pyreadstat.read_xport("SAS.xpt")
## Show the Dataframe
print(df_1.head(2))
## Show the metadata
print(meta)
        crash_datetime borough  zip_code  latitude  longitude    location  \
0  2024-06-30 17:30:00               NaN       NaN        NaN  (0.0, 0.0)   
1  2024-06-30 00:32:00               NaN       NaN        NaN               

      on_street_name cross_street_name off_street_name  \
0                                          GOLD STREET   
1  BELT PARKWAY RAMP                                     

   number_of_persons_injured  ...  contributing_factor_vehicle_2  \
0                        0.0  ...                    Unspecified   
1                        0.0  ...                    Unspecified   

   contributing_factor_vehicle_3  contributing_factor_vehicle_4  \
0                                                                 
1                                                                 

   contributing_factor_vehicle_5  collision_id  \
0                                    4736746.0   
1                                    4736768.0   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   

  vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5  
0                                                              
1                                                              

[2 rows x 28 columns]
<pyreadstat._readstat_parser.metadata_container object at 0x10f3d3ce0>

6.2.5.2 rdata files (the suffix of this file is .RData)

6.2.5.2.1 Transfer the .csv file to a .Rdata file

We need to install the package rpy2

pip install rpy2

And then we choose the function pandas2ri The following code helps us achieve the goal.

import pandas as pd
## Use the Package rpy2
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
## Activate conversion between Pandas and R
pandas2ri.activate()
df_csv = pd.read_csv("nyccrashes_cleaned.csv")
## Transfer the Pandas DataFrame to R DataFrame
df_r = pandas2ri.py2rpy(df_csv)
## Save as .Rdata file
## Export the data to the default working directory
ro.globalenv["R"] = df_r
ro.r("save(R, file = 'nyccrashes_cleaned.Rdata')")
print("The CSV file has been transfered to a .Rdata file successfully.")
Loading custom .RprofileThe CSV file has been transfered to a .Rdata file successfully.
/Users/junyan/work/teaching/ids-f24/ids-f24/.ids-f24-venv/lib/python3.12/site-packages/rpy2/robjects/pandas2ri.py:65: UserWarning:

Error while trying to convert the column "borough". Fall back to string conversion. The error is: Series can only be of one type, or None (and here we have <class 'float'> and <class 'str'>). If happening with a pandas DataFrame the method infer_objects() will normalize data types before conversion.

/Users/junyan/work/teaching/ids-f24/ids-f24/.ids-f24-venv/lib/python3.12/site-packages/rpy2/robjects/pandas2ri.py:65: UserWarning:

Error while trying to convert the column "on_street_name". Fall back to string conversion. The error is: Series can only be of one type, or None (and here we have <class 'float'> and <class 'str'>). If happening with a pandas DataFrame the method infer_objects() will normalize data types before conversion.

/Users/junyan/work/teaching/ids-f24/ids-f24/.ids-f24-venv/lib/python3.12/site-packages/rpy2/robjects/pandas2ri.py:65: UserWarning:

Error while trying to convert the column "cross_street_name". Fall back to string conversion. The error is: Series can only be of one type, or None (and here we have <class 'float'> and <class 'str'>). If happening with a pandas DataFrame the method infer_objects() will normalize data types before conversion.

/Users/junyan/work/teaching/ids-f24/ids-f24/.ids-f24-venv/lib/python3.12/site-packages/rpy2/robjects/pandas2ri.py:65: UserWarning:

Error while trying to convert the column "contributing_factor_vehicle_3". Fall back to string conversion. The error is: Series can only be of one type, or None (and here we have <class 'float'> and <class 'str'>). If happening with a pandas DataFrame the method infer_objects() will normalize data types before conversion.

/Users/junyan/work/teaching/ids-f24/ids-f24/.ids-f24-venv/lib/python3.12/site-packages/rpy2/robjects/pandas2ri.py:65: UserWarning:

Error while trying to convert the column "contributing_factor_vehicle_4". Fall back to string conversion. The error is: Series can only be of one type, or None (and here we have <class 'float'> and <class 'str'>). If happening with a pandas DataFrame the method infer_objects() will normalize data types before conversion.

/Users/junyan/work/teaching/ids-f24/ids-f24/.ids-f24-venv/lib/python3.12/site-packages/rpy2/robjects/pandas2ri.py:65: UserWarning:

Error while trying to convert the column "contributing_factor_vehicle_5". Fall back to string conversion. The error is: Series can only be of one type, or None (and here we have <class 'float'> and <class 'str'>). If happening with a pandas DataFrame the method infer_objects() will normalize data types before conversion.

/Users/junyan/work/teaching/ids-f24/ids-f24/.ids-f24-venv/lib/python3.12/site-packages/rpy2/robjects/pandas2ri.py:65: UserWarning:

Error while trying to convert the column "vehicle_type_code_3". Fall back to string conversion. The error is: Series can only be of one type, or None (and here we have <class 'float'> and <class 'str'>). If happening with a pandas DataFrame the method infer_objects() will normalize data types before conversion.

/Users/junyan/work/teaching/ids-f24/ids-f24/.ids-f24-venv/lib/python3.12/site-packages/rpy2/robjects/pandas2ri.py:65: UserWarning:

Error while trying to convert the column "vehicle_type_code_4". Fall back to string conversion. The error is: Series can only be of one type, or None (and here we have <class 'float'> and <class 'str'>). If happening with a pandas DataFrame the method infer_objects() will normalize data types before conversion.

/Users/junyan/work/teaching/ids-f24/ids-f24/.ids-f24-venv/lib/python3.12/site-packages/rpy2/robjects/pandas2ri.py:65: UserWarning:

Error while trying to convert the column "vehicle_type_code_5". Fall back to string conversion. The error is: Series can only be of one type, or None (and here we have <class 'float'> and <class 'str'>). If happening with a pandas DataFrame the method infer_objects() will normalize data types before conversion.
  • The error means : 1.Type conversion failure of columns. Some columns were not converted to the correct data typr in r as epected, and were instead coerced to strings. 2.The data is still saved, but there are potential data type issues. 3.These errores will not influence importing the .Rdata file.

  • If you want to perform necessary type cpnversions, the following code is suitable.

df_csv["boroughs"] = df["boroughs"].astype(str)
6.2.5.2.2 Import the generated .Rdata file

We also use the Package rpy2. We need the function pandas2ri too. The code is following

import pandas as pd
import rpy2.robjects as ro
## Load the .Rdata file
r_file_path = "nyccrashes_cleaned.Rdata"
ro.r["load"](r_file_path)
## View loaded variables
loaded_objects = ro.r("ls()")
## Show the loaded vatiables
print("Loaded R objects:", loaded_objects)
## We have set the name of dataframe as "R" above
r_dataframe = ro.r["R"]
from rpy2.robjects import pandas2ri
## Transfer R Dataframe to Pandas Dataframe 
## Aim to deal with the data conveniently
pandas2ri.activate()
df_2 = pandas2ri.rpy2py(r_dataframe)
print(df_2)
Loaded R objects: ['R']
      Unnamed: 0       crash_datetime    borough  zip_code   latitude  \
0              0  2024-06-30 17:30:00       None       NaN        NaN   
1              1  2024-06-30 00:32:00       None       NaN        NaN   
2              2  2024-06-30 07:05:00   BROOKLYN   11235.0  40.581060   
3              3  2024-06-30 20:47:00  MANHATTAN   10021.0  40.763630   
4              4  2024-06-30 10:14:00   BROOKLYN   11222.0  40.730460   
...          ...                  ...        ...       ...        ...   
1870        1870  2024-07-07 21:25:00      BRONX   10457.0  40.852520   
1871        1871  2024-07-07 10:31:00      BRONX   10460.0  40.843945   
1872        1872  2024-07-07 20:15:00     QUEENS   11436.0  40.677982   
1873        1873  2024-07-07 14:45:00      BRONX   10452.0  40.843822   
1874        1874  2024-07-07 14:12:00      BRONX   10468.0  40.861084   

      longitude                 location           on_street_name  \
0           NaN               (0.0, 0.0)                     None   
1           NaN                     None        BELT PARKWAY RAMP   
2    -73.967440    (40.58106, -73.96744)                     None   
3    -73.953300     (40.76363, -73.9533)                FDR DRIVE   
4    -73.951490    (40.73046, -73.95149)        GREENPOINT AVENUE   
...         ...                      ...                      ...   
1870 -73.900020    (40.85252, -73.90002)          EAST 180 STREET   
1871 -73.885800    (40.843945, -73.8858)                     None   
1872 -73.791214  (40.677982, -73.791214)        SUTPHIN BOULEVARD   
1873 -73.927500    (40.843822, -73.9275)  MAJOR DEEGAN EXPRESSWAY   
1874 -73.911490   (40.861084, -73.91149)                     None   

          cross_street_name            off_street_name  ...  \
0                      None                GOLD STREET  ...   
1                      None                       None  ...   
2                      None    2797      OCEAN PARKWAY  ...   
3                      None                       None  ...   
4     MC GUINNESS BOULEVARD                       None  ...   
...                     ...                        ...  ...   
1870                   None                       None  ...   
1871                   None  855       EAST 178 STREET  ...   
1872             120 AVENUE                       None  ...   
1873                   None                       None  ...   
1874                   None    2258      HAMPDEN PLACE  ...   

      contributing_factor_vehicle_2  contributing_factor_vehicle_3  \
0                       Unspecified                           None   
1                       Unspecified                           None   
2                              None                           None   
3                              None                           None   
4                       Unspecified                           None   
...                             ...                            ...   
1870                    Unspecified                           None   
1871                    Unspecified                           None   
1872                    Unspecified                           None   
1873                    Unspecified                           None   
1874                           None                           None   

      contributing_factor_vehicle_4  contributing_factor_vehicle_5  \
0                              None                           None   
1                              None                           None   
2                              None                           None   
3                              None                           None   
4                              None                           None   
...                             ...                            ...   
1870                           None                           None   
1871                           None                           None   
1872                           None                           None   
1873                           None                           None   
1874                           None                           None   

      collision_id                  vehicle_type_code_1  \
0          4736746                                Sedan   
1          4736768  Station Wagon/Sport Utility Vehicle   
2          4737060  Station Wagon/Sport Utility Vehicle   
3          4737510                                Sedan   
4          4736759                                  Bus   
...            ...                                  ...   
1870       4744144                        Pick-up Truck   
1871       4744576  Station Wagon/Sport Utility Vehicle   
1872       4745391                                Sedan   
1873       4746540                                Sedan   
1874       4746320                                Sedan   

                      vehicle_type_code_2  vehicle_type_code_3  \
0                                   Sedan                 None   
1     Station Wagon/Sport Utility Vehicle                 None   
2                                    None                 None   
3                                    None                 None   
4                               Box Truck                 None   
...                                   ...                  ...   
1870                                Sedan                 None   
1871                                 None                 None   
1872                                Sedan                 None   
1873                                Sedan                 None   
1874                                 None                 None   

     vehicle_type_code_4 vehicle_type_code_5  
0                   None                None  
1                   None                None  
2                   None                None  
3                   None                None  
4                   None                None  
...                  ...                 ...  
1870                None                None  
1871                None                None  
1872                None                None  
1873                None                None  
1874                None                None  

[1875 rows x 29 columns]

6.2.5.3 stata data (the suffix of this file is .dta)

6.2.5.3.1 Transfer the .csv file to a .dta file

We can only use Pandas. We choose the fucntion to_stata to save the .dta file.

import pandas as pd
df_csv = pd.read_csv("nyccrashes_cleaned.csv")
## Export the data to the default working directory
df_csv.to_stata("stata.dta")
/var/folders/cq/5ysgnwfn7c3g0h46xyzvpj800000gn/T/ipykernel_70286/1977170459.py:4: InvalidColumnName:


Not all pandas column names were valid Stata variable names.
The following replacements have been made:

    Unnamed: 0   ->   Unnamed__0

If this is not what you expect, please make sure you have Stata-compliant
column names in your DataFrame (strings only, max 32 characters, only
alphanumerics and underscores, no Stata reserved words)

6.2.5.3.2 Import the .dta file

We use the function read_stata from Pandas. And here is the specific code.

import pandas as pd
df_3 = pd.read_stata("stata.dta")
print(df_3.head(2))
   index  Unnamed__0       crash_datetime borough  zip_code  latitude  \
0      0           0  2024-06-30 17:30:00               NaN       NaN   
1      1           1  2024-06-30 00:32:00               NaN       NaN   

   longitude    location     on_street_name cross_street_name  ...  \
0        NaN  (0.0, 0.0)                                       ...   
1        NaN              BELT PARKWAY RAMP                    ...   

  contributing_factor_vehicle_2  contributing_factor_vehicle_3  \
0                   Unspecified                                  
1                   Unspecified                                  

   contributing_factor_vehicle_4  contributing_factor_vehicle_5  collision_id  \
0                                                                     4736746   
1                                                                     4736768   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   

   vehicle_type_code_3  vehicle_type_code_4 vehicle_type_code_5  
0                                                                
1                                                                

[2 rows x 30 columns]

6.2.5.4 spss data (the suffix of this file is .sav)

6.2.5.4.1 Transfer the .csv file to a .sav file

We need to use the Package pyreadstat We choose the function write_sav form pyreadstat We are supposed to use the CSV file which is without space. We can uese the following code.

import pandas as pd
import pyreadstat
df_csv = pd.read_csv("Without_Unamed.csv")
## Export the data to the default working directory
pyreadstat.write_sav(df_csv, "SPSS.sav")
6.2.5.4.2 Import the generated .sav file

We also use the Package pyreadstat. We utilize the function read_sav from pyreadstat. The following code helps us import the .sav file.

import pandas as pd
import pyreadstat
df_4, meta = pyreadstat.read_sav("SPSS.sav")
print(df_4.head(2))
print(meta)
        crash_datetime borough  zip_code  latitude  longitude    location  \
0  2024-06-30 17:30:00               NaN       NaN        NaN  (0.0, 0.0)   
1  2024-06-30 00:32:00               NaN       NaN        NaN               

      on_street_name cross_street_name off_street_name  \
0                                          GOLD STREET   
1  BELT PARKWAY RAMP                                     

   number_of_persons_injured  ...  contributing_factor_vehicle_2  \
0                        0.0  ...                    Unspecified   
1                        0.0  ...                    Unspecified   

   contributing_factor_vehicle_3  contributing_factor_vehicle_4  \
0                                                                 
1                                                                 

   contributing_factor_vehicle_5  collision_id  \
0                                    4736746.0   
1                                    4736768.0   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   

  vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5  
0                                                              
1                                                              

[2 rows x 28 columns]
<pyreadstat._readstat_parser.metadata_container object at 0x10971d760>

6.2.5.5 Matlab files (the suffix of this file is .mat)

6.2.5.5.1 Transfer the .csv file to a .mat file

We need to install the package scipy.io

pip install scipy

And then we choose the function savemat from scipy. The specific code is following.

import pandas as pd
from scipy.io import savemat
df_csv = pd.read_csv("nyccrashes_cleaned.csv")
## Convert DataFrame to dicotionary form
## MATLAB.mat require dictionary format
data_dict = {"data": df_csv.to_dict("list")}
## Save the dictionary as a .mat file
## Export the data to the default working directory
savemat("MATLAB.mat", data_dict)
6.2.5.5.2 Import the generated .mat file

We use the Package scipy.io too. We choose the function loadmat from spicy.io And the corresponding code is following.

import pandas as pd
from scipy.io import loadmat
df_csv = pd.read_csv("nyccrashes_cleaned.csv")
df_5 = loadmat("MATLAB.mat")
## Show the data keys
print(df_5.keys())
## Show the contents of the "data" keys
print(df_5["data"])
dict_keys(['__header__', '__version__', '__globals__', 'data'])
[[(array([[   0,    1,    2, ..., 1872, 1873, 1874]]), array(['2024-06-30 17:30:00', '2024-06-30 00:32:00',
         '2024-06-30 07:05:00', ..., '2024-07-07 20:15:00',
         '2024-07-07 14:45:00', '2024-07-07 14:12:00'], dtype='<U19'), array(['nan                             ',
         'nan                             ',
         'BROOKLYN                        ', ...,
         'QUEENS                          ',
         'BRONX                           ',
         'BRONX                           '], dtype='<U32'), array([[   nan,    nan, 11235., ..., 11436., 10452., 10468.]]), array([[      nan,       nan, 40.58106 , ..., 40.677982, 40.843822,
          40.861084]]), array([[       nan,        nan, -73.96744 , ..., -73.791214, -73.9275  ,
          -73.91149 ]]), array(['(0.0, 0.0)                      ',
         'nan                             ',
         '(40.58106, -73.96744)           ', ...,
         '(40.677982, -73.791214)         ',
         '(40.843822, -73.9275)           ',
         '(40.861084, -73.91149)          '], dtype='<U32'), array(['nan                             ',
         'BELT PARKWAY RAMP               ',
         'nan                             ', ...,
         'SUTPHIN BOULEVARD               ',
         'MAJOR DEEGAN EXPRESSWAY         ',
         'nan                             '], dtype='<U32'), array(['nan                             ',
         'nan                             ',
         'nan                             ', ...,
         '120 AVENUE                      ',
         'nan                             ',
         'nan                             '], dtype='<U32'), array(['GOLD STREET                        ',
         'nan                                ',
         '2797      OCEAN PARKWAY            ', ...,
         'nan                                ',
         'nan                                ',
         '2258      HAMPDEN PLACE            '], dtype='<U35'), array([[0, 0, 0, ..., 1, 0, 0]]), array([[0, 0, 0, ..., 0, 0, 0]]), array([[0, 0, 0, ..., 0, 0, 0]]), array([[0, 0, 0, ..., 0, 0, 0]]), array([[0, 0, 0, ..., 0, 0, 0]]), array([[0, 0, 0, ..., 0, 0, 0]]), array([[0, 0, 0, ..., 1, 0, 0]]), array([[0, 0, 0, ..., 0, 0, 0]]), array(['Passing Too Closely                                  ',
         'Unspecified                                          ',
         'Unspecified                                          ', ...,
         'Passing or Lane Usage Improper                       ',
         'Driver Inexperience                                  ',
         'Unspecified                                          '],
        dtype='<U53'), array(['Unspecified                                          ',
         'Unspecified                                          ',
         'nan                                                  ', ...,
         'Unspecified                                          ',
         'Unspecified                                          ',
         'nan                                                  '],
        dtype='<U53'), array(['nan                             ',
         'nan                             ',
         'nan                             ', ...,
         'nan                             ',
         'nan                             ',
         'nan                             '], dtype='<U32'), array(['nan                             ',
         'nan                             ',
         'nan                             ', ...,
         'nan                             ',
         'nan                             ',
         'nan                             '], dtype='<U32'), array(['nan                             ',
         'nan                             ',
         'nan                             ', ...,
         'nan                             ',
         'nan                             ',
         'nan                             '], dtype='<U32'), array([[4736746, 4736768, 4737060, ..., 4745391, 4746540, 4746320]]), array(['Sedan                              ',
         'Station Wagon/Sport Utility Vehicle',
         'Station Wagon/Sport Utility Vehicle', ...,
         'Sedan                              ',
         'Sedan                              ',
         'Sedan                              '], dtype='<U35'), array(['Sedan                              ',
         'Station Wagon/Sport Utility Vehicle',
         'nan                                ', ...,
         'Sedan                              ',
         'Sedan                              ',
         'nan                                '], dtype='<U35'), array(['nan                                ',
         'nan                                ',
         'nan                                ', ...,
         'nan                                ',
         'nan                                ',
         'nan                                '], dtype='<U35'), array(['nan                                ',
         'nan                                ',
         'nan                                ', ...,
         'nan                                ',
         'nan                                ',
         'nan                                '], dtype='<U35'), array(['nan                                ',
         'nan                                ',
         'nan                                ', ...,
         'nan                                ',
         'nan                                ',
         'nan                                '], dtype='<U35'))                                                                                                                                                                                                                                                                                                                                                ]]

6.2.5.6 HDF5 files (the suffix of this file is .h5)

6.2.5.6.1 Transfer the .csv file to a .h5 file

We can only use Pandas. At the same time, the function to_hdf helps us achive the goal. The code is following.

import pandas as pd
import tables
df_csv = pd.read_csv("nyccrashes_cleaned.csv")
## Export the data to the default working directory
df_csv.to_hdf("HDF5.h5", key = "data", mode = "w")
/var/folders/cq/5ysgnwfn7c3g0h46xyzvpj800000gn/T/ipykernel_70286/3411576893.py:5: PerformanceWarning:


your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->Index(['crash_datetime', 'borough', 'location', 'on_street_name',
       'cross_street_name', 'off_street_name', 'contributing_factor_vehicle_1',
       'contributing_factor_vehicle_2', 'contributing_factor_vehicle_3',
       'contributing_factor_vehicle_4', 'contributing_factor_vehicle_5',
       'vehicle_type_code_1', 'vehicle_type_code_2', 'vehicle_type_code_3',
       'vehicle_type_code_4', 'vehicle_type_code_5'],
      dtype='object')]

6.2.5.6.2 Import the generated .h5 file

We only use Pandas too. We need the function read_h5. The code of importing .h5 file is following.

import pandas as pd
df_6 = pd.read_hdf("HDF5.h5", key = "data")
print(df_6.head(2))
   Unnamed: 0       crash_datetime borough  zip_code  latitude  longitude  \
0           0  2024-06-30 17:30:00     NaN       NaN       NaN        NaN   
1           1  2024-06-30 00:32:00     NaN       NaN       NaN        NaN   

     location     on_street_name cross_street_name off_street_name  ...  \
0  (0.0, 0.0)                NaN               NaN     GOLD STREET  ...   
1         NaN  BELT PARKWAY RAMP               NaN             NaN  ...   

   contributing_factor_vehicle_2  contributing_factor_vehicle_3  \
0                    Unspecified                            NaN   
1                    Unspecified                            NaN   

   contributing_factor_vehicle_4  contributing_factor_vehicle_5  collision_id  \
0                            NaN                            NaN       4736746   
1                            NaN                            NaN       4736768   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   

   vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5  
0                  NaN                 NaN                 NaN  
1                  NaN                 NaN                 NaN  

[2 rows x 29 columns]

6.2.5.7 Import multiple files and merge them into a new file

I have introduced the method of importing single file of data. Python also allows us to import multiple files simultaneously. We choose the Package glob and Package Pandas

## Install Package Glob
pip install glob

The effect of Package glob is to find files and directories that match the specified pattern. We use the function glob from Package glob. The intention of function glob is to find all file paths that match a specitic pattern and return a list of file paths. The following fucntion is the corresponding code.

## Use the package globe and the package pandas
import glob 
import pandas as pd
## Merge multiple arrays
## * means match any number of characters( including the null characters)
all_files = glob.glob("*.csv") 
## Create a list to store the data
all_data = []
## Use "for" statement to import all of the csv files
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    all_data.append(df)
## Combine multiple pandas objects into one along a fixed axis using some merging methods
data_merge = pd.concat(all_data, axis=0, ignore_index=True)
## Check the result
print(data_merge.head(2))
   Unnamed: 0       crash_datetime borough  zip_code  latitude  longitude  \
0         0.0  2024-06-30 17:30:00     NaN       NaN       NaN        NaN   
1         1.0  2024-06-30 00:32:00     NaN       NaN       NaN        NaN   

     location     on_street_name cross_street_name off_street_name  ...  \
0  (0.0, 0.0)                NaN               NaN     GOLD STREET  ...   
1         NaN  BELT PARKWAY RAMP               NaN             NaN  ...   

   contributing_factor_vehicle_2  contributing_factor_vehicle_3  \
0                    Unspecified                            NaN   
1                    Unspecified                            NaN   

   contributing_factor_vehicle_4  contributing_factor_vehicle_5  collision_id  \
0                            NaN                            NaN       4736746   
1                            NaN                            NaN       4736768   

                   vehicle_type_code_1                  vehicle_type_code_2  \
0                                Sedan                                Sedan   
1  Station Wagon/Sport Utility Vehicle  Station Wagon/Sport Utility Vehicle   

   vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5  
0                  NaN                 NaN                 NaN  
1                  NaN                 NaN                 NaN  

[2 rows x 29 columns]

6.2.6 View data information

It is natural for us to be interested in the fundamental information of the data we have imported. As a result, I have listed some useful functions to get the basic knowledge of the data.

The following code helps us know how much the data is. We choose the basic function shape from pandas.

## How much is the crashes data is
df_csv.shape
(1875, 29)
  • There are 1875 data and 29 columns in the file.

The following code helps us check the type of each variable in data. The fucntion is dtypes from Pandas

## Show all types of the crashes' variables
df_csv.dtypes
Unnamed: 0                         int64
crash_datetime                    object
borough                           object
zip_code                         float64
latitude                         float64
longitude                        float64
location                          object
on_street_name                    object
cross_street_name                 object
off_street_name                   object
number_of_persons_injured          int64
number_of_persons_killed           int64
number_of_pedestrians_injured      int64
number_of_pedestrians_killed       int64
number_of_cyclist_injured          int64
number_of_cyclist_killed           int64
number_of_motorist_injured         int64
number_of_motorist_killed          int64
contributing_factor_vehicle_1     object
contributing_factor_vehicle_2     object
contributing_factor_vehicle_3     object
contributing_factor_vehicle_4     object
contributing_factor_vehicle_5     object
collision_id                       int64
vehicle_type_code_1               object
vehicle_type_code_2               object
vehicle_type_code_3               object
vehicle_type_code_4               object
vehicle_type_code_5               object
dtype: object
  • Our computer has listed 29 variables and their corresponding types.

The following code is suitable for viewing overall data information. We use the basic function info from Pandas

df_csv.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1875 entries, 0 to 1874
Data columns (total 29 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Unnamed: 0                     1875 non-null   int64  
 1   crash_datetime                 1875 non-null   object 
 2   borough                        1749 non-null   object 
 3   zip_code                       1749 non-null   float64
 4   latitude                       1722 non-null   float64
 5   longitude                      1722 non-null   float64
 6   location                       1729 non-null   object 
 7   on_street_name                 1329 non-null   object 
 8   cross_street_name              943 non-null    object 
 9   off_street_name                546 non-null    object 
 10  number_of_persons_injured      1875 non-null   int64  
 11  number_of_persons_killed       1875 non-null   int64  
 12  number_of_pedestrians_injured  1875 non-null   int64  
 13  number_of_pedestrians_killed   1875 non-null   int64  
 14  number_of_cyclist_injured      1875 non-null   int64  
 15  number_of_cyclist_killed       1875 non-null   int64  
 16  number_of_motorist_injured     1875 non-null   int64  
 17  number_of_motorist_killed      1875 non-null   int64  
 18  contributing_factor_vehicle_1  1864 non-null   object 
 19  contributing_factor_vehicle_2  1425 non-null   object 
 20  contributing_factor_vehicle_3  174 non-null    object 
 21  contributing_factor_vehicle_4  52 non-null     object 
 22  contributing_factor_vehicle_5  14 non-null     object 
 23  collision_id                   1875 non-null   int64  
 24  vehicle_type_code_1            1842 non-null   object 
 25  vehicle_type_code_2            1230 non-null   object 
 26  vehicle_type_code_3            162 non-null    object 
 27  vehicle_type_code_4            48 non-null     object 
 28  vehicle_type_code_5            14 non-null     object 
dtypes: float64(3), int64(10), object(16)
memory usage: 424.9+ KB
  • The basic information of crashes data has been listed.

The function describe from Pandas helps generate descriptive statistics.

## Show the basic descriptive statistics of crashes data
df_csv.describe()
Unnamed: 0 zip_code latitude longitude number_of_persons_injured number_of_persons_killed number_of_pedestrians_injured number_of_pedestrians_killed number_of_cyclist_injured number_of_cyclist_killed number_of_motorist_injured number_of_motorist_killed collision_id
count 1875.000000 1749.000000 1722.000000 1722.000000 1875.000000 1875.000000 1875.000000 1875.000000 1875.000000 1875.0 1875.000000 1875.000000 1.875000e+03
mean 937.000000 10892.563179 40.719287 -73.919898 0.617067 0.004267 0.093333 0.002667 0.065067 0.0 0.433067 0.001600 4.738587e+06
std 541.410196 525.579066 0.081315 0.085191 0.915610 0.103219 0.338452 0.095207 0.246709 0.0 0.891185 0.039979 1.659947e+03
min 0.000000 10000.000000 40.513510 -74.237366 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 4.736561e+06
25% 468.500000 10455.000000 40.662752 -73.968543 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 4.737666e+06
50% 937.000000 11208.000000 40.712778 -73.922933 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 4.738257e+06
75% 1405.500000 11239.000000 40.767641 -73.869405 1.000000 0.000000 0.000000 0.000000 0.000000 0.0 1.000000 0.000000 4.738883e+06
max 1874.000000 11694.000000 40.907246 -73.702190 11.000000 4.000000 7.000000 4.000000 1.000000 0.0 11.000000 1.000000 4.752444e+06

If we want to summarize the names of all columns, it is a good choice to use the function columns from Pandas

## Get all columns' names of crashes data
df_csv.columns
Index(['Unnamed: 0', 'crash_datetime', 'borough', 'zip_code', 'latitude',
       'longitude', 'location', 'on_street_name', 'cross_street_name',
       'off_street_name', 'number_of_persons_injured',
       'number_of_persons_killed', 'number_of_pedestrians_injured',
       'number_of_pedestrians_killed', 'number_of_cyclist_injured',
       'number_of_cyclist_killed', 'number_of_motorist_injured',
       'number_of_motorist_killed', 'contributing_factor_vehicle_1',
       'contributing_factor_vehicle_2', 'contributing_factor_vehicle_3',
       'contributing_factor_vehicle_4', 'contributing_factor_vehicle_5',
       'collision_id', 'vehicle_type_code_1', 'vehicle_type_code_2',
       'vehicle_type_code_3', 'vehicle_type_code_4', 'vehicle_type_code_5'],
      dtype='object')

The function tail from Pandas allow us to view the last rows.

## Show the last 2 rows of crashes data
df_csv.tail(n = 2)
Unnamed: 0 crash_datetime borough zip_code latitude longitude location on_street_name cross_street_name off_street_name ... contributing_factor_vehicle_2 contributing_factor_vehicle_3 contributing_factor_vehicle_4 contributing_factor_vehicle_5 collision_id vehicle_type_code_1 vehicle_type_code_2 vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5
1873 1873 2024-07-07 14:45:00 BRONX 10452.0 40.843822 -73.92750 (40.843822, -73.9275) MAJOR DEEGAN EXPRESSWAY NaN NaN ... Unspecified NaN NaN NaN 4746540 Sedan Sedan NaN NaN NaN
1874 1874 2024-07-07 14:12:00 BRONX 10468.0 40.861084 -73.91149 (40.861084, -73.91149) NaN NaN 2258 HAMPDEN PLACE ... NaN NaN NaN NaN 4746320 Sedan NaN NaN NaN NaN

2 rows × 29 columns

The function unique from Pandas dedicates in unique values ​​of one column.

## Show the unique values in the column named "crash_datetime"
df_csv["crash_datetime"].unique()
array(['2024-06-30 17:30:00', '2024-06-30 00:32:00',
       '2024-06-30 07:05:00', ..., '2024-07-07 09:13:00',
       '2024-07-07 10:31:00', '2024-07-07 14:12:00'], dtype=object)

We can get the values of one column(without deduplication). The choose of function is values from Pandas instead of unique

## Show all values of the column named "crash_datetime"
df_csv["crash_datetime"].values
array(['2024-06-30 17:30:00', '2024-06-30 00:32:00',
       '2024-06-30 07:05:00', ..., '2024-07-07 20:15:00',
       '2024-07-07 14:45:00', '2024-07-07 14:12:00'], dtype=object)

6.2.7 Find Null Values

It is necessary for us to find null values before we clean and preprocess the data. The following content covers how to find null values.

6.2.7.1 Determine whether there are missing values.

We need to use the function isnull firstly. The aim is to detect missing values in data. And then we add the command any to determine whether there are missing values.

  • Determine whether there are missing values in columns
## Determine whether there are missing values in columns of crashes data
df_csv.isnull().any(axis = 0) # "axis=0" means columns
Unnamed: 0                       False
crash_datetime                   False
borough                           True
zip_code                          True
latitude                          True
longitude                         True
location                          True
on_street_name                    True
cross_street_name                 True
off_street_name                   True
number_of_persons_injured        False
number_of_persons_killed         False
number_of_pedestrians_injured    False
number_of_pedestrians_killed     False
number_of_cyclist_injured        False
number_of_cyclist_killed         False
number_of_motorist_injured       False
number_of_motorist_killed        False
contributing_factor_vehicle_1     True
contributing_factor_vehicle_2     True
contributing_factor_vehicle_3     True
contributing_factor_vehicle_4     True
contributing_factor_vehicle_5     True
collision_id                     False
vehicle_type_code_1               True
vehicle_type_code_2               True
vehicle_type_code_3               True
vehicle_type_code_4               True
vehicle_type_code_5               True
dtype: bool
  • Determine whether there are missing values ​​in rows.
## Determine whether there are missing values in rows of crashes data
df_csv.isnull().any(axis = 1) # "axis=1" means rows
0       True
1       True
2       True
3       True
4       True
        ... 
1870    True
1871    True
1872    True
1873    True
1874    True
Length: 1875, dtype: bool

6.2.7.2 Locate the missing values of rows/columns

We utilize the function loc from Pandas. The function of loc selects or modifies data in a DataFrame or Series. The slection and modification are based on labels.

  • Locate the missing values of rows
## Locate the missing values in crashes data rows
df_csv.loc[df_csv.isnull().any(axis = 1)]
Unnamed: 0 crash_datetime borough zip_code latitude longitude location on_street_name cross_street_name off_street_name ... contributing_factor_vehicle_2 contributing_factor_vehicle_3 contributing_factor_vehicle_4 contributing_factor_vehicle_5 collision_id vehicle_type_code_1 vehicle_type_code_2 vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5
0 0 2024-06-30 17:30:00 NaN NaN NaN NaN (0.0, 0.0) NaN NaN GOLD STREET ... Unspecified NaN NaN NaN 4736746 Sedan Sedan NaN NaN NaN
1 1 2024-06-30 00:32:00 NaN NaN NaN NaN NaN BELT PARKWAY RAMP NaN NaN ... Unspecified NaN NaN NaN 4736768 Station Wagon/Sport Utility Vehicle Station Wagon/Sport Utility Vehicle NaN NaN NaN
2 2 2024-06-30 07:05:00 BROOKLYN 11235.0 40.581060 -73.967440 (40.58106, -73.96744) NaN NaN 2797 OCEAN PARKWAY ... NaN NaN NaN NaN 4737060 Station Wagon/Sport Utility Vehicle NaN NaN NaN NaN
3 3 2024-06-30 20:47:00 MANHATTAN 10021.0 40.763630 -73.953300 (40.76363, -73.9533) FDR DRIVE NaN NaN ... NaN NaN NaN NaN 4737510 Sedan NaN NaN NaN NaN
4 4 2024-06-30 10:14:00 BROOKLYN 11222.0 40.730460 -73.951490 (40.73046, -73.95149) GREENPOINT AVENUE MC GUINNESS BOULEVARD NaN ... Unspecified NaN NaN NaN 4736759 Bus Box Truck NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1870 1870 2024-07-07 21:25:00 BRONX 10457.0 40.852520 -73.900020 (40.85252, -73.90002) EAST 180 STREET NaN NaN ... Unspecified NaN NaN NaN 4744144 Pick-up Truck Sedan NaN NaN NaN
1871 1871 2024-07-07 10:31:00 BRONX 10460.0 40.843945 -73.885800 (40.843945, -73.8858) NaN NaN 855 EAST 178 STREET ... Unspecified NaN NaN NaN 4744576 Station Wagon/Sport Utility Vehicle NaN NaN NaN NaN
1872 1872 2024-07-07 20:15:00 QUEENS 11436.0 40.677982 -73.791214 (40.677982, -73.791214) SUTPHIN BOULEVARD 120 AVENUE NaN ... Unspecified NaN NaN NaN 4745391 Sedan Sedan NaN NaN NaN
1873 1873 2024-07-07 14:45:00 BRONX 10452.0 40.843822 -73.927500 (40.843822, -73.9275) MAJOR DEEGAN EXPRESSWAY NaN NaN ... Unspecified NaN NaN NaN 4746540 Sedan Sedan NaN NaN NaN
1874 1874 2024-07-07 14:12:00 BRONX 10468.0 40.861084 -73.911490 (40.861084, -73.91149) NaN NaN 2258 HAMPDEN PLACE ... NaN NaN NaN NaN 4746320 Sedan NaN NaN NaN NaN

1875 rows × 29 columns

6.2.7.3 Determine the number of missing values.

We also use the function isnull. But this time we add the command sum rather than any.

## Calculate the number of missing values in crashes data columns
df_csv.isnull().sum(axis = 0)
Unnamed: 0                          0
crash_datetime                      0
borough                           126
zip_code                          126
latitude                          153
longitude                         153
location                          146
on_street_name                    546
cross_street_name                 932
off_street_name                  1329
number_of_persons_injured           0
number_of_persons_killed            0
number_of_pedestrians_injured       0
number_of_pedestrians_killed        0
number_of_cyclist_injured           0
number_of_cyclist_killed            0
number_of_motorist_injured          0
number_of_motorist_killed           0
contributing_factor_vehicle_1      11
contributing_factor_vehicle_2     450
contributing_factor_vehicle_3    1701
contributing_factor_vehicle_4    1823
contributing_factor_vehicle_5    1861
collision_id                        0
vehicle_type_code_1                33
vehicle_type_code_2               645
vehicle_type_code_3              1713
vehicle_type_code_4              1827
vehicle_type_code_5              1861
dtype: int64

6.3 SQL

This section was written by Thea Johnson, a senior in Statistics at the time.

6.3.1 Subsection 1

6.3.2 Subsection 2

6.3.3 Extended Reading

Give some references here.

6.4 NYC Crash Data

Consider a subset of the NYC Crash Data, which contains all NYC motor vehicle collisions data with documentation from NYC Open Data. We downloaded the crash data for the week of June 30, 2024, on September 16, 2024, in CSC format.

import pandas as pd

# Load the dataset
file_path = 'data/nyccrashes_2024w0630_by20240916.csv'
df = pd.read_csv(file_path)

# Replace column names: convert to lowercase and replace spaces with underscores
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Display the first few rows of the dataset to understand its structure
df.head()
crash_date crash_time borough zip_code latitude longitude location on_street_name cross_street_name off_street_name ... contributing_factor_vehicle_2 contributing_factor_vehicle_3 contributing_factor_vehicle_4 contributing_factor_vehicle_5 collision_id vehicle_type_code_1 vehicle_type_code_2 vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5
0 06/30/2024 17:30 NaN NaN 0.00000 0.00000 (0.0, 0.0) NaN NaN GOLD STREET ... Unspecified NaN NaN NaN 4736746 Sedan Sedan NaN NaN NaN
1 06/30/2024 0:32 NaN NaN NaN NaN NaN BELT PARKWAY RAMP NaN NaN ... Unspecified NaN NaN NaN 4736768 Station Wagon/Sport Utility Vehicle Station Wagon/Sport Utility Vehicle NaN NaN NaN
2 06/30/2024 7:05 BROOKLYN 11235.0 40.58106 -73.96744 (40.58106, -73.96744) NaN NaN 2797 OCEAN PARKWAY ... NaN NaN NaN NaN 4737060 Station Wagon/Sport Utility Vehicle NaN NaN NaN NaN
3 06/30/2024 20:47 NaN NaN 40.76363 -73.95330 (40.76363, -73.9533) FDR DRIVE NaN NaN ... NaN NaN NaN NaN 4737510 Sedan NaN NaN NaN NaN
4 06/30/2024 10:14 BROOKLYN 11222.0 40.73046 -73.95149 (40.73046, -73.95149) GREENPOINT AVENUE MC GUINNESS BOULEVARD NaN ... Unspecified NaN NaN NaN 4736759 Bus Box Truck NaN NaN NaN

5 rows × 29 columns

Now we can do some cleaning after a quick browse.

# Replace invalid coordinates (latitude=0, longitude=0 or NaN) with NaN
df.loc[(df['latitude'] == 0) & (df['longitude'] == 0), 
       ['latitude', 'longitude']] = pd.NA
df['latitude'] = df['latitude'].replace(0, pd.NA)
df['longitude'] = df['longitude'].replace(0, pd.NA)

# Longitude/latitude don't need double precision
df['latitude'] = df['latitude'].astype('float32', errors='ignore')
df['longitude'] = df['longitude'].astype('float32', errors='ignore')

# Drop the redundant 'location' column
df = df.drop(columns=['location'])

# Converting 'crash_date' and 'crash_time' columns into a single datetime column
df['crash_datetime'] = pd.to_datetime(df['crash_date'] + ' ' 
                       + df['crash_time'], format='%m/%d/%Y %H:%M', errors='coerce')

# Drop the original 'crash_date' and 'crash_time' columns
df = df.drop(columns=['crash_date', 'crash_time'])

Are missing in zip code and borough always co-occur?

# Check if missing values in 'zip_code' and 'borough' always co-occur
# Count rows where both are missing
missing_cooccur = df[['zip_code', 'borough']].isnull().all(axis=1).sum()
# Count total missing in 'zip_code' and 'borough', respectively
total_missing_zip_code = df['zip_code'].isnull().sum()
total_missing_borough = df['borough'].isnull().sum()

# If missing in both columns always co-occur, the number of missing
# co-occurrences should be equal to the total missing in either column
missing_cooccur, total_missing_zip_code, total_missing_borough
(np.int64(541), np.int64(541), np.int64(541))

Are there cases where zip_code and borough are missing but the geo codes are not missing? If so, fill in zip_code and borough using the geo codes by reverse geocoding.

First make sure geopy is installed.

pip install geopy

Now we use model Nominatim in package geopy to reverse geocode.

from geopy.geocoders import Nominatim
import time

# Initialize the geocoder; the `user_agent` is your identifier 
# when using the service. Be mindful not to crash the server
# by unlimited number of queries, especially invalid code.
geolocator = Nominatim(user_agent="jyGeopyTry")

We write a function to do the reverse geocoding given lattitude and longitude.

# Function to fill missing zip_code
def get_zip_code(latitude, longitude):
    try:
        location = geolocator.reverse((latitude, longitude), timeout=10)
        if location:
            address = location.raw['address']
            zip_code = address.get('postcode', None)
            return zip_code
        else:
            return None
    except Exception as e:
        print(f"Error: {e} for coordinates {latitude}, {longitude}")
        return None
    finally:
        time.sleep(1)  # Delay to avoid overwhelming the service

Let’s try it out:

# Example usage
latitude = 40.730610
longitude = -73.935242
zip_code = get_zip_code(latitude, longitude)

The function get_zip_code can then be applied to rows where zip code is missing but geocodes are not to fill the missing zip code.

Once zip code is known, figuring out burough is simple because valid zip codes from each borough are known.

6.5 Cross-platform Data Format Arrow

The CSV format (and related formats like TSV - tab-separated values) for data tables is ubiquitous, convenient, and can be read or written by many different data analysis environments, including spreadsheets. An advantage of the textual representation of the data in a CSV file is that the entire data table, or portions of it, can be previewed in a text editor. However, the textual representation can be ambiguous and inconsistent. The format of a particular column: Boolean, integer, floating-point, text, factor, etc. must be inferred from text representation, often at the expense of reading the entire file before these inferences can be made. Experienced data scientists are aware that a substantial part of an analysis or report generation is often the “data cleaning” involved in preparing the data for analysis. This can be an open-ended task — it required numerous trial-and-error iterations to create the list of different missing data representations we use for the sample CSV file and even now we are not sure we have them all.

To read and export data efficiently, leveraging the Apache Arrow library can significantly improve performance and storage efficiency, especially with large datasets. The IPC (Inter-Process Communication) file format in the context of Apache Arrow is a key component for efficiently sharing data between different processes, potentially written in different programming languages. Arrow’s IPC mechanism is designed around two main file formats:

  • Stream Format: For sending an arbitrary length sequence of Arrow record batches (tables). The stream format is useful for real-time data exchange where the size of the data is not known upfront and can grow indefinitely.
  • File (or Feather) Format: Optimized for storage and memory-mapped access, allowing for fast random access to different sections of the data. This format is ideal for scenarios where the entire dataset is available upfront and can be stored in a file system for repeated reads and writes.

Apache Arrow provides a columnar memory format for flat and hierarchical data, optimized for efficient data analytics. It can be used in Python through the pyarrow package. Here’s how you can use Arrow to read, manipulate, and export data, including a demonstration of storage savings.

First, ensure you have pyarrow installed on your computer (and preferrably, in your current virtual environment):

pip install pyarrow

Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames, optimized for speed and efficiency, particularly for IPC and data sharing between Python and R or Julia.

df.to_feather('data/nyccrashes_cleaned.feather')

# Compare the file sizes of the feather format and the CSV format
import os

# File paths
csv_file = 'data/nyccrashes_2024w0630_by20240916.csv'
feather_file = 'data/nyccrashes_cleaned.feather'

# Get file sizes in bytes
csv_size = os.path.getsize(csv_file)
feather_size = os.path.getsize(feather_file)

# Convert bytes to a more readable format (e.g., MB)
csv_size_mb = csv_size / (1024 * 1024)
feather_size_mb = feather_size / (1024 * 1024)

# Print the file sizes
print(f"CSV file size: {csv_size_mb:.2f} MB")
print(f"Feather file size: {feather_size_mb:.2f} MB")

Read the feather file back in:

dff = pd.read_feather("data/nyccrashes_cleaned.feather")
dff.shape

6.6 Using the Census Data

The US Census provides a lot of useful data that could be merged with the NYC crash data for further analytics.

First, ensure the DataFrame (df) is ready for merging with census data. Specifically, check that the zip_code column is clean and consistent.and consistent.

import pandas as pd
df = pd.read_feather("data/nyccrashes_cleaned.feather")

valid_zip_df = df.dropna(subset=['zip_code']).copy()
valid_zip_df['zip_code'] = valid_zip_df['zip_code'].astype(int).astype(str).str.zfill(5)
unique_zips = valid_zip_df['zip_code'].unique()

We can use the uszipcode package to get basic demographic data for each zip code. For more detailed or specific census data, using the CensusData package or direct API calls to the Census Bureau’s API.

The uszipcode package provides a range of information about ZIP codes in the United States. When you query a ZIP code using uszipcode, you can access various attributes related to demographic data, housing, geographic location, and more. Here are some of the key variables available at the ZIP code level:

Demographic Information

  • population: The total population.
  • population_density: The population per square kilometer.
  • housing_units: The total number of housing units.
  • occupied_housing_units: The number of occupied housing units.
  • median_home_value: The median value of homes.
  • median_household_income: The median household income.
  • age_distribution: A breakdown of the population by age.

Geographic Information

  • zipcode: The ZIP code.
  • zipcode_type: The type of ZIP code (e.g., Standard, PO Box).
  • major_city: The major city associated with the ZIP code.
  • post_office_city: The city name recognized by the U.S. Postal Service.
  • common_city_list: A list of common city names for the ZIP code.
  • county: The county in which the ZIP code is located.
  • state: The state in which the ZIP code is located.
  • lat: The latitude of the approximate center of the ZIP code.
  • lng: The longitude of the approximate center of the ZIP code.
  • timezone: The timezone of the ZIP code.

Economic and Housing Data

  • land_area_in_sqmi: The land area in square miles.
  • water_area_in_sqmi: The water area in square miles.
  • occupancy_rate: The rate of occupancy for housing units.
  • median_age: The median age of the population.

Install the uszipcode package into the current virtual environment, if it has not been installd yet.

pip install uszipcode


``
We will first clean the `zip_code` column to ensure it only
contains valid ZIP codes. Then, we will use a vectorized
approach to fetch the required data for each unique zip code
and merge this information back into the original `DataFrame`.
`{python}
# Remove rows where 'zip_code' is missing or not a valid ZIP code format
valid_zip_df = df.dropna(subset=['zip_code']).copy()
valid_zip_df['zip_code'] = valid_zip_df['zip_code'].astype(str).str.zfill(5)

Since uszipcode doesn’t valid_zip_dfrently support vectorized operations for multiple ZIP code queries, we’ll optimize the process by querying each unique ZIP code once, then merging the results with the original DataFrame. This approach minimizes redundant queries for ZIP codes that appear multiple

times.

from uszipcode import SearchEngine

# Initialize the SearchEngine
search = SearchEngine()

# Fetch median home value and median household income for each unique ZIP code
zip_data = []
for zip_code in unique_zips:
    result = search.by_zipcode(zip_code)
    if result:  # Check if the result is not None
        zip_data.append({
            "zip_code": zip_code,
            "median_home_value": result.median_home_value,
            "median_household_income": result.median_household_income
        })
    else:  # Handle the case where the result is None
        zip_data.append({
            "zip_code": zip_code,
            "median_home_value": None,
            "median_household_income": None
        })

# Convert to DataFrame
zip_info_df = pd.DataFrame(zip_data)

# Merge this info back into the original DataFrame based on 'zip_code'
merged_df = pd.merge(valid_zip_df, zip_info_df, how="left", on="zip_code")

merged_df.head()
crash_datetime borough zip_code latitude longitude location on_street_name cross_street_name off_street_name number_of_persons_injured ... contributing_factor_vehicle_4 contributing_factor_vehicle_5 collision_id vehicle_type_code_1 vehicle_type_code_2 vehicle_type_code_3 vehicle_type_code_4 vehicle_type_code_5 median_home_value median_household_income
0 2024-06-30 07:05:00 BROOKLYN 11235 40.58106 -73.96744 (40.58106, -73.96744) None None 2797 OCEAN PARKWAY 0 ... None None 4737060 Station Wagon/Sport Utility Vehicle None None None None 531000.0 41639.0
1 2024-06-30 20:47:00 MANHATTAN 10021 40.76363 -73.95330 (40.76363, -73.9533) FDR DRIVE None None 0 ... None None 4737510 Sedan None None None None 1000001.0 107907.0
2 2024-06-30 10:14:00 BROOKLYN 11222 40.73046 -73.95149 (40.73046, -73.95149) GREENPOINT AVENUE MC GUINNESS BOULEVARD None 0 ... None None 4736759 Bus Box Truck None None None 726500.0 63739.0
3 2024-06-30 15:52:00 BRONX 10468 40.86685 -73.89597 (40.86685, -73.89597) None None 60 EAST KINGSBRIDGE ROAD 1 ... None None 4736781 Station Wagon/Sport Utility Vehicle Moped None None None 171200.0 33776.0
4 2024-06-30 16:30:00 BROOKLYN 11226 40.63969 -73.95321 (40.63969, -73.95321) NEWKIRK AVENUE EAST 25 STREET None 1 ... None None 4737299 Sedan Moped None None None 480500.0 40734.0

5 rows × 30 columns