import pandas as pd
import numpy as np5 Data Manipulation with Pandas
This chapter was prepared by Garrick Ho and Ginamarie Mastrorilli.
Pandas is an open source Python library for data analysis. It gives Python the ability to work with spreadsheet-like data for fast data loading, manipulating, aligning, merging, etc. The name is derived from ‘panel data’, an econometrics term for multidimensional structured datasets.
Pandas introduces two new data types to Python: Series and DataFrame
5.1 Series
A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index
[1, 3, 4][1, 3, 4]
To create a series, usepd.Series().
s = pd.Series([4, 7, -5, 3])
s0 4
1 7
2 -5
3 3
dtype: int64
The string representation of a Series displayed interactively shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers 0 through n-1 (where n is the length of the data).
s = pd.Series([4, 7, -5, 3], index = ['a', 'b', 'c', 'd'])
sa 4
b 7
c -5
d 3
dtype: int64
s.valuesarray([ 4, 7, -5, 3])
s.indexIndex(['a', 'b', 'c', 'd'], dtype='object')
5.1.1 Selecting single or a set of values using index
To select a single value, use [ ].
s['b']7
To select multiple values, add a comma between each value and use double brackets.
s[['c', 'a', 'b']]c -5
a 4
b 7
dtype: int64
Selecting by the index is also possible.
s[2]-5
To select multiple values in a row, use : and the first index is where it starts and the second index is where it stops. It does not include the second index.
s[1:3]b 7
c -5
dtype: int64
To select multiple values, add a comma between each value and use double brackets.
s[[0,3]]a 4
d 3
dtype: int64
5.1.2 Filtering
Filtering values in a series can be done with <, >, =.
s > 0a True
b True
c False
d True
dtype: bool
s[s > 0]a 4
b 7
d 3
dtype: int64
5.1.3 Math operation
Math functions are able to be apply to a series.
s**2a 16
b 49
c 25
d 9
dtype: int64
np.exp(s)a 54.598150
b 1096.633158
c 0.006738
d 20.085537
dtype: float64
s.mean()2.25
Series are aligned by index label in arithmetic operations.
s2 = pd.Series([1, 2, 3, 4], index = ['a', 'c', 'd', 'e'])zero = s + s2
zeroa 5.0
b NaN
c -3.0
d 6.0
e NaN
dtype: float64
Note: “NaN” stands for missing values in pandas
5.1.4 Finding NaN values
To find all the missing values in a series, use .isnull().
zero.isnull()a False
b True
c False
d False
e True
dtype: bool
To find all the non-missing vales, use .notnull().
zero.notnull()a True
b False
c True
d True
e False
dtype: bool
zero[zero.notnull()]a 5.0
c -3.0
d 6.0
dtype: float64
5.1.5 Replacing NaN
To change NaN to 0, use .fillna().
zero.fillna(0)a 5.0
b 0.0
c -3.0
d 6.0
e 0.0
dtype: float64
5.1.6 Forward-fill
Fill all the NaN values with the previous value in the series.
zero.fillna(method = 'ffill')a 5.0
b 5.0
c -3.0
d 6.0
e 6.0
dtype: float64
5.1.7 Back-fill
Fill all the NaN values with the next value in the series.
Note that e is Nan because there is no next value in the series.
zero.fillna(method = 'bfill')a 5.0
b -3.0
c -3.0
d 6.0
e NaN
dtype: float64
5.1.8 Drop
To drop all NaN, use .dropna().
zero.dropna()a 5.0
c -3.0
d 6.0
dtype: float64
Notice how that zero hasn’t change at all. If the function wants to be applied to the original series, set it to its self.
zeroa 5.0
b NaN
c -3.0
d 6.0
e NaN
dtype: float64
zero = zero.dropna()
zeroa 5.0
c -3.0
d 6.0
dtype: float64
change the index to be the same as s2 so there is no missing value
s.index = ['a', 'c', 'd', 'e']
s + s2a 5
c 9
d -2
e 7
dtype: int64
5.2 DataFrame
A DataFrame represents a rectangular table of data and contains an ordered collection of columns. The DataFrame has both a row and column index.
Since each column of a DataFrame is essentially a Series with its column index, it can be thought of as a dictionary of Series all sharing the same index.
Each column (Series) has to be the same type, whereas, each row can contain mixed types.
5.2.1 Creating DataFrame
5.2.1.1 from a dict of equal-length lists
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
d = pd.DataFrame(data)
d| state | year | pop | |
|---|---|---|---|
| 0 | Ohio | 2000 | 1.5 |
| 1 | Ohio | 2001 | 1.7 |
| 2 | Ohio | 2002 | 3.6 |
| 3 | Nevada | 2001 | 2.4 |
| 4 | Nevada | 2002 | 2.9 |
| 5 | Nevada | 2003 | 3.2 |
5.2.1.2 from an DataFrame
Starting with an empty dataframe, series are able to be added to the Dataframe.
d1 = pd.DataFrame()d1['state'] = ['Ohio', 'Nevada']
d1['year'] = [2001, 2001]
d1['pop'] = [1.7, 2.4]d1| state | year | pop | |
|---|---|---|---|
| 0 | Ohio | 2001 | 1.7 |
| 1 | Nevada | 2001 | 2.4 |
5.2.2 select columns
d| state | year | pop | |
|---|---|---|---|
| 0 | Ohio | 2000 | 1.5 |
| 1 | Ohio | 2001 | 1.7 |
| 2 | Ohio | 2002 | 3.6 |
| 3 | Nevada | 2001 | 2.4 |
| 4 | Nevada | 2002 | 2.9 |
| 5 | Nevada | 2003 | 3.2 |
d['state']0 Ohio
1 Ohio
2 Ohio
3 Nevada
4 Nevada
5 Nevada
Name: state, dtype: object
d[['state','pop']]| state | pop | |
|---|---|---|
| 0 | Ohio | 1.5 |
| 1 | Ohio | 1.7 |
| 2 | Ohio | 3.6 |
| 3 | Nevada | 2.4 |
| 4 | Nevada | 2.9 |
| 5 | Nevada | 3.2 |
5.2.3 select rows
d2 = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
d2| one | two | three | four | |
|---|---|---|---|---|
| Ohio | 0 | 1 | 2 | 3 |
| Colorado | 4 | 5 | 6 | 7 |
| Utah | 8 | 9 | 10 | 11 |
| New York | 12 | 13 | 14 | 15 |
d2.loc['Colorado': 'Utah']| one | two | three | four | |
|---|---|---|---|---|
| Colorado | 4 | 5 | 6 | 7 |
| Utah | 8 | 9 | 10 | 11 |
d2.iloc[1:3]| one | two | three | four | |
|---|---|---|---|---|
| Colorado | 4 | 5 | 6 | 7 |
| Utah | 8 | 9 | 10 | 11 |
5.2.4 change row index and column name
Use .rename ti rename any row or column.
d2.rename(index={'Colorado':'Connecticut'},columns={'one':'five'})| five | two | three | four | |
|---|---|---|---|---|
| Ohio | 0 | 1 | 2 | 3 |
| Connecticut | 4 | 5 | 6 | 7 |
| Utah | 8 | 9 | 10 | 11 |
| New York | 12 | 13 | 14 | 15 |
Notice how d2 does not change.
d2| one | two | three | four | |
|---|---|---|---|---|
| Ohio | 0 | 1 | 2 | 3 |
| Colorado | 4 | 5 | 6 | 7 |
| Utah | 8 | 9 | 10 | 11 |
| New York | 12 | 13 | 14 | 15 |
You can use inplace = True to change the original Dataframe.
d2.rename(index = {'Colorado':'Connecticut'}, columns = {'one':'five'}, inplace = True)d2| five | two | three | four | |
|---|---|---|---|---|
| Ohio | 0 | 1 | 2 | 3 |
| Connecticut | 4 | 5 | 6 | 7 |
| Utah | 8 | 9 | 10 | 11 |
| New York | 12 | 13 | 14 | 15 |
5.2.5 basics attributes and methods
d2.indexIndex(['Ohio', 'Connecticut', 'Utah', 'New York'], dtype='object')
d2.columnsIndex(['five', 'two', 'three', 'four'], dtype='object')
d2.valuesarray([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
d2.shape(4, 4)
d2.mean()five 6.0
two 7.0
three 8.0
four 9.0
dtype: float64
5.2.6 add and delete rows and columns
For dropping a row or column use .drop.
d2.drop(index = "Connecticut", columns = "five") # add "inplace=True" will change the original DataFrame| two | three | four | |
|---|---|---|---|
| Ohio | 1 | 2 | 3 |
| Utah | 9 | 10 | 11 |
| New York | 13 | 14 | 15 |
d2| five | two | three | four | |
|---|---|---|---|---|
| Ohio | 0 | 1 | 2 | 3 |
| Connecticut | 4 | 5 | 6 | 7 |
| Utah | 8 | 9 | 10 | 11 |
| New York | 12 | 13 | 14 | 15 |
For deleting a column use del.
del d2['five']
d2| two | three | four | |
|---|---|---|---|
| Ohio | 1 | 2 | 3 |
| Connecticut | 5 | 6 | 7 |
| Utah | 9 | 10 | 11 |
| New York | 13 | 14 | 15 |
d2['one'] = [1, 2, 3, 4]
d2| two | three | four | one | |
|---|---|---|---|---|
| Ohio | 1 | 2 | 3 | 1 |
| Connecticut | 5 | 6 | 7 | 2 |
| Utah | 9 | 10 | 11 | 3 |
| New York | 13 | 14 | 15 | 4 |
.pop returns the values and removes it from the original Dataframe.
d2.pop('one')Ohio 1
Connecticut 2
Utah 3
New York 4
Name: one, dtype: int64
d2| two | three | four | |
|---|---|---|---|
| Ohio | 1 | 2 | 3 |
| Connecticut | 5 | 6 | 7 |
| Utah | 9 | 10 | 11 |
| New York | 13 | 14 | 15 |
5.2.7 Common method
You can import dataset as well
5.2.7.1 csv file
import pandas as pd
crashes = pd.read_csv("data/nyc_crashes_202301.csv")sub_set_1 = crashes.iloc[0:35, 0:8]5.2.7.2 Head and Tail
These two methods show the first and the last a few records from a DataFrame, default is 5.
sub_set_1.head()| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
|---|---|---|---|---|---|---|---|---|
| 0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
| 1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY |
| 2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE |
| 3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.91244 | (40.769737, -73.91244) | ASTORIA BOULEVARD |
| 4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.85072 | (40.830555, -73.85072) | CASTLE HILL AVENUE |
sub_set_1.tail()| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
|---|---|---|---|---|---|---|---|---|
| 30 | 01/01/2023 | 17:05 | NaN | NaN | 40.795250 | -73.97321 | (40.79525, -73.97321) | WEST 96 STREET |
| 31 | 01/01/2023 | 5:45 | QUEENS | 11411.0 | 40.690640 | -73.72832 | (40.69064, -73.72832) | NaN |
| 32 | 01/01/2023 | 0:55 | NaN | NaN | 40.818470 | -73.94140 | (40.81847, -73.9414) | WEST 140 STREET |
| 33 | 01/01/2023 | 10:16 | QUEENS | 11369.0 | 40.764633 | -73.86537 | (40.764633, -73.86537) | NaN |
| 34 | 01/01/2023 | 3:45 | QUEENS | 11694.0 | 40.580510 | -73.84773 | (40.58051, -73.84773) | NaN |
sub_set_1.head(3)| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
|---|---|---|---|---|---|---|---|---|
| 0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
| 1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY |
| 2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE |
5.2.8 unique and nunique
To show only unique values, use .unique.
sub_set_1['BOROUGH'].unique()array([nan, 'BROOKLYN', 'QUEENS', 'MANHATTAN', 'STATEN ISLAND', 'BRONX'],
dtype=object)
To get the number of unique values, use .nunique.
sub_set_1['BOROUGH'].nunique()5
5.2.9 count and value_counts
To count the non missing values, use .count.
sub_set_1['BOROUGH'].count()21
To count the number in each categroy, use .value_counts.
sub_set_1['BOROUGH'].value_counts()QUEENS 12
BROOKLYN 6
MANHATTAN 1
STATEN ISLAND 1
BRONX 1
Name: BOROUGH, dtype: int64
5.2.10 describe and info
sub_set_1.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRASH DATE 35 non-null object
1 CRASH TIME 35 non-null object
2 BOROUGH 21 non-null object
3 ZIP CODE 21 non-null float64
4 LATITUDE 30 non-null float64
5 LONGITUDE 30 non-null float64
6 LOCATION 30 non-null object
7 ON STREET NAME 27 non-null object
dtypes: float64(3), object(5)
memory usage: 2.3+ KB
Summary statistics for numeric type columns.
Use .describe to get an quick summary of the data.
sub_set_1.describe()| ZIP CODE | LATITUDE | LONGITUDE | |
|---|---|---|---|
| count | 21.000000 | 30.000000 | 30.000000 |
| mean | 11188.095238 | 40.728138 | -73.885913 |
| std | 416.736476 | 0.086772 | 0.076496 |
| min | 10026.000000 | 40.580510 | -74.093414 |
| 25% | 11207.000000 | 40.664881 | -73.937413 |
| 50% | 11354.000000 | 40.747835 | -73.886998 |
| 75% | 11369.000000 | 40.793048 | -73.848292 |
| max | 11694.000000 | 40.861810 | -73.728320 |
sub_set_1.describe(percentiles=[x/10 for x in list(range(1, 10, 1))])| ZIP CODE | LATITUDE | LONGITUDE | |
|---|---|---|---|
| count | 21.000000 | 30.000000 | 30.000000 |
| mean | 11188.095238 | 40.728138 | -73.885913 |
| std | 416.736476 | 0.086772 | 0.076496 |
| min | 10026.000000 | 40.580510 | -74.093414 |
| 10% | 10468.000000 | 40.598224 | -73.959692 |
| 20% | 11207.000000 | 40.649558 | -73.943165 |
| 30% | 11223.000000 | 40.673481 | -73.925244 |
| 40% | 11233.000000 | 40.708116 | -73.906082 |
| 50% | 11354.000000 | 40.747835 | -73.886998 |
| 60% | 11357.000000 | 40.760311 | -73.871886 |
| 70% | 11369.000000 | 40.776043 | -73.850365 |
| 80% | 11372.000000 | 40.808170 | -73.828438 |
| 90% | 11411.000000 | 40.833641 | -73.805664 |
| max | 11694.000000 | 40.861810 | -73.728320 |
choose a specific column to get a summary for.
sub_set_1['BOROUGH'].describe()count 21
unique 5
top QUEENS
freq 12
Name: BOROUGH, dtype: object
5.2.11 idxmax and nlargest
.idxmax() returns the index of the largest value.
sub_set_1['ZIP CODE'].idxmax()34
.idxmin() returns the index of the smallest value
sub_set_1['ZIP CODE'].idxmin()7
.nlargest returns the largest values with their index (default is 5).
sub_set_1['ZIP CODE'].nlargest()34 11694.0
11 11691.0
31 11411.0
13 11375.0
26 11372.0
Name: ZIP CODE, dtype: float64
.nsmallest returns the smallest 3 values with their index (default is 5).
sub_set_1['ZIP CODE'].nsmallest()7 10026.0
17 10305.0
28 10468.0
6 11101.0
12 11207.0
Name: ZIP CODE, dtype: float64
5.2.12 sort
use .sort_values to sort values
sub_set_1.sort_values(by = 'BOROUGH')| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
|---|---|---|---|---|---|---|---|---|
| 28 | 01/01/2023 | 7:40 | BRONX | 10468.0 | 40.861810 | -73.912320 | (40.86181, -73.91232) | NaN |
| 5 | 01/01/2023 | 0:35 | BROOKLYN | 11229.0 | 40.601310 | -73.954720 | (40.60131, -73.95472) | NaN |
| 27 | 01/01/2023 | 5:36 | BROOKLYN | 11207.0 | 40.663136 | -73.883250 | (40.663136, -73.88325) | NaN |
| 22 | 01/01/2023 | 4:20 | BROOKLYN | 11233.0 | 40.670116 | -73.922480 | (40.670116, -73.92248) | SAINT JOHNS PLACE |
| 15 | 01/01/2023 | 3:30 | BROOKLYN | 11236.0 | 40.636720 | -73.887695 | (40.63672, -73.887695) | NaN |
| 19 | 01/01/2023 | 1:10 | BROOKLYN | 11223.0 | 40.593760 | -73.982740 | (40.59376, -73.98274) | AVENUE V |
| 12 | 01/01/2023 | 1:45 | BROOKLYN | 11207.0 | 40.652767 | -73.886300 | (40.652767, -73.8863) | PENNSYLVANIA AVENUE |
| 7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET |
| 14 | 01/01/2023 | 17:35 | QUEENS | 11354.0 | 40.771587 | -73.810070 | (40.771587, -73.81007) | BAYSIDE AVENUE |
| 16 | 01/01/2023 | 10:30 | QUEENS | 11355.0 | 40.751800 | -73.817314 | (40.7518, -73.817314) | HOLLY AVENUE |
| 33 | 01/01/2023 | 10:16 | QUEENS | 11369.0 | 40.764633 | -73.865370 | (40.764633, -73.86537) | NaN |
| 21 | 01/01/2023 | 20:25 | QUEENS | 11357.0 | 40.786440 | -73.829155 | (40.78644, -73.829155) | 140 STREET |
| 11 | 01/01/2023 | 23:06 | QUEENS | 11691.0 | 40.598720 | -73.766010 | (40.59872, -73.76601) | BEACH 32 STREET |
| 23 | 01/01/2023 | 6:45 | QUEENS | 11368.0 | 40.739280 | -73.850530 | (40.73928, -73.85053) | SAULTELL AVENUE |
| 24 | 01/01/2023 | 23:15 | QUEENS | 11369.0 | 40.757430 | -73.876230 | (40.75743, -73.87623) | NaN |
| 26 | 01/01/2023 | 4:00 | QUEENS | 11372.0 | 40.751003 | -73.892130 | (40.751003, -73.89213) | 74 STREET |
| 6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD |
| 31 | 01/01/2023 | 5:45 | QUEENS | 11411.0 | 40.690640 | -73.728320 | (40.69064, -73.72832) | NaN |
| 13 | 01/01/2023 | 17:55 | QUEENS | 11375.0 | 40.710320 | -73.849980 | (40.71032, -73.84998) | METROPOLITAN AVENUE |
| 34 | 01/01/2023 | 3:45 | QUEENS | 11694.0 | 40.580510 | -73.847730 | (40.58051, -73.84773) | NaN |
| 17 | 01/01/2023 | 6:40 | STATEN ISLAND | 10305.0 | 40.585240 | -74.093414 | (40.58524, -74.093414) | HYLAN BOULEVARD |
| 0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
| 1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY |
| 2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE |
| 3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.912440 | (40.769737, -73.91244) | ASTORIA BOULEVARD |
| 4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.850720 | (40.830555, -73.85072) | CASTLE HILL AVENUE |
| 8 | 01/01/2023 | 19:53 | NaN | NaN | NaN | NaN | NaN | EAST 222 STREET |
| 9 | 01/01/2023 | 3:11 | NaN | NaN | 40.842110 | -73.825570 | (40.84211, -73.82557) | BRUCKNER EXPRESSWAY |
| 10 | 01/01/2023 | 18:49 | NaN | NaN | 40.674923 | -73.736940 | (40.674923, -73.73694) | MERRICK BOULEVARD |
| 18 | 01/01/2023 | 5:30 | NaN | NaN | 40.704810 | -73.939320 | (40.70481, -73.93932) | SEIGEL STREET |
| 20 | 01/01/2023 | 0:04 | NaN | NaN | 40.858090 | -73.901924 | (40.85809, -73.901924) | EAST 183 STREET |
| 25 | 01/01/2023 | 3:35 | NaN | NaN | NaN | NaN | NaN | GRAND CENTRAL PARKWAY |
| 29 | 01/01/2023 | 10:50 | NaN | NaN | 40.832700 | -73.950226 | (40.8327, -73.950226) | HENRY HUDSON PARKWAY |
| 30 | 01/01/2023 | 17:05 | NaN | NaN | 40.795250 | -73.973210 | (40.79525, -73.97321) | WEST 96 STREET |
| 32 | 01/01/2023 | 0:55 | NaN | NaN | 40.818470 | -73.941400 | (40.81847, -73.9414) | WEST 140 STREET |
sub_set_1.sort_values(by = ['CRASH DATE', 'ZIP CODE'], ascending = True)| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
|---|---|---|---|---|---|---|---|---|
| 7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET |
| 17 | 01/01/2023 | 6:40 | STATEN ISLAND | 10305.0 | 40.585240 | -74.093414 | (40.58524, -74.093414) | HYLAN BOULEVARD |
| 28 | 01/01/2023 | 7:40 | BRONX | 10468.0 | 40.861810 | -73.912320 | (40.86181, -73.91232) | NaN |
| 6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD |
| 12 | 01/01/2023 | 1:45 | BROOKLYN | 11207.0 | 40.652767 | -73.886300 | (40.652767, -73.8863) | PENNSYLVANIA AVENUE |
| 27 | 01/01/2023 | 5:36 | BROOKLYN | 11207.0 | 40.663136 | -73.883250 | (40.663136, -73.88325) | NaN |
| 19 | 01/01/2023 | 1:10 | BROOKLYN | 11223.0 | 40.593760 | -73.982740 | (40.59376, -73.98274) | AVENUE V |
| 5 | 01/01/2023 | 0:35 | BROOKLYN | 11229.0 | 40.601310 | -73.954720 | (40.60131, -73.95472) | NaN |
| 22 | 01/01/2023 | 4:20 | BROOKLYN | 11233.0 | 40.670116 | -73.922480 | (40.670116, -73.92248) | SAINT JOHNS PLACE |
| 15 | 01/01/2023 | 3:30 | BROOKLYN | 11236.0 | 40.636720 | -73.887695 | (40.63672, -73.887695) | NaN |
| 14 | 01/01/2023 | 17:35 | QUEENS | 11354.0 | 40.771587 | -73.810070 | (40.771587, -73.81007) | BAYSIDE AVENUE |
| 16 | 01/01/2023 | 10:30 | QUEENS | 11355.0 | 40.751800 | -73.817314 | (40.7518, -73.817314) | HOLLY AVENUE |
| 21 | 01/01/2023 | 20:25 | QUEENS | 11357.0 | 40.786440 | -73.829155 | (40.78644, -73.829155) | 140 STREET |
| 23 | 01/01/2023 | 6:45 | QUEENS | 11368.0 | 40.739280 | -73.850530 | (40.73928, -73.85053) | SAULTELL AVENUE |
| 24 | 01/01/2023 | 23:15 | QUEENS | 11369.0 | 40.757430 | -73.876230 | (40.75743, -73.87623) | NaN |
| 33 | 01/01/2023 | 10:16 | QUEENS | 11369.0 | 40.764633 | -73.865370 | (40.764633, -73.86537) | NaN |
| 26 | 01/01/2023 | 4:00 | QUEENS | 11372.0 | 40.751003 | -73.892130 | (40.751003, -73.89213) | 74 STREET |
| 13 | 01/01/2023 | 17:55 | QUEENS | 11375.0 | 40.710320 | -73.849980 | (40.71032, -73.84998) | METROPOLITAN AVENUE |
| 31 | 01/01/2023 | 5:45 | QUEENS | 11411.0 | 40.690640 | -73.728320 | (40.69064, -73.72832) | NaN |
| 11 | 01/01/2023 | 23:06 | QUEENS | 11691.0 | 40.598720 | -73.766010 | (40.59872, -73.76601) | BEACH 32 STREET |
| 34 | 01/01/2023 | 3:45 | QUEENS | 11694.0 | 40.580510 | -73.847730 | (40.58051, -73.84773) | NaN |
| 0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
| 1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY |
| 2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE |
| 3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.912440 | (40.769737, -73.91244) | ASTORIA BOULEVARD |
| 4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.850720 | (40.830555, -73.85072) | CASTLE HILL AVENUE |
| 8 | 01/01/2023 | 19:53 | NaN | NaN | NaN | NaN | NaN | EAST 222 STREET |
| 9 | 01/01/2023 | 3:11 | NaN | NaN | 40.842110 | -73.825570 | (40.84211, -73.82557) | BRUCKNER EXPRESSWAY |
| 10 | 01/01/2023 | 18:49 | NaN | NaN | 40.674923 | -73.736940 | (40.674923, -73.73694) | MERRICK BOULEVARD |
| 18 | 01/01/2023 | 5:30 | NaN | NaN | 40.704810 | -73.939320 | (40.70481, -73.93932) | SEIGEL STREET |
| 20 | 01/01/2023 | 0:04 | NaN | NaN | 40.858090 | -73.901924 | (40.85809, -73.901924) | EAST 183 STREET |
| 25 | 01/01/2023 | 3:35 | NaN | NaN | NaN | NaN | NaN | GRAND CENTRAL PARKWAY |
| 29 | 01/01/2023 | 10:50 | NaN | NaN | 40.832700 | -73.950226 | (40.8327, -73.950226) | HENRY HUDSON PARKWAY |
| 30 | 01/01/2023 | 17:05 | NaN | NaN | 40.795250 | -73.973210 | (40.79525, -73.97321) | WEST 96 STREET |
| 32 | 01/01/2023 | 0:55 | NaN | NaN | 40.818470 | -73.941400 | (40.81847, -73.9414) | WEST 140 STREET |
5.2.13 [] method
[] method can be used to select column(s) by passing column name(s).
sub_set_1['ZIP CODE'].head()0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: ZIP CODE, dtype: float64
sub_set_1[['BOROUGH', 'ZIP CODE', 'LOCATION']].head()| BOROUGH | ZIP CODE | LOCATION | |
|---|---|---|---|
| 0 | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN |
| 2 | NaN | NaN | NaN |
| 3 | NaN | NaN | (40.769737, -73.91244) |
| 4 | NaN | NaN | (40.830555, -73.85072) |
5.2.14 loc method
loc can be used to index row(s) and column(s) by providing the row and column labels.
df.loc[row_label(s)] Selects single row or subset of rows from the DataFrame by label.
Index single row
sub_set_1.loc[7]CRASH DATE 01/01/2023
CRASH TIME 1:00
BOROUGH MANHATTAN
ZIP CODE 10026.0
LATITUDE 40.805595
LONGITUDE -73.95819
LOCATION (40.805595, -73.95819)
ON STREET NAME WEST 116 STREET
Name: 7, dtype: object
Index multiple rows
sub_set_1.loc[:8]| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
|---|---|---|---|---|---|---|---|---|
| 0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
| 1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY |
| 2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE |
| 3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.912440 | (40.769737, -73.91244) | ASTORIA BOULEVARD |
| 4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.850720 | (40.830555, -73.85072) | CASTLE HILL AVENUE |
| 5 | 01/01/2023 | 0:35 | BROOKLYN | 11229.0 | 40.601310 | -73.954720 | (40.60131, -73.95472) | NaN |
| 6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD |
| 7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET |
| 8 | 01/01/2023 | 19:53 | NaN | NaN | NaN | NaN | NaN | EAST 222 STREET |
sub_set_1.loc[[0, 7, 4, 6]]| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
|---|---|---|---|---|---|---|---|---|
| 0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
| 7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET |
| 4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.850720 | (40.830555, -73.85072) | CASTLE HILL AVENUE |
| 6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD |
df.loc[:, col_labels] Selects single column or subset of columns by label.
sub_set_1.loc[:, 'LOCATION'].head()0 NaN
1 NaN
2 NaN
3 (40.769737, -73.91244)
4 (40.830555, -73.85072)
Name: LOCATION, dtype: object
sub_set_1.loc[:, 'LATITUDE': 'LOCATION'].head()| LATITUDE | LONGITUDE | LOCATION | |
|---|---|---|---|
| 0 | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN |
| 2 | NaN | NaN | NaN |
| 3 | 40.769737 | -73.91244 | (40.769737, -73.91244) |
| 4 | 40.830555 | -73.85072 | (40.830555, -73.85072) |
sub_set_1.loc[:, ['BOROUGH', 'ZIP CODE', 'LOCATION']].head()| BOROUGH | ZIP CODE | LOCATION | |
|---|---|---|---|
| 0 | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN |
| 2 | NaN | NaN | NaN |
| 3 | NaN | NaN | (40.769737, -73.91244) |
| 4 | NaN | NaN | (40.830555, -73.85072) |
df.loc[row_label(s), col_label(s)] Select both rows and columns by label.
sub_set_1.loc[7, 'BOROUGH']'MANHATTAN'
sub_set_1.loc[:8, ['BOROUGH', 'LOCATION']]| BOROUGH | LOCATION | |
|---|---|---|
| 0 | NaN | NaN |
| 1 | NaN | NaN |
| 2 | NaN | NaN |
| 3 | NaN | (40.769737, -73.91244) |
| 4 | NaN | (40.830555, -73.85072) |
| 5 | BROOKLYN | (40.60131, -73.95472) |
| 6 | QUEENS | (40.744667, -73.931694) |
| 7 | MANHATTAN | (40.805595, -73.95819) |
| 8 | NaN | NaN |
Index by Boolean Series
sub_set_1['BOROUGH'].isin(['MANHATTAN','QUEENS']).head()0 False
1 False
2 False
3 False
4 False
Name: BOROUGH, dtype: bool
sub_set_1.loc[sub_set_1['BOROUGH'].isin(['MANHATTAN','QUEENS'])].head()| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
|---|---|---|---|---|---|---|---|---|
| 6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD |
| 7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET |
| 11 | 01/01/2023 | 23:06 | QUEENS | 11691.0 | 40.598720 | -73.766010 | (40.59872, -73.76601) | BEACH 32 STREET |
| 13 | 01/01/2023 | 17:55 | QUEENS | 11375.0 | 40.710320 | -73.849980 | (40.71032, -73.84998) | METROPOLITAN AVENUE |
| 14 | 01/01/2023 | 17:35 | QUEENS | 11354.0 | 40.771587 | -73.810070 | (40.771587, -73.81007) | BAYSIDE AVENUE |
Use “&” (and), “|” (or) “~” (not) for Pandas
sub_set_1.loc[(sub_set_1["BOROUGH"] == "MANHATTAN") & (sub_set_1["ZIP CODE"]
>= 1000)]| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
|---|---|---|---|---|---|---|---|---|
| 7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.95819 | (40.805595, -73.95819) | WEST 116 STREET |
5.2.15 iloc method
iloc can be used to index row(s) and column(s) by providing the row and column integer(s).
df.iloc[row_integer(s)] Selects single row or subset of rows from the DataFrame by integer position
Note: same as indexing for sequence (but different with loc, it is 0 basis and the selection is close to the left and open to the right (the last item is excluded).
sub_set_1.iloc[3]CRASH DATE 01/01/2023
CRASH TIME 23:45
BOROUGH NaN
ZIP CODE NaN
LATITUDE 40.769737
LONGITUDE -73.91244
LOCATION (40.769737, -73.91244)
ON STREET NAME ASTORIA BOULEVARD
Name: 3, dtype: object
sub_set_1.iloc[:8]| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
|---|---|---|---|---|---|---|---|---|
| 0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
| 1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY |
| 2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE |
| 3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.912440 | (40.769737, -73.91244) | ASTORIA BOULEVARD |
| 4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.850720 | (40.830555, -73.85072) | CASTLE HILL AVENUE |
| 5 | 01/01/2023 | 0:35 | BROOKLYN | 11229.0 | 40.601310 | -73.954720 | (40.60131, -73.95472) | NaN |
| 6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD |
| 7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET |
sub_set_1.iloc[:, 1:3].head()| CRASH TIME | BOROUGH | |
|---|---|---|
| 0 | 14:38 | NaN |
| 1 | 8:04 | NaN |
| 2 | 18:05 | NaN |
| 3 | 23:45 | NaN |
| 4 | 4:50 | NaN |
df.iloc[row_integer(s), col_integer(s)] Selects row and columns from the DataFrame by integer positions.
sub_set_1.iloc[0:5, :6] | CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | |
|---|---|---|---|---|---|---|
| 0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN |
| 1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN |
| 2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN |
| 3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.91244 |
| 4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.85072 |
5.2.16 concat method
pd.concat([df1, df2], axis = 0) can be used to combine two dataframe either row-wise or column-wise depends on value of axis:
- 0 (default, row-wise)
- 1 (column-wise)
sub_set_1| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
|---|---|---|---|---|---|---|---|---|
| 0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
| 1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY |
| 2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE |
| 3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.912440 | (40.769737, -73.91244) | ASTORIA BOULEVARD |
| 4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.850720 | (40.830555, -73.85072) | CASTLE HILL AVENUE |
| 5 | 01/01/2023 | 0:35 | BROOKLYN | 11229.0 | 40.601310 | -73.954720 | (40.60131, -73.95472) | NaN |
| 6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD |
| 7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET |
| 8 | 01/01/2023 | 19:53 | NaN | NaN | NaN | NaN | NaN | EAST 222 STREET |
| 9 | 01/01/2023 | 3:11 | NaN | NaN | 40.842110 | -73.825570 | (40.84211, -73.82557) | BRUCKNER EXPRESSWAY |
| 10 | 01/01/2023 | 18:49 | NaN | NaN | 40.674923 | -73.736940 | (40.674923, -73.73694) | MERRICK BOULEVARD |
| 11 | 01/01/2023 | 23:06 | QUEENS | 11691.0 | 40.598720 | -73.766010 | (40.59872, -73.76601) | BEACH 32 STREET |
| 12 | 01/01/2023 | 1:45 | BROOKLYN | 11207.0 | 40.652767 | -73.886300 | (40.652767, -73.8863) | PENNSYLVANIA AVENUE |
| 13 | 01/01/2023 | 17:55 | QUEENS | 11375.0 | 40.710320 | -73.849980 | (40.71032, -73.84998) | METROPOLITAN AVENUE |
| 14 | 01/01/2023 | 17:35 | QUEENS | 11354.0 | 40.771587 | -73.810070 | (40.771587, -73.81007) | BAYSIDE AVENUE |
| 15 | 01/01/2023 | 3:30 | BROOKLYN | 11236.0 | 40.636720 | -73.887695 | (40.63672, -73.887695) | NaN |
| 16 | 01/01/2023 | 10:30 | QUEENS | 11355.0 | 40.751800 | -73.817314 | (40.7518, -73.817314) | HOLLY AVENUE |
| 17 | 01/01/2023 | 6:40 | STATEN ISLAND | 10305.0 | 40.585240 | -74.093414 | (40.58524, -74.093414) | HYLAN BOULEVARD |
| 18 | 01/01/2023 | 5:30 | NaN | NaN | 40.704810 | -73.939320 | (40.70481, -73.93932) | SEIGEL STREET |
| 19 | 01/01/2023 | 1:10 | BROOKLYN | 11223.0 | 40.593760 | -73.982740 | (40.59376, -73.98274) | AVENUE V |
| 20 | 01/01/2023 | 0:04 | NaN | NaN | 40.858090 | -73.901924 | (40.85809, -73.901924) | EAST 183 STREET |
| 21 | 01/01/2023 | 20:25 | QUEENS | 11357.0 | 40.786440 | -73.829155 | (40.78644, -73.829155) | 140 STREET |
| 22 | 01/01/2023 | 4:20 | BROOKLYN | 11233.0 | 40.670116 | -73.922480 | (40.670116, -73.92248) | SAINT JOHNS PLACE |
| 23 | 01/01/2023 | 6:45 | QUEENS | 11368.0 | 40.739280 | -73.850530 | (40.73928, -73.85053) | SAULTELL AVENUE |
| 24 | 01/01/2023 | 23:15 | QUEENS | 11369.0 | 40.757430 | -73.876230 | (40.75743, -73.87623) | NaN |
| 25 | 01/01/2023 | 3:35 | NaN | NaN | NaN | NaN | NaN | GRAND CENTRAL PARKWAY |
| 26 | 01/01/2023 | 4:00 | QUEENS | 11372.0 | 40.751003 | -73.892130 | (40.751003, -73.89213) | 74 STREET |
| 27 | 01/01/2023 | 5:36 | BROOKLYN | 11207.0 | 40.663136 | -73.883250 | (40.663136, -73.88325) | NaN |
| 28 | 01/01/2023 | 7:40 | BRONX | 10468.0 | 40.861810 | -73.912320 | (40.86181, -73.91232) | NaN |
| 29 | 01/01/2023 | 10:50 | NaN | NaN | 40.832700 | -73.950226 | (40.8327, -73.950226) | HENRY HUDSON PARKWAY |
| 30 | 01/01/2023 | 17:05 | NaN | NaN | 40.795250 | -73.973210 | (40.79525, -73.97321) | WEST 96 STREET |
| 31 | 01/01/2023 | 5:45 | QUEENS | 11411.0 | 40.690640 | -73.728320 | (40.69064, -73.72832) | NaN |
| 32 | 01/01/2023 | 0:55 | NaN | NaN | 40.818470 | -73.941400 | (40.81847, -73.9414) | WEST 140 STREET |
| 33 | 01/01/2023 | 10:16 | QUEENS | 11369.0 | 40.764633 | -73.865370 | (40.764633, -73.86537) | NaN |
| 34 | 01/01/2023 | 3:45 | QUEENS | 11694.0 | 40.580510 | -73.847730 | (40.58051, -73.84773) | NaN |
sub_set_2 = crashes.iloc[35:60, 0:8]
sub_set_2| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
|---|---|---|---|---|---|---|---|---|
| 35 | 01/01/2023 | 1:20 | QUEENS | 11377.0 | 40.748720 | -73.896250 | (40.74872, -73.89625) | BROOKLYN QUEENS EXPRESSWAY |
| 36 | 01/01/2023 | 18:26 | BROOKLYN | 11208.0 | 40.678246 | -73.870186 | (40.678246, -73.870186) | NaN |
| 37 | 01/01/2023 | 0:00 | NaN | NaN | NaN | NaN | NaN | WILLIAMSBURG BRIDGE OUTER ROADWA |
| 38 | 01/01/2023 | 22:35 | NaN | NaN | 40.755780 | -74.001990 | (40.75578, -74.00199) | WEST 34 STREET |
| 39 | 01/01/2023 | 12:00 | BROOKLYN | 11211.0 | 40.706690 | -73.958840 | (40.70669, -73.95884) | NaN |
| 40 | 01/01/2023 | 8:45 | NaN | NaN | 40.828114 | -73.931070 | (40.828114, -73.93107) | MAJOR DEEGAN EXPRESSWAY |
| 41 | 01/01/2023 | 1:20 | MANHATTAN | 10022.0 | 40.758980 | -73.962440 | (40.75898, -73.96244) | 1 AVENUE |
| 42 | 01/01/2023 | 16:45 | QUEENS | 11420.0 | 40.679070 | -73.831540 | (40.67907, -73.83154) | NaN |
| 43 | 01/01/2023 | 0:10 | BROOKLYN | 11236.0 | 40.638350 | -73.908890 | (40.63835, -73.90889) | FLATLANDS AVENUE |
| 44 | 01/01/2023 | 14:53 | BROOKLYN | 11208.0 | 40.660797 | -73.871830 | (40.660797, -73.87183) | ATKINS AVENUE |
| 45 | 01/01/2023 | 13:10 | NaN | NaN | 40.831290 | -73.929040 | (40.83129, -73.92904) | WOODYCREST AVENUE |
| 46 | 01/01/2023 | 14:16 | NaN | NaN | NaN | NaN | NaN | VERRAZANO BRIDGE UPPER |
| 47 | 01/01/2023 | 15:12 | NaN | NaN | 40.712780 | -74.011690 | (40.71278, -74.01169) | GREENWICH STREET |
| 48 | 01/01/2023 | 0:00 | NaN | NaN | 40.688370 | -73.944916 | (40.68837, -73.944916) | LEXINGTON AVENUE |
| 49 | 01/01/2023 | 3:35 | NaN | NaN | 40.825935 | -73.859130 | (40.825935, -73.85913) | BRUCKNER EXPRESSWAY |
| 50 | 01/01/2023 | 9:45 | NaN | NaN | 40.882645 | -73.886566 | (40.882645, -73.886566) | SEDGWICK AVENUE |
| 51 | 01/01/2023 | 12:20 | NaN | NaN | NaN | NaN | NaN | 102 CROSS DRIVE |
| 52 | 01/01/2023 | 12:00 | QUEENS | 11354.0 | 40.764650 | -73.823494 | (40.76465, -73.823494) | NORTHERN BOULEVARD |
| 53 | 01/01/2023 | 1:17 | QUEENS | 11375.0 | 40.724308 | -73.842575 | (40.724308, -73.842575) | 110 STREET |
| 54 | 01/01/2023 | 7:15 | MANHATTAN | 10025.0 | 40.795250 | -73.973210 | (40.79525, -73.97321) | WEST END AVENUE |
| 55 | 01/01/2023 | 1:30 | BRONX | 10457.0 | 40.836480 | -73.897736 | (40.83648, -73.897736) | CLAREMONT PARKWAY |
| 56 | 01/01/2023 | 8:08 | NaN | NaN | 40.686085 | -73.982666 | (40.686085, -73.982666) | ATLANTIC AVENUE |
| 57 | 01/01/2023 | 6:08 | QUEENS | 11420.0 | 40.677242 | -73.816720 | (40.677242, -73.81672) | NaN |
| 58 | 01/01/2023 | 8:20 | BRONX | 10451.0 | NaN | NaN | NaN | EAST 138 STREET |
| 59 | 01/01/2023 | 10:11 | QUEENS | 11358.0 | 40.760440 | -73.804924 | (40.76044, -73.804924) | 161 STREET |
combining by rows
sub_set_3 = pd.concat([sub_set_1, sub_set_2])
sub_set_3| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
|---|---|---|---|---|---|---|---|---|
| 0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
| 1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY |
| 2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE |
| 3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.912440 | (40.769737, -73.91244) | ASTORIA BOULEVARD |
| 4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.850720 | (40.830555, -73.85072) | CASTLE HILL AVENUE |
| 5 | 01/01/2023 | 0:35 | BROOKLYN | 11229.0 | 40.601310 | -73.954720 | (40.60131, -73.95472) | NaN |
| 6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD |
| 7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET |
| 8 | 01/01/2023 | 19:53 | NaN | NaN | NaN | NaN | NaN | EAST 222 STREET |
| 9 | 01/01/2023 | 3:11 | NaN | NaN | 40.842110 | -73.825570 | (40.84211, -73.82557) | BRUCKNER EXPRESSWAY |
| 10 | 01/01/2023 | 18:49 | NaN | NaN | 40.674923 | -73.736940 | (40.674923, -73.73694) | MERRICK BOULEVARD |
| 11 | 01/01/2023 | 23:06 | QUEENS | 11691.0 | 40.598720 | -73.766010 | (40.59872, -73.76601) | BEACH 32 STREET |
| 12 | 01/01/2023 | 1:45 | BROOKLYN | 11207.0 | 40.652767 | -73.886300 | (40.652767, -73.8863) | PENNSYLVANIA AVENUE |
| 13 | 01/01/2023 | 17:55 | QUEENS | 11375.0 | 40.710320 | -73.849980 | (40.71032, -73.84998) | METROPOLITAN AVENUE |
| 14 | 01/01/2023 | 17:35 | QUEENS | 11354.0 | 40.771587 | -73.810070 | (40.771587, -73.81007) | BAYSIDE AVENUE |
| 15 | 01/01/2023 | 3:30 | BROOKLYN | 11236.0 | 40.636720 | -73.887695 | (40.63672, -73.887695) | NaN |
| 16 | 01/01/2023 | 10:30 | QUEENS | 11355.0 | 40.751800 | -73.817314 | (40.7518, -73.817314) | HOLLY AVENUE |
| 17 | 01/01/2023 | 6:40 | STATEN ISLAND | 10305.0 | 40.585240 | -74.093414 | (40.58524, -74.093414) | HYLAN BOULEVARD |
| 18 | 01/01/2023 | 5:30 | NaN | NaN | 40.704810 | -73.939320 | (40.70481, -73.93932) | SEIGEL STREET |
| 19 | 01/01/2023 | 1:10 | BROOKLYN | 11223.0 | 40.593760 | -73.982740 | (40.59376, -73.98274) | AVENUE V |
| 20 | 01/01/2023 | 0:04 | NaN | NaN | 40.858090 | -73.901924 | (40.85809, -73.901924) | EAST 183 STREET |
| 21 | 01/01/2023 | 20:25 | QUEENS | 11357.0 | 40.786440 | -73.829155 | (40.78644, -73.829155) | 140 STREET |
| 22 | 01/01/2023 | 4:20 | BROOKLYN | 11233.0 | 40.670116 | -73.922480 | (40.670116, -73.92248) | SAINT JOHNS PLACE |
| 23 | 01/01/2023 | 6:45 | QUEENS | 11368.0 | 40.739280 | -73.850530 | (40.73928, -73.85053) | SAULTELL AVENUE |
| 24 | 01/01/2023 | 23:15 | QUEENS | 11369.0 | 40.757430 | -73.876230 | (40.75743, -73.87623) | NaN |
| 25 | 01/01/2023 | 3:35 | NaN | NaN | NaN | NaN | NaN | GRAND CENTRAL PARKWAY |
| 26 | 01/01/2023 | 4:00 | QUEENS | 11372.0 | 40.751003 | -73.892130 | (40.751003, -73.89213) | 74 STREET |
| 27 | 01/01/2023 | 5:36 | BROOKLYN | 11207.0 | 40.663136 | -73.883250 | (40.663136, -73.88325) | NaN |
| 28 | 01/01/2023 | 7:40 | BRONX | 10468.0 | 40.861810 | -73.912320 | (40.86181, -73.91232) | NaN |
| 29 | 01/01/2023 | 10:50 | NaN | NaN | 40.832700 | -73.950226 | (40.8327, -73.950226) | HENRY HUDSON PARKWAY |
| 30 | 01/01/2023 | 17:05 | NaN | NaN | 40.795250 | -73.973210 | (40.79525, -73.97321) | WEST 96 STREET |
| 31 | 01/01/2023 | 5:45 | QUEENS | 11411.0 | 40.690640 | -73.728320 | (40.69064, -73.72832) | NaN |
| 32 | 01/01/2023 | 0:55 | NaN | NaN | 40.818470 | -73.941400 | (40.81847, -73.9414) | WEST 140 STREET |
| 33 | 01/01/2023 | 10:16 | QUEENS | 11369.0 | 40.764633 | -73.865370 | (40.764633, -73.86537) | NaN |
| 34 | 01/01/2023 | 3:45 | QUEENS | 11694.0 | 40.580510 | -73.847730 | (40.58051, -73.84773) | NaN |
| 35 | 01/01/2023 | 1:20 | QUEENS | 11377.0 | 40.748720 | -73.896250 | (40.74872, -73.89625) | BROOKLYN QUEENS EXPRESSWAY |
| 36 | 01/01/2023 | 18:26 | BROOKLYN | 11208.0 | 40.678246 | -73.870186 | (40.678246, -73.870186) | NaN |
| 37 | 01/01/2023 | 0:00 | NaN | NaN | NaN | NaN | NaN | WILLIAMSBURG BRIDGE OUTER ROADWA |
| 38 | 01/01/2023 | 22:35 | NaN | NaN | 40.755780 | -74.001990 | (40.75578, -74.00199) | WEST 34 STREET |
| 39 | 01/01/2023 | 12:00 | BROOKLYN | 11211.0 | 40.706690 | -73.958840 | (40.70669, -73.95884) | NaN |
| 40 | 01/01/2023 | 8:45 | NaN | NaN | 40.828114 | -73.931070 | (40.828114, -73.93107) | MAJOR DEEGAN EXPRESSWAY |
| 41 | 01/01/2023 | 1:20 | MANHATTAN | 10022.0 | 40.758980 | -73.962440 | (40.75898, -73.96244) | 1 AVENUE |
| 42 | 01/01/2023 | 16:45 | QUEENS | 11420.0 | 40.679070 | -73.831540 | (40.67907, -73.83154) | NaN |
| 43 | 01/01/2023 | 0:10 | BROOKLYN | 11236.0 | 40.638350 | -73.908890 | (40.63835, -73.90889) | FLATLANDS AVENUE |
| 44 | 01/01/2023 | 14:53 | BROOKLYN | 11208.0 | 40.660797 | -73.871830 | (40.660797, -73.87183) | ATKINS AVENUE |
| 45 | 01/01/2023 | 13:10 | NaN | NaN | 40.831290 | -73.929040 | (40.83129, -73.92904) | WOODYCREST AVENUE |
| 46 | 01/01/2023 | 14:16 | NaN | NaN | NaN | NaN | NaN | VERRAZANO BRIDGE UPPER |
| 47 | 01/01/2023 | 15:12 | NaN | NaN | 40.712780 | -74.011690 | (40.71278, -74.01169) | GREENWICH STREET |
| 48 | 01/01/2023 | 0:00 | NaN | NaN | 40.688370 | -73.944916 | (40.68837, -73.944916) | LEXINGTON AVENUE |
| 49 | 01/01/2023 | 3:35 | NaN | NaN | 40.825935 | -73.859130 | (40.825935, -73.85913) | BRUCKNER EXPRESSWAY |
| 50 | 01/01/2023 | 9:45 | NaN | NaN | 40.882645 | -73.886566 | (40.882645, -73.886566) | SEDGWICK AVENUE |
| 51 | 01/01/2023 | 12:20 | NaN | NaN | NaN | NaN | NaN | 102 CROSS DRIVE |
| 52 | 01/01/2023 | 12:00 | QUEENS | 11354.0 | 40.764650 | -73.823494 | (40.76465, -73.823494) | NORTHERN BOULEVARD |
| 53 | 01/01/2023 | 1:17 | QUEENS | 11375.0 | 40.724308 | -73.842575 | (40.724308, -73.842575) | 110 STREET |
| 54 | 01/01/2023 | 7:15 | MANHATTAN | 10025.0 | 40.795250 | -73.973210 | (40.79525, -73.97321) | WEST END AVENUE |
| 55 | 01/01/2023 | 1:30 | BRONX | 10457.0 | 40.836480 | -73.897736 | (40.83648, -73.897736) | CLAREMONT PARKWAY |
| 56 | 01/01/2023 | 8:08 | NaN | NaN | 40.686085 | -73.982666 | (40.686085, -73.982666) | ATLANTIC AVENUE |
| 57 | 01/01/2023 | 6:08 | QUEENS | 11420.0 | 40.677242 | -73.816720 | (40.677242, -73.81672) | NaN |
| 58 | 01/01/2023 | 8:20 | BRONX | 10451.0 | NaN | NaN | NaN | EAST 138 STREET |
| 59 | 01/01/2023 | 10:11 | QUEENS | 11358.0 | 40.760440 | -73.804924 | (40.76044, -73.804924) | 161 STREET |
combining by columns
sub_set_4 = pd.concat([sub_set_1, sub_set_2], axis = 1)
sub_set_4| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.912440 | (40.769737, -73.91244) | ASTORIA BOULEVARD | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.850720 | (40.830555, -73.85072) | CASTLE HILL AVENUE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | 01/01/2023 | 0:35 | BROOKLYN | 11229.0 | 40.601310 | -73.954720 | (40.60131, -73.95472) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 8 | 01/01/2023 | 19:53 | NaN | NaN | NaN | NaN | NaN | EAST 222 STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 9 | 01/01/2023 | 3:11 | NaN | NaN | 40.842110 | -73.825570 | (40.84211, -73.82557) | BRUCKNER EXPRESSWAY | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 10 | 01/01/2023 | 18:49 | NaN | NaN | 40.674923 | -73.736940 | (40.674923, -73.73694) | MERRICK BOULEVARD | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 11 | 01/01/2023 | 23:06 | QUEENS | 11691.0 | 40.598720 | -73.766010 | (40.59872, -73.76601) | BEACH 32 STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 12 | 01/01/2023 | 1:45 | BROOKLYN | 11207.0 | 40.652767 | -73.886300 | (40.652767, -73.8863) | PENNSYLVANIA AVENUE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 13 | 01/01/2023 | 17:55 | QUEENS | 11375.0 | 40.710320 | -73.849980 | (40.71032, -73.84998) | METROPOLITAN AVENUE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 14 | 01/01/2023 | 17:35 | QUEENS | 11354.0 | 40.771587 | -73.810070 | (40.771587, -73.81007) | BAYSIDE AVENUE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 15 | 01/01/2023 | 3:30 | BROOKLYN | 11236.0 | 40.636720 | -73.887695 | (40.63672, -73.887695) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 16 | 01/01/2023 | 10:30 | QUEENS | 11355.0 | 40.751800 | -73.817314 | (40.7518, -73.817314) | HOLLY AVENUE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 17 | 01/01/2023 | 6:40 | STATEN ISLAND | 10305.0 | 40.585240 | -74.093414 | (40.58524, -74.093414) | HYLAN BOULEVARD | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 18 | 01/01/2023 | 5:30 | NaN | NaN | 40.704810 | -73.939320 | (40.70481, -73.93932) | SEIGEL STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 19 | 01/01/2023 | 1:10 | BROOKLYN | 11223.0 | 40.593760 | -73.982740 | (40.59376, -73.98274) | AVENUE V | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 20 | 01/01/2023 | 0:04 | NaN | NaN | 40.858090 | -73.901924 | (40.85809, -73.901924) | EAST 183 STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 21 | 01/01/2023 | 20:25 | QUEENS | 11357.0 | 40.786440 | -73.829155 | (40.78644, -73.829155) | 140 STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 22 | 01/01/2023 | 4:20 | BROOKLYN | 11233.0 | 40.670116 | -73.922480 | (40.670116, -73.92248) | SAINT JOHNS PLACE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 23 | 01/01/2023 | 6:45 | QUEENS | 11368.0 | 40.739280 | -73.850530 | (40.73928, -73.85053) | SAULTELL AVENUE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24 | 01/01/2023 | 23:15 | QUEENS | 11369.0 | 40.757430 | -73.876230 | (40.75743, -73.87623) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25 | 01/01/2023 | 3:35 | NaN | NaN | NaN | NaN | NaN | GRAND CENTRAL PARKWAY | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 26 | 01/01/2023 | 4:00 | QUEENS | 11372.0 | 40.751003 | -73.892130 | (40.751003, -73.89213) | 74 STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 27 | 01/01/2023 | 5:36 | BROOKLYN | 11207.0 | 40.663136 | -73.883250 | (40.663136, -73.88325) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 28 | 01/01/2023 | 7:40 | BRONX | 10468.0 | 40.861810 | -73.912320 | (40.86181, -73.91232) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 29 | 01/01/2023 | 10:50 | NaN | NaN | 40.832700 | -73.950226 | (40.8327, -73.950226) | HENRY HUDSON PARKWAY | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 30 | 01/01/2023 | 17:05 | NaN | NaN | 40.795250 | -73.973210 | (40.79525, -73.97321) | WEST 96 STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 31 | 01/01/2023 | 5:45 | QUEENS | 11411.0 | 40.690640 | -73.728320 | (40.69064, -73.72832) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 32 | 01/01/2023 | 0:55 | NaN | NaN | 40.818470 | -73.941400 | (40.81847, -73.9414) | WEST 140 STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 33 | 01/01/2023 | 10:16 | QUEENS | 11369.0 | 40.764633 | -73.865370 | (40.764633, -73.86537) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 34 | 01/01/2023 | 3:45 | QUEENS | 11694.0 | 40.580510 | -73.847730 | (40.58051, -73.84773) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 35 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 1:20 | QUEENS | 11377.0 | 40.748720 | -73.896250 | (40.74872, -73.89625) | BROOKLYN QUEENS EXPRESSWAY |
| 36 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 18:26 | BROOKLYN | 11208.0 | 40.678246 | -73.870186 | (40.678246, -73.870186) | NaN |
| 37 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 0:00 | NaN | NaN | NaN | NaN | NaN | WILLIAMSBURG BRIDGE OUTER ROADWA |
| 38 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 22:35 | NaN | NaN | 40.755780 | -74.001990 | (40.75578, -74.00199) | WEST 34 STREET |
| 39 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 12:00 | BROOKLYN | 11211.0 | 40.706690 | -73.958840 | (40.70669, -73.95884) | NaN |
| 40 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 8:45 | NaN | NaN | 40.828114 | -73.931070 | (40.828114, -73.93107) | MAJOR DEEGAN EXPRESSWAY |
| 41 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 1:20 | MANHATTAN | 10022.0 | 40.758980 | -73.962440 | (40.75898, -73.96244) | 1 AVENUE |
| 42 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 16:45 | QUEENS | 11420.0 | 40.679070 | -73.831540 | (40.67907, -73.83154) | NaN |
| 43 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 0:10 | BROOKLYN | 11236.0 | 40.638350 | -73.908890 | (40.63835, -73.90889) | FLATLANDS AVENUE |
| 44 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 14:53 | BROOKLYN | 11208.0 | 40.660797 | -73.871830 | (40.660797, -73.87183) | ATKINS AVENUE |
| 45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 13:10 | NaN | NaN | 40.831290 | -73.929040 | (40.83129, -73.92904) | WOODYCREST AVENUE |
| 46 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 14:16 | NaN | NaN | NaN | NaN | NaN | VERRAZANO BRIDGE UPPER |
| 47 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 15:12 | NaN | NaN | 40.712780 | -74.011690 | (40.71278, -74.01169) | GREENWICH STREET |
| 48 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 0:00 | NaN | NaN | 40.688370 | -73.944916 | (40.68837, -73.944916) | LEXINGTON AVENUE |
| 49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 3:35 | NaN | NaN | 40.825935 | -73.859130 | (40.825935, -73.85913) | BRUCKNER EXPRESSWAY |
| 50 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 9:45 | NaN | NaN | 40.882645 | -73.886566 | (40.882645, -73.886566) | SEDGWICK AVENUE |
| 51 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 12:20 | NaN | NaN | NaN | NaN | NaN | 102 CROSS DRIVE |
| 52 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 12:00 | QUEENS | 11354.0 | 40.764650 | -73.823494 | (40.76465, -73.823494) | NORTHERN BOULEVARD |
| 53 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 1:17 | QUEENS | 11375.0 | 40.724308 | -73.842575 | (40.724308, -73.842575) | 110 STREET |
| 54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 7:15 | MANHATTAN | 10025.0 | 40.795250 | -73.973210 | (40.79525, -73.97321) | WEST END AVENUE |
| 55 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 1:30 | BRONX | 10457.0 | 40.836480 | -73.897736 | (40.83648, -73.897736) | CLAREMONT PARKWAY |
| 56 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 8:08 | NaN | NaN | 40.686085 | -73.982666 | (40.686085, -73.982666) | ATLANTIC AVENUE |
| 57 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 6:08 | QUEENS | 11420.0 | 40.677242 | -73.816720 | (40.677242, -73.81672) | NaN |
| 58 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 8:20 | BRONX | 10451.0 | NaN | NaN | NaN | EAST 138 STREET |
| 59 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 10:11 | QUEENS | 11358.0 | 40.760440 | -73.804924 | (40.76044, -73.804924) | 161 STREET |
use .fillna() to fill in the missing values
5.3 Merge and Join
pd.merge( ): allows a user to do one-to-one, one-to-many, and many-to-many joins- One-to-one joins mean each row is related on a single row in a different table using a key column
- One-to-many joins mean each row in one table in related to one or more rows in a different table using a key column
- Many-to-many joins mean one or more rows in one table is related to one or more rows in a seperate table using a key column
5.3.1 Ex: One-to One Join
- Create 2 new Data Frames from the January 2023 data with a common column (Zip Code)
- For this example, we are using .drop_duplicates() to get unique ZIP CODE values
- Using pd.merge( ) the two data frames are combined using that common column as a key
jan23 = pd.read_csv("data/nyc_crashes_202301.csv")
crash_zip = jan23[["CRASH DATE", "ZIP CODE"]].copy().drop_duplicates(subset = [ "ZIP CODE"])
crash_zip.tail() # to view DF of crash date & zip code| CRASH DATE | ZIP CODE | |
|---|---|---|
| 3590 | 01/16/2023 | 10023.0 |
| 4304 | 01/19/2023 | 11363.0 |
| 5530 | 01/24/2023 | 11109.0 |
| 5732 | 01/25/2023 | 10280.0 |
| 7140 | 01/31/2023 | 10169.0 |
borough_zip = jan23[["ZIP CODE", "BOROUGH"]].copy().drop_duplicates(subset = [ "ZIP CODE"])
borough_zip.tail() # to view DF of zip code and borough| ZIP CODE | BOROUGH | |
|---|---|---|
| 3590 | 10023.0 | MANHATTAN |
| 4304 | 11363.0 | QUEENS |
| 5530 | 11109.0 | QUEENS |
| 5732 | 10280.0 | MANHATTAN |
| 7140 | 10169.0 | MANHATTAN |
merge_w_zip = pd.merge(crash_zip, borough_zip)
merge_w_zip.tail() # to view joined data frames| CRASH DATE | ZIP CODE | BOROUGH | |
|---|---|---|---|
| 176 | 01/16/2023 | 10023.0 | MANHATTAN |
| 177 | 01/19/2023 | 11363.0 | QUEENS |
| 178 | 01/24/2023 | 11109.0 | QUEENS |
| 179 | 01/25/2023 | 10280.0 | MANHATTAN |
| 180 | 01/31/2023 | 10169.0 | MANHATTAN |
We can specify the name of the key column using on:
pd.merge(crash_zip, borough_zip, on = 'ZIP CODE').tail()| CRASH DATE | ZIP CODE | BOROUGH | |
|---|---|---|---|
| 176 | 01/16/2023 | 10023.0 | MANHATTAN |
| 177 | 01/19/2023 | 11363.0 | QUEENS |
| 178 | 01/24/2023 | 11109.0 | QUEENS |
| 179 | 01/25/2023 | 10280.0 | MANHATTAN |
| 180 | 01/31/2023 | 10169.0 | MANHATTAN |
5.3.2 Ex: One-to-Many Join
- Create a new DataFrame consisting of Employee and Department
- Create a new DataFrame consisting of Employee and Hire Year
emp_dept = pd.DataFrame({'employee': ['Emily', 'Jake', 'Paul', 'Jackie'],
'department': ['Accounting', 'HR', 'Engineering', 'Accounting']})
emp_hire = pd.DataFrame({'employee': ['Emily', 'Jake', 'Paul', 'Jackie'],
'hire_year': ['2020', '2019', '2023', '2011']})
# combining to create a one to one join
dept_hire = pd.merge(emp_dept,emp_hire) # to merge Employee, Department and Hire Year into one dataframe
dept_hire| employee | department | hire_year | |
|---|---|---|---|
| 0 | Emily | Accounting | 2020 |
| 1 | Jake | HR | 2019 |
| 2 | Paul | Engineering | 2023 |
| 3 | Jackie | Accounting | 2011 |
- Create a new DataFrame that consists of Department and the Supervisor for that department
- Merge this new DataFrame with ‘dept_hire’ to create a Many-to-One join using Department as a key
dept_sup = pd.DataFrame({'supervisor': ['Lily', 'Angela', 'Steven'],
'department': ['Accounting', 'HR', 'Engineering']})
sup_emp_dept_hire = pd.merge(dept_sup,dept_hire) # merge using Department as the key
sup_emp_dept_hire| supervisor | department | employee | hire_year | |
|---|---|---|---|---|
| 0 | Lily | Accounting | Emily | 2020 |
| 1 | Lily | Accounting | Jackie | 2011 |
| 2 | Angela | HR | Jake | 2019 |
| 3 | Steven | Engineering | Paul | 2023 |
5.3.3 Ex: Merging when the Key has Different Variable Names
- We will merge two dataframes that have a similar column containing the same information, but are named differently
- Using the employee data from above, but changing
employeeinemp_hiretoemployee_name - Need to drop either
employeeoremployee_nameafter merging to not have redundant information
emp_dept_names = pd.DataFrame({'employee': ['Emily', 'Jake', 'Paul', 'Jackie'],
'department': ['Accounting', 'HR', 'Engineering', 'Accounting']})
emp_hire_names = pd.DataFrame({'employee_name': ['Emily', 'Jake', 'Paul', 'Jackie'],
'hire_year': ['2020', '2019', '2023', '2011']})
# to merge Employee, Department and Hire Year into one dataframe & drop column 'employee_name'
dept_hire_names = pd.merge(emp_dept_names,emp_hire_names, left_on = 'employee',
right_on = 'employee_name').drop('employee_name', axis = 1)
dept_hire_names| employee | department | hire_year | |
|---|---|---|---|
| 0 | Emily | Accounting | 2020 |
| 1 | Jake | HR | 2019 |
| 2 | Paul | Engineering | 2023 |
| 3 | Jackie | Accounting | 2011 |
5.3.4 Ex: Joining ‘uszipcode’
- Create a subset of jan23 data with 7 zipcodes
- Using ‘uszipcode’ data to join the zip codes from jan23 with data provided in this package
from uszipcode import SearchEnginesearch = SearchEngine()
# create a DF of zip codes from jan23 & convert to integers
zipcodes = pd.DataFrame(jan23["ZIP CODE"].tail(15).dropna().reset_index(drop = True))
zipcodes["ZIP CODE"] = zipcodes["ZIP CODE"].astype(int)
# create new,empty column in the df to store the address information
zipcodes['Address'] = None
# using uszipcode library to retreive address info
for index, row in zipcodes.iterrows():
result = search.by_zipcode(row['ZIP CODE'])
zipcodes.at[index, 'Address'] = result.major_city + ', ' + result.state
print(zipcodes) ZIP CODE Address
0 11228 Brooklyn, NY
1 10027 New York, NY
2 10040 New York, NY
3 10035 New York, NY
4 10035 New York, NY
5 10457 Bronx, NY
6 11203 Brooklyn, NY
7 11208 Brooklyn, NY
8 11230 Brooklyn, NY
9 10033 New York, NY
10 11435 Jamaica, NY
5.4 Aggregation and Grouping
Built-In Pandas Aggregations (for DataFrame & Series objects): - count( ) - Total number of items - first( ), last( ) - First and last item - mean( ), median( )
- Mean and median - min( ), max( )
- Minimum and maximum - std( ), var( )
- Standard deviation and variance - mad( )
- Mean absolute deviation - prod( )
- Product of all items - sum( )
- Sum of all items - groupby() - compute aggregates on subsets of data
5.4.1 Ex: Titanic Groupby
- We will use the Titanic Data Set from the ‘seaborn’ library
import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head()| survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
- Below the data is groupby ‘sex’ and the counts for each row are displayed
titanic.groupby('sex').count()| survived | pclass | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sex | ||||||||||||||
| female | 314 | 314 | 261 | 314 | 314 | 314 | 312 | 314 | 314 | 314 | 97 | 312 | 314 | 314 |
| male | 577 | 577 | 453 | 577 | 577 | 577 | 577 | 577 | 577 | 577 | 106 | 577 | 577 | 577 |
5.4.2 Ex: Crash Data Group By
- Using the crash_zip DataFrame from above, grouping by ZIP CODE and using the count() method, we can see how many counts for each listed zip code
crash_zip.groupby('ZIP CODE').count()| CRASH DATE | |
|---|---|
| ZIP CODE | |
| 10001.0 | 1 |
| 10002.0 | 1 |
| 10003.0 | 1 |
| 10004.0 | 1 |
| 10005.0 | 1 |
| ... | ... |
| 11436.0 | 1 |
| 11691.0 | 1 |
| 11692.0 | 1 |
| 11693.0 | 1 |
| 11694.0 | 1 |
180 rows × 1 columns
5.5 Pivot Tables
- Creates a two dimensional table using column data
- Easy way to visualize data to see patterns and summarize data
- Use ‘groupby( )’ to create a pivot table
5.5.1 Ex: Pivot Table using Titanic Data Set
- We will again use the Titanic data set, but now we will create a pivot table
We can use groupby() to help create a pivot table - Group the data by ‘sex’ and ‘class’ to select survival. - Then use the aggregate function mean() to show within the table
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()| class | First | Second | Third |
|---|---|---|---|
| sex | |||
| female | 0.968085 | 0.921053 | 0.500000 |
| male | 0.368852 | 0.157407 | 0.135447 |
5.5.2 Ex: Pivot Table of Crash Data
- Create a pivot table that shows the Number of Persons Injured for every Crash Date per Zip Code
jan23.pivot_table('NUMBER OF PERSONS INJURED', index = 'ZIP CODE', columns = 'CRASH DATE')| CRASH DATE | 01/01/2023 | 01/02/2023 | 01/03/2023 | 01/04/2023 | 01/05/2023 | 01/06/2023 | 01/07/2023 | 01/08/2023 | 01/09/2023 | 01/10/2023 | ... | 01/22/2023 | 01/23/2023 | 01/24/2023 | 01/25/2023 | 01/26/2023 | 01/27/2023 | 01/28/2023 | 01/29/2023 | 01/30/2023 | 01/31/2023 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ZIP CODE | |||||||||||||||||||||
| 10001.0 | 0.0 | NaN | 0.000000 | NaN | 0.0 | NaN | NaN | 0.0 | 0.0 | NaN | ... | NaN | NaN | 0.0 | 0.500000 | 0.0 | 0.0 | 1.00 | 1.0 | 0.500000 | 0.5 |
| 10002.0 | 0.0 | 0.5 | 0.666667 | 0.0 | NaN | NaN | NaN | 0.0 | NaN | 0.0 | ... | 0.5 | NaN | 0.0 | 1.333333 | 0.0 | 0.0 | 0.75 | 0.0 | 0.333333 | 1.0 |
| 10003.0 | 0.5 | 0.0 | 1.000000 | 1.0 | NaN | NaN | 0.5 | 0.0 | 0.0 | NaN | ... | 0.0 | NaN | NaN | NaN | 0.0 | 0.0 | 0.50 | 0.0 | 0.000000 | 1.0 |
| 10004.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0.00 | NaN | NaN | NaN |
| 10005.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1.0 | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 11436.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 0.000000 | NaN | NaN | 0.00 | NaN | 0.000000 | NaN |
| 11691.0 | 0.0 | 0.5 | 1.000000 | 1.0 | 1.0 | NaN | NaN | 0.0 | NaN | 0.0 | ... | 0.5 | 1.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.50 | NaN | 0.000000 | 0.0 |
| 11692.0 | NaN | NaN | 1.000000 | 1.0 | 0.0 | 1.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | 0.0 |
| 11693.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0.0 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN |
| 11694.0 | 0.0 | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 2.0 | 1.0 | NaN | 1.000000 | 0.0 | NaN | NaN | NaN | NaN | NaN |
180 rows × 31 columns
5.6 Vectorized String Operations
Must use the ‘str’ attribute of a Pandas Series and Index objects to access operations
Some examples of operations:
- len()
- lower()
- translate()
- slower()
- ljust()
- upper()
- startswith()
- isupper()
- rjust()
- find()
- endswith()
- isnumeric()
- capitalize()
- swapcase()
- istitle()
- rpartition()
5.6.1 Ex: Create a Panda Series of ‘BOROUGH’ names and swap cases.
bname = pd.Series(jan23['BOROUGH'])
bname.head(15).dropna() # original 5 BROOKLYN
6 QUEENS
7 MANHATTAN
11 QUEENS
12 BROOKLYN
13 QUEENS
14 QUEENS
Name: BOROUGH, dtype: object
bname.str.capitalize().head(15).dropna()# to make first letter capital5 Brooklyn
6 Queens
7 Manhattan
11 Queens
12 Brooklyn
13 Queens
14 Queens
Name: BOROUGH, dtype: object
bname.str.swapcase().head(15).dropna() # to make all lower case5 brooklyn
6 queens
7 manhattan
11 queens
12 brooklyn
13 queens
14 queens
Name: BOROUGH, dtype: object
bname.str.len().head(15).dropna() # to return the length of the name and data type5 8.0
6 6.0
7 9.0
11 6.0
12 8.0
13 6.0
14 6.0
Name: BOROUGH, dtype: float64
bname.str.startswith('B').head(15).dropna() # to see if starts with a letter B5 True
6 False
7 False
11 False
12 True
13 False
14 False
Name: BOROUGH, dtype: object
5.7 Time Series
- Time Stamps : Moments in time
- Ex: July 4th, 2023 at 8:00 AM
- Pandas provides the Timestamp type
- Time Intervals: Reference a length of time with a beginning and end
- Ex: The year of 2022
- Pandas provides the Period type
- Time Deltas/ Durations: Reference an exact length of time
- Ex: 0.3 seconds
- Pandas provides the Timedelta type
5.7.1 Can create a Timestamp object
- combines ‘datetime’ and ‘dateutil’ to be used as a Series or DataFrame
date = pd.to_datetime("2nd of February, 2023")
date
type(date)pandas._libs.tslibs.timestamps.Timestamp
5.7.2 Can create Series that has time indexed data
ind = pd.DatetimeIndex(['2022-07-04', '2022-08-04',
'2022-07-04', '2022-08-04'])
inddata = pd.Series([0,1,2,3], index = ind)
inddata2022-07-04 0
2022-08-04 1
2022-07-04 2
2022-08-04 3
dtype: int64
5.7.3 Frequencies and Offsets
The following are the main codes avaiable:
- D Calendar day
- B Business day
- W Weekly
- M Month end
- BM Business month end
- Q Quarter end
- BQ Business quarter end
- A Year end
- BA Business year end
- H Hours
- BH Business hours
- T Minutes
- S Seconds
- L Milliseonds
- U Microseconds
- N nanoseconds
5.7.4 Ex: TimeDelta
- create a TimeDelta data type starting at 00:00:00 using frequency of 2 hours and 30 minutes (2H30T) over 5 periods.
pd.timedelta_range(0, periods = 5, freq = "2H30T")TimedeltaIndex(['0 days 00:00:00', '0 days 02:30:00', '0 days 05:00:00',
'0 days 07:30:00', '0 days 10:00:00'],
dtype='timedelta64[ns]', freq='150T')
5.8 High Performance Pandas: eval()
eval()uses string expressions to compute operations using DataFrames- supports all arithmetic operations, comparison operators, bitwise operators (
&and|), and the use ofandandorin Boolean expressions
- supports all arithmetic operations, comparison operators, bitwise operators (
nrows, ncols = 10, 5 # creating 2 DF of 5 rows and 10 columns
rand = np.random.RandomState(7)
dfa, dfb = (pd.DataFrame(rand.rand(nrows, ncols))
for i in range (2))
# to compute sum of dfa and dfb and place into one table
print("dfa", dfa)
print("dfb", dfb)
pd.eval('dfa + dfb')dfa 0 1 2 3 4
0 0.076308 0.779919 0.438409 0.723465 0.977990
1 0.538496 0.501120 0.072051 0.268439 0.499883
2 0.679230 0.803739 0.380941 0.065936 0.288146
3 0.909594 0.213385 0.452124 0.931206 0.024899
4 0.600549 0.950130 0.230303 0.548490 0.909128
5 0.133169 0.523413 0.750410 0.669013 0.467753
6 0.204849 0.490766 0.372385 0.477401 0.365890
7 0.837918 0.768648 0.313995 0.572625 0.276049
8 0.452843 0.352978 0.657399 0.370351 0.459093
9 0.719324 0.412992 0.906423 0.180452 0.741119
dfb 0 1 2 3 4
0 0.422374 0.426454 0.634380 0.522906 0.414886
1 0.001427 0.092262 0.709394 0.524346 0.696160
2 0.955468 0.682914 0.053129 0.308853 0.592595
3 0.235120 0.964971 0.945048 0.848401 0.472324
4 0.841477 0.131111 0.308734 0.462996 0.741847
5 0.485825 0.136876 0.343537 0.324426 0.300419
6 0.165501 0.414902 0.448121 0.774900 0.796391
7 0.522390 0.460630 0.778214 0.887289 0.674919
8 0.800479 0.939111 0.040656 0.875672 0.276563
9 0.475764 0.796761 0.717242 0.147148 0.658748
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| 0 | 0.498682 | 1.206372 | 1.072789 | 1.246371 | 1.392875 |
| 1 | 0.539923 | 0.593383 | 0.781446 | 0.792785 | 1.196043 |
| 2 | 1.634698 | 1.486653 | 0.434070 | 0.374789 | 0.880740 |
| 3 | 1.144714 | 1.178356 | 1.397172 | 1.779607 | 0.497223 |
| 4 | 1.442026 | 1.081240 | 0.539037 | 1.011486 | 1.650976 |
| 5 | 0.618995 | 0.660289 | 1.093946 | 0.993439 | 0.768172 |
| 6 | 0.370350 | 0.905668 | 0.820505 | 1.252302 | 1.162281 |
| 7 | 1.360308 | 1.229278 | 1.092208 | 1.459914 | 0.950968 |
| 8 | 1.253322 | 1.292090 | 0.698055 | 1.246023 | 0.735656 |
| 9 | 1.195089 | 1.209753 | 1.623666 | 0.327599 | 1.399867 |