import pandas as pd
import numpy as np
5 Data Manipulation with Pandas
This chapter was prepared by Garrick Ho and Ginamarie Mastrorilli.
Pandas is an open source Python library for data analysis. It gives Python the ability to work with spreadsheet-like data for fast data loading, manipulating, aligning, merging, etc. The name is derived from ‘panel data’, an econometrics term for multidimensional structured datasets.
Pandas introduces two new data types to Python: Series and DataFrame
5.1 Series
A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index
1, 3, 4] [
[1, 3, 4]
To create a series, usepd.Series()
.
= pd.Series([4, 7, -5, 3])
s s
0 4
1 7
2 -5
3 3
dtype: int64
The string representation of a Series displayed interactively shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers 0 through n-1 (where n is the length of the data).
= pd.Series([4, 7, -5, 3], index = ['a', 'b', 'c', 'd'])
s s
a 4
b 7
c -5
d 3
dtype: int64
s.values
array([ 4, 7, -5, 3])
s.index
Index(['a', 'b', 'c', 'd'], dtype='object')
5.1.1 Selecting single or a set of values using index
To select a single value, use [ ]
.
'b'] s[
7
To select multiple values, add a comma between each value and use double brackets.
'c', 'a', 'b']] s[[
c -5
a 4
b 7
dtype: int64
Selecting by the index is also possible.
2] s[
-5
To select multiple values in a row, use :
and the first index is where it starts and the second index is where it stops. It does not include the second index.
1:3] s[
b 7
c -5
dtype: int64
To select multiple values, add a comma between each value and use double brackets.
0,3]] s[[
a 4
d 3
dtype: int64
5.1.2 Filtering
Filtering values in a series can be done with <, >, =
.
> 0 s
a True
b True
c False
d True
dtype: bool
> 0] s[s
a 4
b 7
d 3
dtype: int64
5.1.3 Math operation
Math functions are able to be apply to a series.
**2 s
a 16
b 49
c 25
d 9
dtype: int64
np.exp(s)
a 54.598150
b 1096.633158
c 0.006738
d 20.085537
dtype: float64
s.mean()
2.25
Series are aligned by index label in arithmetic operations.
= pd.Series([1, 2, 3, 4], index = ['a', 'c', 'd', 'e']) s2
= s + s2
zero zero
a 5.0
b NaN
c -3.0
d 6.0
e NaN
dtype: float64
Note: “NaN” stands for missing values in pandas
5.1.4 Finding NaN values
To find all the missing values in a series, use .isnull()
.
zero.isnull()
a False
b True
c False
d False
e True
dtype: bool
To find all the non-missing vales, use .notnull()
.
zero.notnull()
a True
b False
c True
d True
e False
dtype: bool
zero[zero.notnull()]
a 5.0
c -3.0
d 6.0
dtype: float64
5.1.5 Replacing NaN
To change NaN to 0, use .fillna()
.
0) zero.fillna(
a 5.0
b 0.0
c -3.0
d 6.0
e 0.0
dtype: float64
5.1.6 Forward-fill
Fill all the NaN values with the previous value in the series.
= 'ffill') zero.fillna(method
a 5.0
b 5.0
c -3.0
d 6.0
e 6.0
dtype: float64
5.1.7 Back-fill
Fill all the NaN values with the next value in the series.
Note that e is Nan because there is no next value in the series.
= 'bfill') zero.fillna(method
a 5.0
b -3.0
c -3.0
d 6.0
e NaN
dtype: float64
5.1.8 Drop
To drop all NaN, use .dropna()
.
zero.dropna()
a 5.0
c -3.0
d 6.0
dtype: float64
Notice how that zero hasn’t change at all. If the function wants to be applied to the original series, set it to its self.
zero
a 5.0
b NaN
c -3.0
d 6.0
e NaN
dtype: float64
= zero.dropna()
zero zero
a 5.0
c -3.0
d 6.0
dtype: float64
change the index to be the same as s2 so there is no missing value
= ['a', 'c', 'd', 'e']
s.index + s2 s
a 5
c 9
d -2
e 7
dtype: int64
5.2 DataFrame
A DataFrame represents a rectangular table of data and contains an ordered collection of columns. The DataFrame has both a row and column index.
Since each column of a DataFrame is essentially a Series with its column index, it can be thought of as a dictionary of Series all sharing the same index.
Each column (Series) has to be the same type, whereas, each row can contain mixed types.
5.2.1 Creating DataFrame
5.2.1.1 from a dict of equal-length lists
= {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
data 'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
= pd.DataFrame(data)
d d
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
5 | Nevada | 2003 | 3.2 |
5.2.1.2 from an DataFrame
Starting with an empty dataframe, series are able to be added to the Dataframe.
= pd.DataFrame() d1
'state'] = ['Ohio', 'Nevada']
d1['year'] = [2001, 2001]
d1['pop'] = [1.7, 2.4] d1[
d1
state | year | pop | |
---|---|---|---|
0 | Ohio | 2001 | 1.7 |
1 | Nevada | 2001 | 2.4 |
5.2.2 select columns
d
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
5 | Nevada | 2003 | 3.2 |
'state'] d[
0 Ohio
1 Ohio
2 Ohio
3 Nevada
4 Nevada
5 Nevada
Name: state, dtype: object
'state','pop']] d[[
state | pop | |
---|---|---|
0 | Ohio | 1.5 |
1 | Ohio | 1.7 |
2 | Ohio | 3.6 |
3 | Nevada | 2.4 |
4 | Nevada | 2.9 |
5 | Nevada | 3.2 |
5.2.3 select rows
= pd.DataFrame(np.arange(16).reshape((4, 4)),
d2 =['Ohio', 'Colorado', 'Utah', 'New York'],
index=['one', 'two', 'three', 'four'])
columns d2
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
'Colorado': 'Utah'] d2.loc[
one | two | three | four | |
---|---|---|---|---|
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
1:3] d2.iloc[
one | two | three | four | |
---|---|---|---|---|
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
5.2.4 change row index and column name
Use .rename
ti rename any row or column.
={'Colorado':'Connecticut'},columns={'one':'five'}) d2.rename(index
five | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Connecticut | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
Notice how d2 does not change.
d2
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
You can use inplace = True
to change the original Dataframe.
= {'Colorado':'Connecticut'}, columns = {'one':'five'}, inplace = True) d2.rename(index
d2
five | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Connecticut | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
5.2.5 basics attributes and methods
d2.index
Index(['Ohio', 'Connecticut', 'Utah', 'New York'], dtype='object')
d2.columns
Index(['five', 'two', 'three', 'four'], dtype='object')
d2.values
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
d2.shape
(4, 4)
d2.mean()
five 6.0
two 7.0
three 8.0
four 9.0
dtype: float64
5.2.6 add and delete rows and columns
For dropping a row or column use .drop
.
= "Connecticut", columns = "five") # add "inplace=True" will change the original DataFrame d2.drop(index
two | three | four | |
---|---|---|---|
Ohio | 1 | 2 | 3 |
Utah | 9 | 10 | 11 |
New York | 13 | 14 | 15 |
d2
five | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Connecticut | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
For deleting a column use del
.
del d2['five']
d2
two | three | four | |
---|---|---|---|
Ohio | 1 | 2 | 3 |
Connecticut | 5 | 6 | 7 |
Utah | 9 | 10 | 11 |
New York | 13 | 14 | 15 |
'one'] = [1, 2, 3, 4]
d2[ d2
two | three | four | one | |
---|---|---|---|---|
Ohio | 1 | 2 | 3 | 1 |
Connecticut | 5 | 6 | 7 | 2 |
Utah | 9 | 10 | 11 | 3 |
New York | 13 | 14 | 15 | 4 |
.pop
returns the values and removes it from the original Dataframe.
'one') d2.pop(
Ohio 1
Connecticut 2
Utah 3
New York 4
Name: one, dtype: int64
d2
two | three | four | |
---|---|---|---|
Ohio | 1 | 2 | 3 |
Connecticut | 5 | 6 | 7 |
Utah | 9 | 10 | 11 |
New York | 13 | 14 | 15 |
5.2.7 Common method
You can import dataset as well
5.2.7.1 csv file
import pandas as pd
= pd.read_csv("data/nyc_crashes_202301.csv") crashes
= crashes.iloc[0:35, 0:8] sub_set_1
5.2.7.2 Head and Tail
These two methods show the first and the last a few records from a DataFrame, default is 5.
sub_set_1.head()
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
---|---|---|---|---|---|---|---|---|
0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY |
2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE |
3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.91244 | (40.769737, -73.91244) | ASTORIA BOULEVARD |
4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.85072 | (40.830555, -73.85072) | CASTLE HILL AVENUE |
sub_set_1.tail()
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
---|---|---|---|---|---|---|---|---|
30 | 01/01/2023 | 17:05 | NaN | NaN | 40.795250 | -73.97321 | (40.79525, -73.97321) | WEST 96 STREET |
31 | 01/01/2023 | 5:45 | QUEENS | 11411.0 | 40.690640 | -73.72832 | (40.69064, -73.72832) | NaN |
32 | 01/01/2023 | 0:55 | NaN | NaN | 40.818470 | -73.94140 | (40.81847, -73.9414) | WEST 140 STREET |
33 | 01/01/2023 | 10:16 | QUEENS | 11369.0 | 40.764633 | -73.86537 | (40.764633, -73.86537) | NaN |
34 | 01/01/2023 | 3:45 | QUEENS | 11694.0 | 40.580510 | -73.84773 | (40.58051, -73.84773) | NaN |
3) sub_set_1.head(
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
---|---|---|---|---|---|---|---|---|
0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY |
2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE |
5.2.8 unique and nunique
To show only unique values, use .unique
.
'BOROUGH'].unique() sub_set_1[
array([nan, 'BROOKLYN', 'QUEENS', 'MANHATTAN', 'STATEN ISLAND', 'BRONX'],
dtype=object)
To get the number of unique values, use .nunique
.
'BOROUGH'].nunique() sub_set_1[
5
5.2.9 count and value_counts
To count the non missing values, use .count
.
'BOROUGH'].count() sub_set_1[
21
To count the number in each categroy, use .value_counts
.
'BOROUGH'].value_counts() sub_set_1[
QUEENS 12
BROOKLYN 6
MANHATTAN 1
STATEN ISLAND 1
BRONX 1
Name: BOROUGH, dtype: int64
5.2.10 describe and info
sub_set_1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRASH DATE 35 non-null object
1 CRASH TIME 35 non-null object
2 BOROUGH 21 non-null object
3 ZIP CODE 21 non-null float64
4 LATITUDE 30 non-null float64
5 LONGITUDE 30 non-null float64
6 LOCATION 30 non-null object
7 ON STREET NAME 27 non-null object
dtypes: float64(3), object(5)
memory usage: 2.3+ KB
Summary statistics for numeric type columns.
Use .describe
to get an quick summary of the data.
sub_set_1.describe()
ZIP CODE | LATITUDE | LONGITUDE | |
---|---|---|---|
count | 21.000000 | 30.000000 | 30.000000 |
mean | 11188.095238 | 40.728138 | -73.885913 |
std | 416.736476 | 0.086772 | 0.076496 |
min | 10026.000000 | 40.580510 | -74.093414 |
25% | 11207.000000 | 40.664881 | -73.937413 |
50% | 11354.000000 | 40.747835 | -73.886998 |
75% | 11369.000000 | 40.793048 | -73.848292 |
max | 11694.000000 | 40.861810 | -73.728320 |
=[x/10 for x in list(range(1, 10, 1))]) sub_set_1.describe(percentiles
ZIP CODE | LATITUDE | LONGITUDE | |
---|---|---|---|
count | 21.000000 | 30.000000 | 30.000000 |
mean | 11188.095238 | 40.728138 | -73.885913 |
std | 416.736476 | 0.086772 | 0.076496 |
min | 10026.000000 | 40.580510 | -74.093414 |
10% | 10468.000000 | 40.598224 | -73.959692 |
20% | 11207.000000 | 40.649558 | -73.943165 |
30% | 11223.000000 | 40.673481 | -73.925244 |
40% | 11233.000000 | 40.708116 | -73.906082 |
50% | 11354.000000 | 40.747835 | -73.886998 |
60% | 11357.000000 | 40.760311 | -73.871886 |
70% | 11369.000000 | 40.776043 | -73.850365 |
80% | 11372.000000 | 40.808170 | -73.828438 |
90% | 11411.000000 | 40.833641 | -73.805664 |
max | 11694.000000 | 40.861810 | -73.728320 |
choose a specific column to get a summary for.
'BOROUGH'].describe() sub_set_1[
count 21
unique 5
top QUEENS
freq 12
Name: BOROUGH, dtype: object
5.2.11 idxmax and nlargest
.idxmax()
returns the index of the largest value.
'ZIP CODE'].idxmax() sub_set_1[
34
.idxmin()
returns the index of the smallest value
'ZIP CODE'].idxmin() sub_set_1[
7
.nlargest
returns the largest values with their index (default is 5).
'ZIP CODE'].nlargest() sub_set_1[
34 11694.0
11 11691.0
31 11411.0
13 11375.0
26 11372.0
Name: ZIP CODE, dtype: float64
.nsmallest
returns the smallest 3 values with their index (default is 5).
'ZIP CODE'].nsmallest() sub_set_1[
7 10026.0
17 10305.0
28 10468.0
6 11101.0
12 11207.0
Name: ZIP CODE, dtype: float64
5.2.12 sort
use .sort_values
to sort values
= 'BOROUGH') sub_set_1.sort_values(by
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
---|---|---|---|---|---|---|---|---|
28 | 01/01/2023 | 7:40 | BRONX | 10468.0 | 40.861810 | -73.912320 | (40.86181, -73.91232) | NaN |
5 | 01/01/2023 | 0:35 | BROOKLYN | 11229.0 | 40.601310 | -73.954720 | (40.60131, -73.95472) | NaN |
27 | 01/01/2023 | 5:36 | BROOKLYN | 11207.0 | 40.663136 | -73.883250 | (40.663136, -73.88325) | NaN |
22 | 01/01/2023 | 4:20 | BROOKLYN | 11233.0 | 40.670116 | -73.922480 | (40.670116, -73.92248) | SAINT JOHNS PLACE |
15 | 01/01/2023 | 3:30 | BROOKLYN | 11236.0 | 40.636720 | -73.887695 | (40.63672, -73.887695) | NaN |
19 | 01/01/2023 | 1:10 | BROOKLYN | 11223.0 | 40.593760 | -73.982740 | (40.59376, -73.98274) | AVENUE V |
12 | 01/01/2023 | 1:45 | BROOKLYN | 11207.0 | 40.652767 | -73.886300 | (40.652767, -73.8863) | PENNSYLVANIA AVENUE |
7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET |
14 | 01/01/2023 | 17:35 | QUEENS | 11354.0 | 40.771587 | -73.810070 | (40.771587, -73.81007) | BAYSIDE AVENUE |
16 | 01/01/2023 | 10:30 | QUEENS | 11355.0 | 40.751800 | -73.817314 | (40.7518, -73.817314) | HOLLY AVENUE |
33 | 01/01/2023 | 10:16 | QUEENS | 11369.0 | 40.764633 | -73.865370 | (40.764633, -73.86537) | NaN |
21 | 01/01/2023 | 20:25 | QUEENS | 11357.0 | 40.786440 | -73.829155 | (40.78644, -73.829155) | 140 STREET |
11 | 01/01/2023 | 23:06 | QUEENS | 11691.0 | 40.598720 | -73.766010 | (40.59872, -73.76601) | BEACH 32 STREET |
23 | 01/01/2023 | 6:45 | QUEENS | 11368.0 | 40.739280 | -73.850530 | (40.73928, -73.85053) | SAULTELL AVENUE |
24 | 01/01/2023 | 23:15 | QUEENS | 11369.0 | 40.757430 | -73.876230 | (40.75743, -73.87623) | NaN |
26 | 01/01/2023 | 4:00 | QUEENS | 11372.0 | 40.751003 | -73.892130 | (40.751003, -73.89213) | 74 STREET |
6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD |
31 | 01/01/2023 | 5:45 | QUEENS | 11411.0 | 40.690640 | -73.728320 | (40.69064, -73.72832) | NaN |
13 | 01/01/2023 | 17:55 | QUEENS | 11375.0 | 40.710320 | -73.849980 | (40.71032, -73.84998) | METROPOLITAN AVENUE |
34 | 01/01/2023 | 3:45 | QUEENS | 11694.0 | 40.580510 | -73.847730 | (40.58051, -73.84773) | NaN |
17 | 01/01/2023 | 6:40 | STATEN ISLAND | 10305.0 | 40.585240 | -74.093414 | (40.58524, -74.093414) | HYLAN BOULEVARD |
0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY |
2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE |
3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.912440 | (40.769737, -73.91244) | ASTORIA BOULEVARD |
4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.850720 | (40.830555, -73.85072) | CASTLE HILL AVENUE |
8 | 01/01/2023 | 19:53 | NaN | NaN | NaN | NaN | NaN | EAST 222 STREET |
9 | 01/01/2023 | 3:11 | NaN | NaN | 40.842110 | -73.825570 | (40.84211, -73.82557) | BRUCKNER EXPRESSWAY |
10 | 01/01/2023 | 18:49 | NaN | NaN | 40.674923 | -73.736940 | (40.674923, -73.73694) | MERRICK BOULEVARD |
18 | 01/01/2023 | 5:30 | NaN | NaN | 40.704810 | -73.939320 | (40.70481, -73.93932) | SEIGEL STREET |
20 | 01/01/2023 | 0:04 | NaN | NaN | 40.858090 | -73.901924 | (40.85809, -73.901924) | EAST 183 STREET |
25 | 01/01/2023 | 3:35 | NaN | NaN | NaN | NaN | NaN | GRAND CENTRAL PARKWAY |
29 | 01/01/2023 | 10:50 | NaN | NaN | 40.832700 | -73.950226 | (40.8327, -73.950226) | HENRY HUDSON PARKWAY |
30 | 01/01/2023 | 17:05 | NaN | NaN | 40.795250 | -73.973210 | (40.79525, -73.97321) | WEST 96 STREET |
32 | 01/01/2023 | 0:55 | NaN | NaN | 40.818470 | -73.941400 | (40.81847, -73.9414) | WEST 140 STREET |
= ['CRASH DATE', 'ZIP CODE'], ascending = True) sub_set_1.sort_values(by
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
---|---|---|---|---|---|---|---|---|
7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET |
17 | 01/01/2023 | 6:40 | STATEN ISLAND | 10305.0 | 40.585240 | -74.093414 | (40.58524, -74.093414) | HYLAN BOULEVARD |
28 | 01/01/2023 | 7:40 | BRONX | 10468.0 | 40.861810 | -73.912320 | (40.86181, -73.91232) | NaN |
6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD |
12 | 01/01/2023 | 1:45 | BROOKLYN | 11207.0 | 40.652767 | -73.886300 | (40.652767, -73.8863) | PENNSYLVANIA AVENUE |
27 | 01/01/2023 | 5:36 | BROOKLYN | 11207.0 | 40.663136 | -73.883250 | (40.663136, -73.88325) | NaN |
19 | 01/01/2023 | 1:10 | BROOKLYN | 11223.0 | 40.593760 | -73.982740 | (40.59376, -73.98274) | AVENUE V |
5 | 01/01/2023 | 0:35 | BROOKLYN | 11229.0 | 40.601310 | -73.954720 | (40.60131, -73.95472) | NaN |
22 | 01/01/2023 | 4:20 | BROOKLYN | 11233.0 | 40.670116 | -73.922480 | (40.670116, -73.92248) | SAINT JOHNS PLACE |
15 | 01/01/2023 | 3:30 | BROOKLYN | 11236.0 | 40.636720 | -73.887695 | (40.63672, -73.887695) | NaN |
14 | 01/01/2023 | 17:35 | QUEENS | 11354.0 | 40.771587 | -73.810070 | (40.771587, -73.81007) | BAYSIDE AVENUE |
16 | 01/01/2023 | 10:30 | QUEENS | 11355.0 | 40.751800 | -73.817314 | (40.7518, -73.817314) | HOLLY AVENUE |
21 | 01/01/2023 | 20:25 | QUEENS | 11357.0 | 40.786440 | -73.829155 | (40.78644, -73.829155) | 140 STREET |
23 | 01/01/2023 | 6:45 | QUEENS | 11368.0 | 40.739280 | -73.850530 | (40.73928, -73.85053) | SAULTELL AVENUE |
24 | 01/01/2023 | 23:15 | QUEENS | 11369.0 | 40.757430 | -73.876230 | (40.75743, -73.87623) | NaN |
33 | 01/01/2023 | 10:16 | QUEENS | 11369.0 | 40.764633 | -73.865370 | (40.764633, -73.86537) | NaN |
26 | 01/01/2023 | 4:00 | QUEENS | 11372.0 | 40.751003 | -73.892130 | (40.751003, -73.89213) | 74 STREET |
13 | 01/01/2023 | 17:55 | QUEENS | 11375.0 | 40.710320 | -73.849980 | (40.71032, -73.84998) | METROPOLITAN AVENUE |
31 | 01/01/2023 | 5:45 | QUEENS | 11411.0 | 40.690640 | -73.728320 | (40.69064, -73.72832) | NaN |
11 | 01/01/2023 | 23:06 | QUEENS | 11691.0 | 40.598720 | -73.766010 | (40.59872, -73.76601) | BEACH 32 STREET |
34 | 01/01/2023 | 3:45 | QUEENS | 11694.0 | 40.580510 | -73.847730 | (40.58051, -73.84773) | NaN |
0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY |
2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE |
3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.912440 | (40.769737, -73.91244) | ASTORIA BOULEVARD |
4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.850720 | (40.830555, -73.85072) | CASTLE HILL AVENUE |
8 | 01/01/2023 | 19:53 | NaN | NaN | NaN | NaN | NaN | EAST 222 STREET |
9 | 01/01/2023 | 3:11 | NaN | NaN | 40.842110 | -73.825570 | (40.84211, -73.82557) | BRUCKNER EXPRESSWAY |
10 | 01/01/2023 | 18:49 | NaN | NaN | 40.674923 | -73.736940 | (40.674923, -73.73694) | MERRICK BOULEVARD |
18 | 01/01/2023 | 5:30 | NaN | NaN | 40.704810 | -73.939320 | (40.70481, -73.93932) | SEIGEL STREET |
20 | 01/01/2023 | 0:04 | NaN | NaN | 40.858090 | -73.901924 | (40.85809, -73.901924) | EAST 183 STREET |
25 | 01/01/2023 | 3:35 | NaN | NaN | NaN | NaN | NaN | GRAND CENTRAL PARKWAY |
29 | 01/01/2023 | 10:50 | NaN | NaN | 40.832700 | -73.950226 | (40.8327, -73.950226) | HENRY HUDSON PARKWAY |
30 | 01/01/2023 | 17:05 | NaN | NaN | 40.795250 | -73.973210 | (40.79525, -73.97321) | WEST 96 STREET |
32 | 01/01/2023 | 0:55 | NaN | NaN | 40.818470 | -73.941400 | (40.81847, -73.9414) | WEST 140 STREET |
5.2.13 [] method
[]
method can be used to select column(s) by passing column name(s).
'ZIP CODE'].head() sub_set_1[
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: ZIP CODE, dtype: float64
'BOROUGH', 'ZIP CODE', 'LOCATION']].head() sub_set_1[[
BOROUGH | ZIP CODE | LOCATION | |
---|---|---|---|
0 | NaN | NaN | NaN |
1 | NaN | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | NaN | (40.769737, -73.91244) |
4 | NaN | NaN | (40.830555, -73.85072) |
5.2.14 loc
method
loc can be used to index row(s) and column(s) by providing the row and column labels.
df.loc[row_label(s)]
Selects single row or subset of rows from the DataFrame by label.
Index single row
7] sub_set_1.loc[
CRASH DATE 01/01/2023
CRASH TIME 1:00
BOROUGH MANHATTAN
ZIP CODE 10026.0
LATITUDE 40.805595
LONGITUDE -73.95819
LOCATION (40.805595, -73.95819)
ON STREET NAME WEST 116 STREET
Name: 7, dtype: object
Index multiple rows
8] sub_set_1.loc[:
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
---|---|---|---|---|---|---|---|---|
0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY |
2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE |
3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.912440 | (40.769737, -73.91244) | ASTORIA BOULEVARD |
4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.850720 | (40.830555, -73.85072) | CASTLE HILL AVENUE |
5 | 01/01/2023 | 0:35 | BROOKLYN | 11229.0 | 40.601310 | -73.954720 | (40.60131, -73.95472) | NaN |
6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD |
7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET |
8 | 01/01/2023 | 19:53 | NaN | NaN | NaN | NaN | NaN | EAST 222 STREET |
0, 7, 4, 6]] sub_set_1.loc[[
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
---|---|---|---|---|---|---|---|---|
0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET |
4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.850720 | (40.830555, -73.85072) | CASTLE HILL AVENUE |
6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD |
df.loc[:, col_labels]
Selects single column or subset of columns by label.
'LOCATION'].head() sub_set_1.loc[:,
0 NaN
1 NaN
2 NaN
3 (40.769737, -73.91244)
4 (40.830555, -73.85072)
Name: LOCATION, dtype: object
'LATITUDE': 'LOCATION'].head() sub_set_1.loc[:,
LATITUDE | LONGITUDE | LOCATION | |
---|---|---|---|
0 | NaN | NaN | NaN |
1 | NaN | NaN | NaN |
2 | NaN | NaN | NaN |
3 | 40.769737 | -73.91244 | (40.769737, -73.91244) |
4 | 40.830555 | -73.85072 | (40.830555, -73.85072) |
'BOROUGH', 'ZIP CODE', 'LOCATION']].head() sub_set_1.loc[:, [
BOROUGH | ZIP CODE | LOCATION | |
---|---|---|---|
0 | NaN | NaN | NaN |
1 | NaN | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | NaN | (40.769737, -73.91244) |
4 | NaN | NaN | (40.830555, -73.85072) |
df.loc[row_label(s), col_label(s)]
Select both rows and columns by label.
7, 'BOROUGH'] sub_set_1.loc[
'MANHATTAN'
8, ['BOROUGH', 'LOCATION']] sub_set_1.loc[:
BOROUGH | LOCATION | |
---|---|---|
0 | NaN | NaN |
1 | NaN | NaN |
2 | NaN | NaN |
3 | NaN | (40.769737, -73.91244) |
4 | NaN | (40.830555, -73.85072) |
5 | BROOKLYN | (40.60131, -73.95472) |
6 | QUEENS | (40.744667, -73.931694) |
7 | MANHATTAN | (40.805595, -73.95819) |
8 | NaN | NaN |
Index by Boolean Series
'BOROUGH'].isin(['MANHATTAN','QUEENS']).head() sub_set_1[
0 False
1 False
2 False
3 False
4 False
Name: BOROUGH, dtype: bool
'BOROUGH'].isin(['MANHATTAN','QUEENS'])].head() sub_set_1.loc[sub_set_1[
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
---|---|---|---|---|---|---|---|---|
6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD |
7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET |
11 | 01/01/2023 | 23:06 | QUEENS | 11691.0 | 40.598720 | -73.766010 | (40.59872, -73.76601) | BEACH 32 STREET |
13 | 01/01/2023 | 17:55 | QUEENS | 11375.0 | 40.710320 | -73.849980 | (40.71032, -73.84998) | METROPOLITAN AVENUE |
14 | 01/01/2023 | 17:35 | QUEENS | 11354.0 | 40.771587 | -73.810070 | (40.771587, -73.81007) | BAYSIDE AVENUE |
Use “&” (and), “|” (or) “~” (not) for Pandas
"BOROUGH"] == "MANHATTAN") & (sub_set_1["ZIP CODE"]
sub_set_1.loc[(sub_set_1[>= 1000)]
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
---|---|---|---|---|---|---|---|---|
7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.95819 | (40.805595, -73.95819) | WEST 116 STREET |
5.2.15 iloc
method
iloc can be used to index row(s) and column(s) by providing the row and column integer(s).
df.iloc[row_integer(s)]
Selects single row or subset of rows from the DataFrame by integer position
Note: same as indexing for sequence (but different with loc
, it is 0 basis and the selection is close to the left and open to the right (the last item is excluded).
3] sub_set_1.iloc[
CRASH DATE 01/01/2023
CRASH TIME 23:45
BOROUGH NaN
ZIP CODE NaN
LATITUDE 40.769737
LONGITUDE -73.91244
LOCATION (40.769737, -73.91244)
ON STREET NAME ASTORIA BOULEVARD
Name: 3, dtype: object
8] sub_set_1.iloc[:
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
---|---|---|---|---|---|---|---|---|
0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY |
2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE |
3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.912440 | (40.769737, -73.91244) | ASTORIA BOULEVARD |
4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.850720 | (40.830555, -73.85072) | CASTLE HILL AVENUE |
5 | 01/01/2023 | 0:35 | BROOKLYN | 11229.0 | 40.601310 | -73.954720 | (40.60131, -73.95472) | NaN |
6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD |
7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET |
1:3].head() sub_set_1.iloc[:,
CRASH TIME | BOROUGH | |
---|---|---|
0 | 14:38 | NaN |
1 | 8:04 | NaN |
2 | 18:05 | NaN |
3 | 23:45 | NaN |
4 | 4:50 | NaN |
df.iloc[row_integer(s), col_integer(s)]
Selects row and columns from the DataFrame by integer positions.
0:5, :6] sub_set_1.iloc[
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|
0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN |
1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN |
2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN |
3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.91244 |
4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.85072 |
5.2.16 concat
method
pd.concat([df1, df2], axis = 0)
can be used to combine two dataframe either row-wise or column-wise depends on value of axis:
- 0 (default, row-wise)
- 1 (column-wise)
sub_set_1
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
---|---|---|---|---|---|---|---|---|
0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY |
2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE |
3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.912440 | (40.769737, -73.91244) | ASTORIA BOULEVARD |
4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.850720 | (40.830555, -73.85072) | CASTLE HILL AVENUE |
5 | 01/01/2023 | 0:35 | BROOKLYN | 11229.0 | 40.601310 | -73.954720 | (40.60131, -73.95472) | NaN |
6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD |
7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET |
8 | 01/01/2023 | 19:53 | NaN | NaN | NaN | NaN | NaN | EAST 222 STREET |
9 | 01/01/2023 | 3:11 | NaN | NaN | 40.842110 | -73.825570 | (40.84211, -73.82557) | BRUCKNER EXPRESSWAY |
10 | 01/01/2023 | 18:49 | NaN | NaN | 40.674923 | -73.736940 | (40.674923, -73.73694) | MERRICK BOULEVARD |
11 | 01/01/2023 | 23:06 | QUEENS | 11691.0 | 40.598720 | -73.766010 | (40.59872, -73.76601) | BEACH 32 STREET |
12 | 01/01/2023 | 1:45 | BROOKLYN | 11207.0 | 40.652767 | -73.886300 | (40.652767, -73.8863) | PENNSYLVANIA AVENUE |
13 | 01/01/2023 | 17:55 | QUEENS | 11375.0 | 40.710320 | -73.849980 | (40.71032, -73.84998) | METROPOLITAN AVENUE |
14 | 01/01/2023 | 17:35 | QUEENS | 11354.0 | 40.771587 | -73.810070 | (40.771587, -73.81007) | BAYSIDE AVENUE |
15 | 01/01/2023 | 3:30 | BROOKLYN | 11236.0 | 40.636720 | -73.887695 | (40.63672, -73.887695) | NaN |
16 | 01/01/2023 | 10:30 | QUEENS | 11355.0 | 40.751800 | -73.817314 | (40.7518, -73.817314) | HOLLY AVENUE |
17 | 01/01/2023 | 6:40 | STATEN ISLAND | 10305.0 | 40.585240 | -74.093414 | (40.58524, -74.093414) | HYLAN BOULEVARD |
18 | 01/01/2023 | 5:30 | NaN | NaN | 40.704810 | -73.939320 | (40.70481, -73.93932) | SEIGEL STREET |
19 | 01/01/2023 | 1:10 | BROOKLYN | 11223.0 | 40.593760 | -73.982740 | (40.59376, -73.98274) | AVENUE V |
20 | 01/01/2023 | 0:04 | NaN | NaN | 40.858090 | -73.901924 | (40.85809, -73.901924) | EAST 183 STREET |
21 | 01/01/2023 | 20:25 | QUEENS | 11357.0 | 40.786440 | -73.829155 | (40.78644, -73.829155) | 140 STREET |
22 | 01/01/2023 | 4:20 | BROOKLYN | 11233.0 | 40.670116 | -73.922480 | (40.670116, -73.92248) | SAINT JOHNS PLACE |
23 | 01/01/2023 | 6:45 | QUEENS | 11368.0 | 40.739280 | -73.850530 | (40.73928, -73.85053) | SAULTELL AVENUE |
24 | 01/01/2023 | 23:15 | QUEENS | 11369.0 | 40.757430 | -73.876230 | (40.75743, -73.87623) | NaN |
25 | 01/01/2023 | 3:35 | NaN | NaN | NaN | NaN | NaN | GRAND CENTRAL PARKWAY |
26 | 01/01/2023 | 4:00 | QUEENS | 11372.0 | 40.751003 | -73.892130 | (40.751003, -73.89213) | 74 STREET |
27 | 01/01/2023 | 5:36 | BROOKLYN | 11207.0 | 40.663136 | -73.883250 | (40.663136, -73.88325) | NaN |
28 | 01/01/2023 | 7:40 | BRONX | 10468.0 | 40.861810 | -73.912320 | (40.86181, -73.91232) | NaN |
29 | 01/01/2023 | 10:50 | NaN | NaN | 40.832700 | -73.950226 | (40.8327, -73.950226) | HENRY HUDSON PARKWAY |
30 | 01/01/2023 | 17:05 | NaN | NaN | 40.795250 | -73.973210 | (40.79525, -73.97321) | WEST 96 STREET |
31 | 01/01/2023 | 5:45 | QUEENS | 11411.0 | 40.690640 | -73.728320 | (40.69064, -73.72832) | NaN |
32 | 01/01/2023 | 0:55 | NaN | NaN | 40.818470 | -73.941400 | (40.81847, -73.9414) | WEST 140 STREET |
33 | 01/01/2023 | 10:16 | QUEENS | 11369.0 | 40.764633 | -73.865370 | (40.764633, -73.86537) | NaN |
34 | 01/01/2023 | 3:45 | QUEENS | 11694.0 | 40.580510 | -73.847730 | (40.58051, -73.84773) | NaN |
= crashes.iloc[35:60, 0:8]
sub_set_2 sub_set_2
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
---|---|---|---|---|---|---|---|---|
35 | 01/01/2023 | 1:20 | QUEENS | 11377.0 | 40.748720 | -73.896250 | (40.74872, -73.89625) | BROOKLYN QUEENS EXPRESSWAY |
36 | 01/01/2023 | 18:26 | BROOKLYN | 11208.0 | 40.678246 | -73.870186 | (40.678246, -73.870186) | NaN |
37 | 01/01/2023 | 0:00 | NaN | NaN | NaN | NaN | NaN | WILLIAMSBURG BRIDGE OUTER ROADWA |
38 | 01/01/2023 | 22:35 | NaN | NaN | 40.755780 | -74.001990 | (40.75578, -74.00199) | WEST 34 STREET |
39 | 01/01/2023 | 12:00 | BROOKLYN | 11211.0 | 40.706690 | -73.958840 | (40.70669, -73.95884) | NaN |
40 | 01/01/2023 | 8:45 | NaN | NaN | 40.828114 | -73.931070 | (40.828114, -73.93107) | MAJOR DEEGAN EXPRESSWAY |
41 | 01/01/2023 | 1:20 | MANHATTAN | 10022.0 | 40.758980 | -73.962440 | (40.75898, -73.96244) | 1 AVENUE |
42 | 01/01/2023 | 16:45 | QUEENS | 11420.0 | 40.679070 | -73.831540 | (40.67907, -73.83154) | NaN |
43 | 01/01/2023 | 0:10 | BROOKLYN | 11236.0 | 40.638350 | -73.908890 | (40.63835, -73.90889) | FLATLANDS AVENUE |
44 | 01/01/2023 | 14:53 | BROOKLYN | 11208.0 | 40.660797 | -73.871830 | (40.660797, -73.87183) | ATKINS AVENUE |
45 | 01/01/2023 | 13:10 | NaN | NaN | 40.831290 | -73.929040 | (40.83129, -73.92904) | WOODYCREST AVENUE |
46 | 01/01/2023 | 14:16 | NaN | NaN | NaN | NaN | NaN | VERRAZANO BRIDGE UPPER |
47 | 01/01/2023 | 15:12 | NaN | NaN | 40.712780 | -74.011690 | (40.71278, -74.01169) | GREENWICH STREET |
48 | 01/01/2023 | 0:00 | NaN | NaN | 40.688370 | -73.944916 | (40.68837, -73.944916) | LEXINGTON AVENUE |
49 | 01/01/2023 | 3:35 | NaN | NaN | 40.825935 | -73.859130 | (40.825935, -73.85913) | BRUCKNER EXPRESSWAY |
50 | 01/01/2023 | 9:45 | NaN | NaN | 40.882645 | -73.886566 | (40.882645, -73.886566) | SEDGWICK AVENUE |
51 | 01/01/2023 | 12:20 | NaN | NaN | NaN | NaN | NaN | 102 CROSS DRIVE |
52 | 01/01/2023 | 12:00 | QUEENS | 11354.0 | 40.764650 | -73.823494 | (40.76465, -73.823494) | NORTHERN BOULEVARD |
53 | 01/01/2023 | 1:17 | QUEENS | 11375.0 | 40.724308 | -73.842575 | (40.724308, -73.842575) | 110 STREET |
54 | 01/01/2023 | 7:15 | MANHATTAN | 10025.0 | 40.795250 | -73.973210 | (40.79525, -73.97321) | WEST END AVENUE |
55 | 01/01/2023 | 1:30 | BRONX | 10457.0 | 40.836480 | -73.897736 | (40.83648, -73.897736) | CLAREMONT PARKWAY |
56 | 01/01/2023 | 8:08 | NaN | NaN | 40.686085 | -73.982666 | (40.686085, -73.982666) | ATLANTIC AVENUE |
57 | 01/01/2023 | 6:08 | QUEENS | 11420.0 | 40.677242 | -73.816720 | (40.677242, -73.81672) | NaN |
58 | 01/01/2023 | 8:20 | BRONX | 10451.0 | NaN | NaN | NaN | EAST 138 STREET |
59 | 01/01/2023 | 10:11 | QUEENS | 11358.0 | 40.760440 | -73.804924 | (40.76044, -73.804924) | 161 STREET |
combining by rows
= pd.concat([sub_set_1, sub_set_2])
sub_set_3 sub_set_3
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
---|---|---|---|---|---|---|---|---|
0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP |
1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY |
2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE |
3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.912440 | (40.769737, -73.91244) | ASTORIA BOULEVARD |
4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.850720 | (40.830555, -73.85072) | CASTLE HILL AVENUE |
5 | 01/01/2023 | 0:35 | BROOKLYN | 11229.0 | 40.601310 | -73.954720 | (40.60131, -73.95472) | NaN |
6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD |
7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET |
8 | 01/01/2023 | 19:53 | NaN | NaN | NaN | NaN | NaN | EAST 222 STREET |
9 | 01/01/2023 | 3:11 | NaN | NaN | 40.842110 | -73.825570 | (40.84211, -73.82557) | BRUCKNER EXPRESSWAY |
10 | 01/01/2023 | 18:49 | NaN | NaN | 40.674923 | -73.736940 | (40.674923, -73.73694) | MERRICK BOULEVARD |
11 | 01/01/2023 | 23:06 | QUEENS | 11691.0 | 40.598720 | -73.766010 | (40.59872, -73.76601) | BEACH 32 STREET |
12 | 01/01/2023 | 1:45 | BROOKLYN | 11207.0 | 40.652767 | -73.886300 | (40.652767, -73.8863) | PENNSYLVANIA AVENUE |
13 | 01/01/2023 | 17:55 | QUEENS | 11375.0 | 40.710320 | -73.849980 | (40.71032, -73.84998) | METROPOLITAN AVENUE |
14 | 01/01/2023 | 17:35 | QUEENS | 11354.0 | 40.771587 | -73.810070 | (40.771587, -73.81007) | BAYSIDE AVENUE |
15 | 01/01/2023 | 3:30 | BROOKLYN | 11236.0 | 40.636720 | -73.887695 | (40.63672, -73.887695) | NaN |
16 | 01/01/2023 | 10:30 | QUEENS | 11355.0 | 40.751800 | -73.817314 | (40.7518, -73.817314) | HOLLY AVENUE |
17 | 01/01/2023 | 6:40 | STATEN ISLAND | 10305.0 | 40.585240 | -74.093414 | (40.58524, -74.093414) | HYLAN BOULEVARD |
18 | 01/01/2023 | 5:30 | NaN | NaN | 40.704810 | -73.939320 | (40.70481, -73.93932) | SEIGEL STREET |
19 | 01/01/2023 | 1:10 | BROOKLYN | 11223.0 | 40.593760 | -73.982740 | (40.59376, -73.98274) | AVENUE V |
20 | 01/01/2023 | 0:04 | NaN | NaN | 40.858090 | -73.901924 | (40.85809, -73.901924) | EAST 183 STREET |
21 | 01/01/2023 | 20:25 | QUEENS | 11357.0 | 40.786440 | -73.829155 | (40.78644, -73.829155) | 140 STREET |
22 | 01/01/2023 | 4:20 | BROOKLYN | 11233.0 | 40.670116 | -73.922480 | (40.670116, -73.92248) | SAINT JOHNS PLACE |
23 | 01/01/2023 | 6:45 | QUEENS | 11368.0 | 40.739280 | -73.850530 | (40.73928, -73.85053) | SAULTELL AVENUE |
24 | 01/01/2023 | 23:15 | QUEENS | 11369.0 | 40.757430 | -73.876230 | (40.75743, -73.87623) | NaN |
25 | 01/01/2023 | 3:35 | NaN | NaN | NaN | NaN | NaN | GRAND CENTRAL PARKWAY |
26 | 01/01/2023 | 4:00 | QUEENS | 11372.0 | 40.751003 | -73.892130 | (40.751003, -73.89213) | 74 STREET |
27 | 01/01/2023 | 5:36 | BROOKLYN | 11207.0 | 40.663136 | -73.883250 | (40.663136, -73.88325) | NaN |
28 | 01/01/2023 | 7:40 | BRONX | 10468.0 | 40.861810 | -73.912320 | (40.86181, -73.91232) | NaN |
29 | 01/01/2023 | 10:50 | NaN | NaN | 40.832700 | -73.950226 | (40.8327, -73.950226) | HENRY HUDSON PARKWAY |
30 | 01/01/2023 | 17:05 | NaN | NaN | 40.795250 | -73.973210 | (40.79525, -73.97321) | WEST 96 STREET |
31 | 01/01/2023 | 5:45 | QUEENS | 11411.0 | 40.690640 | -73.728320 | (40.69064, -73.72832) | NaN |
32 | 01/01/2023 | 0:55 | NaN | NaN | 40.818470 | -73.941400 | (40.81847, -73.9414) | WEST 140 STREET |
33 | 01/01/2023 | 10:16 | QUEENS | 11369.0 | 40.764633 | -73.865370 | (40.764633, -73.86537) | NaN |
34 | 01/01/2023 | 3:45 | QUEENS | 11694.0 | 40.580510 | -73.847730 | (40.58051, -73.84773) | NaN |
35 | 01/01/2023 | 1:20 | QUEENS | 11377.0 | 40.748720 | -73.896250 | (40.74872, -73.89625) | BROOKLYN QUEENS EXPRESSWAY |
36 | 01/01/2023 | 18:26 | BROOKLYN | 11208.0 | 40.678246 | -73.870186 | (40.678246, -73.870186) | NaN |
37 | 01/01/2023 | 0:00 | NaN | NaN | NaN | NaN | NaN | WILLIAMSBURG BRIDGE OUTER ROADWA |
38 | 01/01/2023 | 22:35 | NaN | NaN | 40.755780 | -74.001990 | (40.75578, -74.00199) | WEST 34 STREET |
39 | 01/01/2023 | 12:00 | BROOKLYN | 11211.0 | 40.706690 | -73.958840 | (40.70669, -73.95884) | NaN |
40 | 01/01/2023 | 8:45 | NaN | NaN | 40.828114 | -73.931070 | (40.828114, -73.93107) | MAJOR DEEGAN EXPRESSWAY |
41 | 01/01/2023 | 1:20 | MANHATTAN | 10022.0 | 40.758980 | -73.962440 | (40.75898, -73.96244) | 1 AVENUE |
42 | 01/01/2023 | 16:45 | QUEENS | 11420.0 | 40.679070 | -73.831540 | (40.67907, -73.83154) | NaN |
43 | 01/01/2023 | 0:10 | BROOKLYN | 11236.0 | 40.638350 | -73.908890 | (40.63835, -73.90889) | FLATLANDS AVENUE |
44 | 01/01/2023 | 14:53 | BROOKLYN | 11208.0 | 40.660797 | -73.871830 | (40.660797, -73.87183) | ATKINS AVENUE |
45 | 01/01/2023 | 13:10 | NaN | NaN | 40.831290 | -73.929040 | (40.83129, -73.92904) | WOODYCREST AVENUE |
46 | 01/01/2023 | 14:16 | NaN | NaN | NaN | NaN | NaN | VERRAZANO BRIDGE UPPER |
47 | 01/01/2023 | 15:12 | NaN | NaN | 40.712780 | -74.011690 | (40.71278, -74.01169) | GREENWICH STREET |
48 | 01/01/2023 | 0:00 | NaN | NaN | 40.688370 | -73.944916 | (40.68837, -73.944916) | LEXINGTON AVENUE |
49 | 01/01/2023 | 3:35 | NaN | NaN | 40.825935 | -73.859130 | (40.825935, -73.85913) | BRUCKNER EXPRESSWAY |
50 | 01/01/2023 | 9:45 | NaN | NaN | 40.882645 | -73.886566 | (40.882645, -73.886566) | SEDGWICK AVENUE |
51 | 01/01/2023 | 12:20 | NaN | NaN | NaN | NaN | NaN | 102 CROSS DRIVE |
52 | 01/01/2023 | 12:00 | QUEENS | 11354.0 | 40.764650 | -73.823494 | (40.76465, -73.823494) | NORTHERN BOULEVARD |
53 | 01/01/2023 | 1:17 | QUEENS | 11375.0 | 40.724308 | -73.842575 | (40.724308, -73.842575) | 110 STREET |
54 | 01/01/2023 | 7:15 | MANHATTAN | 10025.0 | 40.795250 | -73.973210 | (40.79525, -73.97321) | WEST END AVENUE |
55 | 01/01/2023 | 1:30 | BRONX | 10457.0 | 40.836480 | -73.897736 | (40.83648, -73.897736) | CLAREMONT PARKWAY |
56 | 01/01/2023 | 8:08 | NaN | NaN | 40.686085 | -73.982666 | (40.686085, -73.982666) | ATLANTIC AVENUE |
57 | 01/01/2023 | 6:08 | QUEENS | 11420.0 | 40.677242 | -73.816720 | (40.677242, -73.81672) | NaN |
58 | 01/01/2023 | 8:20 | BRONX | 10451.0 | NaN | NaN | NaN | EAST 138 STREET |
59 | 01/01/2023 | 10:11 | QUEENS | 11358.0 | 40.760440 | -73.804924 | (40.76044, -73.804924) | 161 STREET |
combining by columns
= pd.concat([sub_set_1, sub_set_2], axis = 1)
sub_set_4 sub_set_4
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01/01/2023 | 14:38 | NaN | NaN | NaN | NaN | NaN | BROOKLYN QUEENS EXPRESSWAY RAMP | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 01/01/2023 | 8:04 | NaN | NaN | NaN | NaN | NaN | NASSAU EXPRESSWAY | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 01/01/2023 | 18:05 | NaN | NaN | NaN | NaN | NaN | 10 AVENUE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 01/01/2023 | 23:45 | NaN | NaN | 40.769737 | -73.912440 | (40.769737, -73.91244) | ASTORIA BOULEVARD | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 01/01/2023 | 4:50 | NaN | NaN | 40.830555 | -73.850720 | (40.830555, -73.85072) | CASTLE HILL AVENUE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | 01/01/2023 | 0:35 | BROOKLYN | 11229.0 | 40.601310 | -73.954720 | (40.60131, -73.95472) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6 | 01/01/2023 | 5:45 | QUEENS | 11101.0 | 40.744667 | -73.931694 | (40.744667, -73.931694) | QUEENS BOULEVARD | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7 | 01/01/2023 | 1:00 | MANHATTAN | 10026.0 | 40.805595 | -73.958190 | (40.805595, -73.95819) | WEST 116 STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | 01/01/2023 | 19:53 | NaN | NaN | NaN | NaN | NaN | EAST 222 STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
9 | 01/01/2023 | 3:11 | NaN | NaN | 40.842110 | -73.825570 | (40.84211, -73.82557) | BRUCKNER EXPRESSWAY | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10 | 01/01/2023 | 18:49 | NaN | NaN | 40.674923 | -73.736940 | (40.674923, -73.73694) | MERRICK BOULEVARD | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11 | 01/01/2023 | 23:06 | QUEENS | 11691.0 | 40.598720 | -73.766010 | (40.59872, -73.76601) | BEACH 32 STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
12 | 01/01/2023 | 1:45 | BROOKLYN | 11207.0 | 40.652767 | -73.886300 | (40.652767, -73.8863) | PENNSYLVANIA AVENUE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
13 | 01/01/2023 | 17:55 | QUEENS | 11375.0 | 40.710320 | -73.849980 | (40.71032, -73.84998) | METROPOLITAN AVENUE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
14 | 01/01/2023 | 17:35 | QUEENS | 11354.0 | 40.771587 | -73.810070 | (40.771587, -73.81007) | BAYSIDE AVENUE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
15 | 01/01/2023 | 3:30 | BROOKLYN | 11236.0 | 40.636720 | -73.887695 | (40.63672, -73.887695) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
16 | 01/01/2023 | 10:30 | QUEENS | 11355.0 | 40.751800 | -73.817314 | (40.7518, -73.817314) | HOLLY AVENUE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
17 | 01/01/2023 | 6:40 | STATEN ISLAND | 10305.0 | 40.585240 | -74.093414 | (40.58524, -74.093414) | HYLAN BOULEVARD | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
18 | 01/01/2023 | 5:30 | NaN | NaN | 40.704810 | -73.939320 | (40.70481, -73.93932) | SEIGEL STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
19 | 01/01/2023 | 1:10 | BROOKLYN | 11223.0 | 40.593760 | -73.982740 | (40.59376, -73.98274) | AVENUE V | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
20 | 01/01/2023 | 0:04 | NaN | NaN | 40.858090 | -73.901924 | (40.85809, -73.901924) | EAST 183 STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
21 | 01/01/2023 | 20:25 | QUEENS | 11357.0 | 40.786440 | -73.829155 | (40.78644, -73.829155) | 140 STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
22 | 01/01/2023 | 4:20 | BROOKLYN | 11233.0 | 40.670116 | -73.922480 | (40.670116, -73.92248) | SAINT JOHNS PLACE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
23 | 01/01/2023 | 6:45 | QUEENS | 11368.0 | 40.739280 | -73.850530 | (40.73928, -73.85053) | SAULTELL AVENUE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
24 | 01/01/2023 | 23:15 | QUEENS | 11369.0 | 40.757430 | -73.876230 | (40.75743, -73.87623) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
25 | 01/01/2023 | 3:35 | NaN | NaN | NaN | NaN | NaN | GRAND CENTRAL PARKWAY | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
26 | 01/01/2023 | 4:00 | QUEENS | 11372.0 | 40.751003 | -73.892130 | (40.751003, -73.89213) | 74 STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
27 | 01/01/2023 | 5:36 | BROOKLYN | 11207.0 | 40.663136 | -73.883250 | (40.663136, -73.88325) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
28 | 01/01/2023 | 7:40 | BRONX | 10468.0 | 40.861810 | -73.912320 | (40.86181, -73.91232) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
29 | 01/01/2023 | 10:50 | NaN | NaN | 40.832700 | -73.950226 | (40.8327, -73.950226) | HENRY HUDSON PARKWAY | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
30 | 01/01/2023 | 17:05 | NaN | NaN | 40.795250 | -73.973210 | (40.79525, -73.97321) | WEST 96 STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
31 | 01/01/2023 | 5:45 | QUEENS | 11411.0 | 40.690640 | -73.728320 | (40.69064, -73.72832) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
32 | 01/01/2023 | 0:55 | NaN | NaN | 40.818470 | -73.941400 | (40.81847, -73.9414) | WEST 140 STREET | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
33 | 01/01/2023 | 10:16 | QUEENS | 11369.0 | 40.764633 | -73.865370 | (40.764633, -73.86537) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
34 | 01/01/2023 | 3:45 | QUEENS | 11694.0 | 40.580510 | -73.847730 | (40.58051, -73.84773) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
35 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 1:20 | QUEENS | 11377.0 | 40.748720 | -73.896250 | (40.74872, -73.89625) | BROOKLYN QUEENS EXPRESSWAY |
36 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 18:26 | BROOKLYN | 11208.0 | 40.678246 | -73.870186 | (40.678246, -73.870186) | NaN |
37 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 0:00 | NaN | NaN | NaN | NaN | NaN | WILLIAMSBURG BRIDGE OUTER ROADWA |
38 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 22:35 | NaN | NaN | 40.755780 | -74.001990 | (40.75578, -74.00199) | WEST 34 STREET |
39 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 12:00 | BROOKLYN | 11211.0 | 40.706690 | -73.958840 | (40.70669, -73.95884) | NaN |
40 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 8:45 | NaN | NaN | 40.828114 | -73.931070 | (40.828114, -73.93107) | MAJOR DEEGAN EXPRESSWAY |
41 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 1:20 | MANHATTAN | 10022.0 | 40.758980 | -73.962440 | (40.75898, -73.96244) | 1 AVENUE |
42 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 16:45 | QUEENS | 11420.0 | 40.679070 | -73.831540 | (40.67907, -73.83154) | NaN |
43 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 0:10 | BROOKLYN | 11236.0 | 40.638350 | -73.908890 | (40.63835, -73.90889) | FLATLANDS AVENUE |
44 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 14:53 | BROOKLYN | 11208.0 | 40.660797 | -73.871830 | (40.660797, -73.87183) | ATKINS AVENUE |
45 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 13:10 | NaN | NaN | 40.831290 | -73.929040 | (40.83129, -73.92904) | WOODYCREST AVENUE |
46 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 14:16 | NaN | NaN | NaN | NaN | NaN | VERRAZANO BRIDGE UPPER |
47 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 15:12 | NaN | NaN | 40.712780 | -74.011690 | (40.71278, -74.01169) | GREENWICH STREET |
48 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 0:00 | NaN | NaN | 40.688370 | -73.944916 | (40.68837, -73.944916) | LEXINGTON AVENUE |
49 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 3:35 | NaN | NaN | 40.825935 | -73.859130 | (40.825935, -73.85913) | BRUCKNER EXPRESSWAY |
50 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 9:45 | NaN | NaN | 40.882645 | -73.886566 | (40.882645, -73.886566) | SEDGWICK AVENUE |
51 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 12:20 | NaN | NaN | NaN | NaN | NaN | 102 CROSS DRIVE |
52 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 12:00 | QUEENS | 11354.0 | 40.764650 | -73.823494 | (40.76465, -73.823494) | NORTHERN BOULEVARD |
53 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 1:17 | QUEENS | 11375.0 | 40.724308 | -73.842575 | (40.724308, -73.842575) | 110 STREET |
54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 7:15 | MANHATTAN | 10025.0 | 40.795250 | -73.973210 | (40.79525, -73.97321) | WEST END AVENUE |
55 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 1:30 | BRONX | 10457.0 | 40.836480 | -73.897736 | (40.83648, -73.897736) | CLAREMONT PARKWAY |
56 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 8:08 | NaN | NaN | 40.686085 | -73.982666 | (40.686085, -73.982666) | ATLANTIC AVENUE |
57 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 6:08 | QUEENS | 11420.0 | 40.677242 | -73.816720 | (40.677242, -73.81672) | NaN |
58 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 8:20 | BRONX | 10451.0 | NaN | NaN | NaN | EAST 138 STREET |
59 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 01/01/2023 | 10:11 | QUEENS | 11358.0 | 40.760440 | -73.804924 | (40.76044, -73.804924) | 161 STREET |
use .fillna()
to fill in the missing values
5.3 Merge and Join
pd.merge( )
: allows a user to do one-to-one, one-to-many, and many-to-many joins- One-to-one joins mean each row is related on a single row in a different table using a key column
- One-to-many joins mean each row in one table in related to one or more rows in a different table using a key column
- Many-to-many joins mean one or more rows in one table is related to one or more rows in a seperate table using a key column
5.3.1 Ex: One-to One Join
- Create 2 new Data Frames from the January 2023 data with a common column (Zip Code)
- For this example, we are using .drop_duplicates() to get unique ZIP CODE values
- Using pd.merge( ) the two data frames are combined using that common column as a key
= pd.read_csv("data/nyc_crashes_202301.csv")
jan23 = jan23[["CRASH DATE", "ZIP CODE"]].copy().drop_duplicates(subset = [ "ZIP CODE"])
crash_zip # to view DF of crash date & zip code crash_zip.tail()
CRASH DATE | ZIP CODE | |
---|---|---|
3590 | 01/16/2023 | 10023.0 |
4304 | 01/19/2023 | 11363.0 |
5530 | 01/24/2023 | 11109.0 |
5732 | 01/25/2023 | 10280.0 |
7140 | 01/31/2023 | 10169.0 |
= jan23[["ZIP CODE", "BOROUGH"]].copy().drop_duplicates(subset = [ "ZIP CODE"])
borough_zip # to view DF of zip code and borough borough_zip.tail()
ZIP CODE | BOROUGH | |
---|---|---|
3590 | 10023.0 | MANHATTAN |
4304 | 11363.0 | QUEENS |
5530 | 11109.0 | QUEENS |
5732 | 10280.0 | MANHATTAN |
7140 | 10169.0 | MANHATTAN |
= pd.merge(crash_zip, borough_zip)
merge_w_zip # to view joined data frames merge_w_zip.tail()
CRASH DATE | ZIP CODE | BOROUGH | |
---|---|---|---|
176 | 01/16/2023 | 10023.0 | MANHATTAN |
177 | 01/19/2023 | 11363.0 | QUEENS |
178 | 01/24/2023 | 11109.0 | QUEENS |
179 | 01/25/2023 | 10280.0 | MANHATTAN |
180 | 01/31/2023 | 10169.0 | MANHATTAN |
We can specify the name of the key column using on
:
= 'ZIP CODE').tail() pd.merge(crash_zip, borough_zip, on
CRASH DATE | ZIP CODE | BOROUGH | |
---|---|---|---|
176 | 01/16/2023 | 10023.0 | MANHATTAN |
177 | 01/19/2023 | 11363.0 | QUEENS |
178 | 01/24/2023 | 11109.0 | QUEENS |
179 | 01/25/2023 | 10280.0 | MANHATTAN |
180 | 01/31/2023 | 10169.0 | MANHATTAN |
5.3.2 Ex: One-to-Many Join
- Create a new DataFrame consisting of Employee and Department
- Create a new DataFrame consisting of Employee and Hire Year
= pd.DataFrame({'employee': ['Emily', 'Jake', 'Paul', 'Jackie'],
emp_dept 'department': ['Accounting', 'HR', 'Engineering', 'Accounting']})
= pd.DataFrame({'employee': ['Emily', 'Jake', 'Paul', 'Jackie'],
emp_hire 'hire_year': ['2020', '2019', '2023', '2011']})
# combining to create a one to one join
= pd.merge(emp_dept,emp_hire) # to merge Employee, Department and Hire Year into one dataframe
dept_hire dept_hire
employee | department | hire_year | |
---|---|---|---|
0 | Emily | Accounting | 2020 |
1 | Jake | HR | 2019 |
2 | Paul | Engineering | 2023 |
3 | Jackie | Accounting | 2011 |
- Create a new DataFrame that consists of Department and the Supervisor for that department
- Merge this new DataFrame with ‘dept_hire’ to create a Many-to-One join using Department as a key
= pd.DataFrame({'supervisor': ['Lily', 'Angela', 'Steven'],
dept_sup 'department': ['Accounting', 'HR', 'Engineering']})
= pd.merge(dept_sup,dept_hire) # merge using Department as the key
sup_emp_dept_hire sup_emp_dept_hire
supervisor | department | employee | hire_year | |
---|---|---|---|---|
0 | Lily | Accounting | Emily | 2020 |
1 | Lily | Accounting | Jackie | 2011 |
2 | Angela | HR | Jake | 2019 |
3 | Steven | Engineering | Paul | 2023 |
5.3.3 Ex: Merging when the Key has Different Variable Names
- We will merge two dataframes that have a similar column containing the same information, but are named differently
- Using the employee data from above, but changing
employee
inemp_hire
toemployee_name
- Need to drop either
employee
oremployee_name
after merging to not have redundant information
= pd.DataFrame({'employee': ['Emily', 'Jake', 'Paul', 'Jackie'],
emp_dept_names 'department': ['Accounting', 'HR', 'Engineering', 'Accounting']})
= pd.DataFrame({'employee_name': ['Emily', 'Jake', 'Paul', 'Jackie'],
emp_hire_names 'hire_year': ['2020', '2019', '2023', '2011']})
# to merge Employee, Department and Hire Year into one dataframe & drop column 'employee_name'
= pd.merge(emp_dept_names,emp_hire_names, left_on = 'employee',
dept_hire_names = 'employee_name').drop('employee_name', axis = 1)
right_on
dept_hire_names
employee | department | hire_year | |
---|---|---|---|
0 | Emily | Accounting | 2020 |
1 | Jake | HR | 2019 |
2 | Paul | Engineering | 2023 |
3 | Jackie | Accounting | 2011 |
5.3.4 Ex: Joining ‘uszipcode’
- Create a subset of jan23 data with 7 zipcodes
- Using ‘uszipcode’ data to join the zip codes from jan23 with data provided in this package
from uszipcode import SearchEngine
= SearchEngine()
search
# create a DF of zip codes from jan23 & convert to integers
= pd.DataFrame(jan23["ZIP CODE"].tail(15).dropna().reset_index(drop = True))
zipcodes "ZIP CODE"] = zipcodes["ZIP CODE"].astype(int)
zipcodes[
# create new,empty column in the df to store the address information
'Address'] = None
zipcodes[
# using uszipcode library to retreive address info
for index, row in zipcodes.iterrows():
= search.by_zipcode(row['ZIP CODE'])
result 'Address'] = result.major_city + ', ' + result.state
zipcodes.at[index,
print(zipcodes)
ZIP CODE Address
0 11228 Brooklyn, NY
1 10027 New York, NY
2 10040 New York, NY
3 10035 New York, NY
4 10035 New York, NY
5 10457 Bronx, NY
6 11203 Brooklyn, NY
7 11208 Brooklyn, NY
8 11230 Brooklyn, NY
9 10033 New York, NY
10 11435 Jamaica, NY
5.4 Aggregation and Grouping
Built-In Pandas Aggregations (for DataFrame & Series objects): - count( ) - Total number of items - first( ), last( ) - First and last item - mean( ), median( )
- Mean and median - min( ), max( )
- Minimum and maximum - std( ), var( )
- Standard deviation and variance - mad( )
- Mean absolute deviation - prod( )
- Product of all items - sum( )
- Sum of all items - groupby() - compute aggregates on subsets of data
5.4.1 Ex: Titanic Groupby
- We will use the Titanic Data Set from the ‘seaborn’ library
import seaborn as sns
= sns.load_dataset('titanic')
titanic titanic.head()
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
- Below the data is groupby ‘sex’ and the counts for each row are displayed
'sex').count() titanic.groupby(
survived | pclass | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sex | ||||||||||||||
female | 314 | 314 | 261 | 314 | 314 | 314 | 312 | 314 | 314 | 314 | 97 | 312 | 314 | 314 |
male | 577 | 577 | 453 | 577 | 577 | 577 | 577 | 577 | 577 | 577 | 106 | 577 | 577 | 577 |
5.4.2 Ex: Crash Data Group By
- Using the crash_zip DataFrame from above, grouping by ZIP CODE and using the count() method, we can see how many counts for each listed zip code
'ZIP CODE').count() crash_zip.groupby(
CRASH DATE | |
---|---|
ZIP CODE | |
10001.0 | 1 |
10002.0 | 1 |
10003.0 | 1 |
10004.0 | 1 |
10005.0 | 1 |
... | ... |
11436.0 | 1 |
11691.0 | 1 |
11692.0 | 1 |
11693.0 | 1 |
11694.0 | 1 |
180 rows × 1 columns
5.5 Pivot Tables
- Creates a two dimensional table using column data
- Easy way to visualize data to see patterns and summarize data
- Use ‘groupby( )’ to create a pivot table
5.5.1 Ex: Pivot Table using Titanic Data Set
- We will again use the Titanic data set, but now we will create a pivot table
We can use groupby() to help create a pivot table - Group the data by ‘sex’ and ‘class’ to select survival. - Then use the aggregate function mean() to show within the table
'sex', 'class'])['survived'].aggregate('mean').unstack() titanic.groupby([
class | First | Second | Third |
---|---|---|---|
sex | |||
female | 0.968085 | 0.921053 | 0.500000 |
male | 0.368852 | 0.157407 | 0.135447 |
5.5.2 Ex: Pivot Table of Crash Data
- Create a pivot table that shows the Number of Persons Injured for every Crash Date per Zip Code
'NUMBER OF PERSONS INJURED', index = 'ZIP CODE', columns = 'CRASH DATE') jan23.pivot_table(
CRASH DATE | 01/01/2023 | 01/02/2023 | 01/03/2023 | 01/04/2023 | 01/05/2023 | 01/06/2023 | 01/07/2023 | 01/08/2023 | 01/09/2023 | 01/10/2023 | ... | 01/22/2023 | 01/23/2023 | 01/24/2023 | 01/25/2023 | 01/26/2023 | 01/27/2023 | 01/28/2023 | 01/29/2023 | 01/30/2023 | 01/31/2023 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ZIP CODE | |||||||||||||||||||||
10001.0 | 0.0 | NaN | 0.000000 | NaN | 0.0 | NaN | NaN | 0.0 | 0.0 | NaN | ... | NaN | NaN | 0.0 | 0.500000 | 0.0 | 0.0 | 1.00 | 1.0 | 0.500000 | 0.5 |
10002.0 | 0.0 | 0.5 | 0.666667 | 0.0 | NaN | NaN | NaN | 0.0 | NaN | 0.0 | ... | 0.5 | NaN | 0.0 | 1.333333 | 0.0 | 0.0 | 0.75 | 0.0 | 0.333333 | 1.0 |
10003.0 | 0.5 | 0.0 | 1.000000 | 1.0 | NaN | NaN | 0.5 | 0.0 | 0.0 | NaN | ... | 0.0 | NaN | NaN | NaN | 0.0 | 0.0 | 0.50 | 0.0 | 0.000000 | 1.0 |
10004.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0.00 | NaN | NaN | NaN |
10005.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1.0 | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
11436.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 0.000000 | NaN | NaN | 0.00 | NaN | 0.000000 | NaN |
11691.0 | 0.0 | 0.5 | 1.000000 | 1.0 | 1.0 | NaN | NaN | 0.0 | NaN | 0.0 | ... | 0.5 | 1.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.50 | NaN | 0.000000 | 0.0 |
11692.0 | NaN | NaN | 1.000000 | 1.0 | 0.0 | 1.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | 0.0 |
11693.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0.0 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN |
11694.0 | 0.0 | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 2.0 | 1.0 | NaN | 1.000000 | 0.0 | NaN | NaN | NaN | NaN | NaN |
180 rows × 31 columns
5.6 Vectorized String Operations
Must use the ‘str’ attribute of a Pandas Series and Index objects to access operations
Some examples of operations:
- len()
- lower()
- translate()
- slower()
- ljust()
- upper()
- startswith()
- isupper()
- rjust()
- find()
- endswith()
- isnumeric()
- capitalize()
- swapcase()
- istitle()
- rpartition()
5.6.1 Ex: Create a Panda Series of ‘BOROUGH’ names and swap cases.
= pd.Series(jan23['BOROUGH'])
bname 15).dropna() # original bname.head(
5 BROOKLYN
6 QUEENS
7 MANHATTAN
11 QUEENS
12 BROOKLYN
13 QUEENS
14 QUEENS
Name: BOROUGH, dtype: object
str.capitalize().head(15).dropna()# to make first letter capital bname.
5 Brooklyn
6 Queens
7 Manhattan
11 Queens
12 Brooklyn
13 Queens
14 Queens
Name: BOROUGH, dtype: object
str.swapcase().head(15).dropna() # to make all lower case bname.
5 brooklyn
6 queens
7 manhattan
11 queens
12 brooklyn
13 queens
14 queens
Name: BOROUGH, dtype: object
str.len().head(15).dropna() # to return the length of the name and data type bname.
5 8.0
6 6.0
7 9.0
11 6.0
12 8.0
13 6.0
14 6.0
Name: BOROUGH, dtype: float64
str.startswith('B').head(15).dropna() # to see if starts with a letter B bname.
5 True
6 False
7 False
11 False
12 True
13 False
14 False
Name: BOROUGH, dtype: object
5.7 Time Series
- Time Stamps : Moments in time
- Ex: July 4th, 2023 at 8:00 AM
- Pandas provides the Timestamp type
- Time Intervals: Reference a length of time with a beginning and end
- Ex: The year of 2022
- Pandas provides the Period type
- Time Deltas/ Durations: Reference an exact length of time
- Ex: 0.3 seconds
- Pandas provides the Timedelta type
5.7.1 Can create a Timestamp object
- combines ‘datetime’ and ‘dateutil’ to be used as a Series or DataFrame
= pd.to_datetime("2nd of February, 2023")
date
datetype(date)
pandas._libs.tslibs.timestamps.Timestamp
5.7.2 Can create Series that has time indexed data
= pd.DatetimeIndex(['2022-07-04', '2022-08-04',
ind '2022-07-04', '2022-08-04'])
= pd.Series([0,1,2,3], index = ind)
inddata inddata
2022-07-04 0
2022-08-04 1
2022-07-04 2
2022-08-04 3
dtype: int64
5.7.3 Frequencies and Offsets
The following are the main codes avaiable:
- D Calendar day
- B Business day
- W Weekly
- M Month end
- BM Business month end
- Q Quarter end
- BQ Business quarter end
- A Year end
- BA Business year end
- H Hours
- BH Business hours
- T Minutes
- S Seconds
- L Milliseonds
- U Microseconds
- N nanoseconds
5.7.4 Ex: TimeDelta
- create a TimeDelta data type starting at 00:00:00 using frequency of 2 hours and 30 minutes (2H30T) over 5 periods.
0, periods = 5, freq = "2H30T") pd.timedelta_range(
TimedeltaIndex(['0 days 00:00:00', '0 days 02:30:00', '0 days 05:00:00',
'0 days 07:30:00', '0 days 10:00:00'],
dtype='timedelta64[ns]', freq='150T')
5.8 High Performance Pandas: eval()
eval()
uses string expressions to compute operations using DataFrames- supports all arithmetic operations, comparison operators, bitwise operators (
&
and|
), and the use ofand
andor
in Boolean expressions
- supports all arithmetic operations, comparison operators, bitwise operators (
= 10, 5 # creating 2 DF of 5 rows and 10 columns
nrows, ncols = np.random.RandomState(7)
rand = (pd.DataFrame(rand.rand(nrows, ncols))
dfa, dfb for i in range (2))
# to compute sum of dfa and dfb and place into one table
print("dfa", dfa)
print("dfb", dfb)
eval('dfa + dfb') pd.
dfa 0 1 2 3 4
0 0.076308 0.779919 0.438409 0.723465 0.977990
1 0.538496 0.501120 0.072051 0.268439 0.499883
2 0.679230 0.803739 0.380941 0.065936 0.288146
3 0.909594 0.213385 0.452124 0.931206 0.024899
4 0.600549 0.950130 0.230303 0.548490 0.909128
5 0.133169 0.523413 0.750410 0.669013 0.467753
6 0.204849 0.490766 0.372385 0.477401 0.365890
7 0.837918 0.768648 0.313995 0.572625 0.276049
8 0.452843 0.352978 0.657399 0.370351 0.459093
9 0.719324 0.412992 0.906423 0.180452 0.741119
dfb 0 1 2 3 4
0 0.422374 0.426454 0.634380 0.522906 0.414886
1 0.001427 0.092262 0.709394 0.524346 0.696160
2 0.955468 0.682914 0.053129 0.308853 0.592595
3 0.235120 0.964971 0.945048 0.848401 0.472324
4 0.841477 0.131111 0.308734 0.462996 0.741847
5 0.485825 0.136876 0.343537 0.324426 0.300419
6 0.165501 0.414902 0.448121 0.774900 0.796391
7 0.522390 0.460630 0.778214 0.887289 0.674919
8 0.800479 0.939111 0.040656 0.875672 0.276563
9 0.475764 0.796761 0.717242 0.147148 0.658748
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 0.498682 | 1.206372 | 1.072789 | 1.246371 | 1.392875 |
1 | 0.539923 | 0.593383 | 0.781446 | 0.792785 | 1.196043 |
2 | 1.634698 | 1.486653 | 0.434070 | 0.374789 | 0.880740 |
3 | 1.144714 | 1.178356 | 1.397172 | 1.779607 | 0.497223 |
4 | 1.442026 | 1.081240 | 0.539037 | 1.011486 | 1.650976 |
5 | 0.618995 | 0.660289 | 1.093946 | 0.993439 | 0.768172 |
6 | 0.370350 | 0.905668 | 0.820505 | 1.252302 | 1.162281 |
7 | 1.360308 | 1.229278 | 1.092208 | 1.459914 | 0.950968 |
8 | 1.253322 | 1.292090 | 0.698055 | 1.246023 | 0.735656 |
9 | 1.195089 | 1.209753 | 1.623666 | 0.327599 | 1.399867 |