5 Data Manipulation with Pandas

This chapter was prepared by Garrick Ho and Ginamarie Mastrorilli.

Pandas is an open source Python library for data analysis. It gives Python the ability to work with spreadsheet-like data for fast data loading, manipulating, aligning, merging, etc. The name is derived from ‘panel data’, an econometrics term for multidimensional structured datasets.

import pandas as pd
import numpy as np

Pandas introduces two new data types to Python: Series and DataFrame

5.1 Series

A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index

[1, 3, 4]

[1, 3, 4]

To create a series, usepd.Series().

s = pd.Series([4, 7, -5, 3])
s

0    4
1    7
2   -5
3    3
dtype: int64

The string representation of a Series displayed interactively shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers 0 through n-1 (where n is the length of the data).

s = pd.Series([4, 7, -5, 3], index = ['a', 'b', 'c', 'd'])
s

a    4
b    7
c   -5
d    3
dtype: int64

s.values

array([ 4,  7, -5,  3])

s.index

Index(['a', 'b', 'c', 'd'], dtype='object')

5.1.1 Selecting single or a set of values using index

To select a single value, use [ ].

s['b']

To select multiple values, add a comma between each value and use double brackets.

s[['c', 'a', 'b']]

c   -5
a    4
b    7
dtype: int64

Selecting by the index is also possible.

s[2]

-5

To select multiple values in a row, use : and the first index is where it starts and the second index is where it stops. It does not include the second index.

s[1:3]

b    7
c   -5
dtype: int64

To select multiple values, add a comma between each value and use double brackets.

s[[0,3]]

a    4
d    3
dtype: int64

5.1.2 Filtering

Filtering values in a series can be done with <, >, =.

s > 0

a     True
b     True
c    False
d     True
dtype: bool

s[s > 0]

a    4
b    7
d    3
dtype: int64

5.1.3 Math operation

Math functions are able to be apply to a series.

s**2

a    16
b    49
c    25
d     9
dtype: int64

np.exp(s)

a      54.598150
b    1096.633158
c       0.006738
d      20.085537
dtype: float64

s.mean()

2.25

Series are aligned by index label in arithmetic operations.

s2 = pd.Series([1, 2, 3, 4], index = ['a', 'c', 'd', 'e'])

zero = s + s2
zero

a    5.0
b    NaN
c   -3.0
d    6.0
e    NaN
dtype: float64

Note: “NaN” stands for missing values in pandas

5.1.4 Finding NaN values

To find all the missing values in a series, use .isnull().

zero.isnull()

a    False
b     True
c    False
d    False
e     True
dtype: bool

To find all the non-missing vales, use .notnull().

zero.notnull()

a     True
b    False
c     True
d     True
e    False
dtype: bool

zero[zero.notnull()]

a    5.0
c   -3.0
d    6.0
dtype: float64

5.1.5 Replacing NaN

To change NaN to 0, use .fillna().

zero.fillna(0)

a    5.0
b    0.0
c   -3.0
d    6.0
e    0.0
dtype: float64

5.1.6 Forward-fill

Fill all the NaN values with the previous value in the series.

zero.fillna(method = 'ffill')

a    5.0
b    5.0
c   -3.0
d    6.0
e    6.0
dtype: float64

5.1.7 Back-fill

Fill all the NaN values with the next value in the series.

Note that e is Nan because there is no next value in the series.

zero.fillna(method = 'bfill')

a    5.0
b   -3.0
c   -3.0
d    6.0
e    NaN
dtype: float64

5.1.8 Drop

To drop all NaN, use .dropna().

zero.dropna()

a    5.0
c   -3.0
d    6.0
dtype: float64

Notice how that zero hasn’t change at all. If the function wants to be applied to the original series, set it to its self.

zero

a    5.0
b    NaN
c   -3.0
d    6.0
e    NaN
dtype: float64

zero = zero.dropna()
zero

a    5.0
c   -3.0
d    6.0
dtype: float64

change the index to be the same as s2 so there is no missing value

s.index = ['a', 'c', 'd', 'e']
s + s2

a    5
c    9
d   -2
e    7
dtype: int64

5.2 DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns. The DataFrame has both a row and column index.

Since each column of a DataFrame is essentially a Series with its column index, it can be thought of as a dictionary of Series all sharing the same index.
Each column (Series) has to be the same type, whereas, each row can contain mixed types.

5.2.1 Creating DataFrame

5.2.1.1 from a dict of equal-length lists

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
d = pd.DataFrame(data)
d

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
5	Nevada	2003	3.2

5.2.1.2 from an DataFrame

Starting with an empty dataframe, series are able to be added to the Dataframe.

d1 = pd.DataFrame()

d1['state'] = ['Ohio', 'Nevada']
d1['year'] = [2001, 2001]
d1['pop'] = [1.7, 2.4]

d1

	state	year	pop
0	Ohio	2001	1.7
1	Nevada	2001	2.4

5.2.2 select columns

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
5	Nevada	2003	3.2

d['state']

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

d[['state','pop']]

	state	pop
0	Ohio	1.5
1	Ohio	1.7
2	Ohio	3.6
3	Nevada	2.4
4	Nevada	2.9
5	Nevada	3.2

5.2.3 select rows

d2 = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                     columns=['one', 'two', 'three', 'four'])
d2

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

d2.loc['Colorado': 'Utah']

	one	two	three	four
Colorado	4	5	6	7
Utah	8	9	10	11

d2.iloc[1:3]

	one	two	three	four
Colorado	4	5	6	7
Utah	8	9	10	11

5.2.4 change row index and column name

Use .rename ti rename any row or column.

d2.rename(index={'Colorado':'Connecticut'},columns={'one':'five'})

	five	two	three	four
Ohio	0	1	2	3
Connecticut	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

Notice how d2 does not change.

d2

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

You can use inplace = True to change the original Dataframe.

d2.rename(index = {'Colorado':'Connecticut'}, columns = {'one':'five'}, inplace = True)

d2

	five	two	three	four
Ohio	0	1	2	3
Connecticut	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

5.2.5 basics attributes and methods

d2.index

Index(['Ohio', 'Connecticut', 'Utah', 'New York'], dtype='object')

d2.columns

Index(['five', 'two', 'three', 'four'], dtype='object')

d2.values

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

d2.shape

(4, 4)

d2.mean()

five     6.0
two      7.0
three    8.0
four     9.0
dtype: float64

5.2.6 add and delete rows and columns

For dropping a row or column use .drop.

d2.drop(index = "Connecticut", columns = "five") # add "inplace=True" will change the original DataFrame

	two	three	four
Ohio	1	2	3
Utah	9	10	11
New York	13	14	15

d2

	five	two	three	four
Ohio	0	1	2	3
Connecticut	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

For deleting a column use del.

del d2['five']
d2

	two	three	four
Ohio	1	2	3
Connecticut	5	6	7
Utah	9	10	11
New York	13	14	15

d2['one'] = [1, 2, 3, 4]
d2

	two	three	four	one
Ohio	1	2	3	1
Connecticut	5	6	7	2
Utah	9	10	11	3
New York	13	14	15	4

.pop returns the values and removes it from the original Dataframe.

d2.pop('one')

Ohio           1
Connecticut    2
Utah           3
New York       4
Name: one, dtype: int64

d2

	two	three	four
Ohio	1	2	3
Connecticut	5	6	7
Utah	9	10	11
New York	13	14	15

5.2.7 Common method

You can import dataset as well

5.2.7.1 csv file

import pandas as pd
crashes = pd.read_csv("data/nyc_crashes_202301.csv")

sub_set_1 = crashes.iloc[0:35, 0:8]

5.2.7.2 Head and Tail

These two methods show the first and the last a few records from a DataFrame, default is 5.

sub_set_1.head()

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME
0	01/01/2023	14:38	NaN	NaN	NaN	NaN	NaN	BROOKLYN QUEENS EXPRESSWAY RAMP
1	01/01/2023	8:04	NaN	NaN	NaN	NaN	NaN	NASSAU EXPRESSWAY
2	01/01/2023	18:05	NaN	NaN	NaN	NaN	NaN	10 AVENUE
3	01/01/2023	23:45	NaN	NaN	40.769737	-73.91244	(40.769737, -73.91244)	ASTORIA BOULEVARD
4	01/01/2023	4:50	NaN	NaN	40.830555	-73.85072	(40.830555, -73.85072)	CASTLE HILL AVENUE

sub_set_1.tail()

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME
30	01/01/2023	17:05	NaN	NaN	40.795250	-73.97321	(40.79525, -73.97321)	WEST 96 STREET
31	01/01/2023	5:45	QUEENS	11411.0	40.690640	-73.72832	(40.69064, -73.72832)	NaN
32	01/01/2023	0:55	NaN	NaN	40.818470	-73.94140	(40.81847, -73.9414)	WEST 140 STREET
33	01/01/2023	10:16	QUEENS	11369.0	40.764633	-73.86537	(40.764633, -73.86537)	NaN
34	01/01/2023	3:45	QUEENS	11694.0	40.580510	-73.84773	(40.58051, -73.84773)	NaN

sub_set_1.head(3)

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME
0	01/01/2023	14:38	NaN	NaN	NaN	NaN	NaN	BROOKLYN QUEENS EXPRESSWAY RAMP
1	01/01/2023	8:04	NaN	NaN	NaN	NaN	NaN	NASSAU EXPRESSWAY
2	01/01/2023	18:05	NaN	NaN	NaN	NaN	NaN	10 AVENUE

5.2.8 unique and nunique

To show only unique values, use .unique.

sub_set_1['BOROUGH'].unique()

array([nan, 'BROOKLYN', 'QUEENS', 'MANHATTAN', 'STATEN ISLAND', 'BRONX'],
      dtype=object)

To get the number of unique values, use .nunique.

sub_set_1['BOROUGH'].nunique()

5.2.9 count and value_counts

To count the non missing values, use .count.

sub_set_1['BOROUGH'].count()

To count the number in each categroy, use .value_counts.

sub_set_1['BOROUGH'].value_counts()

QUEENS           12
BROOKLYN          6
MANHATTAN         1
STATEN ISLAND     1
BRONX             1
Name: BOROUGH, dtype: int64

5.2.10 describe and info

sub_set_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   CRASH DATE      35 non-null     object 
 1   CRASH TIME      35 non-null     object 
 2   BOROUGH         21 non-null     object 
 3   ZIP CODE        21 non-null     float64
 4   LATITUDE        30 non-null     float64
 5   LONGITUDE       30 non-null     float64
 6   LOCATION        30 non-null     object 
 7   ON STREET NAME  27 non-null     object 
dtypes: float64(3), object(5)
memory usage: 2.3+ KB

Summary statistics for numeric type columns.

Use .describe to get an quick summary of the data.

sub_set_1.describe()

	ZIP CODE	LATITUDE	LONGITUDE
count	21.000000	30.000000	30.000000
mean	11188.095238	40.728138	-73.885913
std	416.736476	0.086772	0.076496
min	10026.000000	40.580510	-74.093414
25%	11207.000000	40.664881	-73.937413
50%	11354.000000	40.747835	-73.886998
75%	11369.000000	40.793048	-73.848292
max	11694.000000	40.861810	-73.728320

sub_set_1.describe(percentiles=[x/10 for x in list(range(1, 10, 1))])

	ZIP CODE	LATITUDE	LONGITUDE
count	21.000000	30.000000	30.000000
mean	11188.095238	40.728138	-73.885913
std	416.736476	0.086772	0.076496
min	10026.000000	40.580510	-74.093414
10%	10468.000000	40.598224	-73.959692
20%	11207.000000	40.649558	-73.943165
30%	11223.000000	40.673481	-73.925244
40%	11233.000000	40.708116	-73.906082
50%	11354.000000	40.747835	-73.886998
60%	11357.000000	40.760311	-73.871886
70%	11369.000000	40.776043	-73.850365
80%	11372.000000	40.808170	-73.828438
90%	11411.000000	40.833641	-73.805664
max	11694.000000	40.861810	-73.728320

choose a specific column to get a summary for.

sub_set_1['BOROUGH'].describe()

count         21
unique         5
top       QUEENS
freq          12
Name: BOROUGH, dtype: object

5.2.11 idxmax and nlargest

.idxmax() returns the index of the largest value.

sub_set_1['ZIP CODE'].idxmax()

.idxmin() returns the index of the smallest value

sub_set_1['ZIP CODE'].idxmin()

.nlargest returns the largest values with their index (default is 5).

sub_set_1['ZIP CODE'].nlargest()

34    11694.0
11    11691.0
31    11411.0
13    11375.0
26    11372.0
Name: ZIP CODE, dtype: float64

.nsmallest returns the smallest 3 values with their index (default is 5).

sub_set_1['ZIP CODE'].nsmallest()

7     10026.0
17    10305.0
28    10468.0
6     11101.0
12    11207.0
Name: ZIP CODE, dtype: float64

5.2.12 sort

use .sort_values to sort values

sub_set_1.sort_values(by = 'BOROUGH')

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME
28	01/01/2023	7:40	BRONX	10468.0	40.861810	-73.912320	(40.86181, -73.91232)	NaN
5	01/01/2023	0:35	BROOKLYN	11229.0	40.601310	-73.954720	(40.60131, -73.95472)	NaN
27	01/01/2023	5:36	BROOKLYN	11207.0	40.663136	-73.883250	(40.663136, -73.88325)	NaN
22	01/01/2023	4:20	BROOKLYN	11233.0	40.670116	-73.922480	(40.670116, -73.92248)	SAINT JOHNS PLACE
15	01/01/2023	3:30	BROOKLYN	11236.0	40.636720	-73.887695	(40.63672, -73.887695)	NaN
19	01/01/2023	1:10	BROOKLYN	11223.0	40.593760	-73.982740	(40.59376, -73.98274)	AVENUE V
12	01/01/2023	1:45	BROOKLYN	11207.0	40.652767	-73.886300	(40.652767, -73.8863)	PENNSYLVANIA AVENUE
7	01/01/2023	1:00	MANHATTAN	10026.0	40.805595	-73.958190	(40.805595, -73.95819)	WEST 116 STREET
14	01/01/2023	17:35	QUEENS	11354.0	40.771587	-73.810070	(40.771587, -73.81007)	BAYSIDE AVENUE
16	01/01/2023	10:30	QUEENS	11355.0	40.751800	-73.817314	(40.7518, -73.817314)	HOLLY AVENUE
33	01/01/2023	10:16	QUEENS	11369.0	40.764633	-73.865370	(40.764633, -73.86537)	NaN
21	01/01/2023	20:25	QUEENS	11357.0	40.786440	-73.829155	(40.78644, -73.829155)	140 STREET
11	01/01/2023	23:06	QUEENS	11691.0	40.598720	-73.766010	(40.59872, -73.76601)	BEACH 32 STREET
23	01/01/2023	6:45	QUEENS	11368.0	40.739280	-73.850530	(40.73928, -73.85053)	SAULTELL AVENUE
24	01/01/2023	23:15	QUEENS	11369.0	40.757430	-73.876230	(40.75743, -73.87623)	NaN
26	01/01/2023	4:00	QUEENS	11372.0	40.751003	-73.892130	(40.751003, -73.89213)	74 STREET
6	01/01/2023	5:45	QUEENS	11101.0	40.744667	-73.931694	(40.744667, -73.931694)	QUEENS BOULEVARD
31	01/01/2023	5:45	QUEENS	11411.0	40.690640	-73.728320	(40.69064, -73.72832)	NaN
13	01/01/2023	17:55	QUEENS	11375.0	40.710320	-73.849980	(40.71032, -73.84998)	METROPOLITAN AVENUE
34	01/01/2023	3:45	QUEENS	11694.0	40.580510	-73.847730	(40.58051, -73.84773)	NaN
17	01/01/2023	6:40	STATEN ISLAND	10305.0	40.585240	-74.093414	(40.58524, -74.093414)	HYLAN BOULEVARD
0	01/01/2023	14:38	NaN	NaN	NaN	NaN	NaN	BROOKLYN QUEENS EXPRESSWAY RAMP
1	01/01/2023	8:04	NaN	NaN	NaN	NaN	NaN	NASSAU EXPRESSWAY
2	01/01/2023	18:05	NaN	NaN	NaN	NaN	NaN	10 AVENUE
3	01/01/2023	23:45	NaN	NaN	40.769737	-73.912440	(40.769737, -73.91244)	ASTORIA BOULEVARD
4	01/01/2023	4:50	NaN	NaN	40.830555	-73.850720	(40.830555, -73.85072)	CASTLE HILL AVENUE
8	01/01/2023	19:53	NaN	NaN	NaN	NaN	NaN	EAST 222 STREET
9	01/01/2023	3:11	NaN	NaN	40.842110	-73.825570	(40.84211, -73.82557)	BRUCKNER EXPRESSWAY
10	01/01/2023	18:49	NaN	NaN	40.674923	-73.736940	(40.674923, -73.73694)	MERRICK BOULEVARD
18	01/01/2023	5:30	NaN	NaN	40.704810	-73.939320	(40.70481, -73.93932)	SEIGEL STREET
20	01/01/2023	0:04	NaN	NaN	40.858090	-73.901924	(40.85809, -73.901924)	EAST 183 STREET
25	01/01/2023	3:35	NaN	NaN	NaN	NaN	NaN	GRAND CENTRAL PARKWAY
29	01/01/2023	10:50	NaN	NaN	40.832700	-73.950226	(40.8327, -73.950226)	HENRY HUDSON PARKWAY
30	01/01/2023	17:05	NaN	NaN	40.795250	-73.973210	(40.79525, -73.97321)	WEST 96 STREET
32	01/01/2023	0:55	NaN	NaN	40.818470	-73.941400	(40.81847, -73.9414)	WEST 140 STREET

sub_set_1.sort_values(by = ['CRASH DATE', 'ZIP CODE'], ascending = True)

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME
7	01/01/2023	1:00	MANHATTAN	10026.0	40.805595	-73.958190	(40.805595, -73.95819)	WEST 116 STREET
17	01/01/2023	6:40	STATEN ISLAND	10305.0	40.585240	-74.093414	(40.58524, -74.093414)	HYLAN BOULEVARD
28	01/01/2023	7:40	BRONX	10468.0	40.861810	-73.912320	(40.86181, -73.91232)	NaN
6	01/01/2023	5:45	QUEENS	11101.0	40.744667	-73.931694	(40.744667, -73.931694)	QUEENS BOULEVARD
12	01/01/2023	1:45	BROOKLYN	11207.0	40.652767	-73.886300	(40.652767, -73.8863)	PENNSYLVANIA AVENUE
27	01/01/2023	5:36	BROOKLYN	11207.0	40.663136	-73.883250	(40.663136, -73.88325)	NaN
19	01/01/2023	1:10	BROOKLYN	11223.0	40.593760	-73.982740	(40.59376, -73.98274)	AVENUE V
5	01/01/2023	0:35	BROOKLYN	11229.0	40.601310	-73.954720	(40.60131, -73.95472)	NaN
22	01/01/2023	4:20	BROOKLYN	11233.0	40.670116	-73.922480	(40.670116, -73.92248)	SAINT JOHNS PLACE
15	01/01/2023	3:30	BROOKLYN	11236.0	40.636720	-73.887695	(40.63672, -73.887695)	NaN
14	01/01/2023	17:35	QUEENS	11354.0	40.771587	-73.810070	(40.771587, -73.81007)	BAYSIDE AVENUE
16	01/01/2023	10:30	QUEENS	11355.0	40.751800	-73.817314	(40.7518, -73.817314)	HOLLY AVENUE
21	01/01/2023	20:25	QUEENS	11357.0	40.786440	-73.829155	(40.78644, -73.829155)	140 STREET
23	01/01/2023	6:45	QUEENS	11368.0	40.739280	-73.850530	(40.73928, -73.85053)	SAULTELL AVENUE
24	01/01/2023	23:15	QUEENS	11369.0	40.757430	-73.876230	(40.75743, -73.87623)	NaN
33	01/01/2023	10:16	QUEENS	11369.0	40.764633	-73.865370	(40.764633, -73.86537)	NaN
26	01/01/2023	4:00	QUEENS	11372.0	40.751003	-73.892130	(40.751003, -73.89213)	74 STREET
13	01/01/2023	17:55	QUEENS	11375.0	40.710320	-73.849980	(40.71032, -73.84998)	METROPOLITAN AVENUE
31	01/01/2023	5:45	QUEENS	11411.0	40.690640	-73.728320	(40.69064, -73.72832)	NaN
11	01/01/2023	23:06	QUEENS	11691.0	40.598720	-73.766010	(40.59872, -73.76601)	BEACH 32 STREET
34	01/01/2023	3:45	QUEENS	11694.0	40.580510	-73.847730	(40.58051, -73.84773)	NaN
0	01/01/2023	14:38	NaN	NaN	NaN	NaN	NaN	BROOKLYN QUEENS EXPRESSWAY RAMP
1	01/01/2023	8:04	NaN	NaN	NaN	NaN	NaN	NASSAU EXPRESSWAY
2	01/01/2023	18:05	NaN	NaN	NaN	NaN	NaN	10 AVENUE
3	01/01/2023	23:45	NaN	NaN	40.769737	-73.912440	(40.769737, -73.91244)	ASTORIA BOULEVARD
4	01/01/2023	4:50	NaN	NaN	40.830555	-73.850720	(40.830555, -73.85072)	CASTLE HILL AVENUE
8	01/01/2023	19:53	NaN	NaN	NaN	NaN	NaN	EAST 222 STREET
9	01/01/2023	3:11	NaN	NaN	40.842110	-73.825570	(40.84211, -73.82557)	BRUCKNER EXPRESSWAY
10	01/01/2023	18:49	NaN	NaN	40.674923	-73.736940	(40.674923, -73.73694)	MERRICK BOULEVARD
18	01/01/2023	5:30	NaN	NaN	40.704810	-73.939320	(40.70481, -73.93932)	SEIGEL STREET
20	01/01/2023	0:04	NaN	NaN	40.858090	-73.901924	(40.85809, -73.901924)	EAST 183 STREET
25	01/01/2023	3:35	NaN	NaN	NaN	NaN	NaN	GRAND CENTRAL PARKWAY
29	01/01/2023	10:50	NaN	NaN	40.832700	-73.950226	(40.8327, -73.950226)	HENRY HUDSON PARKWAY
30	01/01/2023	17:05	NaN	NaN	40.795250	-73.973210	(40.79525, -73.97321)	WEST 96 STREET
32	01/01/2023	0:55	NaN	NaN	40.818470	-73.941400	(40.81847, -73.9414)	WEST 140 STREET

5.2.13 [] method

[] method can be used to select column(s) by passing column name(s).

sub_set_1['ZIP CODE'].head()

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
Name: ZIP CODE, dtype: float64

sub_set_1[['BOROUGH', 'ZIP CODE', 'LOCATION']].head()

	BOROUGH	ZIP CODE	LOCATION
0	NaN	NaN	NaN
1	NaN	NaN	NaN
2	NaN	NaN	NaN
3	NaN	NaN	(40.769737, -73.91244)
4	NaN	NaN	(40.830555, -73.85072)

5.2.14 `loc` method

loc can be used to index row(s) and column(s) by providing the row and column labels.

df.loc[row_label(s)] Selects single row or subset of rows from the DataFrame by label.

Index single row

sub_set_1.loc[7]

CRASH DATE                    01/01/2023
CRASH TIME                          1:00
BOROUGH                        MANHATTAN
ZIP CODE                         10026.0
LATITUDE                       40.805595
LONGITUDE                      -73.95819
LOCATION          (40.805595, -73.95819)
ON STREET NAME           WEST 116 STREET
Name: 7, dtype: object

Index multiple rows

sub_set_1.loc[:8]

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME
0	01/01/2023	14:38	NaN	NaN	NaN	NaN	NaN	BROOKLYN QUEENS EXPRESSWAY RAMP
1	01/01/2023	8:04	NaN	NaN	NaN	NaN	NaN	NASSAU EXPRESSWAY
2	01/01/2023	18:05	NaN	NaN	NaN	NaN	NaN	10 AVENUE
3	01/01/2023	23:45	NaN	NaN	40.769737	-73.912440	(40.769737, -73.91244)	ASTORIA BOULEVARD
4	01/01/2023	4:50	NaN	NaN	40.830555	-73.850720	(40.830555, -73.85072)	CASTLE HILL AVENUE
5	01/01/2023	0:35	BROOKLYN	11229.0	40.601310	-73.954720	(40.60131, -73.95472)	NaN
6	01/01/2023	5:45	QUEENS	11101.0	40.744667	-73.931694	(40.744667, -73.931694)	QUEENS BOULEVARD
7	01/01/2023	1:00	MANHATTAN	10026.0	40.805595	-73.958190	(40.805595, -73.95819)	WEST 116 STREET
8	01/01/2023	19:53	NaN	NaN	NaN	NaN	NaN	EAST 222 STREET

sub_set_1.loc[[0, 7, 4, 6]]

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME
0	01/01/2023	14:38	NaN	NaN	NaN	NaN	NaN	BROOKLYN QUEENS EXPRESSWAY RAMP
7	01/01/2023	1:00	MANHATTAN	10026.0	40.805595	-73.958190	(40.805595, -73.95819)	WEST 116 STREET
4	01/01/2023	4:50	NaN	NaN	40.830555	-73.850720	(40.830555, -73.85072)	CASTLE HILL AVENUE
6	01/01/2023	5:45	QUEENS	11101.0	40.744667	-73.931694	(40.744667, -73.931694)	QUEENS BOULEVARD

df.loc[:, col_labels] Selects single column or subset of columns by label.

sub_set_1.loc[:, 'LOCATION'].head()

0                       NaN
1                       NaN
2                       NaN
3    (40.769737, -73.91244)
4    (40.830555, -73.85072)
Name: LOCATION, dtype: object

sub_set_1.loc[:, 'LATITUDE': 'LOCATION'].head()

	LATITUDE	LONGITUDE	LOCATION
0	NaN	NaN	NaN
1	NaN	NaN	NaN
2	NaN	NaN	NaN
3	40.769737	-73.91244	(40.769737, -73.91244)
4	40.830555	-73.85072	(40.830555, -73.85072)

sub_set_1.loc[:, ['BOROUGH', 'ZIP CODE', 'LOCATION']].head()

	BOROUGH	ZIP CODE	LOCATION
0	NaN	NaN	NaN
1	NaN	NaN	NaN
2	NaN	NaN	NaN
3	NaN	NaN	(40.769737, -73.91244)
4	NaN	NaN	(40.830555, -73.85072)

df.loc[row_label(s), col_label(s)] Select both rows and columns by label.

sub_set_1.loc[7, 'BOROUGH']

'MANHATTAN'

sub_set_1.loc[:8, ['BOROUGH', 'LOCATION']]

	BOROUGH	LOCATION
0	NaN	NaN
1	NaN	NaN
2	NaN	NaN
3	NaN	(40.769737, -73.91244)
4	NaN	(40.830555, -73.85072)
5	BROOKLYN	(40.60131, -73.95472)
6	QUEENS	(40.744667, -73.931694)
7	MANHATTAN	(40.805595, -73.95819)
8	NaN	NaN

Index by Boolean Series

sub_set_1['BOROUGH'].isin(['MANHATTAN','QUEENS']).head()

0    False
1    False
2    False
3    False
4    False
Name: BOROUGH, dtype: bool

sub_set_1.loc[sub_set_1['BOROUGH'].isin(['MANHATTAN','QUEENS'])].head()

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME
6	01/01/2023	5:45	QUEENS	11101.0	40.744667	-73.931694	(40.744667, -73.931694)	QUEENS BOULEVARD
7	01/01/2023	1:00	MANHATTAN	10026.0	40.805595	-73.958190	(40.805595, -73.95819)	WEST 116 STREET
11	01/01/2023	23:06	QUEENS	11691.0	40.598720	-73.766010	(40.59872, -73.76601)	BEACH 32 STREET
13	01/01/2023	17:55	QUEENS	11375.0	40.710320	-73.849980	(40.71032, -73.84998)	METROPOLITAN AVENUE
14	01/01/2023	17:35	QUEENS	11354.0	40.771587	-73.810070	(40.771587, -73.81007)	BAYSIDE AVENUE

Use “&” (and), “|” (or) “~” (not) for Pandas

sub_set_1.loc[(sub_set_1["BOROUGH"] == "MANHATTAN") & (sub_set_1["ZIP CODE"] 
>= 1000)]

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME
7	01/01/2023	1:00	MANHATTAN	10026.0	40.805595	-73.95819	(40.805595, -73.95819)	WEST 116 STREET

5.2.15 `iloc` method

iloc can be used to index row(s) and column(s) by providing the row and column integer(s).

df.iloc[row_integer(s)] Selects single row or subset of rows from the DataFrame by integer position

Note: same as indexing for sequence (but different with loc, it is 0 basis and the selection is close to the left and open to the right (the last item is excluded).

sub_set_1.iloc[3]

CRASH DATE                    01/01/2023
CRASH TIME                         23:45
BOROUGH                              NaN
ZIP CODE                             NaN
LATITUDE                       40.769737
LONGITUDE                      -73.91244
LOCATION          (40.769737, -73.91244)
ON STREET NAME         ASTORIA BOULEVARD
Name: 3, dtype: object

sub_set_1.iloc[:8]

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME
0	01/01/2023	14:38	NaN	NaN	NaN	NaN	NaN	BROOKLYN QUEENS EXPRESSWAY RAMP
1	01/01/2023	8:04	NaN	NaN	NaN	NaN	NaN	NASSAU EXPRESSWAY
2	01/01/2023	18:05	NaN	NaN	NaN	NaN	NaN	10 AVENUE
3	01/01/2023	23:45	NaN	NaN	40.769737	-73.912440	(40.769737, -73.91244)	ASTORIA BOULEVARD
4	01/01/2023	4:50	NaN	NaN	40.830555	-73.850720	(40.830555, -73.85072)	CASTLE HILL AVENUE
5	01/01/2023	0:35	BROOKLYN	11229.0	40.601310	-73.954720	(40.60131, -73.95472)	NaN
6	01/01/2023	5:45	QUEENS	11101.0	40.744667	-73.931694	(40.744667, -73.931694)	QUEENS BOULEVARD
7	01/01/2023	1:00	MANHATTAN	10026.0	40.805595	-73.958190	(40.805595, -73.95819)	WEST 116 STREET

sub_set_1.iloc[:, 1:3].head()

	CRASH TIME	BOROUGH
0	14:38	NaN
1	8:04	NaN
2	18:05	NaN
3	23:45	NaN
4	4:50	NaN

df.iloc[row_integer(s), col_integer(s)] Selects row and columns from the DataFrame by integer positions.

sub_set_1.iloc[0:5, :6]

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE
0	01/01/2023	14:38	NaN	NaN	NaN	NaN
1	01/01/2023	8:04	NaN	NaN	NaN	NaN
2	01/01/2023	18:05	NaN	NaN	NaN	NaN
3	01/01/2023	23:45	NaN	NaN	40.769737	-73.91244
4	01/01/2023	4:50	NaN	NaN	40.830555	-73.85072

5.2.16 `concat` method

pd.concat([df1, df2], axis = 0) can be used to combine two dataframe either row-wise or column-wise depends on value of axis:

0 (default, row-wise)
1 (column-wise)

sub_set_1

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME
0	01/01/2023	14:38	NaN	NaN	NaN	NaN	NaN	BROOKLYN QUEENS EXPRESSWAY RAMP
1	01/01/2023	8:04	NaN	NaN	NaN	NaN	NaN	NASSAU EXPRESSWAY
2	01/01/2023	18:05	NaN	NaN	NaN	NaN	NaN	10 AVENUE
3	01/01/2023	23:45	NaN	NaN	40.769737	-73.912440	(40.769737, -73.91244)	ASTORIA BOULEVARD
4	01/01/2023	4:50	NaN	NaN	40.830555	-73.850720	(40.830555, -73.85072)	CASTLE HILL AVENUE
5	01/01/2023	0:35	BROOKLYN	11229.0	40.601310	-73.954720	(40.60131, -73.95472)	NaN
6	01/01/2023	5:45	QUEENS	11101.0	40.744667	-73.931694	(40.744667, -73.931694)	QUEENS BOULEVARD
7	01/01/2023	1:00	MANHATTAN	10026.0	40.805595	-73.958190	(40.805595, -73.95819)	WEST 116 STREET
8	01/01/2023	19:53	NaN	NaN	NaN	NaN	NaN	EAST 222 STREET
9	01/01/2023	3:11	NaN	NaN	40.842110	-73.825570	(40.84211, -73.82557)	BRUCKNER EXPRESSWAY
10	01/01/2023	18:49	NaN	NaN	40.674923	-73.736940	(40.674923, -73.73694)	MERRICK BOULEVARD
11	01/01/2023	23:06	QUEENS	11691.0	40.598720	-73.766010	(40.59872, -73.76601)	BEACH 32 STREET
12	01/01/2023	1:45	BROOKLYN	11207.0	40.652767	-73.886300	(40.652767, -73.8863)	PENNSYLVANIA AVENUE
13	01/01/2023	17:55	QUEENS	11375.0	40.710320	-73.849980	(40.71032, -73.84998)	METROPOLITAN AVENUE
14	01/01/2023	17:35	QUEENS	11354.0	40.771587	-73.810070	(40.771587, -73.81007)	BAYSIDE AVENUE
15	01/01/2023	3:30	BROOKLYN	11236.0	40.636720	-73.887695	(40.63672, -73.887695)	NaN
16	01/01/2023	10:30	QUEENS	11355.0	40.751800	-73.817314	(40.7518, -73.817314)	HOLLY AVENUE
17	01/01/2023	6:40	STATEN ISLAND	10305.0	40.585240	-74.093414	(40.58524, -74.093414)	HYLAN BOULEVARD
18	01/01/2023	5:30	NaN	NaN	40.704810	-73.939320	(40.70481, -73.93932)	SEIGEL STREET
19	01/01/2023	1:10	BROOKLYN	11223.0	40.593760	-73.982740	(40.59376, -73.98274)	AVENUE V
20	01/01/2023	0:04	NaN	NaN	40.858090	-73.901924	(40.85809, -73.901924)	EAST 183 STREET
21	01/01/2023	20:25	QUEENS	11357.0	40.786440	-73.829155	(40.78644, -73.829155)	140 STREET
22	01/01/2023	4:20	BROOKLYN	11233.0	40.670116	-73.922480	(40.670116, -73.92248)	SAINT JOHNS PLACE
23	01/01/2023	6:45	QUEENS	11368.0	40.739280	-73.850530	(40.73928, -73.85053)	SAULTELL AVENUE
24	01/01/2023	23:15	QUEENS	11369.0	40.757430	-73.876230	(40.75743, -73.87623)	NaN
25	01/01/2023	3:35	NaN	NaN	NaN	NaN	NaN	GRAND CENTRAL PARKWAY
26	01/01/2023	4:00	QUEENS	11372.0	40.751003	-73.892130	(40.751003, -73.89213)	74 STREET
27	01/01/2023	5:36	BROOKLYN	11207.0	40.663136	-73.883250	(40.663136, -73.88325)	NaN
28	01/01/2023	7:40	BRONX	10468.0	40.861810	-73.912320	(40.86181, -73.91232)	NaN
29	01/01/2023	10:50	NaN	NaN	40.832700	-73.950226	(40.8327, -73.950226)	HENRY HUDSON PARKWAY
30	01/01/2023	17:05	NaN	NaN	40.795250	-73.973210	(40.79525, -73.97321)	WEST 96 STREET
31	01/01/2023	5:45	QUEENS	11411.0	40.690640	-73.728320	(40.69064, -73.72832)	NaN
32	01/01/2023	0:55	NaN	NaN	40.818470	-73.941400	(40.81847, -73.9414)	WEST 140 STREET
33	01/01/2023	10:16	QUEENS	11369.0	40.764633	-73.865370	(40.764633, -73.86537)	NaN
34	01/01/2023	3:45	QUEENS	11694.0	40.580510	-73.847730	(40.58051, -73.84773)	NaN

sub_set_2 = crashes.iloc[35:60, 0:8]
sub_set_2

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME
35	01/01/2023	1:20	QUEENS	11377.0	40.748720	-73.896250	(40.74872, -73.89625)	BROOKLYN QUEENS EXPRESSWAY
36	01/01/2023	18:26	BROOKLYN	11208.0	40.678246	-73.870186	(40.678246, -73.870186)	NaN
37	01/01/2023	0:00	NaN	NaN	NaN	NaN	NaN	WILLIAMSBURG BRIDGE OUTER ROADWA
38	01/01/2023	22:35	NaN	NaN	40.755780	-74.001990	(40.75578, -74.00199)	WEST 34 STREET
39	01/01/2023	12:00	BROOKLYN	11211.0	40.706690	-73.958840	(40.70669, -73.95884)	NaN
40	01/01/2023	8:45	NaN	NaN	40.828114	-73.931070	(40.828114, -73.93107)	MAJOR DEEGAN EXPRESSWAY
41	01/01/2023	1:20	MANHATTAN	10022.0	40.758980	-73.962440	(40.75898, -73.96244)	1 AVENUE
42	01/01/2023	16:45	QUEENS	11420.0	40.679070	-73.831540	(40.67907, -73.83154)	NaN
43	01/01/2023	0:10	BROOKLYN	11236.0	40.638350	-73.908890	(40.63835, -73.90889)	FLATLANDS AVENUE
44	01/01/2023	14:53	BROOKLYN	11208.0	40.660797	-73.871830	(40.660797, -73.87183)	ATKINS AVENUE
45	01/01/2023	13:10	NaN	NaN	40.831290	-73.929040	(40.83129, -73.92904)	WOODYCREST AVENUE
46	01/01/2023	14:16	NaN	NaN	NaN	NaN	NaN	VERRAZANO BRIDGE UPPER
47	01/01/2023	15:12	NaN	NaN	40.712780	-74.011690	(40.71278, -74.01169)	GREENWICH STREET
48	01/01/2023	0:00	NaN	NaN	40.688370	-73.944916	(40.68837, -73.944916)	LEXINGTON AVENUE
49	01/01/2023	3:35	NaN	NaN	40.825935	-73.859130	(40.825935, -73.85913)	BRUCKNER EXPRESSWAY
50	01/01/2023	9:45	NaN	NaN	40.882645	-73.886566	(40.882645, -73.886566)	SEDGWICK AVENUE
51	01/01/2023	12:20	NaN	NaN	NaN	NaN	NaN	102 CROSS DRIVE
52	01/01/2023	12:00	QUEENS	11354.0	40.764650	-73.823494	(40.76465, -73.823494)	NORTHERN BOULEVARD
53	01/01/2023	1:17	QUEENS	11375.0	40.724308	-73.842575	(40.724308, -73.842575)	110 STREET
54	01/01/2023	7:15	MANHATTAN	10025.0	40.795250	-73.973210	(40.79525, -73.97321)	WEST END AVENUE
55	01/01/2023	1:30	BRONX	10457.0	40.836480	-73.897736	(40.83648, -73.897736)	CLAREMONT PARKWAY
56	01/01/2023	8:08	NaN	NaN	40.686085	-73.982666	(40.686085, -73.982666)	ATLANTIC AVENUE
57	01/01/2023	6:08	QUEENS	11420.0	40.677242	-73.816720	(40.677242, -73.81672)	NaN
58	01/01/2023	8:20	BRONX	10451.0	NaN	NaN	NaN	EAST 138 STREET
59	01/01/2023	10:11	QUEENS	11358.0	40.760440	-73.804924	(40.76044, -73.804924)	161 STREET

combining by rows

sub_set_3 = pd.concat([sub_set_1, sub_set_2])
sub_set_3

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME
0	01/01/2023	14:38	NaN	NaN	NaN	NaN	NaN	BROOKLYN QUEENS EXPRESSWAY RAMP
1	01/01/2023	8:04	NaN	NaN	NaN	NaN	NaN	NASSAU EXPRESSWAY
2	01/01/2023	18:05	NaN	NaN	NaN	NaN	NaN	10 AVENUE
3	01/01/2023	23:45	NaN	NaN	40.769737	-73.912440	(40.769737, -73.91244)	ASTORIA BOULEVARD
4	01/01/2023	4:50	NaN	NaN	40.830555	-73.850720	(40.830555, -73.85072)	CASTLE HILL AVENUE
5	01/01/2023	0:35	BROOKLYN	11229.0	40.601310	-73.954720	(40.60131, -73.95472)	NaN
6	01/01/2023	5:45	QUEENS	11101.0	40.744667	-73.931694	(40.744667, -73.931694)	QUEENS BOULEVARD
7	01/01/2023	1:00	MANHATTAN	10026.0	40.805595	-73.958190	(40.805595, -73.95819)	WEST 116 STREET
8	01/01/2023	19:53	NaN	NaN	NaN	NaN	NaN	EAST 222 STREET
9	01/01/2023	3:11	NaN	NaN	40.842110	-73.825570	(40.84211, -73.82557)	BRUCKNER EXPRESSWAY
10	01/01/2023	18:49	NaN	NaN	40.674923	-73.736940	(40.674923, -73.73694)	MERRICK BOULEVARD
11	01/01/2023	23:06	QUEENS	11691.0	40.598720	-73.766010	(40.59872, -73.76601)	BEACH 32 STREET
12	01/01/2023	1:45	BROOKLYN	11207.0	40.652767	-73.886300	(40.652767, -73.8863)	PENNSYLVANIA AVENUE
13	01/01/2023	17:55	QUEENS	11375.0	40.710320	-73.849980	(40.71032, -73.84998)	METROPOLITAN AVENUE
14	01/01/2023	17:35	QUEENS	11354.0	40.771587	-73.810070	(40.771587, -73.81007)	BAYSIDE AVENUE
15	01/01/2023	3:30	BROOKLYN	11236.0	40.636720	-73.887695	(40.63672, -73.887695)	NaN
16	01/01/2023	10:30	QUEENS	11355.0	40.751800	-73.817314	(40.7518, -73.817314)	HOLLY AVENUE
17	01/01/2023	6:40	STATEN ISLAND	10305.0	40.585240	-74.093414	(40.58524, -74.093414)	HYLAN BOULEVARD
18	01/01/2023	5:30	NaN	NaN	40.704810	-73.939320	(40.70481, -73.93932)	SEIGEL STREET
19	01/01/2023	1:10	BROOKLYN	11223.0	40.593760	-73.982740	(40.59376, -73.98274)	AVENUE V
20	01/01/2023	0:04	NaN	NaN	40.858090	-73.901924	(40.85809, -73.901924)	EAST 183 STREET
21	01/01/2023	20:25	QUEENS	11357.0	40.786440	-73.829155	(40.78644, -73.829155)	140 STREET
22	01/01/2023	4:20	BROOKLYN	11233.0	40.670116	-73.922480	(40.670116, -73.92248)	SAINT JOHNS PLACE
23	01/01/2023	6:45	QUEENS	11368.0	40.739280	-73.850530	(40.73928, -73.85053)	SAULTELL AVENUE
24	01/01/2023	23:15	QUEENS	11369.0	40.757430	-73.876230	(40.75743, -73.87623)	NaN
25	01/01/2023	3:35	NaN	NaN	NaN	NaN	NaN	GRAND CENTRAL PARKWAY
26	01/01/2023	4:00	QUEENS	11372.0	40.751003	-73.892130	(40.751003, -73.89213)	74 STREET
27	01/01/2023	5:36	BROOKLYN	11207.0	40.663136	-73.883250	(40.663136, -73.88325)	NaN
28	01/01/2023	7:40	BRONX	10468.0	40.861810	-73.912320	(40.86181, -73.91232)	NaN
29	01/01/2023	10:50	NaN	NaN	40.832700	-73.950226	(40.8327, -73.950226)	HENRY HUDSON PARKWAY
30	01/01/2023	17:05	NaN	NaN	40.795250	-73.973210	(40.79525, -73.97321)	WEST 96 STREET
31	01/01/2023	5:45	QUEENS	11411.0	40.690640	-73.728320	(40.69064, -73.72832)	NaN
32	01/01/2023	0:55	NaN	NaN	40.818470	-73.941400	(40.81847, -73.9414)	WEST 140 STREET
33	01/01/2023	10:16	QUEENS	11369.0	40.764633	-73.865370	(40.764633, -73.86537)	NaN
34	01/01/2023	3:45	QUEENS	11694.0	40.580510	-73.847730	(40.58051, -73.84773)	NaN
35	01/01/2023	1:20	QUEENS	11377.0	40.748720	-73.896250	(40.74872, -73.89625)	BROOKLYN QUEENS EXPRESSWAY
36	01/01/2023	18:26	BROOKLYN	11208.0	40.678246	-73.870186	(40.678246, -73.870186)	NaN
37	01/01/2023	0:00	NaN	NaN	NaN	NaN	NaN	WILLIAMSBURG BRIDGE OUTER ROADWA
38	01/01/2023	22:35	NaN	NaN	40.755780	-74.001990	(40.75578, -74.00199)	WEST 34 STREET
39	01/01/2023	12:00	BROOKLYN	11211.0	40.706690	-73.958840	(40.70669, -73.95884)	NaN
40	01/01/2023	8:45	NaN	NaN	40.828114	-73.931070	(40.828114, -73.93107)	MAJOR DEEGAN EXPRESSWAY
41	01/01/2023	1:20	MANHATTAN	10022.0	40.758980	-73.962440	(40.75898, -73.96244)	1 AVENUE
42	01/01/2023	16:45	QUEENS	11420.0	40.679070	-73.831540	(40.67907, -73.83154)	NaN
43	01/01/2023	0:10	BROOKLYN	11236.0	40.638350	-73.908890	(40.63835, -73.90889)	FLATLANDS AVENUE
44	01/01/2023	14:53	BROOKLYN	11208.0	40.660797	-73.871830	(40.660797, -73.87183)	ATKINS AVENUE
45	01/01/2023	13:10	NaN	NaN	40.831290	-73.929040	(40.83129, -73.92904)	WOODYCREST AVENUE
46	01/01/2023	14:16	NaN	NaN	NaN	NaN	NaN	VERRAZANO BRIDGE UPPER
47	01/01/2023	15:12	NaN	NaN	40.712780	-74.011690	(40.71278, -74.01169)	GREENWICH STREET
48	01/01/2023	0:00	NaN	NaN	40.688370	-73.944916	(40.68837, -73.944916)	LEXINGTON AVENUE
49	01/01/2023	3:35	NaN	NaN	40.825935	-73.859130	(40.825935, -73.85913)	BRUCKNER EXPRESSWAY
50	01/01/2023	9:45	NaN	NaN	40.882645	-73.886566	(40.882645, -73.886566)	SEDGWICK AVENUE
51	01/01/2023	12:20	NaN	NaN	NaN	NaN	NaN	102 CROSS DRIVE
52	01/01/2023	12:00	QUEENS	11354.0	40.764650	-73.823494	(40.76465, -73.823494)	NORTHERN BOULEVARD
53	01/01/2023	1:17	QUEENS	11375.0	40.724308	-73.842575	(40.724308, -73.842575)	110 STREET
54	01/01/2023	7:15	MANHATTAN	10025.0	40.795250	-73.973210	(40.79525, -73.97321)	WEST END AVENUE
55	01/01/2023	1:30	BRONX	10457.0	40.836480	-73.897736	(40.83648, -73.897736)	CLAREMONT PARKWAY
56	01/01/2023	8:08	NaN	NaN	40.686085	-73.982666	(40.686085, -73.982666)	ATLANTIC AVENUE
57	01/01/2023	6:08	QUEENS	11420.0	40.677242	-73.816720	(40.677242, -73.81672)	NaN
58	01/01/2023	8:20	BRONX	10451.0	NaN	NaN	NaN	EAST 138 STREET
59	01/01/2023	10:11	QUEENS	11358.0	40.760440	-73.804924	(40.76044, -73.804924)	161 STREET

combining by columns

sub_set_4 = pd.concat([sub_set_1, sub_set_2], axis = 1)
sub_set_4

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME
0	01/01/2023	14:38	NaN	NaN	NaN	NaN	NaN	BROOKLYN QUEENS EXPRESSWAY RAMP	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	01/01/2023	8:04	NaN	NaN	NaN	NaN	NaN	NASSAU EXPRESSWAY	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	01/01/2023	18:05	NaN	NaN	NaN	NaN	NaN	10 AVENUE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	01/01/2023	23:45	NaN	NaN	40.769737	-73.912440	(40.769737, -73.91244)	ASTORIA BOULEVARD	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	01/01/2023	4:50	NaN	NaN	40.830555	-73.850720	(40.830555, -73.85072)	CASTLE HILL AVENUE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	01/01/2023	0:35	BROOKLYN	11229.0	40.601310	-73.954720	(40.60131, -73.95472)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	01/01/2023	5:45	QUEENS	11101.0	40.744667	-73.931694	(40.744667, -73.931694)	QUEENS BOULEVARD	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	01/01/2023	1:00	MANHATTAN	10026.0	40.805595	-73.958190	(40.805595, -73.95819)	WEST 116 STREET	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	01/01/2023	19:53	NaN	NaN	NaN	NaN	NaN	EAST 222 STREET	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	01/01/2023	3:11	NaN	NaN	40.842110	-73.825570	(40.84211, -73.82557)	BRUCKNER EXPRESSWAY	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	01/01/2023	18:49	NaN	NaN	40.674923	-73.736940	(40.674923, -73.73694)	MERRICK BOULEVARD	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	01/01/2023	23:06	QUEENS	11691.0	40.598720	-73.766010	(40.59872, -73.76601)	BEACH 32 STREET	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12	01/01/2023	1:45	BROOKLYN	11207.0	40.652767	-73.886300	(40.652767, -73.8863)	PENNSYLVANIA AVENUE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13	01/01/2023	17:55	QUEENS	11375.0	40.710320	-73.849980	(40.71032, -73.84998)	METROPOLITAN AVENUE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	01/01/2023	17:35	QUEENS	11354.0	40.771587	-73.810070	(40.771587, -73.81007)	BAYSIDE AVENUE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	01/01/2023	3:30	BROOKLYN	11236.0	40.636720	-73.887695	(40.63672, -73.887695)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16	01/01/2023	10:30	QUEENS	11355.0	40.751800	-73.817314	(40.7518, -73.817314)	HOLLY AVENUE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
17	01/01/2023	6:40	STATEN ISLAND	10305.0	40.585240	-74.093414	(40.58524, -74.093414)	HYLAN BOULEVARD	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
18	01/01/2023	5:30	NaN	NaN	40.704810	-73.939320	(40.70481, -73.93932)	SEIGEL STREET	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
19	01/01/2023	1:10	BROOKLYN	11223.0	40.593760	-73.982740	(40.59376, -73.98274)	AVENUE V	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
20	01/01/2023	0:04	NaN	NaN	40.858090	-73.901924	(40.85809, -73.901924)	EAST 183 STREET	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
21	01/01/2023	20:25	QUEENS	11357.0	40.786440	-73.829155	(40.78644, -73.829155)	140 STREET	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
22	01/01/2023	4:20	BROOKLYN	11233.0	40.670116	-73.922480	(40.670116, -73.92248)	SAINT JOHNS PLACE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
23	01/01/2023	6:45	QUEENS	11368.0	40.739280	-73.850530	(40.73928, -73.85053)	SAULTELL AVENUE	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
24	01/01/2023	23:15	QUEENS	11369.0	40.757430	-73.876230	(40.75743, -73.87623)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25	01/01/2023	3:35	NaN	NaN	NaN	NaN	NaN	GRAND CENTRAL PARKWAY	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
26	01/01/2023	4:00	QUEENS	11372.0	40.751003	-73.892130	(40.751003, -73.89213)	74 STREET	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
27	01/01/2023	5:36	BROOKLYN	11207.0	40.663136	-73.883250	(40.663136, -73.88325)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
28	01/01/2023	7:40	BRONX	10468.0	40.861810	-73.912320	(40.86181, -73.91232)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
29	01/01/2023	10:50	NaN	NaN	40.832700	-73.950226	(40.8327, -73.950226)	HENRY HUDSON PARKWAY	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
30	01/01/2023	17:05	NaN	NaN	40.795250	-73.973210	(40.79525, -73.97321)	WEST 96 STREET	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31	01/01/2023	5:45	QUEENS	11411.0	40.690640	-73.728320	(40.69064, -73.72832)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
32	01/01/2023	0:55	NaN	NaN	40.818470	-73.941400	(40.81847, -73.9414)	WEST 140 STREET	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
33	01/01/2023	10:16	QUEENS	11369.0	40.764633	-73.865370	(40.764633, -73.86537)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
34	01/01/2023	3:45	QUEENS	11694.0	40.580510	-73.847730	(40.58051, -73.84773)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
35	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	1:20	QUEENS	11377.0	40.748720	-73.896250	(40.74872, -73.89625)	BROOKLYN QUEENS EXPRESSWAY
36	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	18:26	BROOKLYN	11208.0	40.678246	-73.870186	(40.678246, -73.870186)	NaN
37	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	0:00	NaN	NaN	NaN	NaN	NaN	WILLIAMSBURG BRIDGE OUTER ROADWA
38	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	22:35	NaN	NaN	40.755780	-74.001990	(40.75578, -74.00199)	WEST 34 STREET
39	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	12:00	BROOKLYN	11211.0	40.706690	-73.958840	(40.70669, -73.95884)	NaN
40	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	8:45	NaN	NaN	40.828114	-73.931070	(40.828114, -73.93107)	MAJOR DEEGAN EXPRESSWAY
41	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	1:20	MANHATTAN	10022.0	40.758980	-73.962440	(40.75898, -73.96244)	1 AVENUE
42	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	16:45	QUEENS	11420.0	40.679070	-73.831540	(40.67907, -73.83154)	NaN
43	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	0:10	BROOKLYN	11236.0	40.638350	-73.908890	(40.63835, -73.90889)	FLATLANDS AVENUE
44	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	14:53	BROOKLYN	11208.0	40.660797	-73.871830	(40.660797, -73.87183)	ATKINS AVENUE
45	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	13:10	NaN	NaN	40.831290	-73.929040	(40.83129, -73.92904)	WOODYCREST AVENUE
46	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	14:16	NaN	NaN	NaN	NaN	NaN	VERRAZANO BRIDGE UPPER
47	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	15:12	NaN	NaN	40.712780	-74.011690	(40.71278, -74.01169)	GREENWICH STREET
48	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	0:00	NaN	NaN	40.688370	-73.944916	(40.68837, -73.944916)	LEXINGTON AVENUE
49	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	3:35	NaN	NaN	40.825935	-73.859130	(40.825935, -73.85913)	BRUCKNER EXPRESSWAY
50	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	9:45	NaN	NaN	40.882645	-73.886566	(40.882645, -73.886566)	SEDGWICK AVENUE
51	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	12:20	NaN	NaN	NaN	NaN	NaN	102 CROSS DRIVE
52	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	12:00	QUEENS	11354.0	40.764650	-73.823494	(40.76465, -73.823494)	NORTHERN BOULEVARD
53	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	1:17	QUEENS	11375.0	40.724308	-73.842575	(40.724308, -73.842575)	110 STREET
54	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	7:15	MANHATTAN	10025.0	40.795250	-73.973210	(40.79525, -73.97321)	WEST END AVENUE
55	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	1:30	BRONX	10457.0	40.836480	-73.897736	(40.83648, -73.897736)	CLAREMONT PARKWAY
56	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	8:08	NaN	NaN	40.686085	-73.982666	(40.686085, -73.982666)	ATLANTIC AVENUE
57	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	6:08	QUEENS	11420.0	40.677242	-73.816720	(40.677242, -73.81672)	NaN
58	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	8:20	BRONX	10451.0	NaN	NaN	NaN	EAST 138 STREET
59	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	01/01/2023	10:11	QUEENS	11358.0	40.760440	-73.804924	(40.76044, -73.804924)	161 STREET

use .fillna() to fill in the missing values

5.3 Merge and Join

pd.merge( ): allows a user to do one-to-one, one-to-many, and many-to-many joins
- One-to-one joins mean each row is related on a single row in a different table using a key column
- One-to-many joins mean each row in one table in related to one or more rows in a different table using a key column
- Many-to-many joins mean one or more rows in one table is related to one or more rows in a seperate table using a key column

5.3.1 Ex: One-to One Join

Create 2 new Data Frames from the January 2023 data with a common column (Zip Code)
- For this example, we are using .drop_duplicates() to get unique ZIP CODE values
Using pd.merge( ) the two data frames are combined using that common column as a key

jan23 = pd.read_csv("data/nyc_crashes_202301.csv") 
crash_zip = jan23[["CRASH DATE", "ZIP CODE"]].copy().drop_duplicates(subset = [ "ZIP CODE"])
crash_zip.tail() # to view DF of crash date & zip code

	CRASH DATE	ZIP CODE
3590	01/16/2023	10023.0
4304	01/19/2023	11363.0
5530	01/24/2023	11109.0
5732	01/25/2023	10280.0
7140	01/31/2023	10169.0

borough_zip = jan23[["ZIP CODE", "BOROUGH"]].copy().drop_duplicates(subset = [ "ZIP CODE"])
borough_zip.tail() # to view DF of zip code and borough

	ZIP CODE	BOROUGH
3590	10023.0	MANHATTAN
4304	11363.0	QUEENS
5530	11109.0	QUEENS
5732	10280.0	MANHATTAN
7140	10169.0	MANHATTAN

merge_w_zip = pd.merge(crash_zip, borough_zip)
merge_w_zip.tail() # to view joined data frames

	CRASH DATE	ZIP CODE	BOROUGH
176	01/16/2023	10023.0	MANHATTAN
177	01/19/2023	11363.0	QUEENS
178	01/24/2023	11109.0	QUEENS
179	01/25/2023	10280.0	MANHATTAN
180	01/31/2023	10169.0	MANHATTAN

We can specify the name of the key column using on:

pd.merge(crash_zip, borough_zip, on = 'ZIP CODE').tail()

	CRASH DATE	ZIP CODE	BOROUGH
176	01/16/2023	10023.0	MANHATTAN
177	01/19/2023	11363.0	QUEENS
178	01/24/2023	11109.0	QUEENS
179	01/25/2023	10280.0	MANHATTAN
180	01/31/2023	10169.0	MANHATTAN

5.3.2 Ex: One-to-Many Join

Create a new DataFrame consisting of Employee and Department
Create a new DataFrame consisting of Employee and Hire Year

emp_dept = pd.DataFrame({'employee': ['Emily', 'Jake', 'Paul', 'Jackie'],
                        'department': ['Accounting', 'HR', 'Engineering', 'Accounting']})

emp_hire = pd.DataFrame({'employee': ['Emily', 'Jake', 'Paul', 'Jackie'],
                        'hire_year': ['2020', '2019', '2023', '2011']})

# combining to create a one to one join
dept_hire = pd.merge(emp_dept,emp_hire) # to merge Employee, Department and Hire Year into one dataframe
dept_hire

	employee	department	hire_year
0	Emily	Accounting	2020
1	Jake	HR	2019
2	Paul	Engineering	2023
3	Jackie	Accounting	2011

Create a new DataFrame that consists of Department and the Supervisor for that department
Merge this new DataFrame with ‘dept_hire’ to create a Many-to-One join using Department as a key

dept_sup = pd.DataFrame({'supervisor': ['Lily', 'Angela', 'Steven'],
                        'department': ['Accounting', 'HR', 'Engineering']})

sup_emp_dept_hire = pd.merge(dept_sup,dept_hire) # merge using Department as the key
sup_emp_dept_hire

	supervisor	department	employee	hire_year
0	Lily	Accounting	Emily	2020
1	Lily	Accounting	Jackie	2011
2	Angela	HR	Jake	2019
3	Steven	Engineering	Paul	2023

5.3.3 Ex: Merging when the Key has Different Variable Names

We will merge two dataframes that have a similar column containing the same information, but are named differently
Using the employee data from above, but changing employee in emp_hire to employee_name
Need to drop either employee or employee_name after merging to not have redundant information

emp_dept_names = pd.DataFrame({'employee': ['Emily', 'Jake', 'Paul', 'Jackie'],
                        'department': ['Accounting', 'HR', 'Engineering', 'Accounting']})

emp_hire_names = pd.DataFrame({'employee_name': ['Emily', 'Jake', 'Paul', 'Jackie'],
                        'hire_year': ['2020', '2019', '2023', '2011']})

# to merge Employee, Department and Hire Year into one dataframe & drop column 'employee_name'
dept_hire_names = pd.merge(emp_dept_names,emp_hire_names, left_on = 'employee', 
                           right_on = 'employee_name').drop('employee_name', axis = 1) 

dept_hire_names

	employee	department	hire_year
0	Emily	Accounting	2020
1	Jake	HR	2019
2	Paul	Engineering	2023
3	Jackie	Accounting	2011

5.3.4 Ex: Joining ‘uszipcode’

Create a subset of jan23 data with 7 zipcodes
Using ‘uszipcode’ data to join the zip codes from jan23 with data provided in this package

from uszipcode import SearchEngine

search = SearchEngine()

# create a DF of zip codes from jan23 & convert to integers
zipcodes = pd.DataFrame(jan23["ZIP CODE"].tail(15).dropna().reset_index(drop = True))
zipcodes["ZIP CODE"] = zipcodes["ZIP CODE"].astype(int)

# create new,empty column in the df to store the address information
zipcodes['Address'] = None


# using uszipcode library to retreive address info
for index, row in zipcodes.iterrows():
    result = search.by_zipcode(row['ZIP CODE'])
    zipcodes.at[index, 'Address'] = result.major_city + ', ' + result.state

print(zipcodes)

    ZIP CODE       Address
0      11228  Brooklyn, NY
1      10027  New York, NY
2      10040  New York, NY
3      10035  New York, NY
4      10035  New York, NY
5      10457     Bronx, NY
6      11203  Brooklyn, NY
7      11208  Brooklyn, NY
8      11230  Brooklyn, NY
9      10033  New York, NY
10     11435   Jamaica, NY

5.4 Aggregation and Grouping

Built-In Pandas Aggregations (for DataFrame & Series objects): - count( ) - Total number of items - first( ), last( ) - First and last item - mean( ), median( )
- Mean and median - min( ), max( )
- Minimum and maximum - std( ), var( )
- Standard deviation and variance - mad( )
- Mean absolute deviation - prod( )
- Product of all items - sum( )
- Sum of all items - groupby() - compute aggregates on subsets of data

5.4.1 Ex: Titanic Groupby

We will use the Titanic Data Set from the ‘seaborn’ library

import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head()

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

Below the data is groupby ‘sex’ and the counts for each row are displayed

titanic.groupby('sex').count()

	survived	pclass	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
sex
female	314	314	261	314	314	314	312	314	314	314	97	312	314	314
male	577	577	453	577	577	577	577	577	577	577	106	577	577	577

5.4.2 Ex: Crash Data Group By

Using the crash_zip DataFrame from above, grouping by ZIP CODE and using the count() method, we can see how many counts for each listed zip code

crash_zip.groupby('ZIP CODE').count()

	CRASH DATE
ZIP CODE
10001.0	1
10002.0	1
10003.0	1
10004.0	1
10005.0	1
...	...
11436.0	1
11691.0	1
11692.0	1
11693.0	1
11694.0	1

180 rows × 1 columns

5.5 Pivot Tables

Creates a two dimensional table using column data
Easy way to visualize data to see patterns and summarize data
Use ‘groupby( )’ to create a pivot table

5.5.1 Ex: Pivot Table using Titanic Data Set

We will again use the Titanic data set, but now we will create a pivot table

We can use groupby() to help create a pivot table - Group the data by ‘sex’ and ‘class’ to select survival. - Then use the aggregate function mean() to show within the table

titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()

class	First	Second	Third
sex
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

5.5.2 Ex: Pivot Table of Crash Data

Create a pivot table that shows the Number of Persons Injured for every Crash Date per Zip Code

jan23.pivot_table('NUMBER OF PERSONS INJURED', index = 'ZIP CODE', columns = 'CRASH DATE')

CRASH DATE	01/01/2023	01/02/2023	01/03/2023	01/04/2023	01/05/2023	01/06/2023	01/07/2023	01/08/2023	01/09/2023	01/10/2023	...	01/22/2023	01/23/2023	01/24/2023	01/25/2023	01/26/2023	01/27/2023	01/28/2023	01/29/2023	01/30/2023	01/31/2023
ZIP CODE
10001.0	0.0	NaN	0.000000	NaN	0.0	NaN	NaN	0.0	0.0	NaN	...	NaN	NaN	0.0	0.500000	0.0	0.0	1.00	1.0	0.500000	0.5
10002.0	0.0	0.5	0.666667	0.0	NaN	NaN	NaN	0.0	NaN	0.0	...	0.5	NaN	0.0	1.333333	0.0	0.0	0.75	0.0	0.333333	1.0
10003.0	0.5	0.0	1.000000	1.0	NaN	NaN	0.5	0.0	0.0	NaN	...	0.0	NaN	NaN	NaN	0.0	0.0	0.50	0.0	0.000000	1.0
10004.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	0.00	NaN	NaN	NaN
10005.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	1.0	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
11436.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	0.000000	NaN	NaN	0.00	NaN	0.000000	NaN
11691.0	0.0	0.5	1.000000	1.0	1.0	NaN	NaN	0.0	NaN	0.0	...	0.5	1.0	0.0	0.000000	0.0	0.0	0.50	NaN	0.000000	0.0
11692.0	NaN	NaN	1.000000	1.0	0.0	1.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	0.0
11693.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	0.0	0.000000	NaN	NaN	NaN	NaN	NaN	NaN
11694.0	0.0	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	...	2.0	1.0	NaN	1.000000	0.0	NaN	NaN	NaN	NaN	NaN

180 rows × 31 columns

5.6 Vectorized String Operations

Must use the ‘str’ attribute of a Pandas Series and Index objects to access operations
Some examples of operations:
- len()
- lower()
- translate()
- slower()
- ljust()
- upper()
- startswith()
- isupper()
- rjust()
- find()
- endswith()
- isnumeric()
- capitalize()
- swapcase()
- istitle()
- rpartition()

5.6.1 Ex: Create a Panda Series of ‘BOROUGH’ names and swap cases.

bname = pd.Series(jan23['BOROUGH'])
bname.head(15).dropna() # original

5      BROOKLYN
6        QUEENS
7     MANHATTAN
11       QUEENS
12     BROOKLYN
13       QUEENS
14       QUEENS
Name: BOROUGH, dtype: object

bname.str.capitalize().head(15).dropna()# to make first letter capital

5      Brooklyn
6        Queens
7     Manhattan
11       Queens
12     Brooklyn
13       Queens
14       Queens
Name: BOROUGH, dtype: object

bname.str.swapcase().head(15).dropna() # to make all lower case

5      brooklyn
6        queens
7     manhattan
11       queens
12     brooklyn
13       queens
14       queens
Name: BOROUGH, dtype: object

bname.str.len().head(15).dropna() # to return the length of the name and data type

5     8.0
6     6.0
7     9.0
11    6.0
12    8.0
13    6.0
14    6.0
Name: BOROUGH, dtype: float64

bname.str.startswith('B').head(15).dropna() # to see if starts with a letter B

5      True
6     False
7     False
11    False
12     True
13    False
14    False
Name: BOROUGH, dtype: object

5.7 Time Series

Time Stamps : Moments in time
- Ex: July 4th, 2023 at 8:00 AM
- Pandas provides the Timestamp type
Time Intervals: Reference a length of time with a beginning and end
- Ex: The year of 2022
- Pandas provides the Period type
Time Deltas/ Durations: Reference an exact length of time
- Ex: 0.3 seconds
- Pandas provides the Timedelta type

5.7.1 Can create a Timestamp object

combines ‘datetime’ and ‘dateutil’ to be used as a Series or DataFrame

date = pd.to_datetime("2nd of February, 2023")
date
type(date)

pandas._libs.tslibs.timestamps.Timestamp

5.7.2 Can create Series that has time indexed data

ind = pd.DatetimeIndex(['2022-07-04', '2022-08-04',
                          '2022-07-04', '2022-08-04'])
inddata = pd.Series([0,1,2,3], index = ind)
inddata

2022-07-04    0
2022-08-04    1
2022-07-04    2
2022-08-04    3
dtype: int64

5.7.3 Frequencies and Offsets

The following are the main codes avaiable:

- D Calendar day    
- B Business day
- W Weekly      
- M Month end   
- BM   Business month end
- Q Quarter end 
- BQ   Business quarter end
- A Year end    
- BA   Business year end
- H Hours   
- BH   Business hours
- T Minutes     
- S Seconds     
- L Milliseonds     
- U Microseconds        
- N nanoseconds

5.7.4 Ex: TimeDelta

create a TimeDelta data type starting at 00:00:00 using frequency of 2 hours and 30 minutes (2H30T) over 5 periods.

pd.timedelta_range(0, periods = 5, freq = "2H30T")

TimedeltaIndex(['0 days 00:00:00', '0 days 02:30:00', '0 days 05:00:00',
                '0 days 07:30:00', '0 days 10:00:00'],
               dtype='timedelta64[ns]', freq='150T')

5.8 High Performance Pandas: `eval()`

eval() uses string expressions to compute operations using DataFrames
- supports all arithmetic operations, comparison operators, bitwise operators (& and |), and the use of and and or in Boolean expressions

nrows, ncols = 10, 5 # creating 2 DF of 5 rows and 10 columns
rand = np.random.RandomState(7)
dfa, dfb = (pd.DataFrame(rand.rand(nrows, ncols))
            for i in range (2))

# to compute sum of dfa and dfb and place into one table

print("dfa", dfa)
print("dfb", dfb)
pd.eval('dfa + dfb')

dfa           0         1         2         3         4
0  0.076308  0.779919  0.438409  0.723465  0.977990
1  0.538496  0.501120  0.072051  0.268439  0.499883
2  0.679230  0.803739  0.380941  0.065936  0.288146
3  0.909594  0.213385  0.452124  0.931206  0.024899
4  0.600549  0.950130  0.230303  0.548490  0.909128
5  0.133169  0.523413  0.750410  0.669013  0.467753
6  0.204849  0.490766  0.372385  0.477401  0.365890
7  0.837918  0.768648  0.313995  0.572625  0.276049
8  0.452843  0.352978  0.657399  0.370351  0.459093
9  0.719324  0.412992  0.906423  0.180452  0.741119
dfb           0         1         2         3         4
0  0.422374  0.426454  0.634380  0.522906  0.414886
1  0.001427  0.092262  0.709394  0.524346  0.696160
2  0.955468  0.682914  0.053129  0.308853  0.592595
3  0.235120  0.964971  0.945048  0.848401  0.472324
4  0.841477  0.131111  0.308734  0.462996  0.741847
5  0.485825  0.136876  0.343537  0.324426  0.300419
6  0.165501  0.414902  0.448121  0.774900  0.796391
7  0.522390  0.460630  0.778214  0.887289  0.674919
8  0.800479  0.939111  0.040656  0.875672  0.276563
9  0.475764  0.796761  0.717242  0.147148  0.658748

	0	1	2	3	4
0	0.498682	1.206372	1.072789	1.246371	1.392875
1	0.539923	0.593383	0.781446	0.792785	1.196043
2	1.634698	1.486653	0.434070	0.374789	0.880740
3	1.144714	1.178356	1.397172	1.779607	0.497223
4	1.442026	1.081240	0.539037	1.011486	1.650976
5	0.618995	0.660289	1.093946	0.993439	0.768172
6	0.370350	0.905668	0.820505	1.252302	1.162281
7	1.360308	1.229278	1.092208	1.459914	0.950968
8	1.253322	1.292090	0.698055	1.246023	0.735656
9	1.195089	1.209753	1.623666	0.327599	1.399867