5  Data Manipulation with Pandas

This chapter was prepared by Garrick Ho and Ginamarie Mastrorilli.

Pandas is an open source Python library for data analysis. It gives Python the ability to work with spreadsheet-like data for fast data loading, manipulating, aligning, merging, etc. The name is derived from ‘panel data’, an econometrics term for multidimensional structured datasets.

import pandas as pd
import numpy as np

Pandas introduces two new data types to Python: Series and DataFrame

5.1 Series

A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index

[1, 3, 4]
[1, 3, 4]

To create a series, usepd.Series().

s = pd.Series([4, 7, -5, 3])
s
0    4
1    7
2   -5
3    3
dtype: int64

The string representation of a Series displayed interactively shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers 0 through n-1 (where n is the length of the data).

s = pd.Series([4, 7, -5, 3], index = ['a', 'b', 'c', 'd'])
s
a    4
b    7
c   -5
d    3
dtype: int64
s.values
array([ 4,  7, -5,  3])
s.index
Index(['a', 'b', 'c', 'd'], dtype='object')

5.1.1 Selecting single or a set of values using index

To select a single value, use [ ].

s['b']
7

To select multiple values, add a comma between each value and use double brackets.

s[['c', 'a', 'b']]
c   -5
a    4
b    7
dtype: int64

Selecting by the index is also possible.

s[2]
-5

To select multiple values in a row, use : and the first index is where it starts and the second index is where it stops. It does not include the second index.

s[1:3]
b    7
c   -5
dtype: int64

To select multiple values, add a comma between each value and use double brackets.

s[[0,3]]
a    4
d    3
dtype: int64

5.1.2 Filtering

Filtering values in a series can be done with <, >, =.

s > 0
a     True
b     True
c    False
d     True
dtype: bool
s[s > 0]
a    4
b    7
d    3
dtype: int64

5.1.3 Math operation

Math functions are able to be apply to a series.

s**2
a    16
b    49
c    25
d     9
dtype: int64
np.exp(s)
a      54.598150
b    1096.633158
c       0.006738
d      20.085537
dtype: float64
s.mean()
2.25

Series are aligned by index label in arithmetic operations.

s2 = pd.Series([1, 2, 3, 4], index = ['a', 'c', 'd', 'e'])
zero = s + s2
zero
a    5.0
b    NaN
c   -3.0
d    6.0
e    NaN
dtype: float64

Note: “NaN” stands for missing values in pandas

5.1.4 Finding NaN values

To find all the missing values in a series, use .isnull().

zero.isnull()
a    False
b     True
c    False
d    False
e     True
dtype: bool

To find all the non-missing vales, use .notnull().

zero.notnull()
a     True
b    False
c     True
d     True
e    False
dtype: bool
zero[zero.notnull()]
a    5.0
c   -3.0
d    6.0
dtype: float64

5.1.5 Replacing NaN

To change NaN to 0, use .fillna().

zero.fillna(0)
a    5.0
b    0.0
c   -3.0
d    6.0
e    0.0
dtype: float64

5.1.6 Forward-fill

Fill all the NaN values with the previous value in the series.

zero.fillna(method = 'ffill')
a    5.0
b    5.0
c   -3.0
d    6.0
e    6.0
dtype: float64

5.1.7 Back-fill

Fill all the NaN values with the next value in the series.

Note that e is Nan because there is no next value in the series.

zero.fillna(method = 'bfill')
a    5.0
b   -3.0
c   -3.0
d    6.0
e    NaN
dtype: float64

5.1.8 Drop

To drop all NaN, use .dropna().

zero.dropna()
a    5.0
c   -3.0
d    6.0
dtype: float64

Notice how that zero hasn’t change at all. If the function wants to be applied to the original series, set it to its self.

zero
a    5.0
b    NaN
c   -3.0
d    6.0
e    NaN
dtype: float64
zero = zero.dropna()
zero
a    5.0
c   -3.0
d    6.0
dtype: float64

change the index to be the same as s2 so there is no missing value

s.index = ['a', 'c', 'd', 'e']
s + s2
a    5
c    9
d   -2
e    7
dtype: int64

5.2 DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns. The DataFrame has both a row and column index.

  • Since each column of a DataFrame is essentially a Series with its column index, it can be thought of as a dictionary of Series all sharing the same index.

  • Each column (Series) has to be the same type, whereas, each row can contain mixed types.

5.2.1 Creating DataFrame

5.2.1.1 from a dict of equal-length lists

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
d = pd.DataFrame(data)
d
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2

5.2.1.2 from an DataFrame

Starting with an empty dataframe, series are able to be added to the Dataframe.

d1 = pd.DataFrame()
d1['state'] = ['Ohio', 'Nevada']
d1['year'] = [2001, 2001]
d1['pop'] = [1.7, 2.4]
d1
state year pop
0 Ohio 2001 1.7
1 Nevada 2001 2.4

5.2.2 select columns

d
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2
d['state']
0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object
d[['state','pop']]
state pop
0 Ohio 1.5
1 Ohio 1.7
2 Ohio 3.6
3 Nevada 2.4
4 Nevada 2.9
5 Nevada 3.2

5.2.3 select rows

d2 = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                     columns=['one', 'two', 'three', 'four'])
d2
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
d2.loc['Colorado': 'Utah']
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
d2.iloc[1:3]
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11

5.2.4 change row index and column name

Use .rename ti rename any row or column.

d2.rename(index={'Colorado':'Connecticut'},columns={'one':'five'})
five two three four
Ohio 0 1 2 3
Connecticut 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

Notice how d2 does not change.

d2
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

You can use inplace = True to change the original Dataframe.

d2.rename(index = {'Colorado':'Connecticut'}, columns = {'one':'five'}, inplace = True)
d2
five two three four
Ohio 0 1 2 3
Connecticut 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

5.2.5 basics attributes and methods

d2.index
Index(['Ohio', 'Connecticut', 'Utah', 'New York'], dtype='object')
d2.columns
Index(['five', 'two', 'three', 'four'], dtype='object')
d2.values
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])
d2.shape
(4, 4)
d2.mean()
five     6.0
two      7.0
three    8.0
four     9.0
dtype: float64

5.2.6 add and delete rows and columns

For dropping a row or column use .drop.

d2.drop(index = "Connecticut", columns = "five") # add "inplace=True" will change the original DataFrame
two three four
Ohio 1 2 3
Utah 9 10 11
New York 13 14 15
d2
five two three four
Ohio 0 1 2 3
Connecticut 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

For deleting a column use del.

del d2['five']
d2
two three four
Ohio 1 2 3
Connecticut 5 6 7
Utah 9 10 11
New York 13 14 15
d2['one'] = [1, 2, 3, 4]
d2
two three four one
Ohio 1 2 3 1
Connecticut 5 6 7 2
Utah 9 10 11 3
New York 13 14 15 4

.pop returns the values and removes it from the original Dataframe.

d2.pop('one')
Ohio           1
Connecticut    2
Utah           3
New York       4
Name: one, dtype: int64
d2
two three four
Ohio 1 2 3
Connecticut 5 6 7
Utah 9 10 11
New York 13 14 15

5.2.7 Common method

You can import dataset as well

5.2.7.1 csv file

import pandas as pd
crashes = pd.read_csv("data/nyc_crashes_202301.csv")
sub_set_1 = crashes.iloc[0:35, 0:8]

5.2.7.2 Head and Tail

These two methods show the first and the last a few records from a DataFrame, default is 5.

sub_set_1.head()
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME
0 01/01/2023 14:38 NaN NaN NaN NaN NaN BROOKLYN QUEENS EXPRESSWAY RAMP
1 01/01/2023 8:04 NaN NaN NaN NaN NaN NASSAU EXPRESSWAY
2 01/01/2023 18:05 NaN NaN NaN NaN NaN 10 AVENUE
3 01/01/2023 23:45 NaN NaN 40.769737 -73.91244 (40.769737, -73.91244) ASTORIA BOULEVARD
4 01/01/2023 4:50 NaN NaN 40.830555 -73.85072 (40.830555, -73.85072) CASTLE HILL AVENUE
sub_set_1.tail()
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME
30 01/01/2023 17:05 NaN NaN 40.795250 -73.97321 (40.79525, -73.97321) WEST 96 STREET
31 01/01/2023 5:45 QUEENS 11411.0 40.690640 -73.72832 (40.69064, -73.72832) NaN
32 01/01/2023 0:55 NaN NaN 40.818470 -73.94140 (40.81847, -73.9414) WEST 140 STREET
33 01/01/2023 10:16 QUEENS 11369.0 40.764633 -73.86537 (40.764633, -73.86537) NaN
34 01/01/2023 3:45 QUEENS 11694.0 40.580510 -73.84773 (40.58051, -73.84773) NaN
sub_set_1.head(3)
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME
0 01/01/2023 14:38 NaN NaN NaN NaN NaN BROOKLYN QUEENS EXPRESSWAY RAMP
1 01/01/2023 8:04 NaN NaN NaN NaN NaN NASSAU EXPRESSWAY
2 01/01/2023 18:05 NaN NaN NaN NaN NaN 10 AVENUE

5.2.8 unique and nunique

To show only unique values, use .unique.

sub_set_1['BOROUGH'].unique()
array([nan, 'BROOKLYN', 'QUEENS', 'MANHATTAN', 'STATEN ISLAND', 'BRONX'],
      dtype=object)

To get the number of unique values, use .nunique.

sub_set_1['BOROUGH'].nunique()
5

5.2.9 count and value_counts

To count the non missing values, use .count.

sub_set_1['BOROUGH'].count()
21

To count the number in each categroy, use .value_counts.

sub_set_1['BOROUGH'].value_counts()
QUEENS           12
BROOKLYN          6
MANHATTAN         1
STATEN ISLAND     1
BRONX             1
Name: BOROUGH, dtype: int64

5.2.10 describe and info

sub_set_1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   CRASH DATE      35 non-null     object 
 1   CRASH TIME      35 non-null     object 
 2   BOROUGH         21 non-null     object 
 3   ZIP CODE        21 non-null     float64
 4   LATITUDE        30 non-null     float64
 5   LONGITUDE       30 non-null     float64
 6   LOCATION        30 non-null     object 
 7   ON STREET NAME  27 non-null     object 
dtypes: float64(3), object(5)
memory usage: 2.3+ KB

Summary statistics for numeric type columns.

Use .describe to get an quick summary of the data.

sub_set_1.describe()
ZIP CODE LATITUDE LONGITUDE
count 21.000000 30.000000 30.000000
mean 11188.095238 40.728138 -73.885913
std 416.736476 0.086772 0.076496
min 10026.000000 40.580510 -74.093414
25% 11207.000000 40.664881 -73.937413
50% 11354.000000 40.747835 -73.886998
75% 11369.000000 40.793048 -73.848292
max 11694.000000 40.861810 -73.728320
sub_set_1.describe(percentiles=[x/10 for x in list(range(1, 10, 1))])
ZIP CODE LATITUDE LONGITUDE
count 21.000000 30.000000 30.000000
mean 11188.095238 40.728138 -73.885913
std 416.736476 0.086772 0.076496
min 10026.000000 40.580510 -74.093414
10% 10468.000000 40.598224 -73.959692
20% 11207.000000 40.649558 -73.943165
30% 11223.000000 40.673481 -73.925244
40% 11233.000000 40.708116 -73.906082
50% 11354.000000 40.747835 -73.886998
60% 11357.000000 40.760311 -73.871886
70% 11369.000000 40.776043 -73.850365
80% 11372.000000 40.808170 -73.828438
90% 11411.000000 40.833641 -73.805664
max 11694.000000 40.861810 -73.728320

choose a specific column to get a summary for.

sub_set_1['BOROUGH'].describe()
count         21
unique         5
top       QUEENS
freq          12
Name: BOROUGH, dtype: object

5.2.11 idxmax and nlargest

.idxmax() returns the index of the largest value.

sub_set_1['ZIP CODE'].idxmax()
34

.idxmin() returns the index of the smallest value

sub_set_1['ZIP CODE'].idxmin()
7

.nlargest returns the largest values with their index (default is 5).

sub_set_1['ZIP CODE'].nlargest()
34    11694.0
11    11691.0
31    11411.0
13    11375.0
26    11372.0
Name: ZIP CODE, dtype: float64

.nsmallest returns the smallest 3 values with their index (default is 5).

sub_set_1['ZIP CODE'].nsmallest()
7     10026.0
17    10305.0
28    10468.0
6     11101.0
12    11207.0
Name: ZIP CODE, dtype: float64

5.2.12 sort

use .sort_values to sort values

sub_set_1.sort_values(by = 'BOROUGH')
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME
28 01/01/2023 7:40 BRONX 10468.0 40.861810 -73.912320 (40.86181, -73.91232) NaN
5 01/01/2023 0:35 BROOKLYN 11229.0 40.601310 -73.954720 (40.60131, -73.95472) NaN
27 01/01/2023 5:36 BROOKLYN 11207.0 40.663136 -73.883250 (40.663136, -73.88325) NaN
22 01/01/2023 4:20 BROOKLYN 11233.0 40.670116 -73.922480 (40.670116, -73.92248) SAINT JOHNS PLACE
15 01/01/2023 3:30 BROOKLYN 11236.0 40.636720 -73.887695 (40.63672, -73.887695) NaN
19 01/01/2023 1:10 BROOKLYN 11223.0 40.593760 -73.982740 (40.59376, -73.98274) AVENUE V
12 01/01/2023 1:45 BROOKLYN 11207.0 40.652767 -73.886300 (40.652767, -73.8863) PENNSYLVANIA AVENUE
7 01/01/2023 1:00 MANHATTAN 10026.0 40.805595 -73.958190 (40.805595, -73.95819) WEST 116 STREET
14 01/01/2023 17:35 QUEENS 11354.0 40.771587 -73.810070 (40.771587, -73.81007) BAYSIDE AVENUE
16 01/01/2023 10:30 QUEENS 11355.0 40.751800 -73.817314 (40.7518, -73.817314) HOLLY AVENUE
33 01/01/2023 10:16 QUEENS 11369.0 40.764633 -73.865370 (40.764633, -73.86537) NaN
21 01/01/2023 20:25 QUEENS 11357.0 40.786440 -73.829155 (40.78644, -73.829155) 140 STREET
11 01/01/2023 23:06 QUEENS 11691.0 40.598720 -73.766010 (40.59872, -73.76601) BEACH 32 STREET
23 01/01/2023 6:45 QUEENS 11368.0 40.739280 -73.850530 (40.73928, -73.85053) SAULTELL AVENUE
24 01/01/2023 23:15 QUEENS 11369.0 40.757430 -73.876230 (40.75743, -73.87623) NaN
26 01/01/2023 4:00 QUEENS 11372.0 40.751003 -73.892130 (40.751003, -73.89213) 74 STREET
6 01/01/2023 5:45 QUEENS 11101.0 40.744667 -73.931694 (40.744667, -73.931694) QUEENS BOULEVARD
31 01/01/2023 5:45 QUEENS 11411.0 40.690640 -73.728320 (40.69064, -73.72832) NaN
13 01/01/2023 17:55 QUEENS 11375.0 40.710320 -73.849980 (40.71032, -73.84998) METROPOLITAN AVENUE
34 01/01/2023 3:45 QUEENS 11694.0 40.580510 -73.847730 (40.58051, -73.84773) NaN
17 01/01/2023 6:40 STATEN ISLAND 10305.0 40.585240 -74.093414 (40.58524, -74.093414) HYLAN BOULEVARD
0 01/01/2023 14:38 NaN NaN NaN NaN NaN BROOKLYN QUEENS EXPRESSWAY RAMP
1 01/01/2023 8:04 NaN NaN NaN NaN NaN NASSAU EXPRESSWAY
2 01/01/2023 18:05 NaN NaN NaN NaN NaN 10 AVENUE
3 01/01/2023 23:45 NaN NaN 40.769737 -73.912440 (40.769737, -73.91244) ASTORIA BOULEVARD
4 01/01/2023 4:50 NaN NaN 40.830555 -73.850720 (40.830555, -73.85072) CASTLE HILL AVENUE
8 01/01/2023 19:53 NaN NaN NaN NaN NaN EAST 222 STREET
9 01/01/2023 3:11 NaN NaN 40.842110 -73.825570 (40.84211, -73.82557) BRUCKNER EXPRESSWAY
10 01/01/2023 18:49 NaN NaN 40.674923 -73.736940 (40.674923, -73.73694) MERRICK BOULEVARD
18 01/01/2023 5:30 NaN NaN 40.704810 -73.939320 (40.70481, -73.93932) SEIGEL STREET
20 01/01/2023 0:04 NaN NaN 40.858090 -73.901924 (40.85809, -73.901924) EAST 183 STREET
25 01/01/2023 3:35 NaN NaN NaN NaN NaN GRAND CENTRAL PARKWAY
29 01/01/2023 10:50 NaN NaN 40.832700 -73.950226 (40.8327, -73.950226) HENRY HUDSON PARKWAY
30 01/01/2023 17:05 NaN NaN 40.795250 -73.973210 (40.79525, -73.97321) WEST 96 STREET
32 01/01/2023 0:55 NaN NaN 40.818470 -73.941400 (40.81847, -73.9414) WEST 140 STREET
sub_set_1.sort_values(by = ['CRASH DATE', 'ZIP CODE'], ascending = True)
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME
7 01/01/2023 1:00 MANHATTAN 10026.0 40.805595 -73.958190 (40.805595, -73.95819) WEST 116 STREET
17 01/01/2023 6:40 STATEN ISLAND 10305.0 40.585240 -74.093414 (40.58524, -74.093414) HYLAN BOULEVARD
28 01/01/2023 7:40 BRONX 10468.0 40.861810 -73.912320 (40.86181, -73.91232) NaN
6 01/01/2023 5:45 QUEENS 11101.0 40.744667 -73.931694 (40.744667, -73.931694) QUEENS BOULEVARD
12 01/01/2023 1:45 BROOKLYN 11207.0 40.652767 -73.886300 (40.652767, -73.8863) PENNSYLVANIA AVENUE
27 01/01/2023 5:36 BROOKLYN 11207.0 40.663136 -73.883250 (40.663136, -73.88325) NaN
19 01/01/2023 1:10 BROOKLYN 11223.0 40.593760 -73.982740 (40.59376, -73.98274) AVENUE V
5 01/01/2023 0:35 BROOKLYN 11229.0 40.601310 -73.954720 (40.60131, -73.95472) NaN
22 01/01/2023 4:20 BROOKLYN 11233.0 40.670116 -73.922480 (40.670116, -73.92248) SAINT JOHNS PLACE
15 01/01/2023 3:30 BROOKLYN 11236.0 40.636720 -73.887695 (40.63672, -73.887695) NaN
14 01/01/2023 17:35 QUEENS 11354.0 40.771587 -73.810070 (40.771587, -73.81007) BAYSIDE AVENUE
16 01/01/2023 10:30 QUEENS 11355.0 40.751800 -73.817314 (40.7518, -73.817314) HOLLY AVENUE
21 01/01/2023 20:25 QUEENS 11357.0 40.786440 -73.829155 (40.78644, -73.829155) 140 STREET
23 01/01/2023 6:45 QUEENS 11368.0 40.739280 -73.850530 (40.73928, -73.85053) SAULTELL AVENUE
24 01/01/2023 23:15 QUEENS 11369.0 40.757430 -73.876230 (40.75743, -73.87623) NaN
33 01/01/2023 10:16 QUEENS 11369.0 40.764633 -73.865370 (40.764633, -73.86537) NaN
26 01/01/2023 4:00 QUEENS 11372.0 40.751003 -73.892130 (40.751003, -73.89213) 74 STREET
13 01/01/2023 17:55 QUEENS 11375.0 40.710320 -73.849980 (40.71032, -73.84998) METROPOLITAN AVENUE
31 01/01/2023 5:45 QUEENS 11411.0 40.690640 -73.728320 (40.69064, -73.72832) NaN
11 01/01/2023 23:06 QUEENS 11691.0 40.598720 -73.766010 (40.59872, -73.76601) BEACH 32 STREET
34 01/01/2023 3:45 QUEENS 11694.0 40.580510 -73.847730 (40.58051, -73.84773) NaN
0 01/01/2023 14:38 NaN NaN NaN NaN NaN BROOKLYN QUEENS EXPRESSWAY RAMP
1 01/01/2023 8:04 NaN NaN NaN NaN NaN NASSAU EXPRESSWAY
2 01/01/2023 18:05 NaN NaN NaN NaN NaN 10 AVENUE
3 01/01/2023 23:45 NaN NaN 40.769737 -73.912440 (40.769737, -73.91244) ASTORIA BOULEVARD
4 01/01/2023 4:50 NaN NaN 40.830555 -73.850720 (40.830555, -73.85072) CASTLE HILL AVENUE
8 01/01/2023 19:53 NaN NaN NaN NaN NaN EAST 222 STREET
9 01/01/2023 3:11 NaN NaN 40.842110 -73.825570 (40.84211, -73.82557) BRUCKNER EXPRESSWAY
10 01/01/2023 18:49 NaN NaN 40.674923 -73.736940 (40.674923, -73.73694) MERRICK BOULEVARD
18 01/01/2023 5:30 NaN NaN 40.704810 -73.939320 (40.70481, -73.93932) SEIGEL STREET
20 01/01/2023 0:04 NaN NaN 40.858090 -73.901924 (40.85809, -73.901924) EAST 183 STREET
25 01/01/2023 3:35 NaN NaN NaN NaN NaN GRAND CENTRAL PARKWAY
29 01/01/2023 10:50 NaN NaN 40.832700 -73.950226 (40.8327, -73.950226) HENRY HUDSON PARKWAY
30 01/01/2023 17:05 NaN NaN 40.795250 -73.973210 (40.79525, -73.97321) WEST 96 STREET
32 01/01/2023 0:55 NaN NaN 40.818470 -73.941400 (40.81847, -73.9414) WEST 140 STREET

5.2.13 [] method

[] method can be used to select column(s) by passing column name(s).

sub_set_1['ZIP CODE'].head()
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
Name: ZIP CODE, dtype: float64
sub_set_1[['BOROUGH', 'ZIP CODE', 'LOCATION']].head()
BOROUGH ZIP CODE LOCATION
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN (40.769737, -73.91244)
4 NaN NaN (40.830555, -73.85072)

5.2.14 loc method

loc can be used to index row(s) and column(s) by providing the row and column labels.

df.loc[row_label(s)] Selects single row or subset of rows from the DataFrame by label.

Index single row

sub_set_1.loc[7]
CRASH DATE                    01/01/2023
CRASH TIME                          1:00
BOROUGH                        MANHATTAN
ZIP CODE                         10026.0
LATITUDE                       40.805595
LONGITUDE                      -73.95819
LOCATION          (40.805595, -73.95819)
ON STREET NAME           WEST 116 STREET
Name: 7, dtype: object

Index multiple rows

sub_set_1.loc[:8]
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME
0 01/01/2023 14:38 NaN NaN NaN NaN NaN BROOKLYN QUEENS EXPRESSWAY RAMP
1 01/01/2023 8:04 NaN NaN NaN NaN NaN NASSAU EXPRESSWAY
2 01/01/2023 18:05 NaN NaN NaN NaN NaN 10 AVENUE
3 01/01/2023 23:45 NaN NaN 40.769737 -73.912440 (40.769737, -73.91244) ASTORIA BOULEVARD
4 01/01/2023 4:50 NaN NaN 40.830555 -73.850720 (40.830555, -73.85072) CASTLE HILL AVENUE
5 01/01/2023 0:35 BROOKLYN 11229.0 40.601310 -73.954720 (40.60131, -73.95472) NaN
6 01/01/2023 5:45 QUEENS 11101.0 40.744667 -73.931694 (40.744667, -73.931694) QUEENS BOULEVARD
7 01/01/2023 1:00 MANHATTAN 10026.0 40.805595 -73.958190 (40.805595, -73.95819) WEST 116 STREET
8 01/01/2023 19:53 NaN NaN NaN NaN NaN EAST 222 STREET
sub_set_1.loc[[0, 7, 4, 6]]
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME
0 01/01/2023 14:38 NaN NaN NaN NaN NaN BROOKLYN QUEENS EXPRESSWAY RAMP
7 01/01/2023 1:00 MANHATTAN 10026.0 40.805595 -73.958190 (40.805595, -73.95819) WEST 116 STREET
4 01/01/2023 4:50 NaN NaN 40.830555 -73.850720 (40.830555, -73.85072) CASTLE HILL AVENUE
6 01/01/2023 5:45 QUEENS 11101.0 40.744667 -73.931694 (40.744667, -73.931694) QUEENS BOULEVARD

df.loc[:, col_labels] Selects single column or subset of columns by label.

sub_set_1.loc[:, 'LOCATION'].head()
0                       NaN
1                       NaN
2                       NaN
3    (40.769737, -73.91244)
4    (40.830555, -73.85072)
Name: LOCATION, dtype: object
sub_set_1.loc[:, 'LATITUDE': 'LOCATION'].head()
LATITUDE LONGITUDE LOCATION
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 40.769737 -73.91244 (40.769737, -73.91244)
4 40.830555 -73.85072 (40.830555, -73.85072)
sub_set_1.loc[:, ['BOROUGH', 'ZIP CODE', 'LOCATION']].head()
BOROUGH ZIP CODE LOCATION
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN (40.769737, -73.91244)
4 NaN NaN (40.830555, -73.85072)

df.loc[row_label(s), col_label(s)] Select both rows and columns by label.

sub_set_1.loc[7, 'BOROUGH']
'MANHATTAN'
sub_set_1.loc[:8, ['BOROUGH', 'LOCATION']]
BOROUGH LOCATION
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN (40.769737, -73.91244)
4 NaN (40.830555, -73.85072)
5 BROOKLYN (40.60131, -73.95472)
6 QUEENS (40.744667, -73.931694)
7 MANHATTAN (40.805595, -73.95819)
8 NaN NaN

Index by Boolean Series

sub_set_1['BOROUGH'].isin(['MANHATTAN','QUEENS']).head()
0    False
1    False
2    False
3    False
4    False
Name: BOROUGH, dtype: bool
sub_set_1.loc[sub_set_1['BOROUGH'].isin(['MANHATTAN','QUEENS'])].head()
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME
6 01/01/2023 5:45 QUEENS 11101.0 40.744667 -73.931694 (40.744667, -73.931694) QUEENS BOULEVARD
7 01/01/2023 1:00 MANHATTAN 10026.0 40.805595 -73.958190 (40.805595, -73.95819) WEST 116 STREET
11 01/01/2023 23:06 QUEENS 11691.0 40.598720 -73.766010 (40.59872, -73.76601) BEACH 32 STREET
13 01/01/2023 17:55 QUEENS 11375.0 40.710320 -73.849980 (40.71032, -73.84998) METROPOLITAN AVENUE
14 01/01/2023 17:35 QUEENS 11354.0 40.771587 -73.810070 (40.771587, -73.81007) BAYSIDE AVENUE

Use “&” (and), “|” (or) “~” (not) for Pandas

sub_set_1.loc[(sub_set_1["BOROUGH"] == "MANHATTAN") & (sub_set_1["ZIP CODE"] 
>= 1000)]
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME
7 01/01/2023 1:00 MANHATTAN 10026.0 40.805595 -73.95819 (40.805595, -73.95819) WEST 116 STREET

5.2.15 iloc method

iloc can be used to index row(s) and column(s) by providing the row and column integer(s).

df.iloc[row_integer(s)] Selects single row or subset of rows from the DataFrame by integer position

Note: same as indexing for sequence (but different with loc, it is 0 basis and the selection is close to the left and open to the right (the last item is excluded).

sub_set_1.iloc[3]
CRASH DATE                    01/01/2023
CRASH TIME                         23:45
BOROUGH                              NaN
ZIP CODE                             NaN
LATITUDE                       40.769737
LONGITUDE                      -73.91244
LOCATION          (40.769737, -73.91244)
ON STREET NAME         ASTORIA BOULEVARD
Name: 3, dtype: object
sub_set_1.iloc[:8]
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME
0 01/01/2023 14:38 NaN NaN NaN NaN NaN BROOKLYN QUEENS EXPRESSWAY RAMP
1 01/01/2023 8:04 NaN NaN NaN NaN NaN NASSAU EXPRESSWAY
2 01/01/2023 18:05 NaN NaN NaN NaN NaN 10 AVENUE
3 01/01/2023 23:45 NaN NaN 40.769737 -73.912440 (40.769737, -73.91244) ASTORIA BOULEVARD
4 01/01/2023 4:50 NaN NaN 40.830555 -73.850720 (40.830555, -73.85072) CASTLE HILL AVENUE
5 01/01/2023 0:35 BROOKLYN 11229.0 40.601310 -73.954720 (40.60131, -73.95472) NaN
6 01/01/2023 5:45 QUEENS 11101.0 40.744667 -73.931694 (40.744667, -73.931694) QUEENS BOULEVARD
7 01/01/2023 1:00 MANHATTAN 10026.0 40.805595 -73.958190 (40.805595, -73.95819) WEST 116 STREET
sub_set_1.iloc[:, 1:3].head()
CRASH TIME BOROUGH
0 14:38 NaN
1 8:04 NaN
2 18:05 NaN
3 23:45 NaN
4 4:50 NaN

df.iloc[row_integer(s), col_integer(s)] Selects row and columns from the DataFrame by integer positions.

sub_set_1.iloc[0:5, :6] 
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE
0 01/01/2023 14:38 NaN NaN NaN NaN
1 01/01/2023 8:04 NaN NaN NaN NaN
2 01/01/2023 18:05 NaN NaN NaN NaN
3 01/01/2023 23:45 NaN NaN 40.769737 -73.91244
4 01/01/2023 4:50 NaN NaN 40.830555 -73.85072

5.2.16 concat method

pd.concat([df1, df2], axis = 0) can be used to combine two dataframe either row-wise or column-wise depends on value of axis:

  • 0 (default, row-wise)
  • 1 (column-wise)
sub_set_1
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME
0 01/01/2023 14:38 NaN NaN NaN NaN NaN BROOKLYN QUEENS EXPRESSWAY RAMP
1 01/01/2023 8:04 NaN NaN NaN NaN NaN NASSAU EXPRESSWAY
2 01/01/2023 18:05 NaN NaN NaN NaN NaN 10 AVENUE
3 01/01/2023 23:45 NaN NaN 40.769737 -73.912440 (40.769737, -73.91244) ASTORIA BOULEVARD
4 01/01/2023 4:50 NaN NaN 40.830555 -73.850720 (40.830555, -73.85072) CASTLE HILL AVENUE
5 01/01/2023 0:35 BROOKLYN 11229.0 40.601310 -73.954720 (40.60131, -73.95472) NaN
6 01/01/2023 5:45 QUEENS 11101.0 40.744667 -73.931694 (40.744667, -73.931694) QUEENS BOULEVARD
7 01/01/2023 1:00 MANHATTAN 10026.0 40.805595 -73.958190 (40.805595, -73.95819) WEST 116 STREET
8 01/01/2023 19:53 NaN NaN NaN NaN NaN EAST 222 STREET
9 01/01/2023 3:11 NaN NaN 40.842110 -73.825570 (40.84211, -73.82557) BRUCKNER EXPRESSWAY
10 01/01/2023 18:49 NaN NaN 40.674923 -73.736940 (40.674923, -73.73694) MERRICK BOULEVARD
11 01/01/2023 23:06 QUEENS 11691.0 40.598720 -73.766010 (40.59872, -73.76601) BEACH 32 STREET
12 01/01/2023 1:45 BROOKLYN 11207.0 40.652767 -73.886300 (40.652767, -73.8863) PENNSYLVANIA AVENUE
13 01/01/2023 17:55 QUEENS 11375.0 40.710320 -73.849980 (40.71032, -73.84998) METROPOLITAN AVENUE
14 01/01/2023 17:35 QUEENS 11354.0 40.771587 -73.810070 (40.771587, -73.81007) BAYSIDE AVENUE
15 01/01/2023 3:30 BROOKLYN 11236.0 40.636720 -73.887695 (40.63672, -73.887695) NaN
16 01/01/2023 10:30 QUEENS 11355.0 40.751800 -73.817314 (40.7518, -73.817314) HOLLY AVENUE
17 01/01/2023 6:40 STATEN ISLAND 10305.0 40.585240 -74.093414 (40.58524, -74.093414) HYLAN BOULEVARD
18 01/01/2023 5:30 NaN NaN 40.704810 -73.939320 (40.70481, -73.93932) SEIGEL STREET
19 01/01/2023 1:10 BROOKLYN 11223.0 40.593760 -73.982740 (40.59376, -73.98274) AVENUE V
20 01/01/2023 0:04 NaN NaN 40.858090 -73.901924 (40.85809, -73.901924) EAST 183 STREET
21 01/01/2023 20:25 QUEENS 11357.0 40.786440 -73.829155 (40.78644, -73.829155) 140 STREET
22 01/01/2023 4:20 BROOKLYN 11233.0 40.670116 -73.922480 (40.670116, -73.92248) SAINT JOHNS PLACE
23 01/01/2023 6:45 QUEENS 11368.0 40.739280 -73.850530 (40.73928, -73.85053) SAULTELL AVENUE
24 01/01/2023 23:15 QUEENS 11369.0 40.757430 -73.876230 (40.75743, -73.87623) NaN
25 01/01/2023 3:35 NaN NaN NaN NaN NaN GRAND CENTRAL PARKWAY
26 01/01/2023 4:00 QUEENS 11372.0 40.751003 -73.892130 (40.751003, -73.89213) 74 STREET
27 01/01/2023 5:36 BROOKLYN 11207.0 40.663136 -73.883250 (40.663136, -73.88325) NaN
28 01/01/2023 7:40 BRONX 10468.0 40.861810 -73.912320 (40.86181, -73.91232) NaN
29 01/01/2023 10:50 NaN NaN 40.832700 -73.950226 (40.8327, -73.950226) HENRY HUDSON PARKWAY
30 01/01/2023 17:05 NaN NaN 40.795250 -73.973210 (40.79525, -73.97321) WEST 96 STREET
31 01/01/2023 5:45 QUEENS 11411.0 40.690640 -73.728320 (40.69064, -73.72832) NaN
32 01/01/2023 0:55 NaN NaN 40.818470 -73.941400 (40.81847, -73.9414) WEST 140 STREET
33 01/01/2023 10:16 QUEENS 11369.0 40.764633 -73.865370 (40.764633, -73.86537) NaN
34 01/01/2023 3:45 QUEENS 11694.0 40.580510 -73.847730 (40.58051, -73.84773) NaN
sub_set_2 = crashes.iloc[35:60, 0:8]
sub_set_2
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME
35 01/01/2023 1:20 QUEENS 11377.0 40.748720 -73.896250 (40.74872, -73.89625) BROOKLYN QUEENS EXPRESSWAY
36 01/01/2023 18:26 BROOKLYN 11208.0 40.678246 -73.870186 (40.678246, -73.870186) NaN
37 01/01/2023 0:00 NaN NaN NaN NaN NaN WILLIAMSBURG BRIDGE OUTER ROADWA
38 01/01/2023 22:35 NaN NaN 40.755780 -74.001990 (40.75578, -74.00199) WEST 34 STREET
39 01/01/2023 12:00 BROOKLYN 11211.0 40.706690 -73.958840 (40.70669, -73.95884) NaN
40 01/01/2023 8:45 NaN NaN 40.828114 -73.931070 (40.828114, -73.93107) MAJOR DEEGAN EXPRESSWAY
41 01/01/2023 1:20 MANHATTAN 10022.0 40.758980 -73.962440 (40.75898, -73.96244) 1 AVENUE
42 01/01/2023 16:45 QUEENS 11420.0 40.679070 -73.831540 (40.67907, -73.83154) NaN
43 01/01/2023 0:10 BROOKLYN 11236.0 40.638350 -73.908890 (40.63835, -73.90889) FLATLANDS AVENUE
44 01/01/2023 14:53 BROOKLYN 11208.0 40.660797 -73.871830 (40.660797, -73.87183) ATKINS AVENUE
45 01/01/2023 13:10 NaN NaN 40.831290 -73.929040 (40.83129, -73.92904) WOODYCREST AVENUE
46 01/01/2023 14:16 NaN NaN NaN NaN NaN VERRAZANO BRIDGE UPPER
47 01/01/2023 15:12 NaN NaN 40.712780 -74.011690 (40.71278, -74.01169) GREENWICH STREET
48 01/01/2023 0:00 NaN NaN 40.688370 -73.944916 (40.68837, -73.944916) LEXINGTON AVENUE
49 01/01/2023 3:35 NaN NaN 40.825935 -73.859130 (40.825935, -73.85913) BRUCKNER EXPRESSWAY
50 01/01/2023 9:45 NaN NaN 40.882645 -73.886566 (40.882645, -73.886566) SEDGWICK AVENUE
51 01/01/2023 12:20 NaN NaN NaN NaN NaN 102 CROSS DRIVE
52 01/01/2023 12:00 QUEENS 11354.0 40.764650 -73.823494 (40.76465, -73.823494) NORTHERN BOULEVARD
53 01/01/2023 1:17 QUEENS 11375.0 40.724308 -73.842575 (40.724308, -73.842575) 110 STREET
54 01/01/2023 7:15 MANHATTAN 10025.0 40.795250 -73.973210 (40.79525, -73.97321) WEST END AVENUE
55 01/01/2023 1:30 BRONX 10457.0 40.836480 -73.897736 (40.83648, -73.897736) CLAREMONT PARKWAY
56 01/01/2023 8:08 NaN NaN 40.686085 -73.982666 (40.686085, -73.982666) ATLANTIC AVENUE
57 01/01/2023 6:08 QUEENS 11420.0 40.677242 -73.816720 (40.677242, -73.81672) NaN
58 01/01/2023 8:20 BRONX 10451.0 NaN NaN NaN EAST 138 STREET
59 01/01/2023 10:11 QUEENS 11358.0 40.760440 -73.804924 (40.76044, -73.804924) 161 STREET

combining by rows

sub_set_3 = pd.concat([sub_set_1, sub_set_2])
sub_set_3
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME
0 01/01/2023 14:38 NaN NaN NaN NaN NaN BROOKLYN QUEENS EXPRESSWAY RAMP
1 01/01/2023 8:04 NaN NaN NaN NaN NaN NASSAU EXPRESSWAY
2 01/01/2023 18:05 NaN NaN NaN NaN NaN 10 AVENUE
3 01/01/2023 23:45 NaN NaN 40.769737 -73.912440 (40.769737, -73.91244) ASTORIA BOULEVARD
4 01/01/2023 4:50 NaN NaN 40.830555 -73.850720 (40.830555, -73.85072) CASTLE HILL AVENUE
5 01/01/2023 0:35 BROOKLYN 11229.0 40.601310 -73.954720 (40.60131, -73.95472) NaN
6 01/01/2023 5:45 QUEENS 11101.0 40.744667 -73.931694 (40.744667, -73.931694) QUEENS BOULEVARD
7 01/01/2023 1:00 MANHATTAN 10026.0 40.805595 -73.958190 (40.805595, -73.95819) WEST 116 STREET
8 01/01/2023 19:53 NaN NaN NaN NaN NaN EAST 222 STREET
9 01/01/2023 3:11 NaN NaN 40.842110 -73.825570 (40.84211, -73.82557) BRUCKNER EXPRESSWAY
10 01/01/2023 18:49 NaN NaN 40.674923 -73.736940 (40.674923, -73.73694) MERRICK BOULEVARD
11 01/01/2023 23:06 QUEENS 11691.0 40.598720 -73.766010 (40.59872, -73.76601) BEACH 32 STREET
12 01/01/2023 1:45 BROOKLYN 11207.0 40.652767 -73.886300 (40.652767, -73.8863) PENNSYLVANIA AVENUE
13 01/01/2023 17:55 QUEENS 11375.0 40.710320 -73.849980 (40.71032, -73.84998) METROPOLITAN AVENUE
14 01/01/2023 17:35 QUEENS 11354.0 40.771587 -73.810070 (40.771587, -73.81007) BAYSIDE AVENUE
15 01/01/2023 3:30 BROOKLYN 11236.0 40.636720 -73.887695 (40.63672, -73.887695) NaN
16 01/01/2023 10:30 QUEENS 11355.0 40.751800 -73.817314 (40.7518, -73.817314) HOLLY AVENUE
17 01/01/2023 6:40 STATEN ISLAND 10305.0 40.585240 -74.093414 (40.58524, -74.093414) HYLAN BOULEVARD
18 01/01/2023 5:30 NaN NaN 40.704810 -73.939320 (40.70481, -73.93932) SEIGEL STREET
19 01/01/2023 1:10 BROOKLYN 11223.0 40.593760 -73.982740 (40.59376, -73.98274) AVENUE V
20 01/01/2023 0:04 NaN NaN 40.858090 -73.901924 (40.85809, -73.901924) EAST 183 STREET
21 01/01/2023 20:25 QUEENS 11357.0 40.786440 -73.829155 (40.78644, -73.829155) 140 STREET
22 01/01/2023 4:20 BROOKLYN 11233.0 40.670116 -73.922480 (40.670116, -73.92248) SAINT JOHNS PLACE
23 01/01/2023 6:45 QUEENS 11368.0 40.739280 -73.850530 (40.73928, -73.85053) SAULTELL AVENUE
24 01/01/2023 23:15 QUEENS 11369.0 40.757430 -73.876230 (40.75743, -73.87623) NaN
25 01/01/2023 3:35 NaN NaN NaN NaN NaN GRAND CENTRAL PARKWAY
26 01/01/2023 4:00 QUEENS 11372.0 40.751003 -73.892130 (40.751003, -73.89213) 74 STREET
27 01/01/2023 5:36 BROOKLYN 11207.0 40.663136 -73.883250 (40.663136, -73.88325) NaN
28 01/01/2023 7:40 BRONX 10468.0 40.861810 -73.912320 (40.86181, -73.91232) NaN
29 01/01/2023 10:50 NaN NaN 40.832700 -73.950226 (40.8327, -73.950226) HENRY HUDSON PARKWAY
30 01/01/2023 17:05 NaN NaN 40.795250 -73.973210 (40.79525, -73.97321) WEST 96 STREET
31 01/01/2023 5:45 QUEENS 11411.0 40.690640 -73.728320 (40.69064, -73.72832) NaN
32 01/01/2023 0:55 NaN NaN 40.818470 -73.941400 (40.81847, -73.9414) WEST 140 STREET
33 01/01/2023 10:16 QUEENS 11369.0 40.764633 -73.865370 (40.764633, -73.86537) NaN
34 01/01/2023 3:45 QUEENS 11694.0 40.580510 -73.847730 (40.58051, -73.84773) NaN
35 01/01/2023 1:20 QUEENS 11377.0 40.748720 -73.896250 (40.74872, -73.89625) BROOKLYN QUEENS EXPRESSWAY
36 01/01/2023 18:26 BROOKLYN 11208.0 40.678246 -73.870186 (40.678246, -73.870186) NaN
37 01/01/2023 0:00 NaN NaN NaN NaN NaN WILLIAMSBURG BRIDGE OUTER ROADWA
38 01/01/2023 22:35 NaN NaN 40.755780 -74.001990 (40.75578, -74.00199) WEST 34 STREET
39 01/01/2023 12:00 BROOKLYN 11211.0 40.706690 -73.958840 (40.70669, -73.95884) NaN
40 01/01/2023 8:45 NaN NaN 40.828114 -73.931070 (40.828114, -73.93107) MAJOR DEEGAN EXPRESSWAY
41 01/01/2023 1:20 MANHATTAN 10022.0 40.758980 -73.962440 (40.75898, -73.96244) 1 AVENUE
42 01/01/2023 16:45 QUEENS 11420.0 40.679070 -73.831540 (40.67907, -73.83154) NaN
43 01/01/2023 0:10 BROOKLYN 11236.0 40.638350 -73.908890 (40.63835, -73.90889) FLATLANDS AVENUE
44 01/01/2023 14:53 BROOKLYN 11208.0 40.660797 -73.871830 (40.660797, -73.87183) ATKINS AVENUE
45 01/01/2023 13:10 NaN NaN 40.831290 -73.929040 (40.83129, -73.92904) WOODYCREST AVENUE
46 01/01/2023 14:16 NaN NaN NaN NaN NaN VERRAZANO BRIDGE UPPER
47 01/01/2023 15:12 NaN NaN 40.712780 -74.011690 (40.71278, -74.01169) GREENWICH STREET
48 01/01/2023 0:00 NaN NaN 40.688370 -73.944916 (40.68837, -73.944916) LEXINGTON AVENUE
49 01/01/2023 3:35 NaN NaN 40.825935 -73.859130 (40.825935, -73.85913) BRUCKNER EXPRESSWAY
50 01/01/2023 9:45 NaN NaN 40.882645 -73.886566 (40.882645, -73.886566) SEDGWICK AVENUE
51 01/01/2023 12:20 NaN NaN NaN NaN NaN 102 CROSS DRIVE
52 01/01/2023 12:00 QUEENS 11354.0 40.764650 -73.823494 (40.76465, -73.823494) NORTHERN BOULEVARD
53 01/01/2023 1:17 QUEENS 11375.0 40.724308 -73.842575 (40.724308, -73.842575) 110 STREET
54 01/01/2023 7:15 MANHATTAN 10025.0 40.795250 -73.973210 (40.79525, -73.97321) WEST END AVENUE
55 01/01/2023 1:30 BRONX 10457.0 40.836480 -73.897736 (40.83648, -73.897736) CLAREMONT PARKWAY
56 01/01/2023 8:08 NaN NaN 40.686085 -73.982666 (40.686085, -73.982666) ATLANTIC AVENUE
57 01/01/2023 6:08 QUEENS 11420.0 40.677242 -73.816720 (40.677242, -73.81672) NaN
58 01/01/2023 8:20 BRONX 10451.0 NaN NaN NaN EAST 138 STREET
59 01/01/2023 10:11 QUEENS 11358.0 40.760440 -73.804924 (40.76044, -73.804924) 161 STREET

combining by columns

sub_set_4 = pd.concat([sub_set_1, sub_set_2], axis = 1)
sub_set_4
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME
0 01/01/2023 14:38 NaN NaN NaN NaN NaN BROOKLYN QUEENS EXPRESSWAY RAMP NaN NaN NaN NaN NaN NaN NaN NaN
1 01/01/2023 8:04 NaN NaN NaN NaN NaN NASSAU EXPRESSWAY NaN NaN NaN NaN NaN NaN NaN NaN
2 01/01/2023 18:05 NaN NaN NaN NaN NaN 10 AVENUE NaN NaN NaN NaN NaN NaN NaN NaN
3 01/01/2023 23:45 NaN NaN 40.769737 -73.912440 (40.769737, -73.91244) ASTORIA BOULEVARD NaN NaN NaN NaN NaN NaN NaN NaN
4 01/01/2023 4:50 NaN NaN 40.830555 -73.850720 (40.830555, -73.85072) CASTLE HILL AVENUE NaN NaN NaN NaN NaN NaN NaN NaN
5 01/01/2023 0:35 BROOKLYN 11229.0 40.601310 -73.954720 (40.60131, -73.95472) NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 01/01/2023 5:45 QUEENS 11101.0 40.744667 -73.931694 (40.744667, -73.931694) QUEENS BOULEVARD NaN NaN NaN NaN NaN NaN NaN NaN
7 01/01/2023 1:00 MANHATTAN 10026.0 40.805595 -73.958190 (40.805595, -73.95819) WEST 116 STREET NaN NaN NaN NaN NaN NaN NaN NaN
8 01/01/2023 19:53 NaN NaN NaN NaN NaN EAST 222 STREET NaN NaN NaN NaN NaN NaN NaN NaN
9 01/01/2023 3:11 NaN NaN 40.842110 -73.825570 (40.84211, -73.82557) BRUCKNER EXPRESSWAY NaN NaN NaN NaN NaN NaN NaN NaN
10 01/01/2023 18:49 NaN NaN 40.674923 -73.736940 (40.674923, -73.73694) MERRICK BOULEVARD NaN NaN NaN NaN NaN NaN NaN NaN
11 01/01/2023 23:06 QUEENS 11691.0 40.598720 -73.766010 (40.59872, -73.76601) BEACH 32 STREET NaN NaN NaN NaN NaN NaN NaN NaN
12 01/01/2023 1:45 BROOKLYN 11207.0 40.652767 -73.886300 (40.652767, -73.8863) PENNSYLVANIA AVENUE NaN NaN NaN NaN NaN NaN NaN NaN
13 01/01/2023 17:55 QUEENS 11375.0 40.710320 -73.849980 (40.71032, -73.84998) METROPOLITAN AVENUE NaN NaN NaN NaN NaN NaN NaN NaN
14 01/01/2023 17:35 QUEENS 11354.0 40.771587 -73.810070 (40.771587, -73.81007) BAYSIDE AVENUE NaN NaN NaN NaN NaN NaN NaN NaN
15 01/01/2023 3:30 BROOKLYN 11236.0 40.636720 -73.887695 (40.63672, -73.887695) NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 01/01/2023 10:30 QUEENS 11355.0 40.751800 -73.817314 (40.7518, -73.817314) HOLLY AVENUE NaN NaN NaN NaN NaN NaN NaN NaN
17 01/01/2023 6:40 STATEN ISLAND 10305.0 40.585240 -74.093414 (40.58524, -74.093414) HYLAN BOULEVARD NaN NaN NaN NaN NaN NaN NaN NaN
18 01/01/2023 5:30 NaN NaN 40.704810 -73.939320 (40.70481, -73.93932) SEIGEL STREET NaN NaN NaN NaN NaN NaN NaN NaN
19 01/01/2023 1:10 BROOKLYN 11223.0 40.593760 -73.982740 (40.59376, -73.98274) AVENUE V NaN NaN NaN NaN NaN NaN NaN NaN
20 01/01/2023 0:04 NaN NaN 40.858090 -73.901924 (40.85809, -73.901924) EAST 183 STREET NaN NaN NaN NaN NaN NaN NaN NaN
21 01/01/2023 20:25 QUEENS 11357.0 40.786440 -73.829155 (40.78644, -73.829155) 140 STREET NaN NaN NaN NaN NaN NaN NaN NaN
22 01/01/2023 4:20 BROOKLYN 11233.0 40.670116 -73.922480 (40.670116, -73.92248) SAINT JOHNS PLACE NaN NaN NaN NaN NaN NaN NaN NaN
23 01/01/2023 6:45 QUEENS 11368.0 40.739280 -73.850530 (40.73928, -73.85053) SAULTELL AVENUE NaN NaN NaN NaN NaN NaN NaN NaN
24 01/01/2023 23:15 QUEENS 11369.0 40.757430 -73.876230 (40.75743, -73.87623) NaN NaN NaN NaN NaN NaN NaN NaN NaN
25 01/01/2023 3:35 NaN NaN NaN NaN NaN GRAND CENTRAL PARKWAY NaN NaN NaN NaN NaN NaN NaN NaN
26 01/01/2023 4:00 QUEENS 11372.0 40.751003 -73.892130 (40.751003, -73.89213) 74 STREET NaN NaN NaN NaN NaN NaN NaN NaN
27 01/01/2023 5:36 BROOKLYN 11207.0 40.663136 -73.883250 (40.663136, -73.88325) NaN NaN NaN NaN NaN NaN NaN NaN NaN
28 01/01/2023 7:40 BRONX 10468.0 40.861810 -73.912320 (40.86181, -73.91232) NaN NaN NaN NaN NaN NaN NaN NaN NaN
29 01/01/2023 10:50 NaN NaN 40.832700 -73.950226 (40.8327, -73.950226) HENRY HUDSON PARKWAY NaN NaN NaN NaN NaN NaN NaN NaN
30 01/01/2023 17:05 NaN NaN 40.795250 -73.973210 (40.79525, -73.97321) WEST 96 STREET NaN NaN NaN NaN NaN NaN NaN NaN
31 01/01/2023 5:45 QUEENS 11411.0 40.690640 -73.728320 (40.69064, -73.72832) NaN NaN NaN NaN NaN NaN NaN NaN NaN
32 01/01/2023 0:55 NaN NaN 40.818470 -73.941400 (40.81847, -73.9414) WEST 140 STREET NaN NaN NaN NaN NaN NaN NaN NaN
33 01/01/2023 10:16 QUEENS 11369.0 40.764633 -73.865370 (40.764633, -73.86537) NaN NaN NaN NaN NaN NaN NaN NaN NaN
34 01/01/2023 3:45 QUEENS 11694.0 40.580510 -73.847730 (40.58051, -73.84773) NaN NaN NaN NaN NaN NaN NaN NaN NaN
35 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 1:20 QUEENS 11377.0 40.748720 -73.896250 (40.74872, -73.89625) BROOKLYN QUEENS EXPRESSWAY
36 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 18:26 BROOKLYN 11208.0 40.678246 -73.870186 (40.678246, -73.870186) NaN
37 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 0:00 NaN NaN NaN NaN NaN WILLIAMSBURG BRIDGE OUTER ROADWA
38 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 22:35 NaN NaN 40.755780 -74.001990 (40.75578, -74.00199) WEST 34 STREET
39 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 12:00 BROOKLYN 11211.0 40.706690 -73.958840 (40.70669, -73.95884) NaN
40 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 8:45 NaN NaN 40.828114 -73.931070 (40.828114, -73.93107) MAJOR DEEGAN EXPRESSWAY
41 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 1:20 MANHATTAN 10022.0 40.758980 -73.962440 (40.75898, -73.96244) 1 AVENUE
42 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 16:45 QUEENS 11420.0 40.679070 -73.831540 (40.67907, -73.83154) NaN
43 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 0:10 BROOKLYN 11236.0 40.638350 -73.908890 (40.63835, -73.90889) FLATLANDS AVENUE
44 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 14:53 BROOKLYN 11208.0 40.660797 -73.871830 (40.660797, -73.87183) ATKINS AVENUE
45 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 13:10 NaN NaN 40.831290 -73.929040 (40.83129, -73.92904) WOODYCREST AVENUE
46 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 14:16 NaN NaN NaN NaN NaN VERRAZANO BRIDGE UPPER
47 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 15:12 NaN NaN 40.712780 -74.011690 (40.71278, -74.01169) GREENWICH STREET
48 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 0:00 NaN NaN 40.688370 -73.944916 (40.68837, -73.944916) LEXINGTON AVENUE
49 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 3:35 NaN NaN 40.825935 -73.859130 (40.825935, -73.85913) BRUCKNER EXPRESSWAY
50 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 9:45 NaN NaN 40.882645 -73.886566 (40.882645, -73.886566) SEDGWICK AVENUE
51 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 12:20 NaN NaN NaN NaN NaN 102 CROSS DRIVE
52 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 12:00 QUEENS 11354.0 40.764650 -73.823494 (40.76465, -73.823494) NORTHERN BOULEVARD
53 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 1:17 QUEENS 11375.0 40.724308 -73.842575 (40.724308, -73.842575) 110 STREET
54 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 7:15 MANHATTAN 10025.0 40.795250 -73.973210 (40.79525, -73.97321) WEST END AVENUE
55 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 1:30 BRONX 10457.0 40.836480 -73.897736 (40.83648, -73.897736) CLAREMONT PARKWAY
56 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 8:08 NaN NaN 40.686085 -73.982666 (40.686085, -73.982666) ATLANTIC AVENUE
57 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 6:08 QUEENS 11420.0 40.677242 -73.816720 (40.677242, -73.81672) NaN
58 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 8:20 BRONX 10451.0 NaN NaN NaN EAST 138 STREET
59 NaN NaN NaN NaN NaN NaN NaN NaN 01/01/2023 10:11 QUEENS 11358.0 40.760440 -73.804924 (40.76044, -73.804924) 161 STREET

use .fillna() to fill in the missing values

5.3 Merge and Join

  • pd.merge( ): allows a user to do one-to-one, one-to-many, and many-to-many joins
    • One-to-one joins mean each row is related on a single row in a different table using a key column
    • One-to-many joins mean each row in one table in related to one or more rows in a different table using a key column
    • Many-to-many joins mean one or more rows in one table is related to one or more rows in a seperate table using a key column

5.3.1 Ex: One-to One Join

  • Create 2 new Data Frames from the January 2023 data with a common column (Zip Code)
    • For this example, we are using .drop_duplicates() to get unique ZIP CODE values
  • Using pd.merge( ) the two data frames are combined using that common column as a key
jan23 = pd.read_csv("data/nyc_crashes_202301.csv") 
crash_zip = jan23[["CRASH DATE", "ZIP CODE"]].copy().drop_duplicates(subset = [ "ZIP CODE"])
crash_zip.tail() # to view DF of crash date & zip code
CRASH DATE ZIP CODE
3590 01/16/2023 10023.0
4304 01/19/2023 11363.0
5530 01/24/2023 11109.0
5732 01/25/2023 10280.0
7140 01/31/2023 10169.0
borough_zip = jan23[["ZIP CODE", "BOROUGH"]].copy().drop_duplicates(subset = [ "ZIP CODE"])
borough_zip.tail() # to view DF of zip code and borough
ZIP CODE BOROUGH
3590 10023.0 MANHATTAN
4304 11363.0 QUEENS
5530 11109.0 QUEENS
5732 10280.0 MANHATTAN
7140 10169.0 MANHATTAN
merge_w_zip = pd.merge(crash_zip, borough_zip)
merge_w_zip.tail() # to view joined data frames
CRASH DATE ZIP CODE BOROUGH
176 01/16/2023 10023.0 MANHATTAN
177 01/19/2023 11363.0 QUEENS
178 01/24/2023 11109.0 QUEENS
179 01/25/2023 10280.0 MANHATTAN
180 01/31/2023 10169.0 MANHATTAN

We can specify the name of the key column using on:

pd.merge(crash_zip, borough_zip, on = 'ZIP CODE').tail()
CRASH DATE ZIP CODE BOROUGH
176 01/16/2023 10023.0 MANHATTAN
177 01/19/2023 11363.0 QUEENS
178 01/24/2023 11109.0 QUEENS
179 01/25/2023 10280.0 MANHATTAN
180 01/31/2023 10169.0 MANHATTAN

5.3.2 Ex: One-to-Many Join

  • Create a new DataFrame consisting of Employee and Department
  • Create a new DataFrame consisting of Employee and Hire Year
emp_dept = pd.DataFrame({'employee': ['Emily', 'Jake', 'Paul', 'Jackie'],
                        'department': ['Accounting', 'HR', 'Engineering', 'Accounting']})

emp_hire = pd.DataFrame({'employee': ['Emily', 'Jake', 'Paul', 'Jackie'],
                        'hire_year': ['2020', '2019', '2023', '2011']})

# combining to create a one to one join
dept_hire = pd.merge(emp_dept,emp_hire) # to merge Employee, Department and Hire Year into one dataframe
dept_hire
employee department hire_year
0 Emily Accounting 2020
1 Jake HR 2019
2 Paul Engineering 2023
3 Jackie Accounting 2011
  • Create a new DataFrame that consists of Department and the Supervisor for that department
  • Merge this new DataFrame with ‘dept_hire’ to create a Many-to-One join using Department as a key
dept_sup = pd.DataFrame({'supervisor': ['Lily', 'Angela', 'Steven'],
                        'department': ['Accounting', 'HR', 'Engineering']})

sup_emp_dept_hire = pd.merge(dept_sup,dept_hire) # merge using Department as the key
sup_emp_dept_hire
supervisor department employee hire_year
0 Lily Accounting Emily 2020
1 Lily Accounting Jackie 2011
2 Angela HR Jake 2019
3 Steven Engineering Paul 2023

5.3.3 Ex: Merging when the Key has Different Variable Names

  • We will merge two dataframes that have a similar column containing the same information, but are named differently
  • Using the employee data from above, but changing employee in emp_hire to employee_name
  • Need to drop either employee or employee_name after merging to not have redundant information
emp_dept_names = pd.DataFrame({'employee': ['Emily', 'Jake', 'Paul', 'Jackie'],
                        'department': ['Accounting', 'HR', 'Engineering', 'Accounting']})

emp_hire_names = pd.DataFrame({'employee_name': ['Emily', 'Jake', 'Paul', 'Jackie'],
                        'hire_year': ['2020', '2019', '2023', '2011']})

# to merge Employee, Department and Hire Year into one dataframe & drop column 'employee_name'
dept_hire_names = pd.merge(emp_dept_names,emp_hire_names, left_on = 'employee', 
                           right_on = 'employee_name').drop('employee_name', axis = 1) 

dept_hire_names
employee department hire_year
0 Emily Accounting 2020
1 Jake HR 2019
2 Paul Engineering 2023
3 Jackie Accounting 2011

5.3.4 Ex: Joining ‘uszipcode’

  • Create a subset of jan23 data with 7 zipcodes
  • Using ‘uszipcode’ data to join the zip codes from jan23 with data provided in this package
from uszipcode import SearchEngine
search = SearchEngine()

# create a DF of zip codes from jan23 & convert to integers
zipcodes = pd.DataFrame(jan23["ZIP CODE"].tail(15).dropna().reset_index(drop = True))
zipcodes["ZIP CODE"] = zipcodes["ZIP CODE"].astype(int)

# create new,empty column in the df to store the address information
zipcodes['Address'] = None


# using uszipcode library to retreive address info
for index, row in zipcodes.iterrows():
    result = search.by_zipcode(row['ZIP CODE'])
    zipcodes.at[index, 'Address'] = result.major_city + ', ' + result.state

print(zipcodes)
    ZIP CODE       Address
0      11228  Brooklyn, NY
1      10027  New York, NY
2      10040  New York, NY
3      10035  New York, NY
4      10035  New York, NY
5      10457     Bronx, NY
6      11203  Brooklyn, NY
7      11208  Brooklyn, NY
8      11230  Brooklyn, NY
9      10033  New York, NY
10     11435   Jamaica, NY

5.4 Aggregation and Grouping

Built-In Pandas Aggregations (for DataFrame & Series objects): - count( ) - Total number of items - first( ), last( ) - First and last item - mean( ), median( )
- Mean and median - min( ), max( )
- Minimum and maximum - std( ), var( )
- Standard deviation and variance - mad( )
- Mean absolute deviation - prod( )
- Product of all items - sum( )
- Sum of all items - groupby() - compute aggregates on subsets of data

5.4.1 Ex: Titanic Groupby

  • We will use the Titanic Data Set from the ‘seaborn’ library
import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head()
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
  • Below the data is groupby ‘sex’ and the counts for each row are displayed
titanic.groupby('sex').count()
survived pclass age sibsp parch fare embarked class who adult_male deck embark_town alive alone
sex
female 314 314 261 314 314 314 312 314 314 314 97 312 314 314
male 577 577 453 577 577 577 577 577 577 577 106 577 577 577

5.4.2 Ex: Crash Data Group By

  • Using the crash_zip DataFrame from above, grouping by ZIP CODE and using the count() method, we can see how many counts for each listed zip code
crash_zip.groupby('ZIP CODE').count()
CRASH DATE
ZIP CODE
10001.0 1
10002.0 1
10003.0 1
10004.0 1
10005.0 1
... ...
11436.0 1
11691.0 1
11692.0 1
11693.0 1
11694.0 1

180 rows × 1 columns

5.5 Pivot Tables

  • Creates a two dimensional table using column data
  • Easy way to visualize data to see patterns and summarize data
  • Use ‘groupby( )’ to create a pivot table

5.5.1 Ex: Pivot Table using Titanic Data Set

  • We will again use the Titanic data set, but now we will create a pivot table

We can use groupby() to help create a pivot table - Group the data by ‘sex’ and ‘class’ to select survival. - Then use the aggregate function mean() to show within the table

titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447

5.5.2 Ex: Pivot Table of Crash Data

  • Create a pivot table that shows the Number of Persons Injured for every Crash Date per Zip Code
jan23.pivot_table('NUMBER OF PERSONS INJURED', index = 'ZIP CODE', columns = 'CRASH DATE')
CRASH DATE 01/01/2023 01/02/2023 01/03/2023 01/04/2023 01/05/2023 01/06/2023 01/07/2023 01/08/2023 01/09/2023 01/10/2023 ... 01/22/2023 01/23/2023 01/24/2023 01/25/2023 01/26/2023 01/27/2023 01/28/2023 01/29/2023 01/30/2023 01/31/2023
ZIP CODE
10001.0 0.0 NaN 0.000000 NaN 0.0 NaN NaN 0.0 0.0 NaN ... NaN NaN 0.0 0.500000 0.0 0.0 1.00 1.0 0.500000 0.5
10002.0 0.0 0.5 0.666667 0.0 NaN NaN NaN 0.0 NaN 0.0 ... 0.5 NaN 0.0 1.333333 0.0 0.0 0.75 0.0 0.333333 1.0
10003.0 0.5 0.0 1.000000 1.0 NaN NaN 0.5 0.0 0.0 NaN ... 0.0 NaN NaN NaN 0.0 0.0 0.50 0.0 0.000000 1.0
10004.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 ... NaN NaN NaN NaN NaN NaN 0.00 NaN NaN NaN
10005.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.0 NaN NaN NaN 0.0 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11436.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 0.000000 NaN NaN 0.00 NaN 0.000000 NaN
11691.0 0.0 0.5 1.000000 1.0 1.0 NaN NaN 0.0 NaN 0.0 ... 0.5 1.0 0.0 0.000000 0.0 0.0 0.50 NaN 0.000000 0.0
11692.0 NaN NaN 1.000000 1.0 0.0 1.0 NaN NaN NaN NaN ... NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0
11693.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN 0.0 0.000000 NaN NaN NaN NaN NaN NaN
11694.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN NaN NaN ... 2.0 1.0 NaN 1.000000 0.0 NaN NaN NaN NaN NaN

180 rows × 31 columns

5.6 Vectorized String Operations

  • Must use the ‘str’ attribute of a Pandas Series and Index objects to access operations

  • Some examples of operations:

    • len()
    • lower()
    • translate()
    • slower()
    • ljust()
    • upper()
    • startswith()
    • isupper()
    • rjust()
    • find()
    • endswith()
    • isnumeric()
    • capitalize()
    • swapcase()
    • istitle()
    • rpartition()

5.6.1 Ex: Create a Panda Series of ‘BOROUGH’ names and swap cases.

bname = pd.Series(jan23['BOROUGH'])
bname.head(15).dropna() # original 
5      BROOKLYN
6        QUEENS
7     MANHATTAN
11       QUEENS
12     BROOKLYN
13       QUEENS
14       QUEENS
Name: BOROUGH, dtype: object
bname.str.capitalize().head(15).dropna()# to make first letter capital
5      Brooklyn
6        Queens
7     Manhattan
11       Queens
12     Brooklyn
13       Queens
14       Queens
Name: BOROUGH, dtype: object
bname.str.swapcase().head(15).dropna() # to make all lower case
5      brooklyn
6        queens
7     manhattan
11       queens
12     brooklyn
13       queens
14       queens
Name: BOROUGH, dtype: object
bname.str.len().head(15).dropna() # to return the length of the name and data type
5     8.0
6     6.0
7     9.0
11    6.0
12    8.0
13    6.0
14    6.0
Name: BOROUGH, dtype: float64
bname.str.startswith('B').head(15).dropna() # to see if starts with a letter B
5      True
6     False
7     False
11    False
12     True
13    False
14    False
Name: BOROUGH, dtype: object

5.7 Time Series

  • Time Stamps : Moments in time
    • Ex: July 4th, 2023 at 8:00 AM
    • Pandas provides the Timestamp type
  • Time Intervals: Reference a length of time with a beginning and end
    • Ex: The year of 2022
    • Pandas provides the Period type
  • Time Deltas/ Durations: Reference an exact length of time
    • Ex: 0.3 seconds
    • Pandas provides the Timedelta type

5.7.1 Can create a Timestamp object

  • combines ‘datetime’ and ‘dateutil’ to be used as a Series or DataFrame
date = pd.to_datetime("2nd of February, 2023")
date
type(date)
pandas._libs.tslibs.timestamps.Timestamp

5.7.2 Can create Series that has time indexed data

ind = pd.DatetimeIndex(['2022-07-04', '2022-08-04',
                          '2022-07-04', '2022-08-04'])
inddata = pd.Series([0,1,2,3], index = ind)
inddata
2022-07-04    0
2022-08-04    1
2022-07-04    2
2022-08-04    3
dtype: int64

5.7.3 Frequencies and Offsets

The following are the main codes avaiable:

- D Calendar day    
- B Business day
- W Weekly      
- M Month end   
- BM   Business month end
- Q Quarter end 
- BQ   Business quarter end
- A Year end    
- BA   Business year end
- H Hours   
- BH   Business hours
- T Minutes     
- S Seconds     
- L Milliseonds     
- U Microseconds        
- N nanoseconds 

5.7.4 Ex: TimeDelta

  • create a TimeDelta data type starting at 00:00:00 using frequency of 2 hours and 30 minutes (2H30T) over 5 periods.
pd.timedelta_range(0, periods = 5, freq = "2H30T")
TimedeltaIndex(['0 days 00:00:00', '0 days 02:30:00', '0 days 05:00:00',
                '0 days 07:30:00', '0 days 10:00:00'],
               dtype='timedelta64[ns]', freq='150T')

5.8 High Performance Pandas: eval()

  • eval() uses string expressions to compute operations using DataFrames
    • supports all arithmetic operations, comparison operators, bitwise operators (& and |), and the use of and and or in Boolean expressions
nrows, ncols = 10, 5 # creating 2 DF of 5 rows and 10 columns
rand = np.random.RandomState(7)
dfa, dfb = (pd.DataFrame(rand.rand(nrows, ncols))
            for i in range (2))

# to compute sum of dfa and dfb and place into one table

print("dfa", dfa)
print("dfb", dfb)
pd.eval('dfa + dfb')
dfa           0         1         2         3         4
0  0.076308  0.779919  0.438409  0.723465  0.977990
1  0.538496  0.501120  0.072051  0.268439  0.499883
2  0.679230  0.803739  0.380941  0.065936  0.288146
3  0.909594  0.213385  0.452124  0.931206  0.024899
4  0.600549  0.950130  0.230303  0.548490  0.909128
5  0.133169  0.523413  0.750410  0.669013  0.467753
6  0.204849  0.490766  0.372385  0.477401  0.365890
7  0.837918  0.768648  0.313995  0.572625  0.276049
8  0.452843  0.352978  0.657399  0.370351  0.459093
9  0.719324  0.412992  0.906423  0.180452  0.741119
dfb           0         1         2         3         4
0  0.422374  0.426454  0.634380  0.522906  0.414886
1  0.001427  0.092262  0.709394  0.524346  0.696160
2  0.955468  0.682914  0.053129  0.308853  0.592595
3  0.235120  0.964971  0.945048  0.848401  0.472324
4  0.841477  0.131111  0.308734  0.462996  0.741847
5  0.485825  0.136876  0.343537  0.324426  0.300419
6  0.165501  0.414902  0.448121  0.774900  0.796391
7  0.522390  0.460630  0.778214  0.887289  0.674919
8  0.800479  0.939111  0.040656  0.875672  0.276563
9  0.475764  0.796761  0.717242  0.147148  0.658748
0 1 2 3 4
0 0.498682 1.206372 1.072789 1.246371 1.392875
1 0.539923 0.593383 0.781446 0.792785 1.196043
2 1.634698 1.486653 0.434070 0.374789 0.880740
3 1.144714 1.178356 1.397172 1.779607 0.497223
4 1.442026 1.081240 0.539037 1.011486 1.650976
5 0.618995 0.660289 1.093946 0.993439 0.768172
6 0.370350 0.905668 0.820505 1.252302 1.162281
7 1.360308 1.229278 1.092208 1.459914 0.950968
8 1.253322 1.292090 0.698055 1.246023 0.735656
9 1.195089 1.209753 1.623666 0.327599 1.399867