First Glance
Consider the NYC Crash Data in January 2022.
import pandas as pd
jan23 = pd.read_csv("data/nyc_crashes_202301.csv" )
jan23.head()
jan23.describe()
ZIP CODE
LATITUDE
LONGITUDE
NUMBER OF PERSONS INJURED
NUMBER OF PERSONS KILLED
NUMBER OF PEDESTRIANS INJURED
NUMBER OF PEDESTRIANS KILLED
NUMBER OF CYCLIST INJURED
NUMBER OF CYCLIST KILLED
NUMBER OF MOTORIST INJURED
NUMBER OF MOTORIST KILLED
COLLISION_ID
count
4796.000000
6764.000000
6764.000000
7244.000000
7244.000000
7244.000000
7244.000000
7244.000000
7244.000000
7244.000000
7244.000000
7.244000e+03
mean
10893.521685
40.234660
-73.033403
0.502761
0.002347
0.116372
0.000690
0.033269
0.000414
0.333103
0.001242
4.599022e+06
std
526.392428
4.430595
8.041457
0.813641
0.051164
0.397927
0.026265
0.180118
0.020348
0.749174
0.038951
2.365885e+03
min
10001.000000
0.000000
-74.250150
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
4.594332e+06
25%
10457.000000
40.663230
-73.964888
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
4.597113e+06
50%
11208.000000
40.711880
-73.921230
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
4.599058e+06
75%
11239.000000
40.775640
-73.865389
1.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
4.600953e+06
max
11694.000000
40.912827
0.000000
21.000000
2.000000
19.000000
1.000000
2.000000
1.000000
8.000000
2.000000
4.605324e+06
Frequency tables for categorical variables.
jan23["BOROUGH" ].value_counts(dropna= False )
NaN 2448
BROOKLYN 1653
QUEENS 1316
BRONX 821
MANHATTAN 811
STATEN ISLAND 195
Name: BOROUGH, dtype: int64
Some tables are too long.
# jan23["CONTRIBUTING FACTOR VEHICLE 1"].value_counts(dropna=False)
with pd.option_context('display.max_rows' , None ):
print (jan23["VEHICLE TYPE CODE 1" ].value_counts(dropna= False ))
Sedan 3478
Station Wagon/Sport Utility Vehicle 2505
Taxi 177
Pick-up Truck 162
NaN 136
Bus 135
Box Truck 110
Bike 84
Ambulance 55
Tractor Truck Diesel 53
E-Bike 50
Van 43
Motorcycle 28
E-Scooter 28
Dump 18
Garbage or Refuse 17
Flat Bed 17
Moped 17
PK 16
Convertible 11
Chassis Cab 9
Carry All 9
Tanker 7
Tow Truck / Wrecker 7
Tractor Truck Gasoline 7
LIMO 5
AMBULANCE 5
Motorscooter 5
Flat Rack 3
Motorbike 3
4 dr sedan 3
Concrete Mixer 2
Firetruck 2
MOTOR SCOO 2
STAK 2
SANITATION 1
Beverage Truck 1
SCHOOL BUS 1
Waste truc 1
Forklift 1
AMBU 1
UTILITY 1
Log 1
PAS 1
3-Door 1
FDNY AMBUL 1
Multi-Wheeled Vehicle 1
Scooter 1
Stake or Rack 1
Lift Boom 1
Gas Moped 1
Bulk Agriculture 1
Armored Truck 1
TRAILER 1
electric s 1
Motorized 1
VAn 1
ELECTRIC S 1
UTIL 1
FORK LIFT 1
STREET SWE 1
GARBAGE TR 1
Chevy 1
cart 1
Tow Truck 1
NYC FDNY # 1
Scooter ga 1
VAN TRUCK 1
Name: VEHICLE TYPE CODE 1, dtype: int64
Cross-tables
pd.crosstab(index = jan23["CONTRIBUTING FACTOR VEHICLE 1" ],
columns = jan23["BOROUGH" ], dropna = False )
BOROUGH
BRONX
BROOKLYN
MANHATTAN
QUEENS
STATEN ISLAND
CONTRIBUTING FACTOR VEHICLE 1
Accelerator Defective
3
4
1
1
0
Aggressive Driving/Road Rage
9
12
6
12
0
Alcohol Involvement
7
32
9
30
8
Animals Action
0
0
0
0
1
Backing Unsafely
31
57
29
54
10
Brakes Defective
3
6
2
4
1
Cell Phone (hand-Held)
0
1
0
0
0
Driver Inattention/Distraction
177
415
211
307
45
Driver Inexperience
15
41
20
18
6
Driverless/Runaway Vehicle
1
1
2
3
0
Drugs (illegal)
2
5
1
0
0
Failure to Keep Right
1
2
0
2
0
Failure to Yield Right-of-Way
58
108
51
157
12
Fatigued/Drowsy
0
0
0
3
1
Fell Asleep
3
8
3
10
0
Following Too Closely
30
56
38
69
10
Glare
2
2
1
2
1
Illnes
1
1
0
5
1
Lane Marking Improper/Inadequate
1
0
0
0
0
Lost Consciousness
2
3
1
4
2
Obstruction/Debris
1
1
1
1
1
Other Electronic Device
0
0
1
1
0
Other Lighting Defects
1
0
0
0
0
Other Vehicular
40
33
28
29
3
Outside Car Distraction
5
5
0
3
0
Oversized Vehicle
5
7
2
3
1
Passenger Distraction
1
4
5
4
0
Passing Too Closely
41
70
36
44
1
Passing or Lane Usage Improper
26
68
33
80
9
Pavement Defective
0
2
0
0
0
Pavement Slippery
1
6
5
5
0
Pedestrian/Bicyclist/Other Pedestrian Error/Confusion
10
14
17
3
1
Physical Disability
1
0
0
1
1
Reaction to Uninvolved Vehicle
1
6
10
12
4
Steering Failure
5
4
0
7
1
Texting
0
0
0
0
1
Tinted Windows
0
1
1
0
0
Tire Failure/Inadequate
0
2
0
3
1
Traffic Control Device Improper/Non-Working
0
0
0
1
0
Traffic Control Disregarded
25
39
16
50
8
Turning Improperly
22
48
20
35
11
Unsafe Lane Changing
7
16
14
9
2
Unsafe Speed
29
32
22
56
8
Unspecified
229
508
209
273
38
View Obstructed/Limited
15
15
7
12
4
Windshield Inadequate
1
0
0
0
0
Some Cleaning
Questions from Dr. Douglas Bates:
The CRASH_DATE
s are all in the correct month and there are no missing values
There are no missing values in the CRASH_TIME
s but there are 117 values of exactly 00:00:00
. Is this a matter of bad luck when the clock strikes midnight?
Over 1/3 of the ZIP_CODE
and BOROUGH
values are missing. There are the same number of missing values in these columns - do they always co-occur? If LATITUDE
and LONGITUDE
are available, can they be used to infer the ZIP_CODE
?
There are 178 unique non-missing ZIP_CODE
values as stated in the Jamboree description. (“Trust, but verify.”) Is there really a zip code of 10000 in New York?
There are 20 values of 0.0 for LATITUDE
and LONGITUDE
? These are obviously incorrect - should they be coded as missing?
Is it redundant to keep the LOCATIO
in addition to LATITUDE
and LONGITUDE
?
The COLLISION_ID
is unique to each row and can be used as a key. The values are not consecutive - why not?
The NUMBER_OF_...
columns seem reasonable. A further consistency check is suggested in the Jamboree tasks.
In the CONTRIBUTING_FACTOR
_… columns, is Unspecified
different from missing
?
The codes in the VEHICLE_TYPE_CODE_...
columns are the usual hodge-podge of results from “free-form” data entry. Should unk
, UNK
, UNKNOWN
, and Unknown
be converted to missing?
In contrast, the codes in the CONTRIBUTING_FACTOR_...
columns appear to be standardized (not sure why Illnes
isn’t Illness
).
with pd.option_context('display.max_rows' , None ):
print (jan23["CRASH TIME" ].value_counts())
0:00 116
15:00 75
18:00 72
17:00 69
10:00 65
8:00 61
8:30 59
13:00 58
7:00 58
14:00 56
19:00 55
16:00 54
23:00 54
12:00 54
9:00 53
20:00 53
17:30 51
14:30 51
10:30 51
12:30 51
11:00 49
9:30 42
15:30 40
16:30 39
18:30 38
21:00 38
6:00 36
11:30 35
20:30 35
6:30 33
4:00 32
1:00 31
5:00 31
22:00 31
13:30 30
23:30 27
22:30 27
6:50 27
21:30 26
18:20 26
15:20 25
17:20 25
15:15 24
18:45 24
14:50 23
7:30 23
15:45 23
1:30 23
14:40 23
8:20 23
19:30 23
9:45 22
13:20 22
3:00 22
16:20 22
7:50 22
18:40 21
21:45 21
16:45 21
17:45 21
10:50 21
16:10 21
8:40 20
10:40 20
8:15 20
20:50 19
1:20 19
13:40 19
18:35 19
11:20 19
17:40 19
6:20 19
9:40 19
2:00 19
18:50 18
18:10 18
9:20 18
9:15 18
20:15 18
12:50 18
16:35 18
0:30 18
12:20 18
14:45 18
22:40 18
12:15 18
16:40 18
12:45 18
19:40 18
10:45 17
17:50 17
19:50 17
19:20 17
8:45 17
15:50 17
17:55 17
10:20 17
12:10 17
3:30 16
13:45 16
17:35 16
7:45 16
11:40 16
15:10 16
16:15 16
11:45 16
14:15 16
15:40 16
13:15 15
9:50 15
17:15 15
1:45 15
6:45 15
19:10 15
12:40 15
18:05 15
22:20 14
16:50 14
17:25 14
21:40 14
20:40 14
5:30 14
8:50 14
11:15 14
22:45 14
12:25 14
11:50 14
16:55 14
11:10 14
14:20 13
14:35 13
8:55 13
19:45 13
17:10 13
11:55 13
8:25 13
4:30 13
7:20 13
14:55 13
6:40 13
23:40 12
6:55 12
19:15 12
6:05 12
11:05 12
10:15 12
7:05 12
8:35 12
7:15 12
13:10 12
15:35 12
22:15 12
23:45 12
5:45 12
19:05 12
21:50 11
7:10 11
16:25 11
9:25 11
14:10 11
19:55 11
14:05 11
17:05 11
5:35 11
22:10 11
20:10 11
10:25 11
13:25 11
15:05 10
20:45 10
6:15 10
19:35 10
23:15 10
3:50 10
20:35 10
0:40 10
5:10 10
18:15 10
23:20 10
21:15 10
2:20 10
7:25 10
9:05 9
4:50 9
23:50 9
18:29 9
1:50 9
20:20 9
9:10 9
13:05 9
2:30 9
19:25 9
16:05 9
11:04 9
13:50 9
20:25 9
22:50 9
7:40 9
15:25 9
0:20 9
16:49 9
10:35 9
18:25 9
1:10 8
22:41 8
21:20 8
0:55 8
3:45 8
0:10 8
12:35 8
3:10 8
1:55 8
23:05 8
8:17 8
12:05 8
5:40 8
18:55 8
8:10 8
8:05 8
17:16 8
5:50 8
10:55 8
23:35 7
18:08 7
18:51 7
11:34 7
21:10 7
6:10 7
2:40 7
22:25 7
10:05 7
17:54 7
3:20 7
16:48 7
18:34 7
5:20 7
15:55 7
4:20 7
20:05 7
1:15 7
10:28 7
4:40 7
12:55 7
18:18 7
1:05 7
23:25 7
15:43 7
4:15 7
3:15 7
7:35 7
22:35 7
21:35 7
19:19 6
14:51 6
11:25 6
10:10 6
17:13 6
20:37 6
20:23 6
14:58 6
17:08 6
19:49 6
17:39 6
21:08 6
23:13 6
22:55 6
12:08 6
10:19 6
23:23 6
7:48 6
16:28 6
14:22 6
18:22 6
8:44 6
18:58 6
16:33 6
6:54 6
7:55 6
7:06 6
8:08 6
2:15 6
21:04 6
21:25 6
19:18 6
5:55 6
3:25 6
5:05 6
0:45 6
20:55 6
17:29 6
15:33 6
2:35 6
0:50 6
14:54 6
15:57 6
13:24 6
13:36 6
18:28 6
2:50 6
8:14 6
6:25 6
10:37 6
21:01 5
19:51 5
0:06 5
17:11 5
3:40 5
16:37 5
18:16 5
14:57 5
20:14 5
19:57 5
1:25 5
7:19 5
18:06 5
13:34 5
13:52 5
20:59 5
8:28 5
19:36 5
13:28 5
14:11 5
12:46 5
17:44 5
2:27 5
16:38 5
17:46 5
8:58 5
19:26 5
15:26 5
22:18 5
11:56 5
14:25 5
11:32 5
18:48 5
19:11 5
8:47 5
13:04 5
18:07 5
20:24 5
9:35 5
21:58 5
23:06 5
8:16 5
19:52 5
12:51 5
17:28 5
4:37 5
18:04 5
17:26 5
13:48 5
13:55 5
10:54 5
18:52 5
17:58 5
21:06 5
19:21 5
16:16 5
14:28 5
17:12 5
18:37 5
22:12 5
11:35 5
18:44 5
14:44 5
1:40 5
13:07 5
18:24 5
0:25 5
12:48 5
0:09 5
21:05 5
13:35 5
0:35 5
21:55 5
9:22 5
20:48 5
21:29 5
21:07 5
11:17 5
16:58 4
13:56 4
6:37 4
14:21 4
7:41 4
20:18 4
5:44 4
6:49 4
16:02 4
20:49 4
14:39 4
5:12 4
22:05 4
13:09 4
8:43 4
4:10 4
12:54 4
19:28 4
17:14 4
5:38 4
1:21 4
15:21 4
0:18 4
0:15 4
18:57 4
21:03 4
19:06 4
17:48 4
23:38 4
12:18 4
6:53 4
21:23 4
19:59 4
16:31 4
10:04 4
13:54 4
8:02 4
6:52 4
16:51 4
23:33 4
0:42 4
17:17 4
9:44 4
19:12 4
19:13 4
22:17 4
16:27 4
15:48 4
10:48 4
8:52 4
7:18 4
17:52 4
7:08 4
15:04 4
9:59 4
14:26 4
20:17 4
17:57 4
6:38 4
20:32 4
23:19 4
18:14 4
11:24 4
19:47 4
12:03 4
10:18 4
14:06 4
14:38 4
19:16 4
3:35 4
4:25 4
16:13 4
8:12 4
14:53 4
21:13 4
6:58 4
14:56 4
8:36 4
18:26 4
13:53 4
11:23 4
22:14 4
15:41 4
9:41 4
13:57 4
13:02 4
15:16 4
15:13 4
10:11 4
4:05 4
20:06 4
11:42 4
14:37 4
5:07 4
0:05 4
10:56 4
8:21 4
21:47 4
9:55 4
16:04 4
17:36 4
22:58 4
8:34 4
8:46 4
16:14 4
15:58 4
15:42 4
6:27 4
19:23 4
16:44 4
8:59 4
23:10 4
12:37 4
18:43 4
21:16 4
5:15 4
22:23 4
11:18 4
17:49 4
4:55 4
15:08 4
10:38 4
9:28 4
13:18 4
20:31 4
16:47 4
13:43 4
9:11 4
9:57 3
12:28 3
12:07 3
2:45 3
3:17 3
3:02 3
0:13 3
12:47 3
12:29 3
23:39 3
23:48 3
12:33 3
18:38 3
20:41 3
16:12 3
16:43 3
14:14 3
16:32 3
23:09 3
21:44 3
18:09 3
22:39 3
17:51 3
6:59 3
1:18 3
7:42 3
23:54 3
17:34 3
11:19 3
5:27 3
8:41 3
23:55 3
23:07 3
21:51 3
14:24 3
9:03 3
23:01 3
7:39 3
22:13 3
22:46 3
3:05 3
18:17 3
19:46 3
7:51 3
15:52 3
9:07 3
20:51 3
11:38 3
14:33 3
14:02 3
8:48 3
8:57 3
15:22 3
18:56 3
18:47 3
15:44 3
21:21 3
4:45 3
17:06 3
16:59 3
17:21 3
17:42 3
20:44 3
18:12 3
12:01 3
17:43 3
18:02 3
7:52 3
22:53 3
10:39 3
14:01 3
19:33 3
17:41 3
20:42 3
12:34 3
6:35 3
2:55 3
20:11 3
1:42 3
17:47 3
12:38 3
23:57 3
18:42 3
16:24 3
4:34 3
9:12 3
0:19 3
17:18 3
2:10 3
2:07 3
0:47 3
22:08 3
0:57 3
21:09 3
7:38 3
14:03 3
4:36 3
20:39 3
9:54 3
13:44 3
19:04 3
14:16 3
11:29 3
2:21 3
15:12 3
13:46 3
1:29 3
11:53 3
14:27 3
18:03 3
20:52 3
12:11 3
18:41 3
7:26 3
10:33 3
21:33 3
8:13 3
19:09 3
0:11 3
1:35 3
14:29 3
17:38 3
6:07 3
19:03 3
3:23 3
0:01 3
16:26 3
22:52 3
19:53 3
11:51 3
17:31 3
21:54 3
17:07 3
3:11 3
8:42 3
9:26 3
9:42 3
16:21 3
17:22 3
13:37 3
9:53 3
20:07 3
17:33 3
23:31 3
19:32 3
0:04 3
18:32 3
19:48 3
22:11 3
20:26 3
7:12 3
16:36 3
16:46 3
8:49 3
11:13 3
15:32 3
14:12 3
21:11 3
12:39 3
9:51 3
1:34 3
18:19 3
14:04 3
17:59 3
20:12 3
9:18 3
10:17 3
21:38 3
6:06 3
3:29 3
21:02 3
22:36 3
4:35 3
10:57 3
22:24 3
19:38 3
9:29 3
9:47 3
8:24 3
11:16 3
17:53 3
13:31 3
18:27 3
17:02 3
13:14 3
10:36 3
16:41 3
3:16 3
19:54 3
12:59 3
2:26 3
9:01 3
21:42 2
9:06 2
11:52 2
0:07 2
11:49 2
18:13 2
17:24 2
12:04 2
5:59 2
15:51 2
11:48 2
17:19 2
3:37 2
19:31 2
7:01 2
8:38 2
4:51 2
21:56 2
15:19 2
19:56 2
21:22 2
18:39 2
20:38 2
15:11 2
14:31 2
0:24 2
3:53 2
9:04 2
20:57 2
20:16 2
15:47 2
20:28 2
7:09 2
9:49 2
20:29 2
16:22 2
14:47 2
9:17 2
9:19 2
11:33 2
17:37 2
18:23 2
8:27 2
21:52 2
5:49 2
1:51 2
14:13 2
20:09 2
11:44 2
12:23 2
9:08 2
22:51 2
10:32 2
13:49 2
16:08 2
16:17 2
6:46 2
22:38 2
12:16 2
9:31 2
19:17 2
7:56 2
12:52 2
2:53 2
14:36 2
22:27 2
15:49 2
13:27 2
19:58 2
15:29 2
10:34 2
12:53 2
12:13 2
18:53 2
5:25 2
12:14 2
14:52 2
10:46 2
4:57 2
15:54 2
23:14 2
14:59 2
1:44 2
1:47 2
15:38 2
15:06 2
6:31 2
23:12 2
19:14 2
7:44 2
10:12 2
11:28 2
13:59 2
0:12 2
23:18 2
15:39 2
21:27 2
23:28 2
4:19 2
7:04 2
21:41 2
20:13 2
4:28 2
23:24 2
3:55 2
19:22 2
7:46 2
14:17 2
21:26 2
16:42 2
12:12 2
9:43 2
2:52 2
17:23 2
16:18 2
4:42 2
8:06 2
6:19 2
15:01 2
15:34 2
23:44 2
5:47 2
13:17 2
15:28 2
18:36 2
10:41 2
10:24 2
17:27 2
19:08 2
1:13 2
21:32 2
2:29 2
1:52 2
4:16 2
1:24 2
17:04 2
16:53 2
20:47 2
21:59 2
20:22 2
3:48 2
0:34 2
2:41 2
12:44 2
3:14 2
18:46 2
9:39 2
6:21 2
14:19 2
4:54 2
14:42 2
0:02 2
10:51 2
3:07 2
5:42 2
3:04 2
0:26 2
2:11 2
16:01 2
20:53 2
9:23 2
8:04 2
10:53 2
16:29 2
15:03 2
22:47 2
0:17 2
7:17 2
7:34 2
10:26 2
18:11 2
18:21 2
21:18 2
10:01 2
8:29 2
15:36 2
23:11 2
2:25 2
11:57 2
2:42 2
17:56 2
20:02 2
16:56 2
6:18 2
22:49 2
6:44 2
0:49 2
13:13 2
17:09 2
23:59 2
2:51 2
17:32 2
8:33 2
13:41 2
0:16 2
23:21 2
14:48 2
9:46 2
10:22 2
12:41 2
13:12 2
19:41 2
4:43 2
4:53 2
3:12 2
0:43 2
21:19 2
16:07 2
13:06 2
21:37 2
20:36 2
14:09 2
14:23 2
23:02 2
18:54 2
23:42 2
13:26 2
5:01 2
23:08 2
0:32 2
16:03 2
1:32 2
23:41 2
8:09 2
3:54 2
11:47 2
8:26 2
22:04 2
10:16 2
23:52 2
15:23 2
16:34 2
2:08 2
8:01 2
2:36 2
5:36 2
4:32 2
0:27 2
20:54 2
15:18 2
7:07 2
22:01 2
22:26 2
18:49 2
6:36 2
20:04 2
9:32 2
22:16 2
1:17 2
12:57 2
15:02 2
1:01 2
2:05 2
1:38 2
7:03 2
16:11 2
0:38 2
13:22 2
10:07 2
13:01 2
0:08 2
5:09 2
13:38 2
23:16 2
11:09 1
2:56 1
19:43 1
1:36 1
7:13 1
16:09 1
13:21 1
3:57 1
12:36 1
22:42 1
22:03 1
3:56 1
6:17 1
11:07 1
4:46 1
13:32 1
1:16 1
2:22 1
10:42 1
4:07 1
23:51 1
9:36 1
2:34 1
20:19 1
16:52 1
11:46 1
4:31 1
10:23 1
18:33 1
7:58 1
10:29 1
3:46 1
1:11 1
1:22 1
1:09 1
22:33 1
5:19 1
11:36 1
9:34 1
12:17 1
3:59 1
23:53 1
2:04 1
13:39 1
4:47 1
10:02 1
10:03 1
6:43 1
0:23 1
21:53 1
0:03 1
2:54 1
2:18 1
6:33 1
2:38 1
0:58 1
6:08 1
14:32 1
22:07 1
13:47 1
19:37 1
5:37 1
1:59 1
6:14 1
22:34 1
5:22 1
20:58 1
1:19 1
3:08 1
0:41 1
18:01 1
9:16 1
10:21 1
7:29 1
13:16 1
1:12 1
18:59 1
2:59 1
1:57 1
11:03 1
5:34 1
5:33 1
6:24 1
0:37 1
1:49 1
1:56 1
2:23 1
11:22 1
13:19 1
4:48 1
2:17 1
0:28 1
19:29 1
7:37 1
19:02 1
2:47 1
4:39 1
4:27 1
0:53 1
4:09 1
2:48 1
3:47 1
3:36 1
11:43 1
3:09 1
4:49 1
1:08 1
6:47 1
14:49 1
0:33 1
4:17 1
16:54 1
6:09 1
12:42 1
22:29 1
23:22 1
22:31 1
0:31 1
5:11 1
5:32 1
5:39 1
7:31 1
3:28 1
8:32 1
19:34 1
5:41 1
10:13 1
12:27 1
1:26 1
6:23 1
1:02 1
22:56 1
19:44 1
22:54 1
7:24 1
6:29 1
12:24 1
8:31 1
3:26 1
22:37 1
9:24 1
23:46 1
23:27 1
1:31 1
3:52 1
7:54 1
8:51 1
9:27 1
1:14 1
5:24 1
22:09 1
3:43 1
10:59 1
5:02 1
3:19 1
6:51 1
1:58 1
5:46 1
1:37 1
15:07 1
13:33 1
12:06 1
16:19 1
11:59 1
6:03 1
22:02 1
11:08 1
20:01 1
6:04 1
15:14 1
23:32 1
21:36 1
7:57 1
19:01 1
10:14 1
12:56 1
15:17 1
10:43 1
16:39 1
3:03 1
11:39 1
12:49 1
2:28 1
9:58 1
5:29 1
11:58 1
8:18 1
11:12 1
22:59 1
6:41 1
5:03 1
7:33 1
22:21 1
6:56 1
8:56 1
17:03 1
1:03 1
7:59 1
7:43 1
15:31 1
12:19 1
15:24 1
7:49 1
6:16 1
20:43 1
20:21 1
10:06 1
6:22 1
14:43 1
15:46 1
11:11 1
7:23 1
15:53 1
5:28 1
2:44 1
10:08 1
9:33 1
19:07 1
20:56 1
23:47 1
5:21 1
7:22 1
9:02 1
3:44 1
13:29 1
4:58 1
2:33 1
2:13 1
5:52 1
11:37 1
17:01 1
23:17 1
2:58 1
6:13 1
6:26 1
7:28 1
3:24 1
11:06 1
19:27 1
4:08 1
10:09 1
13:08 1
21:14 1
0:56 1
12:32 1
12:02 1
5:06 1
15:27 1
0:48 1
11:54 1
3:38 1
4:23 1
7:53 1
14:46 1
6:01 1
21:12 1
8:37 1
2:02 1
21:43 1
3:42 1
6:48 1
23:49 1
22:48 1
1:39 1
0:44 1
22:43 1
20:08 1
5:56 1
8:53 1
13:58 1
4:06 1
2:03 1
5:26 1
11:14 1
8:03 1
0:36 1
2:14 1
23:56 1
6:42 1
8:23 1
14:18 1
Name: CRASH TIME, dtype: int64
For example, here are some cleaning steps:
import numpy as np
jan23["CONTRIBUTING FACTOR VEHICLE 1" ] = (
jan23["CONTRIBUTING FACTOR VEHICLE 1" ].replace(["Unspecified" ], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 2" ] = (
jan23["CONTRIBUTING FACTOR VEHICLE 2" ].replace(["Unspecified" ], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 3" ] = (
jan23["CONTRIBUTING FACTOR VEHICLE 3" ].replace(["Unspecified" ], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 4" ] = (
jan23["CONTRIBUTING FACTOR VEHICLE 4" ].replace(["Unspecified" ], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 5" ] = (
jan23["CONTRIBUTING FACTOR VEHICLE 5" ].replace(["Unspecified" ], np.nan))
jan23["LATITUDE" ] = jan23["LATITUDE" ].replace([0.0 ], np.nan)
jan23["LONGITUDE" ] = jan23["LONGITUDE" ].replace([0.0 ], np.nan)
jan23.describe()
ZIP CODE
LATITUDE
LONGITUDE
NUMBER OF PERSONS INJURED
NUMBER OF PERSONS KILLED
NUMBER OF PEDESTRIANS INJURED
NUMBER OF PEDESTRIANS KILLED
NUMBER OF CYCLIST INJURED
NUMBER OF CYCLIST KILLED
NUMBER OF MOTORIST INJURED
NUMBER OF MOTORIST KILLED
COLLISION_ID
count
4796.000000
6683.000000
6683.000000
7244.000000
7244.000000
7244.000000
7244.000000
7244.000000
7244.000000
7244.000000
7244.000000
7.244000e+03
mean
10893.521685
40.722317
-73.918590
0.502761
0.002347
0.116372
0.000690
0.033269
0.000414
0.333103
0.001242
4.599022e+06
std
526.392428
0.081602
0.085654
0.813641
0.051164
0.397927
0.026265
0.180118
0.020348
0.749174
0.038951
2.365885e+03
min
10001.000000
40.504658
-74.250150
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
4.594332e+06
25%
10457.000000
40.665350
-73.965530
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
4.597113e+06
50%
11208.000000
40.713196
-73.922740
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
4.599058e+06
75%
11239.000000
40.777754
-73.867714
1.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
4.600953e+06
max
11694.000000
40.912827
-73.702095
21.000000
2.000000
19.000000
1.000000
2.000000
1.000000
8.000000
2.000000
4.605324e+06
By the data dictionary, OFF STREET NAME
is the street address of the collision site. Some records have OFF STREET NAME
but missing LATITUDE
and LONGITUDE
. The geocode can be filled by geocoding the street address with package
Filling the Missing Zip Codes by Reverse Geocoding
The package uszipcode
is the most powerful and easy to use programmable zipcode database in Python. It provides information about 42,724 zipcodes in the US with data crawled from <data.census.gov>. See its documentation for details.
from uszipcode import SearchEngine
sr = SearchEngine()
sr.by_zipcode("10001" )
SimpleZipcode(zipcode='10001', zipcode_type='STANDARD', major_city='New York', post_office_city='New York, NY', common_city_list=['New York'], county='New York County', state='NY', lat=40.75, lng=-74.0, timezone='America/New_York', radius_in_miles=0.9090909090909091, area_code_list='718,917,347,646', population=21102, population_density=33959.0, land_area_in_sqmi=0.62, water_area_in_sqmi=0.0, housing_units=12476, occupied_housing_units=11031, median_home_value=650200, median_household_income=81671, bounds_west=-74.008621, bounds_east=-73.984076, bounds_north=40.759731, bounds_south=40.743451)
We can use uszipcode
to reverse geocode a point by its coordinates. The returned zipcode can be used to handle missing zipcode.
z = sr.by_coordinates(40.769993 , - 73.915825 , radius = 1 )
z[0 ].zipcode
z[0 ].median_home_value
Once we have found the zipcode, we can find its borough. See the complete NYC zip code list .
def nyczip2burough(zip ):
nzip = int (zip )
if nzip >= 10001 and nzip <= 10282 :
return "MANHATTAN"
elif nzip >= 10301 and nzip <= 10314 :
return "STATEN ISLAND"
elif nzip >= 10451 and nzip <= 10475 :
return "BRONX"
elif nzip >= 11004 and nzip <= 11109 :
return "QUEENS"
elif nzip >= 11351 and nzip <= 11697 :
return "QUEENS"
elif nzip >= 11201 and nzip <= 11256 :
return "BROOKLYN"
else :
return np.nan
Let’s try it out:
nyczip2burough(z[0 ].zipcode)
Here is a vectorized version:
import numpy as np
import pandas as pd
from typing import Union, List
def nyczip2borough(zips: Union[np.ndarray, pd.Series]) -> Union[np.ndarray, pd.Series]:
zips = zips.values if isinstance (zips, pd.Series) else zips
condlist = [
(zips >= 10001 ) & (zips <= 10282 ),
(zips >= 10301 ) & (zips <= 10314 ),
(zips >= 10451 ) & (zips <= 10475 ),
(zips >= 11004 ) & (zips <= 11109 ),
(zips >= 11351 ) & (zips <= 11697 ),
(zips >= 11201 ) & (zips <= 11256 ),
]
choicelist = [
"MANHATTAN" ,
"STATEN ISLAND" ,
"BRONX" ,
"QUEENS" ,
"QUEENS" ,
"BROOKLYN" ,
]
result = np.select(condlist, choicelist, default= np.nan)
return pd.Series(result) if isinstance (zips, pd.Series) else result
Try it out
nyczip2borough(jan23["ZIP CODE" ].dropna().head(10 ))
array(['BROOKLYN', 'QUEENS', 'MANHATTAN', 'QUEENS', 'BROOKLYN', 'QUEENS',
'QUEENS', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND'], dtype='<U32')
The uszipcode
package provides databases at the zip code level from the US Census. Such information could be merged with the NYC crash data for further analysis.
from uszipcode import SearchEngine, SimpleZipcode
import os
# set the default database file location
db_file = os.path.join(os.getenv("HOME" ), "simple_db.sqlite" )
search = SearchEngine(db_file_path= db_file)
search.by_zipcode("10030" )
SimpleZipcode(zipcode='10030', zipcode_type='STANDARD', major_city='New York', post_office_city='New York, NY', common_city_list=['New York'], county='New York County', state='NY', lat=40.82, lng=-73.94, timezone='America/New_York', radius_in_miles=0.5681818181818182, area_code_list='212,646,917', population=26999, population_density=96790.0, land_area_in_sqmi=0.28, water_area_in_sqmi=0.0, housing_units=12976, occupied_housing_units=11395, median_home_value=509000, median_household_income=31925, bounds_west=-73.948677, bounds_east=-73.936232, bounds_north=40.824032, bounds_south=40.812791)
The SQL database of US zip code is stored in $HOME/.uszipcode
. It can be imported as a pandas dataframe
.
import sqlite3
import pandas as pd
# change to your own path after installing uszipcode
con = sqlite3.connect (db_file)
zipdf = pd.read_sql_query("SELECT * from simple_zipcode" , con)
zipdf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42724 entries, 0 to 42723
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 zipcode 42724 non-null object
1 zipcode_type 42724 non-null object
2 major_city 42724 non-null object
3 post_office_city 33104 non-null object
4 common_city_list 41877 non-null object
5 county 42724 non-null object
6 state 42724 non-null object
7 lat 42724 non-null float64
8 lng 42724 non-null float64
9 timezone 42724 non-null object
10 radius_in_miles 33104 non-null float64
11 area_code_list 42724 non-null object
12 population 31448 non-null float64
13 population_density 31448 non-null float64
14 land_area_in_sqmi 31448 non-null float64
15 water_area_in_sqmi 31448 non-null float64
16 housing_units 31448 non-null float64
17 occupied_housing_units 31448 non-null float64
18 median_home_value 31448 non-null float64
19 median_household_income 31448 non-null float64
20 bounds_west 33104 non-null float64
21 bounds_east 33104 non-null float64
22 bounds_north 33104 non-null float64
23 bounds_south 33104 non-null float64
dtypes: float64(15), object(9)
memory usage: 7.8+ MB
The zip code dataframe can be merged with the crash dataframe.
Map the Crash Sites
We can do this with package gmplot
. See instructions from this tutorial .
import gmplot
import numpy as np
# prepare the geododes
latitude = jan23["LATITUDE" ].dropna().values
longitude = jan23["LONGITUDE" ].dropna().values
# center of the map and zoom level
gmap = gmplot.GoogleMapPlotter(40.769737 , - 73.91244 , 14 )
# plot heatmap
gmap.heatmap(latitude, longitude)
gmap.scatter(latitude, longitude, c = 'r' , marker = True )
# gmap.scatter(latitude, longitude, '#FF0000', size = 50, marker = False)
# Your Google_API_Key
# gmap.apikey = "put your key here"
# save it to html
gmap.draw(r"nycrashmap.html" )
Predicting Injuries
Let’s start by imporing the cleaned data. We define INJURY
as an indicator of personal injuries and creat an HOUR
variable for the crash hours.
jan23 = pd.read_csv("data/nyc_crashes_202301_cleaned.csv" )
jan23["INJURY" ] = [1 if x > 0 else 0 for x in jan23["NUMBER OF PERSONS INJURED" ]]
jan23["HOUR" ] = jan23["CRASH TIME" ].str .split(":" ).str [0 ].astype(int )
We import zipcode level information from package uszipcode
and merge it.
zipdf = pd.read_sql_query("SELECT * from simple_zipcode WHERE state = 'NY'" , con)
zipdf["zipcode" ] = zipdf["zipcode" ].astype(float )
crashdf = pd.merge(jan23, zipdf, left_on = "ZIP CODE" , right_on = "zipcode" )
Let’s select a few columns that might be useful:
crashdf = crashdf[["INJURY" , "HOUR" , "BOROUGH" , "LATITUDE" , "LONGITUDE" ,
"zipcode" , "population_density" , "median_home_value" ,
"median_household_income" ]]
crashdf.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7240 entries, 0 to 7239
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 INJURY 7240 non-null int64
1 HOUR 7240 non-null int64
2 BOROUGH 7219 non-null object
3 LATITUDE 7240 non-null float64
4 LONGITUDE 7240 non-null float64
5 zipcode 7240 non-null float64
6 population_density 7105 non-null float64
7 median_home_value 7105 non-null float64
8 median_household_income 7105 non-null float64
dtypes: float64(6), int64(2), object(1)
memory usage: 565.6+ KB
Preparation
Categorical variables need to be coded appropriately. Ordinal variables can be coded with OrdinalEncoder
. Nominal variables can be coded with OneHotEncoder
. We want to treat HOUR
, BOROUGH
, and zipcode
as nominal categorical variables. One way to encode them is pd.get_dummies
.
catVars = ["HOUR" , "BOROUGH" ] #, "zipcode"]
crashdf = pd.get_dummies(crashdf, columns = catVars, prefix = catVars)
crashdf.shape
It is important to note that we get one dummy variable for each level of the categorical variable. For example, we get 5 dummies for BOROUGH
and 24 dummies for HOUR
. If these dummies were to used in a regression model, the model matrix is not of full rank; in other words, there is perfect colinearity among the covariates. An error would occur from the model fitting. For regularized regression models or many machine learning methods (e.g., decision tree, random forest, SVM, etc.), this is less an issue.
We need to standardize the continuous covariates so that their impacts are comparable. Standardization also makes the numerical computation more stable. If one variable has a variance that 100 times larger than others, it might dominate the objective function and make it hard for the optimization algorithms to find the optimum. This preprocessing can be done with the sklearn.preprocessing
module.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
conVars = ["population_density" , "median_home_value" , "median_household_income" ]
crashdf[conVars] = scaler.fit_transform(crashdf[conVars])
crashdf.columns
Index(['INJURY', 'LATITUDE', 'LONGITUDE', 'zipcode', 'population_density',
'median_home_value', 'median_household_income', 'HOUR_0', 'HOUR_1',
'HOUR_2', 'HOUR_3', 'HOUR_4', 'HOUR_5', 'HOUR_6', 'HOUR_7', 'HOUR_8',
'HOUR_9', 'HOUR_10', 'HOUR_11', 'HOUR_12', 'HOUR_13', 'HOUR_14',
'HOUR_15', 'HOUR_16', 'HOUR_17', 'HOUR_18', 'HOUR_19', 'HOUR_20',
'HOUR_21', 'HOUR_22', 'HOUR_23', 'BOROUGH_BRONX', 'BOROUGH_BROOKLYN',
'BOROUGH_MANHATTAN', 'BOROUGH_QUEENS', 'BOROUGH_STATEN ISLAND'],
dtype='object')