13 NYC Crash Data

13.1 First Glance

Consider the NYC Crash Data in January 2022.

import pandas as pd

jan23 = pd.read_csv("data/nyc_crashes_202301.csv")
jan23.head()
jan23.describe()

	ZIP CODE	LATITUDE	LONGITUDE	NUMBER OF PERSONS INJURED	NUMBER OF PERSONS KILLED	NUMBER OF PEDESTRIANS INJURED	NUMBER OF PEDESTRIANS KILLED	NUMBER OF CYCLIST INJURED	NUMBER OF CYCLIST KILLED	NUMBER OF MOTORIST INJURED	NUMBER OF MOTORIST KILLED	COLLISION_ID
count	4796.000000	6764.000000	6764.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7.244000e+03
mean	10893.521685	40.234660	-73.033403	0.502761	0.002347	0.116372	0.000690	0.033269	0.000414	0.333103	0.001242	4.599022e+06
std	526.392428	4.430595	8.041457	0.813641	0.051164	0.397927	0.026265	0.180118	0.020348	0.749174	0.038951	2.365885e+03
min	10001.000000	0.000000	-74.250150	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.594332e+06
25%	10457.000000	40.663230	-73.964888	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.597113e+06
50%	11208.000000	40.711880	-73.921230	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.599058e+06
75%	11239.000000	40.775640	-73.865389	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.600953e+06
max	11694.000000	40.912827	0.000000	21.000000	2.000000	19.000000	1.000000	2.000000	1.000000	8.000000	2.000000	4.605324e+06

Frequency tables for categorical variables.

jan23["BOROUGH"].value_counts(dropna=False)

NaN              2448
BROOKLYN         1653
QUEENS           1316
BRONX             821
MANHATTAN         811
STATEN ISLAND     195
Name: BOROUGH, dtype: int64

Some tables are too long.

# jan23["CONTRIBUTING FACTOR VEHICLE 1"].value_counts(dropna=False)
with pd.option_context('display.max_rows', None):
    print(jan23["VEHICLE TYPE CODE 1"].value_counts(dropna=False))

Sedan                                  3478
Station Wagon/Sport Utility Vehicle    2505
Taxi                                    177
Pick-up Truck                           162
NaN                                     136
Bus                                     135
Box Truck                               110
Bike                                     84
Ambulance                                55
Tractor Truck Diesel                     53
E-Bike                                   50
Van                                      43
Motorcycle                               28
E-Scooter                                28
Dump                                     18
Garbage or Refuse                        17
Flat Bed                                 17
Moped                                    17
PK                                       16
Convertible                              11
Chassis Cab                               9
Carry All                                 9
Tanker                                    7
Tow Truck / Wrecker                       7
Tractor Truck Gasoline                    7
LIMO                                      5
AMBULANCE                                 5
Motorscooter                              5
Flat Rack                                 3
Motorbike                                 3
4 dr sedan                                3
Concrete Mixer                            2
Firetruck                                 2
MOTOR SCOO                                2
STAK                                      2
SANITATION                                1
Beverage Truck                            1
SCHOOL BUS                                1
Waste truc                                1
Forklift                                  1
AMBU                                      1
UTILITY                                   1
Log                                       1
PAS                                       1
3-Door                                    1
FDNY AMBUL                                1
Multi-Wheeled Vehicle                     1
Scooter                                   1
Stake or Rack                             1
Lift Boom                                 1
Gas Moped                                 1
Bulk Agriculture                          1
Armored Truck                             1
TRAILER                                   1
electric s                                1
Motorized                                 1
VAn                                       1
ELECTRIC S                                1
UTIL                                      1
FORK LIFT                                 1
STREET SWE                                1
GARBAGE TR                                1
Chevy                                     1
cart                                      1
Tow Truck                                 1
NYC FDNY #                                1
Scooter ga                                1
VAN TRUCK                                 1
Name: VEHICLE TYPE CODE 1, dtype: int64

Cross-tables

pd.crosstab(index = jan23["CONTRIBUTING FACTOR VEHICLE 1"],
            columns = jan23["BOROUGH"], dropna = False)

BOROUGH	BRONX	BROOKLYN	MANHATTAN	QUEENS	STATEN ISLAND
CONTRIBUTING FACTOR VEHICLE 1
Accelerator Defective	3	4	1	1	0
Aggressive Driving/Road Rage	9	12	6	12	0
Alcohol Involvement	7	32	9	30	8
Animals Action	0	0	0	0	1
Backing Unsafely	31	57	29	54	10
Brakes Defective	3	6	2	4	1
Cell Phone (hand-Held)	0	1	0	0	0
Driver Inattention/Distraction	177	415	211	307	45
Driver Inexperience	15	41	20	18	6
Driverless/Runaway Vehicle	1	1	2	3	0
Drugs (illegal)	2	5	1	0	0
Failure to Keep Right	1	2	0	2	0
Failure to Yield Right-of-Way	58	108	51	157	12
Fatigued/Drowsy	0	0	0	3	1
Fell Asleep	3	8	3	10	0
Following Too Closely	30	56	38	69	10
Glare	2	2	1	2	1
Illnes	1	1	0	5	1
Lane Marking Improper/Inadequate	1	0	0	0	0
Lost Consciousness	2	3	1	4	2
Obstruction/Debris	1	1	1	1	1
Other Electronic Device	0	0	1	1	0
Other Lighting Defects	1	0	0	0	0
Other Vehicular	40	33	28	29	3
Outside Car Distraction	5	5	0	3	0
Oversized Vehicle	5	7	2	3	1
Passenger Distraction	1	4	5	4	0
Passing Too Closely	41	70	36	44	1
Passing or Lane Usage Improper	26	68	33	80	9
Pavement Defective	0	2	0	0	0
Pavement Slippery	1	6	5	5	0
Pedestrian/Bicyclist/Other Pedestrian Error/Confusion	10	14	17	3	1
Physical Disability	1	0	0	1	1
Reaction to Uninvolved Vehicle	1	6	10	12	4
Steering Failure	5	4	0	7	1
Texting	0	0	0	0	1
Tinted Windows	0	1	1	0	0
Tire Failure/Inadequate	0	2	0	3	1
Traffic Control Device Improper/Non-Working	0	0	0	1	0
Traffic Control Disregarded	25	39	16	50	8
Turning Improperly	22	48	20	35	11
Unsafe Lane Changing	7	16	14	9	2
Unsafe Speed	29	32	22	56	8
Unspecified	229	508	209	273	38
View Obstructed/Limited	15	15	7	12	4
Windshield Inadequate	1	0	0	0	0

13.2 Some Cleaning

Questions from Dr. Douglas Bates:

The CRASH_DATEs are all in the correct month and there are no missing values
There are no missing values in the CRASH_TIMEs but there are 117 values of exactly 00:00:00. Is this a matter of bad luck when the clock strikes midnight?
Over 1/3 of the ZIP_CODE and BOROUGH values are missing. There are the same number of missing values in these columns - do they always co-occur? If LATITUDE and LONGITUDE are available, can they be used to infer the ZIP_CODE?
There are 178 unique non-missing ZIP_CODE values as stated in the Jamboree description. (“Trust, but verify.”) Is there really a zip code of 10000 in New York?
There are 20 values of 0.0 for LATITUDE and LONGITUDE? These are obviously incorrect - should they be coded as missing?
Is it redundant to keep the LOCATIO in addition to LATITUDE and LONGITUDE?
The COLLISION_ID is unique to each row and can be used as a key. The values are not consecutive - why not?
The NUMBER_OF_... columns seem reasonable. A further consistency check is suggested in the Jamboree tasks.
In the CONTRIBUTING_FACTOR_… columns, is Unspecified different from missing?
The codes in the VEHICLE_TYPE_CODE_... columns are the usual hodge-podge of results from “free-form” data entry. Should unk, UNK, UNKNOWN, and Unknown be converted to missing?
In contrast, the codes in the CONTRIBUTING_FACTOR_... columns appear to be standardized (not sure why Illnes isn’t Illness).

with pd.option_context('display.max_rows', None):
    print(jan23["CRASH TIME"].value_counts())

0:00     116
15:00     75
18:00     72
17:00     69
10:00     65
8:00      61
8:30      59
13:00     58
7:00      58
14:00     56
19:00     55
16:00     54
23:00     54
12:00     54
9:00      53
20:00     53
17:30     51
14:30     51
10:30     51
12:30     51
11:00     49
9:30      42
15:30     40
16:30     39
18:30     38
21:00     38
6:00      36
11:30     35
20:30     35
6:30      33
4:00      32
1:00      31
5:00      31
22:00     31
13:30     30
23:30     27
22:30     27
6:50      27
21:30     26
18:20     26
15:20     25
17:20     25
15:15     24
18:45     24
14:50     23
7:30      23
15:45     23
1:30      23
14:40     23
8:20      23
19:30     23
9:45      22
13:20     22
3:00      22
16:20     22
7:50      22
18:40     21
21:45     21
16:45     21
17:45     21
10:50     21
16:10     21
8:40      20
10:40     20
8:15      20
20:50     19
1:20      19
13:40     19
18:35     19
11:20     19
17:40     19
6:20      19
9:40      19
2:00      19
18:50     18
18:10     18
9:20      18
9:15      18
20:15     18
12:50     18
16:35     18
0:30      18
12:20     18
14:45     18
22:40     18
12:15     18
16:40     18
12:45     18
19:40     18
10:45     17
17:50     17
19:50     17
19:20     17
8:45      17
15:50     17
17:55     17
10:20     17
12:10     17
3:30      16
13:45     16
17:35     16
7:45      16
11:40     16
15:10     16
16:15     16
11:45     16
14:15     16
15:40     16
13:15     15
9:50      15
17:15     15
1:45      15
6:45      15
19:10     15
12:40     15
18:05     15
22:20     14
16:50     14
17:25     14
21:40     14
20:40     14
5:30      14
8:50      14
11:15     14
22:45     14
12:25     14
11:50     14
16:55     14
11:10     14
14:20     13
14:35     13
8:55      13
19:45     13
17:10     13
11:55     13
8:25      13
4:30      13
7:20      13
14:55     13
6:40      13
23:40     12
6:55      12
19:15     12
6:05      12
11:05     12
10:15     12
7:05      12
8:35      12
7:15      12
13:10     12
15:35     12
22:15     12
23:45     12
5:45      12
19:05     12
21:50     11
7:10      11
16:25     11
9:25      11
14:10     11
19:55     11
14:05     11
17:05     11
5:35      11
22:10     11
20:10     11
10:25     11
13:25     11
15:05     10
20:45     10
6:15      10
19:35     10
23:15     10
3:50      10
20:35     10
0:40      10
5:10      10
18:15     10
23:20     10
21:15     10
2:20      10
7:25      10
9:05       9
4:50       9
23:50      9
18:29      9
1:50       9
20:20      9
9:10       9
13:05      9
2:30       9
19:25      9
16:05      9
11:04      9
13:50      9
20:25      9
22:50      9
7:40       9
15:25      9
0:20       9
16:49      9
10:35      9
18:25      9
1:10       8
22:41      8
21:20      8
0:55       8
3:45       8
0:10       8
12:35      8
3:10       8
1:55       8
23:05      8
8:17       8
12:05      8
5:40       8
18:55      8
8:10       8
8:05       8
17:16      8
5:50       8
10:55      8
23:35      7
18:08      7
18:51      7
11:34      7
21:10      7
6:10       7
2:40       7
22:25      7
10:05      7
17:54      7
3:20       7
16:48      7
18:34      7
5:20       7
15:55      7
4:20       7
20:05      7
1:15       7
10:28      7
4:40       7
12:55      7
18:18      7
1:05       7
23:25      7
15:43      7
4:15       7
3:15       7
7:35       7
22:35      7
21:35      7
19:19      6
14:51      6
11:25      6
10:10      6
17:13      6
20:37      6
20:23      6
14:58      6
17:08      6
19:49      6
17:39      6
21:08      6
23:13      6
22:55      6
12:08      6
10:19      6
23:23      6
7:48       6
16:28      6
14:22      6
18:22      6
8:44       6
18:58      6
16:33      6
6:54       6
7:55       6
7:06       6
8:08       6
2:15       6
21:04      6
21:25      6
19:18      6
5:55       6
3:25       6
5:05       6
0:45       6
20:55      6
17:29      6
15:33      6
2:35       6
0:50       6
14:54      6
15:57      6
13:24      6
13:36      6
18:28      6
2:50       6
8:14       6
6:25       6
10:37      6
21:01      5
19:51      5
0:06       5
17:11      5
3:40       5
16:37      5
18:16      5
14:57      5
20:14      5
19:57      5
1:25       5
7:19       5
18:06      5
13:34      5
13:52      5
20:59      5
8:28       5
19:36      5
13:28      5
14:11      5
12:46      5
17:44      5
2:27       5
16:38      5
17:46      5
8:58       5
19:26      5
15:26      5
22:18      5
11:56      5
14:25      5
11:32      5
18:48      5
19:11      5
8:47       5
13:04      5
18:07      5
20:24      5
9:35       5
21:58      5
23:06      5
8:16       5
19:52      5
12:51      5
17:28      5
4:37       5
18:04      5
17:26      5
13:48      5
13:55      5
10:54      5
18:52      5
17:58      5
21:06      5
19:21      5
16:16      5
14:28      5
17:12      5
18:37      5
22:12      5
11:35      5
18:44      5
14:44      5
1:40       5
13:07      5
18:24      5
0:25       5
12:48      5
0:09       5
21:05      5
13:35      5
0:35       5
21:55      5
9:22       5
20:48      5
21:29      5
21:07      5
11:17      5
16:58      4
13:56      4
6:37       4
14:21      4
7:41       4
20:18      4
5:44       4
6:49       4
16:02      4
20:49      4
14:39      4
5:12       4
22:05      4
13:09      4
8:43       4
4:10       4
12:54      4
19:28      4
17:14      4
5:38       4
1:21       4
15:21      4
0:18       4
0:15       4
18:57      4
21:03      4
19:06      4
17:48      4
23:38      4
12:18      4
6:53       4
21:23      4
19:59      4
16:31      4
10:04      4
13:54      4
8:02       4
6:52       4
16:51      4
23:33      4
0:42       4
17:17      4
9:44       4
19:12      4
19:13      4
22:17      4
16:27      4
15:48      4
10:48      4
8:52       4
7:18       4
17:52      4
7:08       4
15:04      4
9:59       4
14:26      4
20:17      4
17:57      4
6:38       4
20:32      4
23:19      4
18:14      4
11:24      4
19:47      4
12:03      4
10:18      4
14:06      4
14:38      4
19:16      4
3:35       4
4:25       4
16:13      4
8:12       4
14:53      4
21:13      4
6:58       4
14:56      4
8:36       4
18:26      4
13:53      4
11:23      4
22:14      4
15:41      4
9:41       4
13:57      4
13:02      4
15:16      4
15:13      4
10:11      4
4:05       4
20:06      4
11:42      4
14:37      4
5:07       4
0:05       4
10:56      4
8:21       4
21:47      4
9:55       4
16:04      4
17:36      4
22:58      4
8:34       4
8:46       4
16:14      4
15:58      4
15:42      4
6:27       4
19:23      4
16:44      4
8:59       4
23:10      4
12:37      4
18:43      4
21:16      4
5:15       4
22:23      4
11:18      4
17:49      4
4:55       4
15:08      4
10:38      4
9:28       4
13:18      4
20:31      4
16:47      4
13:43      4
9:11       4
9:57       3
12:28      3
12:07      3
2:45       3
3:17       3
3:02       3
0:13       3
12:47      3
12:29      3
23:39      3
23:48      3
12:33      3
18:38      3
20:41      3
16:12      3
16:43      3
14:14      3
16:32      3
23:09      3
21:44      3
18:09      3
22:39      3
17:51      3
6:59       3
1:18       3
7:42       3
23:54      3
17:34      3
11:19      3
5:27       3
8:41       3
23:55      3
23:07      3
21:51      3
14:24      3
9:03       3
23:01      3
7:39       3
22:13      3
22:46      3
3:05       3
18:17      3
19:46      3
7:51       3
15:52      3
9:07       3
20:51      3
11:38      3
14:33      3
14:02      3
8:48       3
8:57       3
15:22      3
18:56      3
18:47      3
15:44      3
21:21      3
4:45       3
17:06      3
16:59      3
17:21      3
17:42      3
20:44      3
18:12      3
12:01      3
17:43      3
18:02      3
7:52       3
22:53      3
10:39      3
14:01      3
19:33      3
17:41      3
20:42      3
12:34      3
6:35       3
2:55       3
20:11      3
1:42       3
17:47      3
12:38      3
23:57      3
18:42      3
16:24      3
4:34       3
9:12       3
0:19       3
17:18      3
2:10       3
2:07       3
0:47       3
22:08      3
0:57       3
21:09      3
7:38       3
14:03      3
4:36       3
20:39      3
9:54       3
13:44      3
19:04      3
14:16      3
11:29      3
2:21       3
15:12      3
13:46      3
1:29       3
11:53      3
14:27      3
18:03      3
20:52      3
12:11      3
18:41      3
7:26       3
10:33      3
21:33      3
8:13       3
19:09      3
0:11       3
1:35       3
14:29      3
17:38      3
6:07       3
19:03      3
3:23       3
0:01       3
16:26      3
22:52      3
19:53      3
11:51      3
17:31      3
21:54      3
17:07      3
3:11       3
8:42       3
9:26       3
9:42       3
16:21      3
17:22      3
13:37      3
9:53       3
20:07      3
17:33      3
23:31      3
19:32      3
0:04       3
18:32      3
19:48      3
22:11      3
20:26      3
7:12       3
16:36      3
16:46      3
8:49       3
11:13      3
15:32      3
14:12      3
21:11      3
12:39      3
9:51       3
1:34       3
18:19      3
14:04      3
17:59      3
20:12      3
9:18       3
10:17      3
21:38      3
6:06       3
3:29       3
21:02      3
22:36      3
4:35       3
10:57      3
22:24      3
19:38      3
9:29       3
9:47       3
8:24       3
11:16      3
17:53      3
13:31      3
18:27      3
17:02      3
13:14      3
10:36      3
16:41      3
3:16       3
19:54      3
12:59      3
2:26       3
9:01       3
21:42      2
9:06       2
11:52      2
0:07       2
11:49      2
18:13      2
17:24      2
12:04      2
5:59       2
15:51      2
11:48      2
17:19      2
3:37       2
19:31      2
7:01       2
8:38       2
4:51       2
21:56      2
15:19      2
19:56      2
21:22      2
18:39      2
20:38      2
15:11      2
14:31      2
0:24       2
3:53       2
9:04       2
20:57      2
20:16      2
15:47      2
20:28      2
7:09       2
9:49       2
20:29      2
16:22      2
14:47      2
9:17       2
9:19       2
11:33      2
17:37      2
18:23      2
8:27       2
21:52      2
5:49       2
1:51       2
14:13      2
20:09      2
11:44      2
12:23      2
9:08       2
22:51      2
10:32      2
13:49      2
16:08      2
16:17      2
6:46       2
22:38      2
12:16      2
9:31       2
19:17      2
7:56       2
12:52      2
2:53       2
14:36      2
22:27      2
15:49      2
13:27      2
19:58      2
15:29      2
10:34      2
12:53      2
12:13      2
18:53      2
5:25       2
12:14      2
14:52      2
10:46      2
4:57       2
15:54      2
23:14      2
14:59      2
1:44       2
1:47       2
15:38      2
15:06      2
6:31       2
23:12      2
19:14      2
7:44       2
10:12      2
11:28      2
13:59      2
0:12       2
23:18      2
15:39      2
21:27      2
23:28      2
4:19       2
7:04       2
21:41      2
20:13      2
4:28       2
23:24      2
3:55       2
19:22      2
7:46       2
14:17      2
21:26      2
16:42      2
12:12      2
9:43       2
2:52       2
17:23      2
16:18      2
4:42       2
8:06       2
6:19       2
15:01      2
15:34      2
23:44      2
5:47       2
13:17      2
15:28      2
18:36      2
10:41      2
10:24      2
17:27      2
19:08      2
1:13       2
21:32      2
2:29       2
1:52       2
4:16       2
1:24       2
17:04      2
16:53      2
20:47      2
21:59      2
20:22      2
3:48       2
0:34       2
2:41       2
12:44      2
3:14       2
18:46      2
9:39       2
6:21       2
14:19      2
4:54       2
14:42      2
0:02       2
10:51      2
3:07       2
5:42       2
3:04       2
0:26       2
2:11       2
16:01      2
20:53      2
9:23       2
8:04       2
10:53      2
16:29      2
15:03      2
22:47      2
0:17       2
7:17       2
7:34       2
10:26      2
18:11      2
18:21      2
21:18      2
10:01      2
8:29       2
15:36      2
23:11      2
2:25       2
11:57      2
2:42       2
17:56      2
20:02      2
16:56      2
6:18       2
22:49      2
6:44       2
0:49       2
13:13      2
17:09      2
23:59      2
2:51       2
17:32      2
8:33       2
13:41      2
0:16       2
23:21      2
14:48      2
9:46       2
10:22      2
12:41      2
13:12      2
19:41      2
4:43       2
4:53       2
3:12       2
0:43       2
21:19      2
16:07      2
13:06      2
21:37      2
20:36      2
14:09      2
14:23      2
23:02      2
18:54      2
23:42      2
13:26      2
5:01       2
23:08      2
0:32       2
16:03      2
1:32       2
23:41      2
8:09       2
3:54       2
11:47      2
8:26       2
22:04      2
10:16      2
23:52      2
15:23      2
16:34      2
2:08       2
8:01       2
2:36       2
5:36       2
4:32       2
0:27       2
20:54      2
15:18      2
7:07       2
22:01      2
22:26      2
18:49      2
6:36       2
20:04      2
9:32       2
22:16      2
1:17       2
12:57      2
15:02      2
1:01       2
2:05       2
1:38       2
7:03       2
16:11      2
0:38       2
13:22      2
10:07      2
13:01      2
0:08       2
5:09       2
13:38      2
23:16      2
11:09      1
2:56       1
19:43      1
1:36       1
7:13       1
16:09      1
13:21      1
3:57       1
12:36      1
22:42      1
22:03      1
3:56       1
6:17       1
11:07      1
4:46       1
13:32      1
1:16       1
2:22       1
10:42      1
4:07       1
23:51      1
9:36       1
2:34       1
20:19      1
16:52      1
11:46      1
4:31       1
10:23      1
18:33      1
7:58       1
10:29      1
3:46       1
1:11       1
1:22       1
1:09       1
22:33      1
5:19       1
11:36      1
9:34       1
12:17      1
3:59       1
23:53      1
2:04       1
13:39      1
4:47       1
10:02      1
10:03      1
6:43       1
0:23       1
21:53      1
0:03       1
2:54       1
2:18       1
6:33       1
2:38       1
0:58       1
6:08       1
14:32      1
22:07      1
13:47      1
19:37      1
5:37       1
1:59       1
6:14       1
22:34      1
5:22       1
20:58      1
1:19       1
3:08       1
0:41       1
18:01      1
9:16       1
10:21      1
7:29       1
13:16      1
1:12       1
18:59      1
2:59       1
1:57       1
11:03      1
5:34       1
5:33       1
6:24       1
0:37       1
1:49       1
1:56       1
2:23       1
11:22      1
13:19      1
4:48       1
2:17       1
0:28       1
19:29      1
7:37       1
19:02      1
2:47       1
4:39       1
4:27       1
0:53       1
4:09       1
2:48       1
3:47       1
3:36       1
11:43      1
3:09       1
4:49       1
1:08       1
6:47       1
14:49      1
0:33       1
4:17       1
16:54      1
6:09       1
12:42      1
22:29      1
23:22      1
22:31      1
0:31       1
5:11       1
5:32       1
5:39       1
7:31       1
3:28       1
8:32       1
19:34      1
5:41       1
10:13      1
12:27      1
1:26       1
6:23       1
1:02       1
22:56      1
19:44      1
22:54      1
7:24       1
6:29       1
12:24      1
8:31       1
3:26       1
22:37      1
9:24       1
23:46      1
23:27      1
1:31       1
3:52       1
7:54       1
8:51       1
9:27       1
1:14       1
5:24       1
22:09      1
3:43       1
10:59      1
5:02       1
3:19       1
6:51       1
1:58       1
5:46       1
1:37       1
15:07      1
13:33      1
12:06      1
16:19      1
11:59      1
6:03       1
22:02      1
11:08      1
20:01      1
6:04       1
15:14      1
23:32      1
21:36      1
7:57       1
19:01      1
10:14      1
12:56      1
15:17      1
10:43      1
16:39      1
3:03       1
11:39      1
12:49      1
2:28       1
9:58       1
5:29       1
11:58      1
8:18       1
11:12      1
22:59      1
6:41       1
5:03       1
7:33       1
22:21      1
6:56       1
8:56       1
17:03      1
1:03       1
7:59       1
7:43       1
15:31      1
12:19      1
15:24      1
7:49       1
6:16       1
20:43      1
20:21      1
10:06      1
6:22       1
14:43      1
15:46      1
11:11      1
7:23       1
15:53      1
5:28       1
2:44       1
10:08      1
9:33       1
19:07      1
20:56      1
23:47      1
5:21       1
7:22       1
9:02       1
3:44       1
13:29      1
4:58       1
2:33       1
2:13       1
5:52       1
11:37      1
17:01      1
23:17      1
2:58       1
6:13       1
6:26       1
7:28       1
3:24       1
11:06      1
19:27      1
4:08       1
10:09      1
13:08      1
21:14      1
0:56       1
12:32      1
12:02      1
5:06       1
15:27      1
0:48       1
11:54      1
3:38       1
4:23       1
7:53       1
14:46      1
6:01       1
21:12      1
8:37       1
2:02       1
21:43      1
3:42       1
6:48       1
23:49      1
22:48      1
1:39       1
0:44       1
22:43      1
20:08      1
5:56       1
8:53       1
13:58      1
4:06       1
2:03       1
5:26       1
11:14      1
8:03       1
0:36       1
2:14       1
23:56      1
6:42       1
8:23       1
14:18      1
Name: CRASH TIME, dtype: int64

For example, here are some cleaning steps:

import numpy as np

jan23["CONTRIBUTING FACTOR VEHICLE 1"] = (
    jan23["CONTRIBUTING FACTOR VEHICLE 1"].replace(["Unspecified"], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 2"] = (
    jan23["CONTRIBUTING FACTOR VEHICLE 2"].replace(["Unspecified"], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 3"] = (
    jan23["CONTRIBUTING FACTOR VEHICLE 3"].replace(["Unspecified"], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 4"] = (
    jan23["CONTRIBUTING FACTOR VEHICLE 4"].replace(["Unspecified"], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 5"] = (
    jan23["CONTRIBUTING FACTOR VEHICLE 5"].replace(["Unspecified"], np.nan))
jan23["LATITUDE"] = jan23["LATITUDE"].replace([0.0], np.nan)
jan23["LONGITUDE"] = jan23["LONGITUDE"].replace([0.0], np.nan)
jan23.describe()

	ZIP CODE	LATITUDE	LONGITUDE	NUMBER OF PERSONS INJURED	NUMBER OF PERSONS KILLED	NUMBER OF PEDESTRIANS INJURED	NUMBER OF PEDESTRIANS KILLED	NUMBER OF CYCLIST INJURED	NUMBER OF CYCLIST KILLED	NUMBER OF MOTORIST INJURED	NUMBER OF MOTORIST KILLED	COLLISION_ID
count	4796.000000	6683.000000	6683.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7.244000e+03
mean	10893.521685	40.722317	-73.918590	0.502761	0.002347	0.116372	0.000690	0.033269	0.000414	0.333103	0.001242	4.599022e+06
std	526.392428	0.081602	0.085654	0.813641	0.051164	0.397927	0.026265	0.180118	0.020348	0.749174	0.038951	2.365885e+03
min	10001.000000	40.504658	-74.250150	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.594332e+06
25%	10457.000000	40.665350	-73.965530	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.597113e+06
50%	11208.000000	40.713196	-73.922740	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.599058e+06
75%	11239.000000	40.777754	-73.867714	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.600953e+06
max	11694.000000	40.912827	-73.702095	21.000000	2.000000	19.000000	1.000000	2.000000	1.000000	8.000000	2.000000	4.605324e+06

By the data dictionary, OFF STREET NAME is the street address of the collision site. Some records have OFF STREET NAME but missing LATITUDE and LONGITUDE. The geocode can be filled by geocoding the street address with package

13.3 Filling the Missing Zip Codes by Reverse Geocoding

The package uszipcode is the most powerful and easy to use programmable zipcode database in Python. It provides information about 42,724 zipcodes in the US with data crawled from <data.census.gov>. See its documentation for details.

from uszipcode import SearchEngine

sr = SearchEngine()
sr.by_zipcode("10001")

SimpleZipcode(zipcode='10001', zipcode_type='STANDARD', major_city='New York', post_office_city='New York, NY', common_city_list=['New York'], county='New York County', state='NY', lat=40.75, lng=-74.0, timezone='America/New_York', radius_in_miles=0.9090909090909091, area_code_list='718,917,347,646', population=21102, population_density=33959.0, land_area_in_sqmi=0.62, water_area_in_sqmi=0.0, housing_units=12476, occupied_housing_units=11031, median_home_value=650200, median_household_income=81671, bounds_west=-74.008621, bounds_east=-73.984076, bounds_north=40.759731, bounds_south=40.743451)

We can use uszipcode to reverse geocode a point by its coordinates. The returned zipcode can be used to handle missing zipcode.

z = sr.by_coordinates(40.769993, -73.915825, radius = 1)
z[0].zipcode
z[0].median_home_value

Once we have found the zipcode, we can find its borough. See the complete NYC zip code list.

def nyczip2burough(zip):
    nzip = int(zip)
    if nzip >= 10001 and nzip <= 10282:
        return "MANHATTAN"
    elif nzip >= 10301 and nzip <= 10314:
        return "STATEN ISLAND"
    elif nzip >= 10451 and nzip <= 10475:
        return "BRONX"
    elif nzip >= 11004 and nzip <= 11109:
        return "QUEENS"
    elif nzip >= 11351 and nzip <= 11697:
        return "QUEENS"
    elif nzip >= 11201 and nzip <= 11256:
        return "BROOKLYN"
    else:
        return np.nan

Let’s try it out:

nyczip2burough(z[0].zipcode)

'QUEENS'

Here is a vectorized version:

import numpy as np
import pandas as pd
from typing import Union, List

def nyczip2borough(zips: Union[np.ndarray, pd.Series]) -> Union[np.ndarray, pd.Series]:
    zips = zips.values if isinstance(zips, pd.Series) else zips
    condlist = [
        (zips >= 10001) & (zips <= 10282),
        (zips >= 10301) & (zips <= 10314),
        (zips >= 10451) & (zips <= 10475),
        (zips >= 11004) & (zips <= 11109),
        (zips >= 11351) & (zips <= 11697),
        (zips >= 11201) & (zips <= 11256),
    ]
    choicelist = [
        "MANHATTAN",
        "STATEN ISLAND",
        "BRONX",
        "QUEENS",
        "QUEENS",
        "BROOKLYN",
    ]
    result = np.select(condlist, choicelist, default=np.nan)
    return pd.Series(result) if isinstance(zips, pd.Series) else result

Try it out

nyczip2borough(jan23["ZIP CODE"].dropna().head(10))

array(['BROOKLYN', 'QUEENS', 'MANHATTAN', 'QUEENS', 'BROOKLYN', 'QUEENS',
       'QUEENS', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND'], dtype='<U32')

The uszipcode package provides databases at the zip code level from the US Census. Such information could be merged with the NYC crash data for further analysis.

from uszipcode import SearchEngine, SimpleZipcode
import os
# set the default database file location
db_file = os.path.join(os.getenv("HOME"), "simple_db.sqlite")
search = SearchEngine(db_file_path=db_file)
search.by_zipcode("10030")

SimpleZipcode(zipcode='10030', zipcode_type='STANDARD', major_city='New York', post_office_city='New York, NY', common_city_list=['New York'], county='New York County', state='NY', lat=40.82, lng=-73.94, timezone='America/New_York', radius_in_miles=0.5681818181818182, area_code_list='212,646,917', population=26999, population_density=96790.0, land_area_in_sqmi=0.28, water_area_in_sqmi=0.0, housing_units=12976, occupied_housing_units=11395, median_home_value=509000, median_household_income=31925, bounds_west=-73.948677, bounds_east=-73.936232, bounds_north=40.824032, bounds_south=40.812791)

The SQL database of US zip code is stored in $HOME/.uszipcode. It can be imported as a pandas dataframe.

import sqlite3
import pandas as pd
# change to your own path after installing uszipcode
con = sqlite3.connect(db_file)
zipdf = pd.read_sql_query("SELECT * from simple_zipcode", con)
zipdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42724 entries, 0 to 42723
Data columns (total 24 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   zipcode                  42724 non-null  object 
 1   zipcode_type             42724 non-null  object 
 2   major_city               42724 non-null  object 
 3   post_office_city         33104 non-null  object 
 4   common_city_list         41877 non-null  object 
 5   county                   42724 non-null  object 
 6   state                    42724 non-null  object 
 7   lat                      42724 non-null  float64
 8   lng                      42724 non-null  float64
 9   timezone                 42724 non-null  object 
 10  radius_in_miles          33104 non-null  float64
 11  area_code_list           42724 non-null  object 
 12  population               31448 non-null  float64
 13  population_density       31448 non-null  float64
 14  land_area_in_sqmi        31448 non-null  float64
 15  water_area_in_sqmi       31448 non-null  float64
 16  housing_units            31448 non-null  float64
 17  occupied_housing_units   31448 non-null  float64
 18  median_home_value        31448 non-null  float64
 19  median_household_income  31448 non-null  float64
 20  bounds_west              33104 non-null  float64
 21  bounds_east              33104 non-null  float64
 22  bounds_north             33104 non-null  float64
 23  bounds_south             33104 non-null  float64
dtypes: float64(15), object(9)
memory usage: 7.8+ MB

The zip code dataframe can be merged with the crash dataframe.

13.4 Map the Crash Sites

We can do this with package gmplot. See instructions from this tutorial.

import gmplot
import numpy as np

# prepare the geododes
latitude  = jan23["LATITUDE"].dropna().values
longitude = jan23["LONGITUDE"].dropna().values

# center of the map and zoom level
gmap = gmplot.GoogleMapPlotter(40.769737, -73.91244, 14)

# plot heatmap
gmap.heatmap(latitude, longitude)
gmap.scatter(latitude, longitude, c = 'r', marker = True)
# gmap.scatter(latitude, longitude, '#FF0000', size = 50, marker = False)
# Your Google_API_Key
# gmap.apikey = "put your key here"
# save it to html
gmap.draw(r"nycrashmap.html")

13.5 Predicting Injuries

Let’s start by imporing the cleaned data. We define INJURY as an indicator of personal injuries and creat an HOUR variable for the crash hours.

jan23 = pd.read_csv("data/nyc_crashes_202301_cleaned.csv")
jan23["INJURY"] = [1 if x > 0 else 0 for x in jan23["NUMBER OF PERSONS INJURED"]]
jan23["HOUR"] = jan23["CRASH TIME"].str.split(":").str[0].astype(int)

We import zipcode level information from package uszipcode and merge it.

zipdf = pd.read_sql_query("SELECT * from simple_zipcode WHERE state = 'NY'", con)
zipdf["zipcode"] = zipdf["zipcode"].astype(float)
crashdf = pd.merge(jan23, zipdf, left_on = "ZIP CODE", right_on = "zipcode")

Let’s select a few columns that might be useful:

crashdf = crashdf[["INJURY", "HOUR", "BOROUGH", "LATITUDE", "LONGITUDE",
                   "zipcode", "population_density", "median_home_value",
                   "median_household_income"]]
crashdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7240 entries, 0 to 7239
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   INJURY                   7240 non-null   int64  
 1   HOUR                     7240 non-null   int64  
 2   BOROUGH                  7219 non-null   object 
 3   LATITUDE                 7240 non-null   float64
 4   LONGITUDE                7240 non-null   float64
 5   zipcode                  7240 non-null   float64
 6   population_density       7105 non-null   float64
 7   median_home_value        7105 non-null   float64
 8   median_household_income  7105 non-null   float64
dtypes: float64(6), int64(2), object(1)
memory usage: 565.6+ KB

13.5.1 Preparation

Categorical variables need to be coded appropriately. Ordinal variables can be coded with OrdinalEncoder. Nominal variables can be coded with OneHotEncoder. We want to treat HOUR, BOROUGH, and zipcode as nominal categorical variables. One way to encode them is pd.get_dummies.

catVars = ["HOUR", "BOROUGH"] #, "zipcode"]
crashdf = pd.get_dummies(crashdf, columns = catVars, prefix = catVars)
crashdf.shape

(7240, 36)

It is important to note that we get one dummy variable for each level of the categorical variable. For example, we get 5 dummies for BOROUGH and 24 dummies for HOUR. If these dummies were to used in a regression model, the model matrix is not of full rank; in other words, there is perfect colinearity among the covariates. An error would occur from the model fitting. For regularized regression models or many machine learning methods (e.g., decision tree, random forest, SVM, etc.), this is less an issue.

We need to standardize the continuous covariates so that their impacts are comparable. Standardization also makes the numerical computation more stable. If one variable has a variance that 100 times larger than others, it might dominate the objective function and make it hard for the optimization algorithms to find the optimum. This preprocessing can be done with the sklearn.preprocessing module.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
conVars = ["population_density", "median_home_value", "median_household_income"]
crashdf[conVars] = scaler.fit_transform(crashdf[conVars])
crashdf.columns

Index(['INJURY', 'LATITUDE', 'LONGITUDE', 'zipcode', 'population_density',
       'median_home_value', 'median_household_income', 'HOUR_0', 'HOUR_1',
       'HOUR_2', 'HOUR_3', 'HOUR_4', 'HOUR_5', 'HOUR_6', 'HOUR_7', 'HOUR_8',
       'HOUR_9', 'HOUR_10', 'HOUR_11', 'HOUR_12', 'HOUR_13', 'HOUR_14',
       'HOUR_15', 'HOUR_16', 'HOUR_17', 'HOUR_18', 'HOUR_19', 'HOUR_20',
       'HOUR_21', 'HOUR_22', 'HOUR_23', 'BOROUGH_BRONX', 'BOROUGH_BROOKLYN',
       'BOROUGH_MANHATTAN', 'BOROUGH_QUEENS', 'BOROUGH_STATEN ISLAND'],
      dtype='object')