First Glance
Consider the NYC Crash Data in January 2022.
import pandas as pd
jan23 = pd.read_csv("data/nyc_crashes_202301.csv" )
jan23.head()
jan23.describe()
ZIP CODE
LATITUDE
LONGITUDE
NUMBER OF PERSONS INJURED
NUMBER OF PERSONS KILLED
NUMBER OF PEDESTRIANS INJURED
NUMBER OF PEDESTRIANS KILLED
NUMBER OF CYCLIST INJURED
NUMBER OF CYCLIST KILLED
NUMBER OF MOTORIST INJURED
NUMBER OF MOTORIST KILLED
COLLISION_ID
count
4796.000000
6764.000000
6764.000000
7244.000000
7244.000000
7244.000000
7244.000000
7244.000000
7244.000000
7244.000000
7244.000000
7.244000e+03
mean
10893.521685
40.234660
-73.033403
0.502761
0.002347
0.116372
0.000690
0.033269
0.000414
0.333103
0.001242
4.599022e+06
std
526.392428
4.430595
8.041457
0.813641
0.051164
0.397927
0.026265
0.180118
0.020348
0.749174
0.038951
2.365885e+03
min
10001.000000
0.000000
-74.250150
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
4.594332e+06
25%
10457.000000
40.663230
-73.964888
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
4.597113e+06
50%
11208.000000
40.711880
-73.921230
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
4.599058e+06
75%
11239.000000
40.775640
-73.865389
1.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
4.600953e+06
max
11694.000000
40.912827
0.000000
21.000000
2.000000
19.000000
1.000000
2.000000
1.000000
8.000000
2.000000
4.605324e+06
Frequency tables for categorical variables.
jan23["BOROUGH" ].value_counts(dropna= False )
NaN 2448
BROOKLYN 1653
QUEENS 1316
BRONX 821
MANHATTAN 811
STATEN ISLAND 195
Name: BOROUGH, dtype: int64
Some tables are too long.
# jan23["CONTRIBUTING FACTOR VEHICLE 1"].value_counts(dropna=False)
with pd.option_context('display.max_rows' , None ):
print (jan23["VEHICLE TYPE CODE 1" ].value_counts(dropna= False ))
Sedan 3478
Station Wagon/Sport Utility Vehicle 2505
Taxi 177
Pick-up Truck 162
NaN 136
Bus 135
Box Truck 110
Bike 84
Ambulance 55
Tractor Truck Diesel 53
E-Bike 50
Van 43
Motorcycle 28
E-Scooter 28
Dump 18
Garbage or Refuse 17
Flat Bed 17
Moped 17
PK 16
Convertible 11
Chassis Cab 9
Carry All 9
Tanker 7
Tow Truck / Wrecker 7
Tractor Truck Gasoline 7
LIMO 5
AMBULANCE 5
Motorscooter 5
Flat Rack 3
Motorbike 3
4 dr sedan 3
Concrete Mixer 2
Firetruck 2
MOTOR SCOO 2
STAK 2
SANITATION 1
Beverage Truck 1
SCHOOL BUS 1
Waste truc 1
Forklift 1
AMBU 1
UTILITY 1
Log 1
PAS 1
3-Door 1
FDNY AMBUL 1
Multi-Wheeled Vehicle 1
Scooter 1
Stake or Rack 1
Lift Boom 1
Gas Moped 1
Bulk Agriculture 1
Armored Truck 1
TRAILER 1
electric s 1
Motorized 1
VAn 1
ELECTRIC S 1
UTIL 1
FORK LIFT 1
STREET SWE 1
GARBAGE TR 1
Chevy 1
cart 1
Tow Truck 1
NYC FDNY # 1
Scooter ga 1
VAN TRUCK 1
Name: VEHICLE TYPE CODE 1, dtype: int64
Cross-tables
pd.crosstab(index = jan23["CONTRIBUTING FACTOR VEHICLE 1" ],
columns = jan23["BOROUGH" ], dropna = False )
BOROUGH
BRONX
BROOKLYN
MANHATTAN
QUEENS
STATEN ISLAND
CONTRIBUTING FACTOR VEHICLE 1
Accelerator Defective
3
4
1
1
0
Aggressive Driving/Road Rage
9
12
6
12
0
Alcohol Involvement
7
32
9
30
8
Animals Action
0
0
0
0
1
Backing Unsafely
31
57
29
54
10
Brakes Defective
3
6
2
4
1
Cell Phone (hand-Held)
0
1
0
0
0
Driver Inattention/Distraction
177
415
211
307
45
Driver Inexperience
15
41
20
18
6
Driverless/Runaway Vehicle
1
1
2
3
0
Drugs (illegal)
2
5
1
0
0
Failure to Keep Right
1
2
0
2
0
Failure to Yield Right-of-Way
58
108
51
157
12
Fatigued/Drowsy
0
0
0
3
1
Fell Asleep
3
8
3
10
0
Following Too Closely
30
56
38
69
10
Glare
2
2
1
2
1
Illnes
1
1
0
5
1
Lane Marking Improper/Inadequate
1
0
0
0
0
Lost Consciousness
2
3
1
4
2
Obstruction/Debris
1
1
1
1
1
Other Electronic Device
0
0
1
1
0
Other Lighting Defects
1
0
0
0
0
Other Vehicular
40
33
28
29
3
Outside Car Distraction
5
5
0
3
0
Oversized Vehicle
5
7
2
3
1
Passenger Distraction
1
4
5
4
0
Passing Too Closely
41
70
36
44
1
Passing or Lane Usage Improper
26
68
33
80
9
Pavement Defective
0
2
0
0
0
Pavement Slippery
1
6
5
5
0
Pedestrian/Bicyclist/Other Pedestrian Error/Confusion
10
14
17
3
1
Physical Disability
1
0
0
1
1
Reaction to Uninvolved Vehicle
1
6
10
12
4
Steering Failure
5
4
0
7
1
Texting
0
0
0
0
1
Tinted Windows
0
1
1
0
0
Tire Failure/Inadequate
0
2
0
3
1
Traffic Control Device Improper/Non-Working
0
0
0
1
0
Traffic Control Disregarded
25
39
16
50
8
Turning Improperly
22
48
20
35
11
Unsafe Lane Changing
7
16
14
9
2
Unsafe Speed
29
32
22
56
8
Unspecified
229
508
209
273
38
View Obstructed/Limited
15
15
7
12
4
Windshield Inadequate
1
0
0
0
0
Some Cleaning
Questions from Dr. Douglas Bates:
The CRASH_DATE
s are all in the correct month and there are no missing values
There are no missing values in the CRASH_TIME
s but there are 117 values of exactly 00:00:00
. Is this a matter of bad luck when the clock strikes midnight?
Over 1/3 of the ZIP_CODE
and BOROUGH
values are missing. There are the same number of missing values in these columns - do they always co-occur? If LATITUDE
and LONGITUDE
are available, can they be used to infer the ZIP_CODE
?
There are 178 unique non-missing ZIP_CODE
values as stated in the Jamboree description. (“Trust, but verify.”) Is there really a zip code of 10000 in New York?
There are 20 values of 0.0 for LATITUDE
and LONGITUDE
? These are obviously incorrect - should they be coded as missing?
Is it redundant to keep the LOCATIO
in addition to LATITUDE
and LONGITUDE
?
The COLLISION_ID
is unique to each row and can be used as a key. The values are not consecutive - why not?
The NUMBER_OF_...
columns seem reasonable. A further consistency check is suggested in the Jamboree tasks.
In the CONTRIBUTING_FACTOR
_… columns, is Unspecified
different from missing
?
The codes in the VEHICLE_TYPE_CODE_...
columns are the usual hodge-podge of results from “free-form” data entry. Should unk
, UNK
, UNKNOWN
, and Unknown
be converted to missing?
In contrast, the codes in the CONTRIBUTING_FACTOR_...
columns appear to be standardized (not sure why Illnes
isn’t Illness
).
with pd.option_context('display.max_rows' , None ):
print (jan23["CRASH TIME" ].value_counts())
0:00 116
15:00 75
18:00 72
17:00 69
10:00 65
8:00 61
8:30 59
13:00 58
7:00 58
14:00 56
19:00 55
16:00 54
23:00 54
12:00 54
9:00 53
20:00 53
17:30 51
14:30 51
10:30 51
12:30 51
11:00 49
9:30 42
15:30 40
16:30 39
18:30 38
21:00 38
6:00 36
11:30 35
20:30 35
6:30 33
4:00 32
1:00 31
5:00 31
22:00 31
13:30 30
23:30 27
22:30 27
6:50 27
21:30 26
18:20 26
15:20 25
17:20 25
15:15 24
18:45 24
14:50 23
7:30 23
15:45 23
1:30 23
14:40 23
8:20 23
19:30 23
9:45 22
13:20 22
3:00 22
16:20 22
7:50 22
18:40 21
21:45 21
16:45 21
17:45 21
10:50 21
16:10 21
8:40 20
10:40 20
8:15 20
20:50 19
1:20 19
13:40 19
18:35 19
11:20 19
17:40 19
6:20 19
9:40 19
2:00 19
18:50 18
18:10 18
9:20 18
9:15 18
20:15 18
12:50 18
16:35 18
0:30 18
12:20 18
14:45 18
22:40 18
12:15 18
16:40 18
12:45 18
19:40 18
10:45 17
17:50 17
19:50 17
19:20 17
8:45 17
15:50 17
17:55 17
10:20 17
12:10 17
3:30 16
13:45 16
17:35 16
7:45 16
11:40 16
15:10 16
16:15 16
11:45 16
14:15 16
15:40 16
13:15 15
9:50 15
17:15 15
1:45 15
6:45 15
19:10 15
12:40 15
18:05 15
22:20 14
16:50 14
17:25 14
21:40 14
20:40 14
5:30 14
8:50 14
11:15 14
22:45 14
12:25 14
11:50 14
16:55 14
11:10 14
14:20 13
14:35 13
8:55 13
19:45 13
17:10 13
11:55 13
8:25 13
4:30 13
7:20 13
14:55 13
6:40 13
23:40 12
6:55 12
19:15 12
6:05 12
11:05 12
10:15 12
7:05 12
8:35 12
7:15 12
13:10 12
15:35 12
22:15 12
23:45 12
5:45 12
19:05 12
21:50 11
7:10 11
16:25 11
9:25 11
14:10 11
19:55 11
14:05 11
17:05 11
5:35 11
22:10 11
20:10 11
10:25 11
13:25 11
15:05 10
20:45 10
6:15 10
19:35 10
23:15 10
3:50 10
20:35 10
0:40 10
5:10 10
18:15 10
23:20 10
21:15 10
2:20 10
7:25 10
9:05 9
4:50 9
23:50 9
18:29 9
1:50 9
20:20 9
9:10 9
13:05 9
2:30 9
19:25 9
16:05 9
11:04 9
13:50 9
20:25 9
22:50 9
7:40 9
15:25 9
0:20 9
16:49 9
10:35 9
18:25 9
1:10 8
22:41 8
21:20 8
0:55 8
3:45 8
0:10 8
12:35 8
3:10 8
1:55 8
23:05 8
8:17 8
12:05 8
5:40 8
18:55 8
8:10 8
8:05 8
17:16 8
5:50 8
10:55 8
23:35 7
18:08 7
18:51 7
11:34 7
21:10 7
6:10 7
2:40 7
22:25 7
10:05 7
17:54 7
3:20 7
16:48 7
18:34 7
5:20 7
15:55 7
4:20 7
20:05 7
1:15 7
10:28 7
4:40 7
12:55 7
18:18 7
1:05 7
23:25 7
15:43 7
4:15 7
3:15 7
7:35 7
22:35 7
21:35 7
19:19 6
14:51 6
11:25 6
10:10 6
17:13 6
20:37 6
20:23 6
14:58 6
17:08 6
19:49 6
17:39 6
21:08 6
23:13 6
22:55 6
12:08 6
10:19 6
23:23 6
7:48 6
16:28 6
14:22 6
18:22 6
8:44 6
18:58 6
16:33 6
6:54 6
7:55 6
7:06 6
8:08 6
2:15 6
21:04 6
21:25 6
19:18 6
5:55 6
3:25 6
5:05 6
0:45 6
20:55 6
17:29 6
15:33 6
2:35 6
0:50 6
14:54 6
15:57 6
13:24 6
13:36 6
18:28 6
2:50 6
8:14 6
6:25 6
10:37 6
21:01 5
19:51 5
0:06 5
17:11 5
3:40 5
16:37 5
18:16 5
14:57 5
20:14 5
19:57 5
1:25 5
7:19 5
18:06 5
13:34 5
13:52 5
20:59 5
8:28 5
19:36 5
13:28 5
14:11 5
12:46 5
17:44 5
2:27 5
16:38 5
17:46 5
8:58 5
19:26 5
15:26 5
22:18 5
11:56 5
14:25 5
11:32 5
18:48 5
19:11 5
8:47 5
13:04 5
18:07 5
20:24 5
9:35 5
21:58 5
23:06 5
8:16 5
19:52 5
12:51 5
17:28 5
4:37 5
18:04 5
17:26 5
13:48 5
13:55 5
10:54 5
18:52 5
17:58 5
21:06 5
19:21 5
16:16 5
14:28 5
17:12 5
18:37 5
22:12 5
11:35 5
18:44 5
14:44 5
1:40 5
13:07 5
18:24 5
0:25 5
12:48 5
0:09 5
21:05 5
13:35 5
0:35 5
21:55 5
9:22 5
20:48 5
21:29 5
21:07 5
11:17 5
16:58 4
13:56 4
6:37 4
14:21 4
7:41 4
20:18 4
5:44 4
6:49 4
16:02 4
20:49 4
14:39 4
5:12 4
22:05 4
13:09 4
8:43 4
4:10 4
12:54 4
19:28 4
17:14 4
5:38 4
1:21 4
15:21 4
0:18 4
0:15 4
18:57 4
21:03 4
19:06 4
17:48 4
23:38 4
12:18 4
6:53 4
21:23 4
19:59 4
16:31 4
10:04 4
13:54 4
8:02 4
6:52 4
16:51 4
23:33 4
0:42 4
17:17 4
9:44 4
19:12 4
19:13 4
22:17 4
16:27 4
15:48 4
10:48 4
8:52 4
7:18 4
17:52 4
7:08 4
15:04 4
9:59 4
14:26 4
20:17 4
17:57 4
6:38 4
20:32 4
23:19 4
18:14 4
11:24 4
19:47 4
12:03 4
10:18 4
14:06 4
14:38 4
19:16 4
3:35 4
4:25 4
16:13 4
8:12 4
14:53 4
21:13 4
6:58 4
14:56 4
8:36 4
18:26 4
13:53 4
11:23 4
22:14 4
15:41 4
9:41 4
13:57 4
13:02 4
15:16 4
15:13 4
10:11 4
4:05 4
20:06 4
11:42 4
14:37 4
5:07 4
0:05 4
10:56 4
8:21 4
21:47 4
9:55 4
16:04 4
17:36 4
22:58 4
8:34 4
8:46 4
16:14 4
15:58 4
15:42 4
6:27 4
19:23 4
16:44 4
8:59 4
23:10 4
12:37 4
18:43 4
21:16 4
5:15 4
22:23 4
11:18 4
17:49 4
4:55 4
15:08 4
10:38 4
9:28 4
13:18 4
20:31 4
16:47 4
13:43 4
9:11 4
9:57 3
12:28 3
12:07 3
2:45 3
3:17 3
3:02 3
0:13 3
12:47 3
12:29 3
23:39 3
23:48 3
12:33 3
18:38 3
20:41 3
16:12 3
16:43 3
14:14 3
16:32 3
23:09 3
21:44 3
18:09 3
22:39 3
17:51 3
6:59 3
1:18 3
7:42 3
23:54 3
17:34 3
11:19 3
5:27 3
8:41 3
23:55 3
23:07 3
21:51 3
14:24 3
9:03 3
23:01 3
7:39 3
22:13 3
22:46 3
3:05 3
18:17 3
19:46 3
7:51 3
15:52 3
9:07 3
20:51 3
11:38 3
14:33 3
14:02 3
8:48 3
8:57 3
15:22 3
18:56 3
18:47 3
15:44 3
21:21 3
4:45 3
17:06 3
16:59 3
17:21 3
17:42 3
20:44 3
18:12 3
12:01 3
17:43 3
18:02 3
7:52 3
22:53 3
10:39 3
14:01 3
19:33 3
17:41 3
20:42 3
12:34 3
6:35 3
2:55 3
20:11 3
1:42 3
17:47 3
12:38 3
23:57 3
18:42 3
16:24 3
4:34 3
9:12 3
0:19 3
17:18 3
2:10 3
2:07 3
0:47 3
22:08 3
0:57 3
21:09 3
7:38 3
14:03 3
4:36 3
20:39 3
9:54 3
13:44 3
19:04 3
14:16 3
11:29 3
2:21 3
15:12 3
13:46 3
1:29 3
11:53 3
14:27 3
18:03 3
20:52 3
12:11 3
18:41 3
7:26 3
10:33 3
21:33 3
8:13 3
19:09 3
0:11 3
1:35 3
14:29 3
17:38 3
6:07 3
19:03 3
3:23 3
0:01 3
16:26 3
22:52 3
19:53 3
11:51 3
17:31 3
21:54 3
17:07 3
3:11 3
8:42 3
9:26 3
9:42 3
16:21 3
17:22 3
13:37 3
9:53 3
20:07 3
17:33 3
23:31 3
19:32 3
0:04 3
18:32 3
19:48 3
22:11 3
20:26 3
7:12 3
16:36 3
16:46 3
8:49 3
11:13 3
15:32 3
14:12 3
21:11 3
12:39 3
9:51 3
1:34 3
18:19 3
14:04 3
17:59 3
20:12 3
9:18 3
10:17 3
21:38 3
6:06 3
3:29 3
21:02 3
22:36 3
4:35 3
10:57 3
22:24 3
19:38 3
9:29 3
9:47 3
8:24 3
11:16 3
17:53 3
13:31 3
18:27 3
17:02 3
13:14 3
10:36 3
16:41 3
3:16 3
19:54 3
12:59 3
2:26 3
9:01 3
21:42 2
9:06 2
11:52 2
0:07 2
11:49 2
18:13 2
17:24 2
12:04 2
5:59 2
15:51 2
11:48 2
17:19 2
3:37 2
19:31 2
7:01 2
8:38 2
4:51 2
21:56 2
15:19 2
19:56 2
21:22 2
18:39 2
20:38 2
15:11 2
14:31 2
0:24 2
3:53 2
9:04 2
20:57 2
20:16 2
15:47 2
20:28 2
7:09 2
9:49 2
20:29 2
16:22 2
14:47 2
9:17 2
9:19 2
11:33 2
17:37 2
18:23 2
8:27 2
21:52 2
5:49 2
1:51 2
14:13 2
20:09 2
11:44 2
12:23 2
9:08 2
22:51 2
10:32 2
13:49 2
16:08 2
16:17 2
6:46 2
22:38 2
12:16 2
9:31 2
19:17 2
7:56 2
12:52 2
2:53 2
14:36 2
22:27 2
15:49 2
13:27 2
19:58 2
15:29 2
10:34 2
12:53 2
12:13 2
18:53 2
5:25 2
12:14 2
14:52 2
10:46 2
4:57 2
15:54 2
23:14 2
14:59 2
1:44 2
1:47 2
15:38 2
15:06 2
6:31 2
23:12 2
19:14 2
7:44 2
10:12 2
11:28 2
13:59 2
0:12 2
23:18 2
15:39 2
21:27 2
23:28 2
4:19 2
7:04 2
21:41 2
20:13 2
4:28 2
23:24 2
3:55 2
19:22 2
7:46 2
14:17 2
21:26 2
16:42 2
12:12 2
9:43 2
2:52 2
17:23 2
16:18 2
4:42 2
8:06 2
6:19 2
15:01 2
15:34 2
23:44 2
5:47 2
13:17 2
15:28 2
18:36 2
10:41 2
10:24 2
17:27 2
19:08 2
1:13 2
21:32 2
2:29 2
1:52 2
4:16 2
1:24 2
17:04 2
16:53 2
20:47 2
21:59 2
20:22 2
3:48 2
0:34 2
2:41 2
12:44 2
3:14 2
18:46 2
9:39 2
6:21 2
14:19 2
4:54 2
14:42 2
0:02 2
10:51 2
3:07 2
5:42 2
3:04 2
0:26 2
2:11 2
16:01 2
20:53 2
9:23 2
8:04 2
10:53 2
16:29 2
15:03 2
22:47 2
0:17 2
7:17 2
7:34 2
10:26 2
18:11 2
18:21 2
21:18 2
10:01 2
8:29 2
15:36 2
23:11 2
2:25 2
11:57 2
2:42 2
17:56 2
20:02 2
16:56 2
6:18 2
22:49 2
6:44 2
0:49 2
13:13 2
17:09 2
23:59 2
2:51 2
17:32 2
8:33 2
13:41 2
0:16 2
23:21 2
14:48 2
9:46 2
10:22 2
12:41 2
13:12 2
19:41 2
4:43 2
4:53 2
3:12 2
0:43 2
21:19 2
16:07 2
13:06 2
21:37 2
20:36 2
14:09 2
14:23 2
23:02 2
18:54 2
23:42 2
13:26 2
5:01 2
23:08 2
0:32 2
16:03 2
1:32 2
23:41 2
8:09 2
3:54 2
11:47 2
8:26 2
22:04 2
10:16 2
23:52 2
15:23 2
16:34 2
2:08 2
8:01 2
2:36 2
5:36 2
4:32 2
0:27 2
20:54 2
15:18 2
7:07 2
22:01 2
22:26 2
18:49 2
6:36 2
20:04 2
9:32 2
22:16 2
1:17 2
12:57 2
15:02 2
1:01 2
2:05 2
1:38 2
7:03 2
16:11 2
0:38 2
13:22 2
10:07 2
13:01 2
0:08 2
5:09 2
13:38 2
23:16 2
11:09 1
2:56 1
19:43 1
1:36 1
7:13 1
16:09 1
13:21 1
3:57 1
12:36 1
22:42 1
22:03 1
3:56 1
6:17 1
11:07 1
4:46 1
13:32 1
1:16 1
2:22 1
10:42 1
4:07 1
23:51 1
9:36 1
2:34 1
20:19 1
16:52 1
11:46 1
4:31 1
10:23 1
18:33 1
7:58 1
10:29 1
3:46 1
1:11 1
1:22 1
1:09 1
22:33 1
5:19 1
11:36 1
9:34 1
12:17 1
3:59 1
23:53 1
2:04 1
13:39 1
4:47 1
10:02 1
10:03 1
6:43 1
0:23 1
21:53 1
0:03 1
2:54 1
2:18 1
6:33 1
2:38 1
0:58 1
6:08 1
14:32 1
22:07 1
13:47 1
19:37 1
5:37 1
1:59 1
6:14 1
22:34 1
5:22 1
20:58 1
1:19 1
3:08 1
0:41 1
18:01 1
9:16 1
10:21 1
7:29 1
13:16 1
1:12 1
18:59 1
2:59 1
1:57 1
11:03 1
5:34 1
5:33 1
6:24 1
0:37 1
1:49 1
1:56 1
2:23 1
11:22 1
13:19 1
4:48 1
2:17 1
0:28 1
19:29 1
7:37 1
19:02 1
2:47 1
4:39 1
4:27 1
0:53 1
4:09 1
2:48 1
3:47 1
3:36 1
11:43 1
3:09 1
4:49 1
1:08 1
6:47 1
14:49 1
0:33 1
4:17 1
16:54 1
6:09 1
12:42 1
22:29 1
23:22 1
22:31 1
0:31 1
5:11 1
5:32 1
5:39 1
7:31 1
3:28 1
8:32 1
19:34 1
5:41 1
10:13 1
12:27 1
1:26 1
6:23 1
1:02 1
22:56 1
19:44 1
22:54 1
7:24 1
6:29 1
12:24 1
8:31 1
3:26 1
22:37 1
9:24 1
23:46 1
23:27 1
1:31 1
3:52 1
7:54 1
8:51 1
9:27 1
1:14 1
5:24 1
22:09 1
3:43 1
10:59 1
5:02 1
3:19 1
6:51 1
1:58 1
5:46 1
1:37 1
15:07 1
13:33 1
12:06 1
16:19 1
11:59 1
6:03 1
22:02 1
11:08 1
20:01 1
6:04 1
15:14 1
23:32 1
21:36 1
7:57 1
19:01 1
10:14 1
12:56 1
15:17 1
10:43 1
16:39 1
3:03 1
11:39 1
12:49 1
2:28 1
9:58 1
5:29 1
11:58 1
8:18 1
11:12 1
22:59 1
6:41 1
5:03 1
7:33 1
22:21 1
6:56 1
8:56 1
17:03 1
1:03 1
7:59 1
7:43 1
15:31 1
12:19 1
15:24 1
7:49 1
6:16 1
20:43 1
20:21 1
10:06 1
6:22 1
14:43 1
15:46 1
11:11 1
7:23 1
15:53 1
5:28 1
2:44 1
10:08 1
9:33 1
19:07 1
20:56 1
23:47 1
5:21 1
7:22 1
9:02 1
3:44 1
13:29 1
4:58 1
2:33 1
2:13 1
5:52 1
11:37 1
17:01 1
23:17 1
2:58 1
6:13 1
6:26 1
7:28 1
3:24 1
11:06 1
19:27 1
4:08 1
10:09 1
13:08 1
21:14 1
0:56 1
12:32 1
12:02 1
5:06 1
15:27 1
0:48 1
11:54 1
3:38 1
4:23 1
7:53 1
14:46 1
6:01 1
21:12 1
8:37 1
2:02 1
21:43 1
3:42 1
6:48 1
23:49 1
22:48 1
1:39 1
0:44 1
22:43 1
20:08 1
5:56 1
8:53 1
13:58 1
4:06 1
2:03 1
5:26 1
11:14 1
8:03 1
0:36 1
2:14 1
23:56 1
6:42 1
8:23 1
14:18 1
Name: CRASH TIME, dtype: int64
For example, here are some cleaning steps:
import numpy as np
jan23["CONTRIBUTING FACTOR VEHICLE 1" ] = (
jan23["CONTRIBUTING FACTOR VEHICLE 1" ].replace(["Unspecified" ], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 2" ] = (
jan23["CONTRIBUTING FACTOR VEHICLE 2" ].replace(["Unspecified" ], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 3" ] = (
jan23["CONTRIBUTING FACTOR VEHICLE 3" ].replace(["Unspecified" ], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 4" ] = (
jan23["CONTRIBUTING FACTOR VEHICLE 4" ].replace(["Unspecified" ], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 5" ] = (
jan23["CONTRIBUTING FACTOR VEHICLE 5" ].replace(["Unspecified" ], np.nan))
jan23["LATITUDE" ] = jan23["LATITUDE" ].replace([0.0 ], np.nan)
jan23["LONGITUDE" ] = jan23["LONGITUDE" ].replace([0.0 ], np.nan)
jan23.describe()
ZIP CODE
LATITUDE
LONGITUDE
NUMBER OF PERSONS INJURED
NUMBER OF PERSONS KILLED
NUMBER OF PEDESTRIANS INJURED
NUMBER OF PEDESTRIANS KILLED
NUMBER OF CYCLIST INJURED
NUMBER OF CYCLIST KILLED
NUMBER OF MOTORIST INJURED
NUMBER OF MOTORIST KILLED
COLLISION_ID
count
4796.000000
6683.000000
6683.000000
7244.000000
7244.000000
7244.000000
7244.000000
7244.000000
7244.000000
7244.000000
7244.000000
7.244000e+03
mean
10893.521685
40.722317
-73.918590
0.502761
0.002347
0.116372
0.000690
0.033269
0.000414
0.333103
0.001242
4.599022e+06
std
526.392428
0.081602
0.085654
0.813641
0.051164
0.397927
0.026265
0.180118
0.020348
0.749174
0.038951
2.365885e+03
min
10001.000000
40.504658
-74.250150
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
4.594332e+06
25%
10457.000000
40.665350
-73.965530
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
4.597113e+06
50%
11208.000000
40.713196
-73.922740
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
4.599058e+06
75%
11239.000000
40.777754
-73.867714
1.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
4.600953e+06
max
11694.000000
40.912827
-73.702095
21.000000
2.000000
19.000000
1.000000
2.000000
1.000000
8.000000
2.000000
4.605324e+06
By the data dictionary, OFF STREET NAME
is the street address of the collision site. Some records have OFF STREET NAME
but missing LATITUDE
and LONGITUDE
. The geocode can be filled by geocoding the street address with package
Filling the Missing Zip Codes by Reverse Geocoding
The package uszipcode
is the most powerful and easy to use programmable zipcode database in Python. It provides information about 42,724 zipcodes in the US with data crawled from <data.census.gov>. See its documentation for details.
from uszipcode import SearchEngine
sr = SearchEngine()
sr.by_zipcode("10001" )
SimpleZipcode(zipcode='10001', zipcode_type='STANDARD', major_city='New York', post_office_city='New York, NY', common_city_list=['New York'], county='New York County', state='NY', lat=40.75, lng=-74.0, timezone='America/New_York', radius_in_miles=0.9090909090909091, area_code_list='718,917,347,646', population=21102, population_density=33959.0, land_area_in_sqmi=0.62, water_area_in_sqmi=0.0, housing_units=12476, occupied_housing_units=11031, median_home_value=650200, median_household_income=81671, bounds_west=-74.008621, bounds_east=-73.984076, bounds_north=40.759731, bounds_south=40.743451)
We can use uszipcode
to reverse geocode a point by its coordinates. The returned zipcode can be used to handle missing zipcode.
z = sr.by_coordinates(40.769993 , - 73.915825 , radius = 1 )
z[0 ].zipcode
z[0 ].median_home_value
Once we have found the zipcode, we can find its borough. See the complete NYC zip code list .
def nyczip2burough(zip ):
nzip = int (zip )
if nzip >= 10001 and nzip <= 10282 :
return "MANHATTAN"
elif nzip >= 10301 and nzip <= 10314 :
return "STATEN ISLAND"
elif nzip >= 10451 and nzip <= 10475 :
return "BRONX"
elif nzip >= 11004 and nzip <= 11109 :
return "QUEENS"
elif nzip >= 11351 and nzip <= 11697 :
return "QUEENS"
elif nzip >= 11201 and nzip <= 11256 :
return "BROOKLYN"
else :
return np.nan
Let’s try it out:
nyczip2burough(z[0 ].zipcode)
Here is a vectorized version:
import numpy as np
import pandas as pd
from typing import Union, List
def nyczip2borough(zips: Union[np.ndarray, pd.Series]) -> Union[np.ndarray, pd.Series]:
zips = zips.values if isinstance (zips, pd.Series) else zips
condlist = [
(zips >= 10001 ) & (zips <= 10282 ),
(zips >= 10301 ) & (zips <= 10314 ),
(zips >= 10451 ) & (zips <= 10475 ),
(zips >= 11004 ) & (zips <= 11109 ),
(zips >= 11351 ) & (zips <= 11697 ),
(zips >= 11201 ) & (zips <= 11256 ),
]
choicelist = [
"MANHATTAN" ,
"STATEN ISLAND" ,
"BRONX" ,
"QUEENS" ,
"QUEENS" ,
"BROOKLYN" ,
]
result = np.select(condlist, choicelist, default= np.nan)
return pd.Series(result) if isinstance (zips, pd.Series) else result
Try it out
nyczip2borough(jan23["ZIP CODE" ].dropna().head(10 ))
array(['BROOKLYN', 'QUEENS', 'MANHATTAN', 'QUEENS', 'BROOKLYN', 'QUEENS',
'QUEENS', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND'], dtype='<U32')
The uszipcode
package provides databases at the zip code level from the US Census. Such information could be merged with the NYC crash data for further analysis.
from uszipcode import SearchEngine, SimpleZipcode
import os
# set the default database file location
db_file = os.path.join(os.getenv("HOME" ), "simple_db.sqlite" )
search = SearchEngine(db_file_path= db_file)
search.by_zipcode("10030" )
SimpleZipcode(zipcode='10030', zipcode_type='STANDARD', major_city='New York', post_office_city='New York, NY', common_city_list=['New York'], county='New York County', state='NY', lat=40.82, lng=-73.94, timezone='America/New_York', radius_in_miles=0.5681818181818182, area_code_list='212,646,917', population=26999, population_density=96790.0, land_area_in_sqmi=0.28, water_area_in_sqmi=0.0, housing_units=12976, occupied_housing_units=11395, median_home_value=509000, median_household_income=31925, bounds_west=-73.948677, bounds_east=-73.936232, bounds_north=40.824032, bounds_south=40.812791)
The SQL database of US zip code is stored in $HOME/.uszipcode
. It can be imported as a pandas dataframe
.
import sqlite3
import pandas as pd
# change to your own path after installing uszipcode
con = sqlite3.connect (db_file)
zipdf = pd.read_sql_query("SELECT * from simple_zipcode" , con)
zipdf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42724 entries, 0 to 42723
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 zipcode 42724 non-null object
1 zipcode_type 42724 non-null object
2 major_city 42724 non-null object
3 post_office_city 33104 non-null object
4 common_city_list 41877 non-null object
5 county 42724 non-null object
6 state 42724 non-null object
7 lat 42724 non-null float64
8 lng 42724 non-null float64
9 timezone 42724 non-null object
10 radius_in_miles 33104 non-null float64
11 area_code_list 42724 non-null object
12 population 31448 non-null float64
13 population_density 31448 non-null float64
14 land_area_in_sqmi 31448 non-null float64
15 water_area_in_sqmi 31448 non-null float64
16 housing_units 31448 non-null float64
17 occupied_housing_units 31448 non-null float64
18 median_home_value 31448 non-null float64
19 median_household_income 31448 non-null float64
20 bounds_west 33104 non-null float64
21 bounds_east 33104 non-null float64
22 bounds_north 33104 non-null float64
23 bounds_south 33104 non-null float64
dtypes: float64(15), object(9)
memory usage: 7.8+ MB
The zip code dataframe can be merged with the crash dataframe.
Map the Crash Sites
We can do this with package gmplot
. See instructions from this tutorial .
import gmplot
import numpy as np
# prepare the geododes
latitude = jan23["LATITUDE" ].dropna().values
longitude = jan23["LONGITUDE" ].dropna().values
# center of the map and zoom level
gmap = gmplot.GoogleMapPlotter(40.769737 , - 73.91244 , 14 )
# plot heatmap
gmap.heatmap(latitude, longitude)
gmap.scatter(latitude, longitude, c = 'r' , marker = True )
# gmap.scatter(latitude, longitude, '#FF0000', size = 50, marker = False)
# Your Google_API_Key
# gmap.apikey = "put your key here"
# save it to html
gmap.draw(r"nycrashmap.html" )