7  Descriptive Statistics

This chapter was prepared by Courtney Jones.

In basic terms, descriptive statistics are how we describe the data. Descriptive Statistics is extremely important to exploratory data analysis, as it allows us to describe and summarize the data to put it into context and visualize it. If we just were looking at a bunch of raw data, what use is that to us? We use terms to describe center, spread, correlation, counts, and more to to give us context to the raw data we have.

7.1 Different Python methods and which to use

7.1.1 Explanation of Methods

There are many methods to perform descriptive statistics operations. After briefly describing them, we will perform example operations to put into context how they work.

Python’s built-in functions: These built-in operations are in the Python library, where we would not have to import any packages. There are not many operations already built-in, and it cannot compute large datasets well.

Statistics package: Includes some additional functions for computation. NumPy is more compatible for using opertions than this package.

NumPy: NumPy is a very common package to import. It is beneficial when working with single and multi dimensional arrays.

Pandas: Pandas is based off of the same numerical computing as NumPy and works with series and dataframes.

7.1.2 Example: mean

Below is an example of computing mean with all of the above methods to express their differences.

Just for this example, I will create my own datasets (as the NYC data does not portray the differences as easily).

import math
import numpy as np
import pandas as pd

x = [2, 3.5, 7, 4]
xnan = [2, 3.5, 7, 4, math.nan]
y = np.array(xnan)
z = pd.Series(xnan)

Python’s built-in functions

Here, we create the formula for mean by only using the built-in Python operations.

mean = sum(x) / len(x) # sum and length are built-in, whereas mean is not
mean # uses just the list
4.125
mean_xnan = sum(xnan) / len(xnan)
mean_xnan
nan

This method cannot skip NaN’s in the list, so the user would have to find a way to eliminate all NaN’s from their list before computing.

Statistics Package

import statistics
statistics.mean(x) # uses just the list, x, rather than the array or series
4.125
statistics.mean(xnan)
nan

Similarly, will just output “nan” if there are any nan’s in the list.

NumPy

import numpy as np
np.mean(y) # uses the array y = np.array(xnan)
nan

Notice that nan occurs. To avoid this, we can use nanmean() instead.

np.nanmean(y)
4.125

Pandas

import pandas as pd
z.mean() # uses the series z = pd.Series(xnan)
4.125

nan does not occur due to the default parameter in the pandas mean skipna = True.

z.mean(skipna = False)
nan

7.1.3 So what do we use?

As shown above, pandas is nice as it automatically ignores nan by default when computing numeric operations, rather than just outputting nan. This is faster, cleaner, and preferrable to me when I am calculating operations. So, outside of the context of this class, I prefer using pandas if I had the choice.

Moreover, in the context of this class, the data we will be analyzing is typically in the form of a dataframe. Pandas will typically be the best option when working with a dataframe, so it is best to continue using pandas.

7.2 Data

The data I will pull from is the January 2023 NYC Crash Data (cleaned).

jan23 = pd.read_csv("data/nyc_crashes_202301_cleaned.csv")

jan23 = jan23.loc[:,['CRASH DATE', 'CRASH TIME', 'BOROUGH', 'ZIP CODE', 'LATITUDE',
       'LONGITUDE', 'LOCATION', 'ON STREET NAME', 'CROSS STREET NAME',
       'OFF STREET NAME', 'NUMBER OF PERSONS INJURED',
       'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED',
       'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED',
       'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED',
       'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1',
       'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',
       'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5',
       'COLLISION_ID', 'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2',
       'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5']]

7.2.1 Isolating Parts of the Dataframe

Descriptive statistics do not make sense in context with all aspects of the dataframe we will be using. Most of the descriptive statistics shown below will only make sense with continuous variables. Thus, I will briefly show how to isolate certain aspects of the dataframe, so that we can do so later.

7.2.1.1 Columns

jan23["BOROUGH"] # isolating BOROUGH column
0        BROOKLYN
1          QUEENS
2       MANHATTAN
3          QUEENS
4           BRONX
          ...    
7239     BROOKLYN
7240     BROOKLYN
7241     BROOKLYN
7242    MANHATTAN
7243       QUEENS
Name: BOROUGH, Length: 7244, dtype: object
type(jan23["BOROUGH"])
pandas.core.series.Series

Notice that the individual columns are classified as series. Pandas can be used on dataframes and series.

jan23["BOROUGH"].value_counts(dropna = False) # categorical / discrete
# "dropna = True" is the default and drops the missing (NaN) values
BROOKLYN         2386
QUEENS           1980
MANHATTAN        1290
BRONX            1179
STATEN ISLAND     384
NaN                25
Name: BOROUGH, dtype: int64

value_counts() does not work on dataframes, as it is a series operation. Moreover, it allows us to explore individual columns in more detail.

jan23["NUMBER OF PEDESTRIANS KILLED"].value_counts(dropna = False) # numeric / continous
# works with both categorical and numeric values
0    7239
1       5
Name: NUMBER OF PEDESTRIANS KILLED, dtype: int64
jan23[["BOROUGH", "NUMBER OF PEDESTRIANS KILLED"]] # isolating multiple columns
BOROUGH NUMBER OF PEDESTRIANS KILLED
0 BROOKLYN 0
1 QUEENS 0
2 MANHATTAN 0
3 QUEENS 0
4 BRONX 0
... ... ...
7239 BROOKLYN 0
7240 BROOKLYN 0
7241 BROOKLYN 0
7242 MANHATTAN 0
7243 QUEENS 0

7244 rows × 2 columns

7.2.1.2 Rows

Descriptive Statistics on rows are not very beneficial, as comparing the variables in rows of this NYC dataframe do not make much sense. Often, looking at rows is not very ideal, and the outputs are not always useful. However, here are a few ways that rows can be isolated from the dataframe if necessary.

jan23.iloc[6543:6547]
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME ... CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 COLLISION_ID VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
6543 1/28/23 5:25 BROOKLYN 11206.0 40.701077 -73.94043 (40.701077, -73.94043) HUMBOLDT STREET FLUSHING AVENUE NaN ... Other Vehicular NaN NaN NaN 4602244 Station Wagon/Sport Utility Vehicle NaN NaN NaN NaN
6544 1/28/23 10:55 STATEN ISLAND 10301.0 40.640907 -74.08134 (40.640907, -74.08134) NaN NaN 25 SHERMAN AVENUE ... Unspecified NaN NaN NaN 4602219 Sedan Sedan NaN NaN NaN
6545 1/28/23 0:09 QUEENS 11372.0 40.755030 -73.88242 (40.75503, -73.88242) NaN NaN 33-11 85 STREET ... Unspecified NaN NaN NaN 4602365 Sedan Box Truck NaN NaN NaN
6546 1/28/23 13:00 BROOKLYN 11220.0 40.644955 -74.01611 (40.644955, -74.01611) NaN NaN 325 54 STREET ... Unspecified NaN NaN NaN 4602449 Sedan Station Wagon/Sport Utility Vehicle NaN NaN NaN

4 rows × 29 columns

jan23[jan23["CRASH DATE"] == "01/01/2023"]
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME ... CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 COLLISION_ID VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5

0 rows × 29 columns

jan23[jan23["COLLISION_ID"] == 4594599]
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME ... CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 COLLISION_ID VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
1 1/1/23 8:04 QUEENS 11430.0 40.659508 -73.773687 (40.6595077,-73.7736867) NASSAU EXPRESSWAY NaN NaN ... Unspecified NaN NaN NaN 4594599 Sedan Sedan NaN NaN NaN

1 rows × 29 columns

type(jan23[jan23["COLLISION_ID"] == 4594599])
pandas.core.frame.DataFrame

7.2.2 Data Isolated

Only the continuous variables will make sense for most of the descriptive statistics below, so we will use the following dataframe of just the continous variables, when applicable.

cjan23 = jan23[["NUMBER OF PEDESTRIANS INJURED", "NUMBER OF PEDESTRIANS KILLED", "NUMBER OF CYCLIST INJURED", 
                "NUMBER OF CYCLIST KILLED", "NUMBER OF MOTORIST INJURED", "NUMBER OF MOTORIST KILLED"]]

7.3 Common Operations

7.3.1 Descriptive Statistics with Pandas

7.3.2 center

  • mean(): mean
  • median(): median
  • mode(): mode

7.3.3 spread

  • min(): minimum
  • max(): maximum
  • std(): standard deviation
  • var(): variance
  • quantile(): quantiles

7.3.4 shape

  • skew(): adjusted Fisher-Pearson standardized moment

7.3.5 correlation (deals with two variables)

  • corr(): correlation coefficient
  • cov(): covariance

7.3.6 other important operations

  • count(): total count
  • sum(): summation
  • value_counts(): individual counts
  • describe(): describe the data with many descriptive statistics

Below I worked on a few specific descriptive statistics operators to give a general idea of how the operators work. If an operator is not used below, it is listed under where it would be used similarly.

7.3.7 Important operators

Sum

sum(axis = None, skipna = False). Below we focus on the usage of axis.

cjan23.sum() # or cjan23.sum(0) or cjan23.sum(None)
# takes the indvidual sums of the numeric columns
NUMBER OF PEDESTRIANS INJURED     843
NUMBER OF PEDESTRIANS KILLED        5
NUMBER OF CYCLIST INJURED         241
NUMBER OF CYCLIST KILLED            3
NUMBER OF MOTORIST INJURED       2413
NUMBER OF MOTORIST KILLED           9
dtype: int64
# compute "axis = 1", rows
cjan23.sum(1)
0       1
1       1
2       0
3       2
4       0
       ..
7239    0
7240    1
7241    0
7242    0
7243    2
Length: 7244, dtype: int64

These functions: mean(), median(), mode(), min(), max() std(), var(), quantile(), skew(), and count() are used similarly, where their default will operate on the columns, and a specification of axis = 1 will operate on the rows. Any of these operators not shown below, give a similar looking output as sum() does above.

7.3.8 Center

Mode

jan23.mode() # lists the most frequent value
# mode is relevant for discrete and continuous variables
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME ... CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 COLLISION_ID VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
0 1/13/23 0:00 BROOKLYN 11207.0 40.606566 -74.044983 (0.0, 0.0) BELT PARKWAY 3 AVENUE 49-21 METROPOLITAN AVENUE ... Unspecified Unspecified Unspecified Unspecified 4594332 Sedan Sedan Sedan Sedan Sedan
1 NaN NaN NaN NaN NaN NaN NaN NaN BROADWAY 560 WINTHROP STREET ... NaN NaN NaN NaN 4594347 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN 985 RICHMOND AVENUE ... NaN NaN NaN NaN 4594350 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN ATLANTIC AVENUE ... NaN NaN NaN NaN 4594351 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 4594359 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7239 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 4605213 NaN NaN NaN NaN NaN
7240 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 4605214 NaN NaN NaN NaN NaN
7241 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 4605246 NaN NaN NaN NaN NaN
7242 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 4605289 NaN NaN NaN NaN NaN
7243 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 4605324 NaN NaN NaN NaN NaN

7244 rows × 29 columns

The mode outputs the most frequent value. If there are multiple values that are the most frequent, then all of those values will be outputted. For example, “OFF STREET NAME” has four values that are the most frequent. Thus, four values are outputted. Since mode() was outputted in the format of a dataframe, the NaN values just represent empty spaces, where other columns have a value in that row. See: “COLLISION_ID”. There are 7244 rows because there are 7244 unique collision ID’s, so they all are the most frequent value (one occurrence of each). This explains all the empty spaces with all of the other variables, since the dataframe format needed a filler to still output a dataframe.

7.3.9 Spread

Quantile

cjan23.quantile([.65, .9]) # can specify specific quantiles
NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED NUMBER OF MOTORIST KILLED
0.65 0.0 0.0 0.0 0.0 0.0 0.0
0.90 1.0 0.0 0.0 0.0 1.0 0.0

7.3.10 Shape

Skew

cjan23.skew() # negative skew means left skewness, positive means right
NUMBER OF PEDESTRIANS INJURED    16.357461
NUMBER OF PEDESTRIANS KILLED     38.031561
NUMBER OF CYCLIST INJURED         5.276883
NUMBER OF CYCLIST KILLED         49.118899
NUMBER OF MOTORIST INJURED        3.230788
NUMBER OF MOTORIST KILLED        34.958763
dtype: float64

7.3.11 Correlation

Correlation Coefficient

jan23["NUMBER OF PEDESTRIANS INJURED"].corr(jan23["NUMBER OF PEDESTRIANS KILLED"])  
# the correlation coefficient of these two variables
-0.007686375916216244

cov() would be calculated in the same way.

7.3.12 Describe

Above calculates each chosen operation indivudally. Is there one operation that can show multiple descriptive statistics at once?

Just for the purpose of showing how to make changes to the default function where the character values are needed to portray, I will be using all variables (dataframe jan23). Later, I will use cjan23 when delving more into editing the describe function, as the descriptie statistics automatically count numeric values as continuous (which is not true for many of these numeric variables).

jan23.describe() # default omits character and string values
ZIP CODE LATITUDE LONGITUDE NUMBER OF PERSONS INJURED NUMBER OF PERSONS KILLED NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED NUMBER OF MOTORIST KILLED COLLISION_ID
count 7240.000000 7240.000000 7240.000000 7244.000000 7244.000000 7244.000000 7244.000000 7244.000000 7244.000000 7244.000000 7244.000000 7.244000e+03
mean 10876.268785 40.723872 -73.917446 0.502761 0.002347 0.116372 0.000690 0.033269 0.000414 0.333103 0.001242 4.599022e+06
std 532.816111 0.087734 0.088494 0.813641 0.051164 0.397927 0.026265 0.180118 0.020348 0.749174 0.038951 2.365885e+03
min 10001.000000 40.504658 -74.250150 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.594332e+06
25% 10453.000000 40.665374 -73.966253 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.597113e+06
50% 11208.000000 40.714790 -73.922485 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.599058e+06
75% 11239.000000 40.784210 -73.865596 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.600953e+06
max 12134.000000 43.299428 -73.051978 21.000000 2.000000 19.000000 1.000000 2.000000 1.000000 8.000000 2.000000 4.605324e+06

It may be useful to edit the describe() feature to show moreso the values that we wish to see. The default .describe() output is shown above.

The default describe() input:

DataFrame.describe(percentiles = None, include = None, exclude = None, datetime_is_numeric = False)

7.3.12.1 Changing the default

# changing percentile default
jan23.describe([.2, .45, .9])
ZIP CODE LATITUDE LONGITUDE NUMBER OF PERSONS INJURED NUMBER OF PERSONS KILLED NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED NUMBER OF MOTORIST KILLED COLLISION_ID
count 7240.000000 7240.000000 7240.000000 7244.000000 7244.000000 7244.000000 7244.000000 7244.000000 7244.000000 7244.000000 7244.000000 7.244000e+03
mean 10876.268785 40.723872 -73.917446 0.502761 0.002347 0.116372 0.000690 0.033269 0.000414 0.333103 0.001242 4.599022e+06
std 532.816111 0.087734 0.088494 0.813641 0.051164 0.397927 0.026265 0.180118 0.020348 0.749174 0.038951 2.365885e+03
min 10001.000000 40.504658 -74.250150 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.594332e+06
20% 10305.000000 40.651722 -73.978789 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.596668e+06
45% 11204.000000 40.704138 -73.931070 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.598665e+06
50% 11208.000000 40.714790 -73.922485 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.599058e+06
90% 11415.000000 40.843662 -73.802865 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 4.602072e+06
max 12134.000000 43.299428 -73.051978 21.000000 2.000000 19.000000 1.000000 2.000000 1.000000 8.000000 2.000000 4.605324e+06

Replaces the default .25 and .75, but keeps the median (.5).

# including all columns, rather than just "number" default
jan23.describe(include = 'all')
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME ... CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 COLLISION_ID VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
count 7244 7244 7219 7240.000000 7240.000000 7240.000000 7244 5341 3453 1903 ... 5378 689 191 62 7.244000e+03 7108 4553 634 179 59
unique 31 1245 5 NaN NaN NaN 6140 1580 1562 1877 ... 30 13 5 3 NaN 67 81 17 11 5
top 1/13/23 0:00 BROOKLYN NaN NaN NaN (0.0, 0.0) BELT PARKWAY BROADWAY 560 WINTHROP STREET ... Unspecified Unspecified Unspecified Unspecified NaN Sedan Sedan Sedan Sedan Sedan
freq 294 116 2386 NaN NaN NaN 81 124 37 3 ... 4550 637 183 60 NaN 3478 1969 327 93 29
mean NaN NaN NaN 10876.268785 40.723872 -73.917446 NaN NaN NaN NaN ... NaN NaN NaN NaN 4.599022e+06 NaN NaN NaN NaN NaN
std NaN NaN NaN 532.816111 0.087734 0.088494 NaN NaN NaN NaN ... NaN NaN NaN NaN 2.365885e+03 NaN NaN NaN NaN NaN
min NaN NaN NaN 10001.000000 40.504658 -74.250150 NaN NaN NaN NaN ... NaN NaN NaN NaN 4.594332e+06 NaN NaN NaN NaN NaN
25% NaN NaN NaN 10453.000000 40.665374 -73.966253 NaN NaN NaN NaN ... NaN NaN NaN NaN 4.597113e+06 NaN NaN NaN NaN NaN
50% NaN NaN NaN 11208.000000 40.714790 -73.922485 NaN NaN NaN NaN ... NaN NaN NaN NaN 4.599058e+06 NaN NaN NaN NaN NaN
75% NaN NaN NaN 11239.000000 40.784210 -73.865596 NaN NaN NaN NaN ... NaN NaN NaN NaN 4.600953e+06 NaN NaN NaN NaN NaN
max NaN NaN NaN 12134.000000 43.299428 -73.051978 NaN NaN NaN NaN ... NaN NaN NaN NaN 4.605324e+06 NaN NaN NaN NaN NaN

11 rows × 29 columns

My only con with this, is that most of these numerical values are not included in the “unique”, “top”, and “frequency” rows, even though in context they are discrete, and would make sense to be included in these.

# excluding numerical columns
# gives just "object" i.e. categorical
jan23.describe(exclude = 'number')
CRASH DATE CRASH TIME BOROUGH LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME CONTRIBUTING FACTOR VEHICLE 1 CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
count 7244 7244 7219 7244 5341 3453 1903 7189 5378 689 191 62 7108 4553 634 179 59
unique 31 1245 5 6140 1580 1562 1877 48 30 13 5 3 67 81 17 11 5
top 1/13/23 0:00 BROOKLYN (0.0, 0.0) BELT PARKWAY BROADWAY 560 WINTHROP STREET Driver Inattention/Distraction Unspecified Unspecified Unspecified Unspecified Sedan Sedan Sedan Sedan Sedan
freq 294 116 2386 81 124 37 3 1776 4550 637 183 60 3478 1969 327 93 29

Interesting note: rather than outputting as an empty set, the function decided to use the columns that are typically omitted instead, which is the same as jan23.describe("include = object").

# making datetime numeric
jan23.describe(include = 'all', datetime_is_numeric = True)
CRASH DATE CRASH TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME ... CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 COLLISION_ID VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
count 7244 7244 7219 7240.000000 7240.000000 7240.000000 7244 5341 3453 1903 ... 5378 689 191 62 7.244000e+03 7108 4553 634 179 59
unique 31 1245 5 NaN NaN NaN 6140 1580 1562 1877 ... 30 13 5 3 NaN 67 81 17 11 5
top 1/13/23 0:00 BROOKLYN NaN NaN NaN (0.0, 0.0) BELT PARKWAY BROADWAY 560 WINTHROP STREET ... Unspecified Unspecified Unspecified Unspecified NaN Sedan Sedan Sedan Sedan Sedan
freq 294 116 2386 NaN NaN NaN 81 124 37 3 ... 4550 637 183 60 NaN 3478 1969 327 93 29
mean NaN NaN NaN 10876.268785 40.723872 -73.917446 NaN NaN NaN NaN ... NaN NaN NaN NaN 4.599022e+06 NaN NaN NaN NaN NaN
std NaN NaN NaN 532.816111 0.087734 0.088494 NaN NaN NaN NaN ... NaN NaN NaN NaN 2.365885e+03 NaN NaN NaN NaN NaN
min NaN NaN NaN 10001.000000 40.504658 -74.250150 NaN NaN NaN NaN ... NaN NaN NaN NaN 4.594332e+06 NaN NaN NaN NaN NaN
25% NaN NaN NaN 10453.000000 40.665374 -73.966253 NaN NaN NaN NaN ... NaN NaN NaN NaN 4.597113e+06 NaN NaN NaN NaN NaN
50% NaN NaN NaN 11208.000000 40.714790 -73.922485 NaN NaN NaN NaN ... NaN NaN NaN NaN 4.599058e+06 NaN NaN NaN NaN NaN
75% NaN NaN NaN 11239.000000 40.784210 -73.865596 NaN NaN NaN NaN ... NaN NaN NaN NaN 4.600953e+06 NaN NaN NaN NaN NaN
max NaN NaN NaN 12134.000000 43.299428 -73.051978 NaN NaN NaN NaN ... NaN NaN NaN NaN 4.605324e+06 NaN NaN NaN NaN NaN

11 rows × 29 columns

Including datetime as numeric works if the date times are inputted in a different style (typically, YYYY-MM-DD 00:00:00.000000), which our data is not. Thus, as we see, the date and time is still treated as an object.

7.3.12.2 Changing rows with describe()

Above were specific ways to change the function that were already built into the function itself. What if we want to add more rows describing another descriptive statistic? I will be using just the discrete values for the following examples.

# adding sum to the dataframe
cjan23.describe().append(pd.Series(cjan23.sum(), name = 'sum'))
/var/folders/cq/5ysgnwfn7c3g0h46xyzvpj800000gn/T/ipykernel_34745/3080308836.py:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  cjan23.describe().append(pd.Series(cjan23.sum(), name = 'sum'))
NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED NUMBER OF MOTORIST KILLED
count 7244.000000 7244.000000 7244.000000 7244.000000 7244.000000 7244.000000
mean 0.116372 0.000690 0.033269 0.000414 0.333103 0.001242
std 0.397927 0.026265 0.180118 0.020348 0.749174 0.038951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 19.000000 1.000000 2.000000 1.000000 8.000000 2.000000
sum 843.000000 5.000000 241.000000 3.000000 2413.000000 9.000000
# adding a row counting nan's
cjan23.describe().append(pd.Series(cjan23.isna().sum(), name = 'nans'))
/var/folders/cq/5ysgnwfn7c3g0h46xyzvpj800000gn/T/ipykernel_34745/4231082202.py:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  cjan23.describe().append(pd.Series(cjan23.isna().sum(), name = 'nans'))
NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED NUMBER OF MOTORIST KILLED
count 7244.000000 7244.000000 7244.000000 7244.000000 7244.000000 7244.000000
mean 0.116372 0.000690 0.033269 0.000414 0.333103 0.001242
std 0.397927 0.026265 0.180118 0.020348 0.749174 0.038951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 19.000000 1.000000 2.000000 1.000000 8.000000 2.000000
nans 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
# removing a row
cjan23.describe().drop(labels = "max", axis = 0)
NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED NUMBER OF MOTORIST KILLED
count 7244.000000 7244.000000 7244.000000 7244.000000 7244.000000 7244.000000
mean 0.116372 0.000690 0.033269 0.000414 0.333103 0.001242
std 0.397927 0.026265 0.180118 0.020348 0.749174 0.038951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

7.3.12.3 Changing columns with describe()

# removing a column
cjan23.describe().drop(columns = "NUMBER OF CYCLIST INJURED")
NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED NUMBER OF MOTORIST KILLED
count 7244.000000 7244.000000 7244.000000 7244.000000 7244.000000
mean 0.116372 0.000690 0.000414 0.333103 0.001242
std 0.397927 0.026265 0.020348 0.749174 0.038951
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000
max 19.000000 1.000000 1.000000 8.000000 2.000000
# note that the manual changes made above are not permanent unless the variable is reassigned
cjan23.describe()
NUMBER OF PEDESTRIANS INJURED NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED NUMBER OF MOTORIST KILLED
count 7244.000000 7244.000000 7244.000000 7244.000000 7244.000000 7244.000000
mean 0.116372 0.000690 0.033269 0.000414 0.333103 0.001242
std 0.397927 0.026265 0.180118 0.020348 0.749174 0.038951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 19.000000 1.000000 2.000000 1.000000 8.000000 2.000000

All of the above ways were manipulating the describe() operator to potentially make visualizing descriptive statistics easier, by putting certain desirable traits in or out of the table.

Describe on Individual Columns

jan23["BOROUGH"].describe() # character and discrete
count         7219
unique           5
top       BROOKLYN
freq          2386
Name: BOROUGH, dtype: object
jan23["NUMBER OF PEDESTRIANS KILLED"].describe() # numeric and continuous
count    7244.000000
mean        0.000690
std         0.026265
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000
Name: NUMBER OF PEDESTRIANS KILLED, dtype: float64

Note that numeric and discrete would still be treated as continuous, so descriptive statistics are not very beneficial fot these variables. Regardless, descriptive statistics are typically more of interest to us if they are continuous.

7.4 Conclusion

In this presentation we looked into different methods of performing descriptive statistics, and saw how to use many of these operators. There are many ways to compute descriptive statistics, and we explored how to do so with pandas. We then focused on how to maniputlate the describe() function in many ways that may help us to visualize the data much easier. Afterwards, we looked at isolating columns and rows to perform descriptive statistics on. Analyzing the descriptive statistics is extremely important to understaning data. Another way to possibly put data into a more digestible form is to visualize it, which other presentations touch on.