7 Descriptive Statistics

This chapter was prepared by Courtney Jones.

In basic terms, descriptive statistics are how we describe the data. Descriptive Statistics is extremely important to exploratory data analysis, as it allows us to describe and summarize the data to put it into context and visualize it. If we just were looking at a bunch of raw data, what use is that to us? We use terms to describe center, spread, correlation, counts, and more to to give us context to the raw data we have.

7.1 Different Python methods and which to use

7.1.1 Explanation of Methods

There are many methods to perform descriptive statistics operations. After briefly describing them, we will perform example operations to put into context how they work.

Python’s built-in functions: These built-in operations are in the Python library, where we would not have to import any packages. There are not many operations already built-in, and it cannot compute large datasets well.

Statistics package: Includes some additional functions for computation. NumPy is more compatible for using opertions than this package.

NumPy: NumPy is a very common package to import. It is beneficial when working with single and multi dimensional arrays.

Pandas: Pandas is based off of the same numerical computing as NumPy and works with series and dataframes.

7.1.2 Example: mean

Below is an example of computing mean with all of the above methods to express their differences.

Just for this example, I will create my own datasets (as the NYC data does not portray the differences as easily).

import math
import numpy as np
import pandas as pd

x = [2, 3.5, 7, 4]
xnan = [2, 3.5, 7, 4, math.nan]
y = np.array(xnan)
z = pd.Series(xnan)

Python’s built-in functions

Here, we create the formula for mean by only using the built-in Python operations.

mean = sum(x) / len(x) # sum and length are built-in, whereas mean is not
mean # uses just the list

4.125

mean_xnan = sum(xnan) / len(xnan)
mean_xnan

nan

This method cannot skip NaN’s in the list, so the user would have to find a way to eliminate all NaN’s from their list before computing.

Statistics Package

import statistics
statistics.mean(x) # uses just the list, x, rather than the array or series

4.125

statistics.mean(xnan)

nan

Similarly, will just output “nan” if there are any nan’s in the list.

NumPy

import numpy as np
np.mean(y) # uses the array y = np.array(xnan)

nan

Notice that nan occurs. To avoid this, we can use nanmean() instead.

np.nanmean(y)

4.125

Pandas

import pandas as pd
z.mean() # uses the series z = pd.Series(xnan)

4.125

nan does not occur due to the default parameter in the pandas mean skipna = True.

z.mean(skipna = False)

nan

7.1.3 So what do we use?

As shown above, pandas is nice as it automatically ignores nan by default when computing numeric operations, rather than just outputting nan. This is faster, cleaner, and preferrable to me when I am calculating operations. So, outside of the context of this class, I prefer using pandas if I had the choice.

Moreover, in the context of this class, the data we will be analyzing is typically in the form of a dataframe. Pandas will typically be the best option when working with a dataframe, so it is best to continue using pandas.

7.2 Data

The data I will pull from is the January 2023 NYC Crash Data (cleaned).

jan23 = pd.read_csv("data/nyc_crashes_202301_cleaned.csv")

jan23 = jan23.loc[:,['CRASH DATE', 'CRASH TIME', 'BOROUGH', 'ZIP CODE', 'LATITUDE',
       'LONGITUDE', 'LOCATION', 'ON STREET NAME', 'CROSS STREET NAME',
       'OFF STREET NAME', 'NUMBER OF PERSONS INJURED',
       'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED',
       'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED',
       'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED',
       'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1',
       'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',
       'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5',
       'COLLISION_ID', 'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2',
       'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5']]

7.2.1 Isolating Parts of the Dataframe

Descriptive statistics do not make sense in context with all aspects of the dataframe we will be using. Most of the descriptive statistics shown below will only make sense with continuous variables. Thus, I will briefly show how to isolate certain aspects of the dataframe, so that we can do so later.

7.2.1.1 Columns

jan23["BOROUGH"] # isolating BOROUGH column

0        BROOKLYN
1          QUEENS
2       MANHATTAN
3          QUEENS
4           BRONX
          ...    
7239     BROOKLYN
7240     BROOKLYN
7241     BROOKLYN
7242    MANHATTAN
7243       QUEENS
Name: BOROUGH, Length: 7244, dtype: object

type(jan23["BOROUGH"])

pandas.core.series.Series

Notice that the individual columns are classified as series. Pandas can be used on dataframes and series.

jan23["BOROUGH"].value_counts(dropna = False) # categorical / discrete
# "dropna = True" is the default and drops the missing (NaN) values

BROOKLYN         2386
QUEENS           1980
MANHATTAN        1290
BRONX            1179
STATEN ISLAND     384
NaN                25
Name: BOROUGH, dtype: int64

value_counts() does not work on dataframes, as it is a series operation. Moreover, it allows us to explore individual columns in more detail.

jan23["NUMBER OF PEDESTRIANS KILLED"].value_counts(dropna = False) # numeric / continous
# works with both categorical and numeric values

0    7239
1       5
Name: NUMBER OF PEDESTRIANS KILLED, dtype: int64

jan23[["BOROUGH", "NUMBER OF PEDESTRIANS KILLED"]] # isolating multiple columns

	BOROUGH	NUMBER OF PEDESTRIANS KILLED
0	BROOKLYN	0
1	QUEENS	0
2	MANHATTAN	0
3	QUEENS	0
4	BRONX	0
...	...	...
7239	BROOKLYN	0
7240	BROOKLYN	0
7241	BROOKLYN	0
7242	MANHATTAN	0
7243	QUEENS	0

7244 rows × 2 columns

7.2.1.2 Rows

Descriptive Statistics on rows are not very beneficial, as comparing the variables in rows of this NYC dataframe do not make much sense. Often, looking at rows is not very ideal, and the outputs are not always useful. However, here are a few ways that rows can be isolated from the dataframe if necessary.

jan23.iloc[6543:6547]

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME	CROSS STREET NAME	OFF STREET NAME	...	CONTRIBUTING FACTOR VEHICLE 2	CONTRIBUTING FACTOR VEHICLE 3	CONTRIBUTING FACTOR VEHICLE 4	CONTRIBUTING FACTOR VEHICLE 5	COLLISION_ID	VEHICLE TYPE CODE 1	VEHICLE TYPE CODE 2	VEHICLE TYPE CODE 3	VEHICLE TYPE CODE 4	VEHICLE TYPE CODE 5
6543	1/28/23	5:25	BROOKLYN	11206.0	40.701077	-73.94043	(40.701077, -73.94043)	HUMBOLDT STREET	FLUSHING AVENUE	NaN	...	Other Vehicular	NaN	NaN	NaN	4602244	Station Wagon/Sport Utility Vehicle	NaN	NaN	NaN	NaN
6544	1/28/23	10:55	STATEN ISLAND	10301.0	40.640907	-74.08134	(40.640907, -74.08134)	NaN	NaN	25 SHERMAN AVENUE	...	Unspecified	NaN	NaN	NaN	4602219	Sedan	Sedan	NaN	NaN	NaN
6545	1/28/23	0:09	QUEENS	11372.0	40.755030	-73.88242	(40.75503, -73.88242)	NaN	NaN	33-11 85 STREET	...	Unspecified	NaN	NaN	NaN	4602365	Sedan	Box Truck	NaN	NaN	NaN
6546	1/28/23	13:00	BROOKLYN	11220.0	40.644955	-74.01611	(40.644955, -74.01611)	NaN	NaN	325 54 STREET	...	Unspecified	NaN	NaN	NaN	4602449	Sedan	Station Wagon/Sport Utility Vehicle	NaN	NaN	NaN

4 rows × 29 columns

jan23[jan23["CRASH DATE"] == "01/01/2023"]

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME	CROSS STREET NAME	OFF STREET NAME	...	CONTRIBUTING FACTOR VEHICLE 2	CONTRIBUTING FACTOR VEHICLE 3	CONTRIBUTING FACTOR VEHICLE 4	CONTRIBUTING FACTOR VEHICLE 5	COLLISION_ID	VEHICLE TYPE CODE 1	VEHICLE TYPE CODE 2	VEHICLE TYPE CODE 3	VEHICLE TYPE CODE 4	VEHICLE TYPE CODE 5

0 rows × 29 columns

jan23[jan23["COLLISION_ID"] == 4594599]

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME	CROSS STREET NAME	OFF STREET NAME	...	CONTRIBUTING FACTOR VEHICLE 2	CONTRIBUTING FACTOR VEHICLE 3	CONTRIBUTING FACTOR VEHICLE 4	CONTRIBUTING FACTOR VEHICLE 5	COLLISION_ID	VEHICLE TYPE CODE 1	VEHICLE TYPE CODE 2	VEHICLE TYPE CODE 3	VEHICLE TYPE CODE 4	VEHICLE TYPE CODE 5
1	1/1/23	8:04	QUEENS	11430.0	40.659508	-73.773687	(40.6595077,-73.7736867)	NASSAU EXPRESSWAY	NaN	NaN	...	Unspecified	NaN	NaN	NaN	4594599	Sedan	Sedan	NaN	NaN	NaN

1 rows × 29 columns

type(jan23[jan23["COLLISION_ID"] == 4594599])

pandas.core.frame.DataFrame

7.2.2 Data Isolated

Only the continuous variables will make sense for most of the descriptive statistics below, so we will use the following dataframe of just the continous variables, when applicable.

cjan23 = jan23[["NUMBER OF PEDESTRIANS INJURED", "NUMBER OF PEDESTRIANS KILLED", "NUMBER OF CYCLIST INJURED", 
                "NUMBER OF CYCLIST KILLED", "NUMBER OF MOTORIST INJURED", "NUMBER OF MOTORIST KILLED"]]

7.3 Common Operations

7.3.1 Descriptive Statistics with Pandas

7.3.2 center

mean(): mean
median(): median
mode(): mode

7.3.3 spread

min(): minimum
max(): maximum
std(): standard deviation
var(): variance
quantile(): quantiles

7.3.4 shape

skew(): adjusted Fisher-Pearson standardized moment

7.3.5 correlation (deals with two variables)

corr(): correlation coefficient
cov(): covariance

7.3.6 other important operations

count(): total count
sum(): summation
value_counts(): individual counts
describe(): describe the data with many descriptive statistics

Below I worked on a few specific descriptive statistics operators to give a general idea of how the operators work. If an operator is not used below, it is listed under where it would be used similarly.

7.3.7 Important operators

Sum

sum(axis = None, skipna = False). Below we focus on the usage of axis.

cjan23.sum() # or cjan23.sum(0) or cjan23.sum(None)
# takes the indvidual sums of the numeric columns

NUMBER OF PEDESTRIANS INJURED     843
NUMBER OF PEDESTRIANS KILLED        5
NUMBER OF CYCLIST INJURED         241
NUMBER OF CYCLIST KILLED            3
NUMBER OF MOTORIST INJURED       2413
NUMBER OF MOTORIST KILLED           9
dtype: int64

# compute "axis = 1", rows
cjan23.sum(1)

0       1
1       1
2       0
3       2
4       0
       ..
7239    0
7240    1
7241    0
7242    0
7243    2
Length: 7244, dtype: int64

These functions: mean(), median(), mode(), min(), max() std(), var(), quantile(), skew(), and count() are used similarly, where their default will operate on the columns, and a specification of axis = 1 will operate on the rows. Any of these operators not shown below, give a similar looking output as sum() does above.

7.3.8 Center

Mode

jan23.mode() # lists the most frequent value
# mode is relevant for discrete and continuous variables

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME	CROSS STREET NAME	OFF STREET NAME	...	CONTRIBUTING FACTOR VEHICLE 2	CONTRIBUTING FACTOR VEHICLE 3	CONTRIBUTING FACTOR VEHICLE 4	CONTRIBUTING FACTOR VEHICLE 5	COLLISION_ID	VEHICLE TYPE CODE 1	VEHICLE TYPE CODE 2	VEHICLE TYPE CODE 3	VEHICLE TYPE CODE 4	VEHICLE TYPE CODE 5
0	1/13/23	0:00	BROOKLYN	11207.0	40.606566	-74.044983	(0.0, 0.0)	BELT PARKWAY	3 AVENUE	49-21 METROPOLITAN AVENUE	...	Unspecified	Unspecified	Unspecified	Unspecified	4594332	Sedan	Sedan	Sedan	Sedan	Sedan
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	BROADWAY	560 WINTHROP STREET	...	NaN	NaN	NaN	NaN	4594347	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	985 RICHMOND AVENUE	...	NaN	NaN	NaN	NaN	4594350	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	ATLANTIC AVENUE	...	NaN	NaN	NaN	NaN	4594351	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4594359	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7239	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4605213	NaN	NaN	NaN	NaN	NaN
7240	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4605214	NaN	NaN	NaN	NaN	NaN
7241	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4605246	NaN	NaN	NaN	NaN	NaN
7242	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4605289	NaN	NaN	NaN	NaN	NaN
7243	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4605324	NaN	NaN	NaN	NaN	NaN

7244 rows × 29 columns

The mode outputs the most frequent value. If there are multiple values that are the most frequent, then all of those values will be outputted. For example, “OFF STREET NAME” has four values that are the most frequent. Thus, four values are outputted. Since mode() was outputted in the format of a dataframe, the NaN values just represent empty spaces, where other columns have a value in that row. See: “COLLISION_ID”. There are 7244 rows because there are 7244 unique collision ID’s, so they all are the most frequent value (one occurrence of each). This explains all the empty spaces with all of the other variables, since the dataframe format needed a filler to still output a dataframe.

7.3.9 Spread

Quantile

cjan23.quantile([.65, .9]) # can specify specific quantiles

	NUMBER OF PEDESTRIANS INJURED	NUMBER OF PEDESTRIANS KILLED	NUMBER OF CYCLIST INJURED	NUMBER OF CYCLIST KILLED	NUMBER OF MOTORIST INJURED	NUMBER OF MOTORIST KILLED
0.65	0.0	0.0	0.0	0.0	0.0	0.0
0.90	1.0	0.0	0.0	0.0	1.0	0.0

7.3.10 Shape

Skew

cjan23.skew() # negative skew means left skewness, positive means right

NUMBER OF PEDESTRIANS INJURED    16.357461
NUMBER OF PEDESTRIANS KILLED     38.031561
NUMBER OF CYCLIST INJURED         5.276883
NUMBER OF CYCLIST KILLED         49.118899
NUMBER OF MOTORIST INJURED        3.230788
NUMBER OF MOTORIST KILLED        34.958763
dtype: float64

7.3.11 Correlation

Correlation Coefficient

jan23["NUMBER OF PEDESTRIANS INJURED"].corr(jan23["NUMBER OF PEDESTRIANS KILLED"])  
# the correlation coefficient of these two variables

-0.007686375916216244

cov() would be calculated in the same way.

7.3.12 Describe

Above calculates each chosen operation indivudally. Is there one operation that can show multiple descriptive statistics at once?

Just for the purpose of showing how to make changes to the default function where the character values are needed to portray, I will be using all variables (dataframe jan23). Later, I will use cjan23 when delving more into editing the describe function, as the descriptie statistics automatically count numeric values as continuous (which is not true for many of these numeric variables).

jan23.describe() # default omits character and string values

	ZIP CODE	LATITUDE	LONGITUDE	NUMBER OF PERSONS INJURED	NUMBER OF PERSONS KILLED	NUMBER OF PEDESTRIANS INJURED	NUMBER OF PEDESTRIANS KILLED	NUMBER OF CYCLIST INJURED	NUMBER OF CYCLIST KILLED	NUMBER OF MOTORIST INJURED	NUMBER OF MOTORIST KILLED	COLLISION_ID
count	7240.000000	7240.000000	7240.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7.244000e+03
mean	10876.268785	40.723872	-73.917446	0.502761	0.002347	0.116372	0.000690	0.033269	0.000414	0.333103	0.001242	4.599022e+06
std	532.816111	0.087734	0.088494	0.813641	0.051164	0.397927	0.026265	0.180118	0.020348	0.749174	0.038951	2.365885e+03
min	10001.000000	40.504658	-74.250150	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.594332e+06
25%	10453.000000	40.665374	-73.966253	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.597113e+06
50%	11208.000000	40.714790	-73.922485	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.599058e+06
75%	11239.000000	40.784210	-73.865596	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.600953e+06
max	12134.000000	43.299428	-73.051978	21.000000	2.000000	19.000000	1.000000	2.000000	1.000000	8.000000	2.000000	4.605324e+06

It may be useful to edit the describe() feature to show moreso the values that we wish to see. The default .describe() output is shown above.

The default describe() input:

DataFrame.describe(percentiles = None, include = None, exclude = None, datetime_is_numeric = False)

7.3.12.1 Changing the default

# changing percentile default
jan23.describe([.2, .45, .9])

	ZIP CODE	LATITUDE	LONGITUDE	NUMBER OF PERSONS INJURED	NUMBER OF PERSONS KILLED	NUMBER OF PEDESTRIANS INJURED	NUMBER OF PEDESTRIANS KILLED	NUMBER OF CYCLIST INJURED	NUMBER OF CYCLIST KILLED	NUMBER OF MOTORIST INJURED	NUMBER OF MOTORIST KILLED	COLLISION_ID
count	7240.000000	7240.000000	7240.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7.244000e+03
mean	10876.268785	40.723872	-73.917446	0.502761	0.002347	0.116372	0.000690	0.033269	0.000414	0.333103	0.001242	4.599022e+06
std	532.816111	0.087734	0.088494	0.813641	0.051164	0.397927	0.026265	0.180118	0.020348	0.749174	0.038951	2.365885e+03
min	10001.000000	40.504658	-74.250150	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.594332e+06
20%	10305.000000	40.651722	-73.978789	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.596668e+06
45%	11204.000000	40.704138	-73.931070	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.598665e+06
50%	11208.000000	40.714790	-73.922485	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.599058e+06
90%	11415.000000	40.843662	-73.802865	1.000000	0.000000	1.000000	0.000000	0.000000	0.000000	1.000000	0.000000	4.602072e+06
max	12134.000000	43.299428	-73.051978	21.000000	2.000000	19.000000	1.000000	2.000000	1.000000	8.000000	2.000000	4.605324e+06

Replaces the default .25 and .75, but keeps the median (.5).

# including all columns, rather than just "number" default
jan23.describe(include = 'all')

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME	CROSS STREET NAME	OFF STREET NAME	...	CONTRIBUTING FACTOR VEHICLE 2	CONTRIBUTING FACTOR VEHICLE 3	CONTRIBUTING FACTOR VEHICLE 4	CONTRIBUTING FACTOR VEHICLE 5	COLLISION_ID	VEHICLE TYPE CODE 1	VEHICLE TYPE CODE 2	VEHICLE TYPE CODE 3	VEHICLE TYPE CODE 4	VEHICLE TYPE CODE 5
count	7244	7244	7219	7240.000000	7240.000000	7240.000000	7244	5341	3453	1903	...	5378	689	191	62	7.244000e+03	7108	4553	634	179	59
unique	31	1245	5	NaN	NaN	NaN	6140	1580	1562	1877	...	30	13	5	3	NaN	67	81	17	11	5
top	1/13/23	0:00	BROOKLYN	NaN	NaN	NaN	(0.0, 0.0)	BELT PARKWAY	BROADWAY	560 WINTHROP STREET	...	Unspecified	Unspecified	Unspecified	Unspecified	NaN	Sedan	Sedan	Sedan	Sedan	Sedan
freq	294	116	2386	NaN	NaN	NaN	81	124	37	3	...	4550	637	183	60	NaN	3478	1969	327	93	29
mean	NaN	NaN	NaN	10876.268785	40.723872	-73.917446	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4.599022e+06	NaN	NaN	NaN	NaN	NaN
std	NaN	NaN	NaN	532.816111	0.087734	0.088494	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	2.365885e+03	NaN	NaN	NaN	NaN	NaN
min	NaN	NaN	NaN	10001.000000	40.504658	-74.250150	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4.594332e+06	NaN	NaN	NaN	NaN	NaN
25%	NaN	NaN	NaN	10453.000000	40.665374	-73.966253	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4.597113e+06	NaN	NaN	NaN	NaN	NaN
50%	NaN	NaN	NaN	11208.000000	40.714790	-73.922485	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4.599058e+06	NaN	NaN	NaN	NaN	NaN
75%	NaN	NaN	NaN	11239.000000	40.784210	-73.865596	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4.600953e+06	NaN	NaN	NaN	NaN	NaN
max	NaN	NaN	NaN	12134.000000	43.299428	-73.051978	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4.605324e+06	NaN	NaN	NaN	NaN	NaN

11 rows × 29 columns

My only con with this, is that most of these numerical values are not included in the “unique”, “top”, and “frequency” rows, even though in context they are discrete, and would make sense to be included in these.

# excluding numerical columns
# gives just "object" i.e. categorical
jan23.describe(exclude = 'number')

	CRASH DATE	CRASH TIME	BOROUGH	LOCATION	ON STREET NAME	CROSS STREET NAME	OFF STREET NAME	CONTRIBUTING FACTOR VEHICLE 1	CONTRIBUTING FACTOR VEHICLE 2	CONTRIBUTING FACTOR VEHICLE 3	CONTRIBUTING FACTOR VEHICLE 4	CONTRIBUTING FACTOR VEHICLE 5	VEHICLE TYPE CODE 1	VEHICLE TYPE CODE 2	VEHICLE TYPE CODE 3	VEHICLE TYPE CODE 4	VEHICLE TYPE CODE 5
count	7244	7244	7219	7244	5341	3453	1903	7189	5378	689	191	62	7108	4553	634	179	59
unique	31	1245	5	6140	1580	1562	1877	48	30	13	5	3	67	81	17	11	5
top	1/13/23	0:00	BROOKLYN	(0.0, 0.0)	BELT PARKWAY	BROADWAY	560 WINTHROP STREET	Driver Inattention/Distraction	Unspecified	Unspecified	Unspecified	Unspecified	Sedan	Sedan	Sedan	Sedan	Sedan
freq	294	116	2386	81	124	37	3	1776	4550	637	183	60	3478	1969	327	93	29

Interesting note: rather than outputting as an empty set, the function decided to use the columns that are typically omitted instead, which is the same as jan23.describe("include = object").

# making datetime numeric
jan23.describe(include = 'all', datetime_is_numeric = True)

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME	CROSS STREET NAME	OFF STREET NAME	...	CONTRIBUTING FACTOR VEHICLE 2	CONTRIBUTING FACTOR VEHICLE 3	CONTRIBUTING FACTOR VEHICLE 4	CONTRIBUTING FACTOR VEHICLE 5	COLLISION_ID	VEHICLE TYPE CODE 1	VEHICLE TYPE CODE 2	VEHICLE TYPE CODE 3	VEHICLE TYPE CODE 4	VEHICLE TYPE CODE 5
count	7244	7244	7219	7240.000000	7240.000000	7240.000000	7244	5341	3453	1903	...	5378	689	191	62	7.244000e+03	7108	4553	634	179	59
unique	31	1245	5	NaN	NaN	NaN	6140	1580	1562	1877	...	30	13	5	3	NaN	67	81	17	11	5
top	1/13/23	0:00	BROOKLYN	NaN	NaN	NaN	(0.0, 0.0)	BELT PARKWAY	BROADWAY	560 WINTHROP STREET	...	Unspecified	Unspecified	Unspecified	Unspecified	NaN	Sedan	Sedan	Sedan	Sedan	Sedan
freq	294	116	2386	NaN	NaN	NaN	81	124	37	3	...	4550	637	183	60	NaN	3478	1969	327	93	29
mean	NaN	NaN	NaN	10876.268785	40.723872	-73.917446	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4.599022e+06	NaN	NaN	NaN	NaN	NaN
std	NaN	NaN	NaN	532.816111	0.087734	0.088494	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	2.365885e+03	NaN	NaN	NaN	NaN	NaN
min	NaN	NaN	NaN	10001.000000	40.504658	-74.250150	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4.594332e+06	NaN	NaN	NaN	NaN	NaN
25%	NaN	NaN	NaN	10453.000000	40.665374	-73.966253	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4.597113e+06	NaN	NaN	NaN	NaN	NaN
50%	NaN	NaN	NaN	11208.000000	40.714790	-73.922485	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4.599058e+06	NaN	NaN	NaN	NaN	NaN
75%	NaN	NaN	NaN	11239.000000	40.784210	-73.865596	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4.600953e+06	NaN	NaN	NaN	NaN	NaN
max	NaN	NaN	NaN	12134.000000	43.299428	-73.051978	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4.605324e+06	NaN	NaN	NaN	NaN	NaN

11 rows × 29 columns

Including datetime as numeric works if the date times are inputted in a different style (typically, YYYY-MM-DD 00:00:00.000000), which our data is not. Thus, as we see, the date and time is still treated as an object.

7.3.12.2 Changing rows with `describe()`

Above were specific ways to change the function that were already built into the function itself. What if we want to add more rows describing another descriptive statistic? I will be using just the discrete values for the following examples.

# adding sum to the dataframe
cjan23.describe().append(pd.Series(cjan23.sum(), name = 'sum'))

/var/folders/cq/5ysgnwfn7c3g0h46xyzvpj800000gn/T/ipykernel_34745/3080308836.py:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  cjan23.describe().append(pd.Series(cjan23.sum(), name = 'sum'))

	NUMBER OF PEDESTRIANS INJURED	NUMBER OF PEDESTRIANS KILLED	NUMBER OF CYCLIST INJURED	NUMBER OF CYCLIST KILLED	NUMBER OF MOTORIST INJURED	NUMBER OF MOTORIST KILLED
count	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000
mean	0.116372	0.000690	0.033269	0.000414	0.333103	0.001242
std	0.397927	0.026265	0.180118	0.020348	0.749174	0.038951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
max	19.000000	1.000000	2.000000	1.000000	8.000000	2.000000
sum	843.000000	5.000000	241.000000	3.000000	2413.000000	9.000000

# adding a row counting nan's
cjan23.describe().append(pd.Series(cjan23.isna().sum(), name = 'nans'))

/var/folders/cq/5ysgnwfn7c3g0h46xyzvpj800000gn/T/ipykernel_34745/4231082202.py:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  cjan23.describe().append(pd.Series(cjan23.isna().sum(), name = 'nans'))

	NUMBER OF PEDESTRIANS INJURED	NUMBER OF PEDESTRIANS KILLED	NUMBER OF CYCLIST INJURED	NUMBER OF CYCLIST KILLED	NUMBER OF MOTORIST INJURED	NUMBER OF MOTORIST KILLED
count	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000
mean	0.116372	0.000690	0.033269	0.000414	0.333103	0.001242
std	0.397927	0.026265	0.180118	0.020348	0.749174	0.038951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
max	19.000000	1.000000	2.000000	1.000000	8.000000	2.000000
nans	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000

# removing a row
cjan23.describe().drop(labels = "max", axis = 0)

	NUMBER OF PEDESTRIANS INJURED	NUMBER OF PEDESTRIANS KILLED	NUMBER OF CYCLIST INJURED	NUMBER OF CYCLIST KILLED	NUMBER OF MOTORIST INJURED	NUMBER OF MOTORIST KILLED
count	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000
mean	0.116372	0.000690	0.033269	0.000414	0.333103	0.001242
std	0.397927	0.026265	0.180118	0.020348	0.749174	0.038951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000

7.3.12.3 Changing columns with `describe()`

# removing a column
cjan23.describe().drop(columns = "NUMBER OF CYCLIST INJURED")

	NUMBER OF PEDESTRIANS INJURED	NUMBER OF PEDESTRIANS KILLED	NUMBER OF CYCLIST KILLED	NUMBER OF MOTORIST INJURED	NUMBER OF MOTORIST KILLED
count	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000
mean	0.116372	0.000690	0.000414	0.333103	0.001242
std	0.397927	0.026265	0.020348	0.749174	0.038951
min	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.000000	0.000000	0.000000	0.000000	0.000000
75%	0.000000	0.000000	0.000000	0.000000	0.000000
max	19.000000	1.000000	1.000000	8.000000	2.000000

# note that the manual changes made above are not permanent unless the variable is reassigned
cjan23.describe()

	NUMBER OF PEDESTRIANS INJURED	NUMBER OF PEDESTRIANS KILLED	NUMBER OF CYCLIST INJURED	NUMBER OF CYCLIST KILLED	NUMBER OF MOTORIST INJURED	NUMBER OF MOTORIST KILLED
count	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000	7244.000000
mean	0.116372	0.000690	0.033269	0.000414	0.333103	0.001242
std	0.397927	0.026265	0.180118	0.020348	0.749174	0.038951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
max	19.000000	1.000000	2.000000	1.000000	8.000000	2.000000

All of the above ways were manipulating the describe() operator to potentially make visualizing descriptive statistics easier, by putting certain desirable traits in or out of the table.

Describe on Individual Columns

jan23["BOROUGH"].describe() # character and discrete

count         7219
unique           5
top       BROOKLYN
freq          2386
Name: BOROUGH, dtype: object

jan23["NUMBER OF PEDESTRIANS KILLED"].describe() # numeric and continuous

count    7244.000000
mean        0.000690
std         0.026265
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000
Name: NUMBER OF PEDESTRIANS KILLED, dtype: float64

Note that numeric and discrete would still be treated as continuous, so descriptive statistics are not very beneficial fot these variables. Regardless, descriptive statistics are typically more of interest to us if they are continuous.

7.4 Conclusion

In this presentation we looked into different methods of performing descriptive statistics, and saw how to use many of these operators. There are many ways to compute descriptive statistics, and we explored how to do so with pandas. We then focused on how to maniputlate the describe() function in many ways that may help us to visualize the data much easier. Afterwards, we looked at isolating columns and rows to perform descriptive statistics on. Analyzing the descriptive statistics is extremely important to understaning data. Another way to possibly put data into a more digestible form is to visualize it, which other presentations touch on.

7.1 Different Python methods and which to use

7.1.1 Explanation of Methods

7.1.2 Example: mean

7.1.3 So what do we use?

7.2 Data

7.2.1 Isolating Parts of the Dataframe

7.2.1.1 Columns

7.2.1.2 Rows

7.2.2 Data Isolated

7.3 Common Operations

7.3.1 Descriptive Statistics with Pandas

7.3.2 center

7.3.3 spread

7.3.4 shape

7.3.5 correlation (deals with two variables)

7.3.6 other important operations

7.3.7 Important operators

7.3.8 Center

7.3.9 Spread

7.3.10 Shape

7.3.11 Correlation

7.3.12 Describe

7.3.12.1 Changing the default

7.3.12.2 Changing rows with describe()

7.3.12.3 Changing columns with describe()

7.4 Conclusion

7.3.12.2 Changing rows with `describe()`

7.3.12.3 Changing columns with `describe()`