import math
import numpy as np
import pandas as pd
= [2, 3.5, 7, 4]
x = [2, 3.5, 7, 4, math.nan]
xnan = np.array(xnan)
y = pd.Series(xnan) z
7 Descriptive Statistics
This chapter was prepared by Courtney Jones.
In basic terms, descriptive statistics are how we describe the data. Descriptive Statistics is extremely important to exploratory data analysis, as it allows us to describe and summarize the data to put it into context and visualize it. If we just were looking at a bunch of raw data, what use is that to us? We use terms to describe center, spread, correlation, counts, and more to to give us context to the raw data we have.
7.1 Different Python methods and which to use
7.1.1 Explanation of Methods
There are many methods to perform descriptive statistics operations. After briefly describing them, we will perform example operations to put into context how they work.
Python’s built-in functions: These built-in operations are in the Python library, where we would not have to import any packages. There are not many operations already built-in, and it cannot compute large datasets well.
Statistics package: Includes some additional functions for computation. NumPy is more compatible for using opertions than this package.
NumPy: NumPy is a very common package to import. It is beneficial when working with single and multi dimensional arrays.
Pandas: Pandas is based off of the same numerical computing as NumPy and works with series and dataframes.
7.1.2 Example: mean
Below is an example of computing mean with all of the above methods to express their differences.
Just for this example, I will create my own datasets (as the NYC data does not portray the differences as easily).
Python’s built-in functions
Here, we create the formula for mean by only using the built-in Python operations.
= sum(x) / len(x) # sum and length are built-in, whereas mean is not
mean # uses just the list mean
4.125
= sum(xnan) / len(xnan)
mean_xnan mean_xnan
nan
This method cannot skip NaN’s in the list, so the user would have to find a way to eliminate all NaN’s from their list before computing.
Statistics Package
import statistics
# uses just the list, x, rather than the array or series statistics.mean(x)
4.125
statistics.mean(xnan)
nan
Similarly, will just output “nan” if there are any nan’s in the list.
NumPy
import numpy as np
# uses the array y = np.array(xnan) np.mean(y)
nan
Notice that nan occurs. To avoid this, we can use nanmean()
instead.
np.nanmean(y)
4.125
Pandas
import pandas as pd
# uses the series z = pd.Series(xnan) z.mean()
4.125
nan does not occur due to the default parameter in the pandas mean skipna = True
.
= False) z.mean(skipna
nan
7.1.3 So what do we use?
As shown above, pandas is nice as it automatically ignores nan by default when computing numeric operations, rather than just outputting nan. This is faster, cleaner, and preferrable to me when I am calculating operations. So, outside of the context of this class, I prefer using pandas if I had the choice.
Moreover, in the context of this class, the data we will be analyzing is typically in the form of a dataframe. Pandas will typically be the best option when working with a dataframe, so it is best to continue using pandas.
7.2 Data
The data I will pull from is the January 2023 NYC Crash Data (cleaned).
= pd.read_csv("data/nyc_crashes_202301_cleaned.csv")
jan23
= jan23.loc[:,['CRASH DATE', 'CRASH TIME', 'BOROUGH', 'ZIP CODE', 'LATITUDE',
jan23 'LONGITUDE', 'LOCATION', 'ON STREET NAME', 'CROSS STREET NAME',
'OFF STREET NAME', 'NUMBER OF PERSONS INJURED',
'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED',
'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED',
'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED',
'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1',
'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',
'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5',
'COLLISION_ID', 'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2',
'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5']]
7.2.1 Isolating Parts of the Dataframe
Descriptive statistics do not make sense in context with all aspects of the dataframe we will be using. Most of the descriptive statistics shown below will only make sense with continuous variables. Thus, I will briefly show how to isolate certain aspects of the dataframe, so that we can do so later.
7.2.1.1 Columns
"BOROUGH"] # isolating BOROUGH column jan23[
0 BROOKLYN
1 QUEENS
2 MANHATTAN
3 QUEENS
4 BRONX
...
7239 BROOKLYN
7240 BROOKLYN
7241 BROOKLYN
7242 MANHATTAN
7243 QUEENS
Name: BOROUGH, Length: 7244, dtype: object
type(jan23["BOROUGH"])
pandas.core.series.Series
Notice that the individual columns are classified as series. Pandas can be used on dataframes and series.
"BOROUGH"].value_counts(dropna = False) # categorical / discrete
jan23[# "dropna = True" is the default and drops the missing (NaN) values
BROOKLYN 2386
QUEENS 1980
MANHATTAN 1290
BRONX 1179
STATEN ISLAND 384
NaN 25
Name: BOROUGH, dtype: int64
value_counts()
does not work on dataframes, as it is a series operation. Moreover, it allows us to explore individual columns in more detail.
"NUMBER OF PEDESTRIANS KILLED"].value_counts(dropna = False) # numeric / continous
jan23[# works with both categorical and numeric values
0 7239
1 5
Name: NUMBER OF PEDESTRIANS KILLED, dtype: int64
"BOROUGH", "NUMBER OF PEDESTRIANS KILLED"]] # isolating multiple columns jan23[[
BOROUGH | NUMBER OF PEDESTRIANS KILLED | |
---|---|---|
0 | BROOKLYN | 0 |
1 | QUEENS | 0 |
2 | MANHATTAN | 0 |
3 | QUEENS | 0 |
4 | BRONX | 0 |
... | ... | ... |
7239 | BROOKLYN | 0 |
7240 | BROOKLYN | 0 |
7241 | BROOKLYN | 0 |
7242 | MANHATTAN | 0 |
7243 | QUEENS | 0 |
7244 rows × 2 columns
7.2.1.2 Rows
Descriptive Statistics on rows are not very beneficial, as comparing the variables in rows of this NYC dataframe do not make much sense. Often, looking at rows is not very ideal, and the outputs are not always useful. However, here are a few ways that rows can be isolated from the dataframe if necessary.
6543:6547] jan23.iloc[
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | COLLISION_ID | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6543 | 1/28/23 | 5:25 | BROOKLYN | 11206.0 | 40.701077 | -73.94043 | (40.701077, -73.94043) | HUMBOLDT STREET | FLUSHING AVENUE | NaN | ... | Other Vehicular | NaN | NaN | NaN | 4602244 | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN | NaN |
6544 | 1/28/23 | 10:55 | STATEN ISLAND | 10301.0 | 40.640907 | -74.08134 | (40.640907, -74.08134) | NaN | NaN | 25 SHERMAN AVENUE | ... | Unspecified | NaN | NaN | NaN | 4602219 | Sedan | Sedan | NaN | NaN | NaN |
6545 | 1/28/23 | 0:09 | QUEENS | 11372.0 | 40.755030 | -73.88242 | (40.75503, -73.88242) | NaN | NaN | 33-11 85 STREET | ... | Unspecified | NaN | NaN | NaN | 4602365 | Sedan | Box Truck | NaN | NaN | NaN |
6546 | 1/28/23 | 13:00 | BROOKLYN | 11220.0 | 40.644955 | -74.01611 | (40.644955, -74.01611) | NaN | NaN | 325 54 STREET | ... | Unspecified | NaN | NaN | NaN | 4602449 | Sedan | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN |
4 rows × 29 columns
"CRASH DATE"] == "01/01/2023"] jan23[jan23[
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | COLLISION_ID | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 |
---|
0 rows × 29 columns
"COLLISION_ID"] == 4594599] jan23[jan23[
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | COLLISION_ID | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1/1/23 | 8:04 | QUEENS | 11430.0 | 40.659508 | -73.773687 | (40.6595077,-73.7736867) | NASSAU EXPRESSWAY | NaN | NaN | ... | Unspecified | NaN | NaN | NaN | 4594599 | Sedan | Sedan | NaN | NaN | NaN |
1 rows × 29 columns
type(jan23[jan23["COLLISION_ID"] == 4594599])
pandas.core.frame.DataFrame
7.2.2 Data Isolated
Only the continuous variables will make sense for most of the descriptive statistics below, so we will use the following dataframe of just the continous variables, when applicable.
= jan23[["NUMBER OF PEDESTRIANS INJURED", "NUMBER OF PEDESTRIANS KILLED", "NUMBER OF CYCLIST INJURED",
cjan23 "NUMBER OF CYCLIST KILLED", "NUMBER OF MOTORIST INJURED", "NUMBER OF MOTORIST KILLED"]]
7.3 Common Operations
7.3.1 Descriptive Statistics with Pandas
7.3.2 center
mean()
: meanmedian()
: medianmode()
: mode
7.3.3 spread
min()
: minimummax()
: maximumstd()
: standard deviationvar()
: variancequantile()
: quantiles
7.3.4 shape
skew()
: adjusted Fisher-Pearson standardized moment
7.3.5 correlation (deals with two variables)
corr()
: correlation coefficientcov()
: covariance
7.3.6 other important operations
count()
: total countsum()
: summationvalue_counts()
: individual countsdescribe()
: describe the data with many descriptive statistics
Below I worked on a few specific descriptive statistics operators to give a general idea of how the operators work. If an operator is not used below, it is listed under where it would be used similarly.
7.3.7 Important operators
Sum
sum(axis = None, skipna = False)
. Below we focus on the usage of axis.
sum() # or cjan23.sum(0) or cjan23.sum(None)
cjan23.# takes the indvidual sums of the numeric columns
NUMBER OF PEDESTRIANS INJURED 843
NUMBER OF PEDESTRIANS KILLED 5
NUMBER OF CYCLIST INJURED 241
NUMBER OF CYCLIST KILLED 3
NUMBER OF MOTORIST INJURED 2413
NUMBER OF MOTORIST KILLED 9
dtype: int64
# compute "axis = 1", rows
sum(1) cjan23.
0 1
1 1
2 0
3 2
4 0
..
7239 0
7240 1
7241 0
7242 0
7243 2
Length: 7244, dtype: int64
These functions: mean()
, median()
, mode()
, min()
, max()
std()
, var()
, quantile()
, skew()
, and count()
are used similarly, where their default will operate on the columns, and a specification of axis = 1
will operate on the rows. Any of these operators not shown below, give a similar looking output as sum()
does above.
7.3.8 Center
Mode
# lists the most frequent value
jan23.mode() # mode is relevant for discrete and continuous variables
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | COLLISION_ID | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1/13/23 | 0:00 | BROOKLYN | 11207.0 | 40.606566 | -74.044983 | (0.0, 0.0) | BELT PARKWAY | 3 AVENUE | 49-21 METROPOLITAN AVENUE | ... | Unspecified | Unspecified | Unspecified | Unspecified | 4594332 | Sedan | Sedan | Sedan | Sedan | Sedan |
1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | BROADWAY | 560 WINTHROP STREET | ... | NaN | NaN | NaN | NaN | 4594347 | NaN | NaN | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 985 RICHMOND AVENUE | ... | NaN | NaN | NaN | NaN | 4594350 | NaN | NaN | NaN | NaN | NaN |
3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ATLANTIC AVENUE | ... | NaN | NaN | NaN | NaN | 4594351 | NaN | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4594359 | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7239 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4605213 | NaN | NaN | NaN | NaN | NaN |
7240 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4605214 | NaN | NaN | NaN | NaN | NaN |
7241 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4605246 | NaN | NaN | NaN | NaN | NaN |
7242 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4605289 | NaN | NaN | NaN | NaN | NaN |
7243 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4605324 | NaN | NaN | NaN | NaN | NaN |
7244 rows × 29 columns
The mode outputs the most frequent value. If there are multiple values that are the most frequent, then all of those values will be outputted. For example, “OFF STREET NAME” has four values that are the most frequent. Thus, four values are outputted. Since mode()
was outputted in the format of a dataframe, the NaN values just represent empty spaces, where other columns have a value in that row. See: “COLLISION_ID”. There are 7244 rows because there are 7244 unique collision ID’s, so they all are the most frequent value (one occurrence of each). This explains all the empty spaces with all of the other variables, since the dataframe format needed a filler to still output a dataframe.
7.3.9 Spread
Quantile
.65, .9]) # can specify specific quantiles cjan23.quantile([
NUMBER OF PEDESTRIANS INJURED | NUMBER OF PEDESTRIANS KILLED | NUMBER OF CYCLIST INJURED | NUMBER OF CYCLIST KILLED | NUMBER OF MOTORIST INJURED | NUMBER OF MOTORIST KILLED | |
---|---|---|---|---|---|---|
0.65 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
0.90 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
7.3.10 Shape
Skew
# negative skew means left skewness, positive means right cjan23.skew()
NUMBER OF PEDESTRIANS INJURED 16.357461
NUMBER OF PEDESTRIANS KILLED 38.031561
NUMBER OF CYCLIST INJURED 5.276883
NUMBER OF CYCLIST KILLED 49.118899
NUMBER OF MOTORIST INJURED 3.230788
NUMBER OF MOTORIST KILLED 34.958763
dtype: float64
7.3.11 Correlation
Correlation Coefficient
"NUMBER OF PEDESTRIANS INJURED"].corr(jan23["NUMBER OF PEDESTRIANS KILLED"])
jan23[# the correlation coefficient of these two variables
-0.007686375916216244
cov()
would be calculated in the same way.
7.3.12 Describe
Above calculates each chosen operation indivudally. Is there one operation that can show multiple descriptive statistics at once?
Just for the purpose of showing how to make changes to the default function where the character values are needed to portray, I will be using all variables (dataframe jan23). Later, I will use cjan23 when delving more into editing the describe function, as the descriptie statistics automatically count numeric values as continuous (which is not true for many of these numeric variables).
# default omits character and string values jan23.describe()
ZIP CODE | LATITUDE | LONGITUDE | NUMBER OF PERSONS INJURED | NUMBER OF PERSONS KILLED | NUMBER OF PEDESTRIANS INJURED | NUMBER OF PEDESTRIANS KILLED | NUMBER OF CYCLIST INJURED | NUMBER OF CYCLIST KILLED | NUMBER OF MOTORIST INJURED | NUMBER OF MOTORIST KILLED | COLLISION_ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 7240.000000 | 7240.000000 | 7240.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7.244000e+03 |
mean | 10876.268785 | 40.723872 | -73.917446 | 0.502761 | 0.002347 | 0.116372 | 0.000690 | 0.033269 | 0.000414 | 0.333103 | 0.001242 | 4.599022e+06 |
std | 532.816111 | 0.087734 | 0.088494 | 0.813641 | 0.051164 | 0.397927 | 0.026265 | 0.180118 | 0.020348 | 0.749174 | 0.038951 | 2.365885e+03 |
min | 10001.000000 | 40.504658 | -74.250150 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.594332e+06 |
25% | 10453.000000 | 40.665374 | -73.966253 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.597113e+06 |
50% | 11208.000000 | 40.714790 | -73.922485 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.599058e+06 |
75% | 11239.000000 | 40.784210 | -73.865596 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.600953e+06 |
max | 12134.000000 | 43.299428 | -73.051978 | 21.000000 | 2.000000 | 19.000000 | 1.000000 | 2.000000 | 1.000000 | 8.000000 | 2.000000 | 4.605324e+06 |
It may be useful to edit the describe()
feature to show moreso the values that we wish to see. The default .describe()
output is shown above.
The default describe()
input:
DataFrame.describe(percentiles = None, include = None, exclude = None, datetime_is_numeric = False)
7.3.12.1 Changing the default
# changing percentile default
.2, .45, .9]) jan23.describe([
ZIP CODE | LATITUDE | LONGITUDE | NUMBER OF PERSONS INJURED | NUMBER OF PERSONS KILLED | NUMBER OF PEDESTRIANS INJURED | NUMBER OF PEDESTRIANS KILLED | NUMBER OF CYCLIST INJURED | NUMBER OF CYCLIST KILLED | NUMBER OF MOTORIST INJURED | NUMBER OF MOTORIST KILLED | COLLISION_ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 7240.000000 | 7240.000000 | 7240.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7.244000e+03 |
mean | 10876.268785 | 40.723872 | -73.917446 | 0.502761 | 0.002347 | 0.116372 | 0.000690 | 0.033269 | 0.000414 | 0.333103 | 0.001242 | 4.599022e+06 |
std | 532.816111 | 0.087734 | 0.088494 | 0.813641 | 0.051164 | 0.397927 | 0.026265 | 0.180118 | 0.020348 | 0.749174 | 0.038951 | 2.365885e+03 |
min | 10001.000000 | 40.504658 | -74.250150 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.594332e+06 |
20% | 10305.000000 | 40.651722 | -73.978789 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.596668e+06 |
45% | 11204.000000 | 40.704138 | -73.931070 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.598665e+06 |
50% | 11208.000000 | 40.714790 | -73.922485 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.599058e+06 |
90% | 11415.000000 | 40.843662 | -73.802865 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 4.602072e+06 |
max | 12134.000000 | 43.299428 | -73.051978 | 21.000000 | 2.000000 | 19.000000 | 1.000000 | 2.000000 | 1.000000 | 8.000000 | 2.000000 | 4.605324e+06 |
Replaces the default .25 and .75, but keeps the median (.5).
# including all columns, rather than just "number" default
= 'all') jan23.describe(include
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | COLLISION_ID | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 7244 | 7244 | 7219 | 7240.000000 | 7240.000000 | 7240.000000 | 7244 | 5341 | 3453 | 1903 | ... | 5378 | 689 | 191 | 62 | 7.244000e+03 | 7108 | 4553 | 634 | 179 | 59 |
unique | 31 | 1245 | 5 | NaN | NaN | NaN | 6140 | 1580 | 1562 | 1877 | ... | 30 | 13 | 5 | 3 | NaN | 67 | 81 | 17 | 11 | 5 |
top | 1/13/23 | 0:00 | BROOKLYN | NaN | NaN | NaN | (0.0, 0.0) | BELT PARKWAY | BROADWAY | 560 WINTHROP STREET | ... | Unspecified | Unspecified | Unspecified | Unspecified | NaN | Sedan | Sedan | Sedan | Sedan | Sedan |
freq | 294 | 116 | 2386 | NaN | NaN | NaN | 81 | 124 | 37 | 3 | ... | 4550 | 637 | 183 | 60 | NaN | 3478 | 1969 | 327 | 93 | 29 |
mean | NaN | NaN | NaN | 10876.268785 | 40.723872 | -73.917446 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4.599022e+06 | NaN | NaN | NaN | NaN | NaN |
std | NaN | NaN | NaN | 532.816111 | 0.087734 | 0.088494 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 2.365885e+03 | NaN | NaN | NaN | NaN | NaN |
min | NaN | NaN | NaN | 10001.000000 | 40.504658 | -74.250150 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4.594332e+06 | NaN | NaN | NaN | NaN | NaN |
25% | NaN | NaN | NaN | 10453.000000 | 40.665374 | -73.966253 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4.597113e+06 | NaN | NaN | NaN | NaN | NaN |
50% | NaN | NaN | NaN | 11208.000000 | 40.714790 | -73.922485 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4.599058e+06 | NaN | NaN | NaN | NaN | NaN |
75% | NaN | NaN | NaN | 11239.000000 | 40.784210 | -73.865596 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4.600953e+06 | NaN | NaN | NaN | NaN | NaN |
max | NaN | NaN | NaN | 12134.000000 | 43.299428 | -73.051978 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4.605324e+06 | NaN | NaN | NaN | NaN | NaN |
11 rows × 29 columns
My only con with this, is that most of these numerical values are not included in the “unique”, “top”, and “frequency” rows, even though in context they are discrete, and would make sense to be included in these.
# excluding numerical columns
# gives just "object" i.e. categorical
= 'number') jan23.describe(exclude
CRASH DATE | CRASH TIME | BOROUGH | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | CONTRIBUTING FACTOR VEHICLE 1 | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 7244 | 7244 | 7219 | 7244 | 5341 | 3453 | 1903 | 7189 | 5378 | 689 | 191 | 62 | 7108 | 4553 | 634 | 179 | 59 |
unique | 31 | 1245 | 5 | 6140 | 1580 | 1562 | 1877 | 48 | 30 | 13 | 5 | 3 | 67 | 81 | 17 | 11 | 5 |
top | 1/13/23 | 0:00 | BROOKLYN | (0.0, 0.0) | BELT PARKWAY | BROADWAY | 560 WINTHROP STREET | Driver Inattention/Distraction | Unspecified | Unspecified | Unspecified | Unspecified | Sedan | Sedan | Sedan | Sedan | Sedan |
freq | 294 | 116 | 2386 | 81 | 124 | 37 | 3 | 1776 | 4550 | 637 | 183 | 60 | 3478 | 1969 | 327 | 93 | 29 |
Interesting note: rather than outputting as an empty set, the function decided to use the columns that are typically omitted instead, which is the same as jan23.describe("include = object")
.
# making datetime numeric
= 'all', datetime_is_numeric = True) jan23.describe(include
CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | COLLISION_ID | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 7244 | 7244 | 7219 | 7240.000000 | 7240.000000 | 7240.000000 | 7244 | 5341 | 3453 | 1903 | ... | 5378 | 689 | 191 | 62 | 7.244000e+03 | 7108 | 4553 | 634 | 179 | 59 |
unique | 31 | 1245 | 5 | NaN | NaN | NaN | 6140 | 1580 | 1562 | 1877 | ... | 30 | 13 | 5 | 3 | NaN | 67 | 81 | 17 | 11 | 5 |
top | 1/13/23 | 0:00 | BROOKLYN | NaN | NaN | NaN | (0.0, 0.0) | BELT PARKWAY | BROADWAY | 560 WINTHROP STREET | ... | Unspecified | Unspecified | Unspecified | Unspecified | NaN | Sedan | Sedan | Sedan | Sedan | Sedan |
freq | 294 | 116 | 2386 | NaN | NaN | NaN | 81 | 124 | 37 | 3 | ... | 4550 | 637 | 183 | 60 | NaN | 3478 | 1969 | 327 | 93 | 29 |
mean | NaN | NaN | NaN | 10876.268785 | 40.723872 | -73.917446 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4.599022e+06 | NaN | NaN | NaN | NaN | NaN |
std | NaN | NaN | NaN | 532.816111 | 0.087734 | 0.088494 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 2.365885e+03 | NaN | NaN | NaN | NaN | NaN |
min | NaN | NaN | NaN | 10001.000000 | 40.504658 | -74.250150 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4.594332e+06 | NaN | NaN | NaN | NaN | NaN |
25% | NaN | NaN | NaN | 10453.000000 | 40.665374 | -73.966253 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4.597113e+06 | NaN | NaN | NaN | NaN | NaN |
50% | NaN | NaN | NaN | 11208.000000 | 40.714790 | -73.922485 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4.599058e+06 | NaN | NaN | NaN | NaN | NaN |
75% | NaN | NaN | NaN | 11239.000000 | 40.784210 | -73.865596 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4.600953e+06 | NaN | NaN | NaN | NaN | NaN |
max | NaN | NaN | NaN | 12134.000000 | 43.299428 | -73.051978 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4.605324e+06 | NaN | NaN | NaN | NaN | NaN |
11 rows × 29 columns
Including datetime as numeric works if the date times are inputted in a different style (typically, YYYY-MM-DD 00:00:00.000000), which our data is not. Thus, as we see, the date and time is still treated as an object.
7.3.12.2 Changing rows with describe()
Above were specific ways to change the function that were already built into the function itself. What if we want to add more rows describing another descriptive statistic? I will be using just the discrete values for the following examples.
# adding sum to the dataframe
sum(), name = 'sum')) cjan23.describe().append(pd.Series(cjan23.
/var/folders/cq/5ysgnwfn7c3g0h46xyzvpj800000gn/T/ipykernel_34745/3080308836.py:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
cjan23.describe().append(pd.Series(cjan23.sum(), name = 'sum'))
NUMBER OF PEDESTRIANS INJURED | NUMBER OF PEDESTRIANS KILLED | NUMBER OF CYCLIST INJURED | NUMBER OF CYCLIST KILLED | NUMBER OF MOTORIST INJURED | NUMBER OF MOTORIST KILLED | |
---|---|---|---|---|---|---|
count | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 |
mean | 0.116372 | 0.000690 | 0.033269 | 0.000414 | 0.333103 | 0.001242 |
std | 0.397927 | 0.026265 | 0.180118 | 0.020348 | 0.749174 | 0.038951 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 19.000000 | 1.000000 | 2.000000 | 1.000000 | 8.000000 | 2.000000 |
sum | 843.000000 | 5.000000 | 241.000000 | 3.000000 | 2413.000000 | 9.000000 |
# adding a row counting nan's
sum(), name = 'nans')) cjan23.describe().append(pd.Series(cjan23.isna().
/var/folders/cq/5ysgnwfn7c3g0h46xyzvpj800000gn/T/ipykernel_34745/4231082202.py:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
cjan23.describe().append(pd.Series(cjan23.isna().sum(), name = 'nans'))
NUMBER OF PEDESTRIANS INJURED | NUMBER OF PEDESTRIANS KILLED | NUMBER OF CYCLIST INJURED | NUMBER OF CYCLIST KILLED | NUMBER OF MOTORIST INJURED | NUMBER OF MOTORIST KILLED | |
---|---|---|---|---|---|---|
count | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 |
mean | 0.116372 | 0.000690 | 0.033269 | 0.000414 | 0.333103 | 0.001242 |
std | 0.397927 | 0.026265 | 0.180118 | 0.020348 | 0.749174 | 0.038951 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 19.000000 | 1.000000 | 2.000000 | 1.000000 | 8.000000 | 2.000000 |
nans | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
# removing a row
= "max", axis = 0) cjan23.describe().drop(labels
NUMBER OF PEDESTRIANS INJURED | NUMBER OF PEDESTRIANS KILLED | NUMBER OF CYCLIST INJURED | NUMBER OF CYCLIST KILLED | NUMBER OF MOTORIST INJURED | NUMBER OF MOTORIST KILLED | |
---|---|---|---|---|---|---|
count | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 |
mean | 0.116372 | 0.000690 | 0.033269 | 0.000414 | 0.333103 | 0.001242 |
std | 0.397927 | 0.026265 | 0.180118 | 0.020348 | 0.749174 | 0.038951 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
7.3.12.3 Changing columns with describe()
# removing a column
= "NUMBER OF CYCLIST INJURED") cjan23.describe().drop(columns
NUMBER OF PEDESTRIANS INJURED | NUMBER OF PEDESTRIANS KILLED | NUMBER OF CYCLIST KILLED | NUMBER OF MOTORIST INJURED | NUMBER OF MOTORIST KILLED | |
---|---|---|---|---|---|
count | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 |
mean | 0.116372 | 0.000690 | 0.000414 | 0.333103 | 0.001242 |
std | 0.397927 | 0.026265 | 0.020348 | 0.749174 | 0.038951 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 19.000000 | 1.000000 | 1.000000 | 8.000000 | 2.000000 |
# note that the manual changes made above are not permanent unless the variable is reassigned
cjan23.describe()
NUMBER OF PEDESTRIANS INJURED | NUMBER OF PEDESTRIANS KILLED | NUMBER OF CYCLIST INJURED | NUMBER OF CYCLIST KILLED | NUMBER OF MOTORIST INJURED | NUMBER OF MOTORIST KILLED | |
---|---|---|---|---|---|---|
count | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 | 7244.000000 |
mean | 0.116372 | 0.000690 | 0.033269 | 0.000414 | 0.333103 | 0.001242 |
std | 0.397927 | 0.026265 | 0.180118 | 0.020348 | 0.749174 | 0.038951 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 19.000000 | 1.000000 | 2.000000 | 1.000000 | 8.000000 | 2.000000 |
All of the above ways were manipulating the describe()
operator to potentially make visualizing descriptive statistics easier, by putting certain desirable traits in or out of the table.
Describe on Individual Columns
"BOROUGH"].describe() # character and discrete jan23[
count 7219
unique 5
top BROOKLYN
freq 2386
Name: BOROUGH, dtype: object
"NUMBER OF PEDESTRIANS KILLED"].describe() # numeric and continuous jan23[
count 7244.000000
mean 0.000690
std 0.026265
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: NUMBER OF PEDESTRIANS KILLED, dtype: float64
Note that numeric and discrete would still be treated as continuous, so descriptive statistics are not very beneficial fot these variables. Regardless, descriptive statistics are typically more of interest to us if they are continuous.
7.4 Conclusion
In this presentation we looked into different methods of performing descriptive statistics, and saw how to use many of these operators. There are many ways to compute descriptive statistics, and we explored how to do so with pandas. We then focused on how to maniputlate the describe()
function in many ways that may help us to visualize the data much easier. Afterwards, we looked at isolating columns and rows to perform descriptive statistics on. Analyzing the descriptive statistics is extremely important to understaning data. Another way to possibly put data into a more digestible form is to visualize it, which other presentations touch on.