10.3. Random Forests¶

Random forests are an ensemble learning method that can be used for classification or regression by constructing several individual decision trees as opposed to just one.

For classification:

Each decision tree in the forest gives a class prediction, and whichever class is selected most often is the output of the random forest.

For regression:

The output of the random forest is the average prediction of the individual trees.

Advantages of Random Forests:

Reduces overfitting that sometimes occurs in decision trees and improves accuracy
Can handle large datasets as well as missing data
Can perform both classification and regression tasks
Produces good predictions that can be easily understood

Disadvantages of Random Forests:

Requires more computational power and resources
Consumes more time compared to a decision tree algorithm

10.3.1. Installing Packages¶

To utilize random forests in Python, we first need to install the scikit-learn package.

pip: pip install scikit-learn
conda: conda install scikit-learn

We’ll also want to install the graphviz package to improve the visualizations, which can be done using conda install python-graphviz.

10.3.2. Simple Example¶

## Configure the inline figures to svg format
%config InlineBackend.figure_formats = ['svg']

from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.tree import export_graphviz
import graphviz

wine = load_wine()
X, y = wine.data, wine.target
## Build a random forest with 50 trees in it
clf = RandomForestClassifier(n_estimators = 50, random_state = 99)
clf = clf.fit(X, y)

## Select a single tree from the forest to visualize
estimator = clf.estimators_[5]
dot_data = tree.export_graphviz(estimator, out_file = None, 
                                feature_names = wine.feature_names,  
                                class_names = wine.target_names,
                                filled = True, rounded = True,  
                                proportion = False, precision = 2, 
                                special_characters=True)

# Draw graph
graph = graphviz.Source(dot_data, format = "svg") 
graph

10.3.3. Random Forest for Classification Using NYC Crash Data¶

Using the NYC Motor Vehicle Crash data, we’ll create a random forest model in order to try and predict whether a crash results in an injury and/or death to any person. Before fitting the model however, we first need to decide which features we’ll want to include in our model. For this example, we’ll include the following features and outcome variable to predict:

injured_dead_status (outcome): indicates whether at least person was injured or killed in a crash (1 = yes, 0 = no)
timeframe: the time of day in which the crash occured, split into 6-hour intervals over 24 hours
(1 = 12AM-5:59AM, 2 = 6AM-11:59AM, 3 = 12PM-5:59PM, 4 = 6PM-11:59PM)
borough: which borough the crash took place in
num_vehicles_involved: number of vehicles involved in a crash
time_of_week: day of week of the crash, which we’ll split into weekdays and weekends

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix

## Read in NYC crash data
nyc_crash = pd.read_csv("../data/nyc_mv_collisions_202201.csv")

## Create new variables relating to people injured and/or killed
## in a crash
nyc_crash['num_injured_dead'] = nyc_crash.apply(lambda row: 
                                                   row['NUMBER OF PERSONS KILLED'] +
                                                   row['NUMBER OF PERSONS INJURED'],
                                                   axis = 1)

nyc_crash['injured_dead_status'] = nyc_crash.apply(lambda row:
                                                   1 if row['num_injured_dead'] > 0
                                                   else 0, axis = 1)

nyc_crash['injured_dead_status'] = nyc_crash['injured_dead_status'].astype('category')

## Split crash times by hour
nyc_crash["hour"] = [x.split(":")[0] for x in nyc_crash["CRASH TIME"]]
nyc_crash["hour"] = [int(x) for x in nyc_crash["hour"]]

def timeframes(x):
    if x <= 5:
        return 1
    elif x > 5 and x <= 11:
        return 2
    elif x > 11 and x <= 17:
        return 3
    else:
        return 4

## Take crash hours and put them into specifc intervals
nyc_crash['timeframe'] = nyc_crash['hour'].apply(timeframes)
nyc_crash['timeframe'] = nyc_crash['timeframe'].astype('category')

contributing_factors = ['CONTRIBUTING FACTOR VEHICLE 1', 'CONTRIBUTING FACTOR VEHICLE 2',
                        'CONTRIBUTING FACTOR VEHICLE 3', 'CONTRIBUTING FACTOR VEHICLE 4',
                        'CONTRIBUTING FACTOR VEHICLE 5']

## Number of vehicles involved in any crash
nyc_crash['num_vehicles_involved'] = len(nyc_crash[contributing_factors].columns) - nyc_crash[
                                         contributing_factors].isnull().sum(axis = 1)

## Convert 'CRASH DATE' to datetime format
nyc_crash['CRASH DATE'] = nyc_crash['CRASH DATE'].astype('datetime64[ns]')

## Column to indicate day of the week
nyc_crash['day_of_week'] = nyc_crash['CRASH DATE'].dt.day_name()

def is_weekend(x):
    if x == 'Saturday' or x == 'Sunday':
        return 'weekend'
    else:
        return 'weekday'
    
## Categorize days of week to weekday or weekend  
nyc_crash['time_of_week'] = nyc_crash['day_of_week'].apply(is_weekend)
nyc_crash.head()

	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME	CROSS STREET NAME	OFF STREET NAME	...	VEHICLE TYPE CODE 3	VEHICLE TYPE CODE 4	VEHICLE TYPE CODE 5	hour	timeframe	num_vehicles_involved	day_of_week	time_of_week
0	2022-01-01	7:05	NaN	NaN	NaN	NaN	NaN	EAST 128 STREET	3 AVENUE BRIDGE	NaN	...	NaN	NaN	NaN	7	2	1	Saturday	weekend
1	2022-01-01	14:43	NaN	NaN	40.769993	-73.915825	(40.769993, -73.915825)	GRAND CENTRAL PKWY	NaN	NaN	...	NaN	NaN	NaN	14	3	1	Saturday	weekend
2	2022-01-01	21:20	QUEENS	11414.0	40.657230	-73.841380	(40.65723, -73.84138)	91 STREET	160 AVENUE	NaN	...	NaN	NaN	NaN	21	4	1	Saturday	weekend
3	2022-01-01	4:30	NaN	NaN	NaN	NaN	NaN	Southern parkway	Jfk expressway	NaN	...	NaN	NaN	NaN	4	1	2	Saturday	weekend
4	2022-01-01	7:57	NaN	NaN	NaN	NaN	NaN	WESTCHESTER AVENUE	SHERIDAN EXPRESSWAY	NaN	...	NaN	NaN	NaN	7	2	1	Saturday	weekend

5 rows × 36 columns

nyc_crash.rename(columns = {'BOROUGH': 'borough'}, inplace = True)
features = nyc_crash[['borough', 'timeframe', 'num_vehicles_involved', 'time_of_week']]

We now have all of the features we want to include in our model. Before we set up our random forest, we’ll want to take our caterogical features and transform them into binary data for each category without any arbitrary ordering, a process known as one-hot encoding of data. This can be done very easily in pandas using the pd.get_dummies() function, where we pass in our features of interest.

labels = np.array(nyc_crash['injured_dead_status'])
features = pd.get_dummies(features)
features.head()

	num_vehicles_involved	borough_QUEENS	timeframe_1	timeframe_2	timeframe_3	timeframe_4	time_of_week_weekend
0	1	0	0	1	0	0	1
1	1	0	0	0	1	0	1
2	1	1	0	0	0	1	1
3	2	0	1	0	0	0	1
4	1	0	0	1	0	0	1

Now our categorical features have seperate columns with binary values for each category. For example, the borough variable now has five seperate columns with either a 1 or 0 for each observation, for each borough. Now we can begin building our random forest model.

## start building random forest model:
x_train, x_test, y_train, y_test = train_test_split(features, labels, 
                                                    test_size = 0.2, 
                                                    random_state = 42)

rfclf = RandomForestClassifier(n_estimators = 100, random_state = 42)
rfclf = rfclf.fit(x_train, y_train)

# Extract a single tree from the random forest
estimator = rfclf.estimators_[33]
dot_data = tree.export_graphviz(estimator, out_file = None, 
                                feature_names = features.columns,  
                                class_names = ['0', '1'],
                                filled = True, rounded = True,  
                                proportion = False, precision = 2, 
                                special_characters=True)
# Draw graph
graph = graphviz.Source(dot_data, format = "svg") 
graph

This individual decision tree that we selected from the random forest is quite large, with many splits occuring from the top node. If we count the number of levels, we can see that this tree has a depth of 13 before reaching the bottom. For this particular tree, the initial node splits on the num_vehicles_involved variable being less than or equal to 1.5 cars.

Many of these trees are going to look different, and some might suffer from overfitting or be very inaccurate, but overall the classification returned by the majority of the decision trees is likely to be the most accurate. We can check the accuracy of our random forest model using our test data by using scikit-learn’s accuracy_score() and confusion_matrix() functions.

We can also limit the depth of our random forest’s trees by passing in the max_depth parameter in the random forest classifier, so that it’s easier to visualize the tree.

y_pred = rfclf.predict(x_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.7003916449086162
[[1041   41]
 [ 418   32]]

rfclf2 = RandomForestClassifier(n_estimators = 100, random_state = 42, max_depth = 4)
rfclf2 = rfclf2.fit(x_train, y_train)

# Extract a single tree from the random forest
estimator = rfclf2.estimators_[5]
dot_data = tree.export_graphviz(estimator, out_file = None, 
                                feature_names = features.columns,  
                                class_names = ['0', '1'],
                                filled = True, rounded = True,  
                                proportion = False, precision = 2, 
                                special_characters=True)
# Draw graph
graph = graphviz.Source(dot_data, format = "svg") 
graph

for i in range (1, 11):
    rfclf = RandomForestClassifier(n_estimators = 100, random_state = 42, max_depth = i)
    rfclf = rfclf.fit(x_train, y_train)
    y_pred = rfclf.predict(x_test)
    print("Accuracy for Forest with Max Depth {i}:".format(i = i), 
          metrics.accuracy_score(y_test, y_pred))

Accuracy for Forest with Max Depth 1: 0.706266318537859

Accuracy for Forest with Max Depth 2: 0.706266318537859

Accuracy for Forest with Max Depth 3: 0.706266318537859

Accuracy for Forest with Max Depth 4: 0.7095300261096605

Accuracy for Forest with Max Depth 5: 0.7095300261096605

Accuracy for Forest with Max Depth 6: 0.7095300261096605

Accuracy for Forest with Max Depth 7: 0.7056135770234987

Accuracy for Forest with Max Depth 8: 0.706266318537859

Accuracy for Forest with Max Depth 9: 0.7043080939947781

Accuracy for Forest with Max Depth 10: 0.7023498694516971

Looking at the depth for the trees in our random forest, it appears that the accuracy of our model is at its highest for trees with a depth between 4 and 6, with an accuracy of about 70.95%. We can also see some other performance metrics using the classifcation_report() function in sckiit-learn.

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.71      0.97      0.82      1082
           1       0.46      0.07      0.12       450

    accuracy                           0.70      1532
   macro avg       0.58      0.52      0.47      1532
weighted avg       0.64      0.70      0.61      1532

We might also want to take a look at the importance of each of the different features our model considered when splitting on different nodes. This can be done using the feature_importances_ attribute, which we can plot in a simple bar graph.

importances = rfclf.feature_importances_

plt.bar(features.columns, importances)
plt.xticks(rotation = 90)
plt.title('Importance of Each Feature')

Text(0.5, 1.0, 'Importance of Each Feature')

10.3.4. Random Forest Model for Regression¶

from sklearn.ensemble import RandomForestRegressor

labels2 = labels = np.array(nyc_crash['num_injured_dead'])

## start building random forest model:
x_train, x_test, y_train, y_test = train_test_split(features, labels2, 
                                                    test_size = 0.2, 
                                                    random_state = 42)

rfclf = RandomForestRegressor(n_estimators = 100, random_state = 42, max_depth = 4)
rfclf = rfclf.fit(x_train, y_train)

# Extract a single tree from the random forest
estimator = rfclf.estimators_[33]
dot_data = tree.export_graphviz(estimator, out_file = None, 
                                feature_names = features.columns,  
                                class_names = None,
                                filled = True, rounded = True,  
                                proportion = False, precision = 2, 
                                special_characters=True)
# Draw graph
graph = graphviz.Source(dot_data, format = "svg") 
graph

y_pred = rfclf.predict(x_test)

df = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df["diff"] = abs(df["Actual"] - df["Predicted"])
df.sort_values(by = ['diff'], inplace = True)

print('Mean Absolute Error:', 
      metrics.mean_absolute_error(y_test, y_pred))

df.sort_values(by=['diff'], ascending=False,inplace=True)
df.head()

Mean Absolute Error: 0.5603409520733241

	Actual	Predicted	diff
1291	5	0.642135	4.357865
750	4	0.327590	3.672410
791	4	0.330216	3.669784
2	4	0.332213	3.667787
442	4	0.337047	3.662953

10.3.5. Sources and Additional Information:¶

https://scikit-learn.org/stable/modules/ensemble.html#forest

https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76

https://towardsdatascience.com/random-forest-in-python-24d0893d51c0

https://towardsdatascience.com/how-to-visualize-a-decision-tree-from-a-random-forest-in-python-using-scikit-learn-38ad2d75f21c

https://www.section.io/engineering-education/introduction-to-random-forest-in-machine-learning/

Introduction to Data Science, Spring 2022

Random Forests

Contents