Decision Trees | Ruojia Sun

Social Disparity in Impacts of Climate Disasters in the United States

Decision Trees: Overview

A decision tree is a supervised learning model that can be used for classification or regression tasks. Like other supervised learning algorithms, decision trees are trained with labeled data in order to predict values of a target attribute for new data vectors. Decision trees have a directional hierarchical tree structure, which consists of a root node, internal (decision) nodes , and leaf nodes. Below is an example of a decision tree that aims to predict someone’s risk for heart attack based on attributes such as their age and weight.

(Image Credit: Avinash Navlani)

The tree begins with the root node, which has no incoming edges. The tree splits into branches based on conditions represented in internal nodes. The end of the branch that doesn’t split anymore is a leaf node, which represents decisions, in this case, whether someone has a low risk or high risk of heart attacks.

One important aspect of decision tree learning is deciding which features to choose and what conditions to use for node splitting. Different node splitting methods use different metrics to evaluate how well each test condition classifies samples into classes. 3 common metrics to determine the best split are GINI, Entropy, and Information Gain. GINI and Entropy are given by the following formulas:

where p_i are the probabilities for each class. GINI and Entropy both measure the impurity of the data samples in a set. If all samples in each set belong to one class, then the GINI and entropy will equal zero.

Information gain measures the difference between impurity values before splitting the data at a node and the weighted average of the impurity after the split. The formula is given below, where j is each node after the split.

Below is an example of evaluating a split using GINI:

In this example, the split on the left is better as determined by a lower GINI value.

Decision trees have many advantages such as being easy to understand and interpret and relatively fast to compute and simple to implement. They can be used in a variety of classification problems to categorize data into class labels, as well as regression problems to predict a continuous value. However, they have various disadvantages such as being prone to overfitting and having high variance. Generally, it is possible to create an infinite number of decision trees with a particular dataset (for a large enough dataset and feature space), since different decision trees can be generated depending on the choice of splitting attributes, ordering of splitting attributes, tree structure, stopping criteria, pruning, and more. Furthermore, decision tree training employs heuristics to create a close to optimal solution, rather than generating a globally optimal solution.

In this work, I apply decision trees to classify whether individuals are able to recover from hurricane impacts a year later. I use a dataset of the Kaiser Family Foundation/Episcopal Health Foundation Poll: Harvey Anniversary Survey, which has survey data on how individuals have been impacted by Hurricane Harvey in 2017 reported 1 year after the storm. I included data features of storm impacts (home damage and reduced work hours) and demographics (race, gender, and income) to predict whether a respondent reported that their day to day life is largely back to normal or still disrupted 1 year later. Such a model can be used to determine the best way to allocate resources in climate relief so that recovery aid can reach those who need it the most. In this work, I incorporate demographics such as race, gender, and income into the climate impacts model to take into account and understand how these socioeconomic vulnerabilities affect an individual’s ability to recover after a climate disaster.

Data Preparation

As is the case with supervised learning algorithms, decision trees require labeled data. I used the Harvey Anniversary Survey dataset, which has many different features corresponding to survey questions. A sample of the raw data is shown below.

Raw Harvey Anniversary Survey data

From the Harvey Anniversary Survey dataset, I was interested in predicting individuals’ abilities to recover from the hurricane. Thus, I chose the attribute corresponding to hurricane recovery as the label for the data. For the predictors, I was interested in a combination of storm impacts and demographics, and chose the following features: whether the respondent sustained home damage as a result of Hurricane Harvey, whether the respondent had hours cut back at work as a result of Hurricane Harvey, as well as the respondent’s race, gender, and income.

For the hurricane recovery attribute I used as the label, the survey question asks “Which of the following best describes your personal situation in terms of recovering from Hurricane Harvey?” and there are 4 possible outcomes: largely back to normal, almost back to normal, still somewhat disrupted, and still very disrupted. In supervised learning, it is important to make sure the data is balanced - that there are similar numbers of samples for each value of the label, as well as similar numbers of samples for each value of the features. The responses for the recovery label were unbalanced, with the counts of each response from the raw data shown below:

In order to balance the data while retaining as many of the “somewhat disrupted” and “very disrupted” data vectors as possible, I combined the responses into two labels when preparing the data. I titled the label “recovery” and the classes “yes” and “no,” with “yes” including “largely/almost back to normal” responses and “no” corresponding to “still somewhat disrupted” and “still very disrupted” responses.

Although there were imbalances to varying extents for each attribute, since my project focuses on the social impacts of climate disasters, I prioritized balancing the demographic attributes so that the resulting model isn’t biased to make more accurate predictions for more highly represented identities. Among these attributes, the race attribute was the most unbalanced, with counts from raw data shown below.

Since there were relatively few respondents that identified as Hispanic, mixed race, or Asian, I omitted these races when cleaning the data. This decision stemmed from my goal to keep more data samples for white and black/African-American races when balancing the data in an attempt to improve prediction using those variables; however, the trade-off is that this model is limited to apply to only those two races. In general, this is a challenge of applying machine learning with minority identities, since large numbers of samples or an oversample of the minorities are needed.

After preparing and balancing the recovery label and the race attribute, the final balance of the data is as follows:

A sample of the cleaned data is shown below, and the prepared data can be found here.

For supervised learning algorithms such as decision trees, we need to split the data into a training set to train the model and a testing set to test the accuracy of the model. The training and testing sets must be disjoint, so that the model does not see the testing data during training. This ensures that the testing data can be used to accurately evaluate the model’s performance for making predictions with new data. To prepare the data for decision trees, I used a 3-to-1 training-to-testing data split (i.e. I sampled 75% of the data to use as training data). A sample of the training and testing datasets are shown below.

Training data

Testing data

Code

Code for Decision Trees in R can be found here. R was used since R can run decision trees with categorical data.

Code to prepare the data can be found here.

Results

I used decision trees to model an individual’s ability to recover after a storm, and include results for 4 different parameter values for generating decision trees. The resulting decision trees can vary significantly depending on the sampled data, but I provide representative decision tree examples for each parameter value.

Decision Tree 1

The first decision tree algorithm uses the default stopping parameter of cp = 0.01 (a smaller cp results in a larger tree) and the default splitting method of GINI. This algorithm has an accuracy of 73.27% averaged over 5 models. The example tree below only uses the home damage feature for splitting. If the individual’s home is damaged, then the tree will give a result of no recovery. The tree and confusion matrix are shown below.

Decision Tree 2

The second decision tree algorithm uses the default stopping parameter of cp = 0.005 (a smaller cp results in a larger tree) and the default splitting method of GINI. This algorithm has an accuracy of 72.45% averaged over 5 models. This tree uses more features for splitting, with nodes split by home damage, followed by income, race, etc. For example, if the individual has home damage and their income is below the poverty line, then the tree will yield a result of no recovery.

The confusion matrix is shown below. Even though the overall accuracy decreased from tree 1 to tree 2, in this example, tree 2 identified a more balanced number of “yes”s and “no”s correctly, indicating that there are tradeoffs between different models, which can be selected for depending on the performance goals.

Decision Tree 3

The third decision tree algorithm uses the default stopping parameter of cp = 0.005 and the splitting method of information gain. This algorithm has an accuracy of 71.02% averaged over 5 models. Despite having the same value for the stopping parameter, this tree has many more nodes than the previous one.

Confusion matrix:

Decision Tree 4

The fourth decision tree algorithm uses the default stopping parameter of cp = 0 and the splitting method of information gain. This algorithm has an accuracy of 71.02% averaged over 5 models and is the most complex.

Confusion matrix:

Overall, there’s a progressive decrease in accuracy from the simpler to the more complex trees, from 73.27% to 70.20%, which suggests overfitting in the trees with more nodes.

Conclusion

I generated decision trees to predict whether an individual would recover from storm impacts 1 year following a hurricane, using attributes of home damage, reduced work hours, and respondent’s race, gender, and income. With different parameter values, I created different decision trees of varying complexity. With this dataset and problem, the trees generated were very sensitive to parameter values and varied significantly depending on the sampled test data, which should be taken into consideration when building the model. The models had accuracies between 70.20% and 73.27%. Given that the accuracy is not very high for any of the models, it’s possible that the prediction could be improved with a larger dataset, by using different features, or by applying a different model altogether. I observed a decrease in accuracy from the simpler to the more complex trees, which suggests overfitting with the bigger trees. Interestingly, the simplest one that only split nodes based on home damage had the highest average accuracy This suggests that when maximizing overall accuracy, the other attributes may not be important in modeling hurricane recovery, at least in this model implementation. However, depending on the performance goals of the model, another models could be preferable (for example, for correctly predicting a more balanced number of “yes” and “no” classes). While it is clear that the feature of home damage is the best for splitting, the demographic features of income and race were the next most consequential for splitting and still impacted the prediction.