Clustering | Ruojia Sun

Social Disparity in Impacts of Climate Disasters in the United States

Overview: Clustering

Clustering is the process of grouping or categorizing unlabeled data. In data science, clustering can be used to create labels for data and to discover underlying patterns and structures in the data. The general goal of clustering is to group data such that data vectors in the same group are more similar to each other than to those in other groups. There are various methods for clustering, including partitional clustering and hierarchical clustering.

(Image credit: Amit Chauhan)

Partitional clustering methods, such as k-means, divides data into disjoint groups based on distance metrics, and necessitates specifying the number of clusters to be generated. There are various methods to determine the optimal number of clusters, including the silhouette method, which measures how well a data point belongs to its assigned cluster compared to the nearest neighboring cluster. Hierarchical clustering determines clusters by building a hierarchy either top-down (divisive) or bottom-up (agglomerative) and does not require specifying the number of clusters. Two main types of hierarchical clustering are agglomerative, which is bottom-up, or divisive, which is top down. A few different common hierarchical clustering methods are visualized below.

(Image Credit: Prof. Ami Gates)

Both partitional and hierarchical clustering methods utilize distance metrics to calculate similarity between data vectors of numeric data only. Two commonly used distance metrics are Euclidean distance (i.e magnitude of the difference of two vectors), and cosine similarity (i.e. angle between the two vectors), which are visualized below for 2-dimensional vectors. With high dimensional data, Euclidean distance loses accuracy, and similarity measures such as cosine similarity are preferred.

(Image Credit: Manu Nellutla)

I will use k-means and hierarchical clustering to group U.S. counties based on their climate change risk and socially vulnerable demographics. This application of clustering can be used to group counties in the U.S. based on the level of climate risk and social vulnerability, in order to

to better understand how to allocate resources in disaster recovery and other climate-related initiatives. Clustering may also reveal relationships between where socially vulnerable populations live and where there is high climate risk (for example, if 2 clusters emerge that correspond to 1. high social vulnerability with high climate risk, and 2. low social vulnerability with low climate risk). In this work, I consider multiple socioeconomic vulnerabilities, specifically, both race, and income, and how they interact with climate risk through an intersectional lens, since different dimensions of an individual’s identity interact in shaping one’s experience in the world.

I sourced the climate risk data from FEMA’s National Risk Index (NRI), and included features of total risk (i.e. aggregated risk of 18 climate disasters), risk of coastal flooding, and risk of heat waves, calculated from factors such as historical natural disaster data. Although the NRI reported climate disaster risks for 18 climate disasters, I chose coastal flooding and heat waves as these natural hazards are likely to be exacerbated by climate change in the future and impact people’s livelihoods. I also used the Social Vulnerability Index (SVI) from the CDC, in particular, I included demographics of percentage of people below 150% of the poverty estimate, percentage of housing cost-burdened housing units, percentage of people who are African-American, and percentage of people who are Hispanic.

Data Preparation

Clustering requires unlabeled numerical data only. For the data preparation, I isolated several dimensions I was interested in from the NRI and SVI datasets, which only had numerical data, and merged the dimensions from the two datasets based on the county ID. Below is the NRI data and SVI data, before preparation.

Raw NRI data

Raw SVI data

The following is the processed data, with 3143 data vectors each corresponding to a U.S. county. The 7 features are: percentage of people below 150% of the poverty estimate, percentage of housing cost-burdened housing units, African-American race, and Hispanic ethnicity, total climate risk, coastal flooding risk, and heat wave risk. I also normalized the data by column in order to avoid domination of features with larger values. The prepared data can be found here.

Code

Code for k-means clustering in Python, application of the silhouette method, and visualization of clusters can be found here.

Code for hierarchical clustering in R and dendrogram visualizations can be found here.

Code to prepare the data can be found here.

Results

I ran k-means clustering using Euclidean distance as the distance metric, and values of k (cluster number) from 2 to 8. Using the silhouette method to compute silhouette scores representing how well a data point belongs to its assigned cluster compared to the nearest neighboring cluster, I determined that the best value of k (the one with the highest silhouette score) was 3. k = 2 and 4 had the next highest silhouette scores so I chose to compare k-means clustering of the data with k = 2, 3 and 4.

The following are plots of the k-means clusters with k = 2, 3, and 4, using Principal Component Analysis (PCA) to reduce the dimensionality of the data to 2 dimensions, Principal Components 1 (PC1) and 2 (PC2).

The matrix below shows the 2 principal components in terms of the 7 features. The income and African-American demographic features dominate in principal component 1 while income, Hispanic demographic, and total climate risk factors dominate in component 2. Based on the location of the cluster centroids in the plots, it appears that 2 principal components only represent the data to a limited extent.

[poverty, housing-burdened, African-American, Hispanic, total climate risk, coastal flooding risk, heat wave risk]

Below are the centroids of the clusters (in the original 7 dimensions) for k = 2, 3, 4, and my analysis of what the groups represent for k = 2 and 3 (a description of k = 4 is more convoluted). I use low, medium, and high descriptors relative to the other centroids, except for the race/ethnicity features, which are based on “high” being > 0.4, “medium” being >= 0.1 and <= 0.4 and “low” being < 0.1.

k = 2:

[poverty, housing-burdened, African-American, Hispanic, total climate risk, coastal flooding risk, heat wave risk]

Group 1: Low poverty and housing-burdened, low African-American population, low Hispanic population, low total risk, low coastal flooding risk, low heat-wave risk
Group 2: High poverty and housing-burdened, high African-American population, low Hispanic population, high total risk, high coastal flooding risk, high heat-wave risk

k = 3:

[poverty, housing-burdened, African-American, Hispanic, total climate risk, coastal flooding risk, heat wave risk]

Group 1: High poverty and housing-burdened, high African-American population, low Hispanic population, medium total risk, medium coastal flooding risk, high heat wave risk
Group 2: Medium poverty and housing-burdened, low African-American population, high Hispanic population, high total risk, high coastal flooding risk, medium heat wave risk
Group 3: Low poverty and housing-burdened, low African-American population, low Hispanic population, low total risk, low coastal flooding risk, low heat wave risk

k = 4:

I also ran hierarchical clustering in R using Ward’s minimum variance method, an agglomerative clustering algorithm that aims to minimize the total within-cluster variance. I applied cosine similarity as the distance metric. The result is shown below (due to the large number of data points, data labels are not shown)

Through visual inspection, it seems that the data is best clustered with 3 clusters, which occurs at approximately height = 0.23. It could also be clustered with k = 2. This aligns with the results from the k-means clustering and silhouette method. The clusters are shown below.

Conclusion

In this part of the project, I used k-means and hierarchical clustering to group U.S. counties based on their climate change risk and socially vulnerable demographics. By applying the silhouette method with k-means clustering, I found that the structure of the data is best described with 3 clusters. These clusters not only help to identify the areas with both high social vulnerability and high climate risk data, but also imply that there is some correlation between where people with high social vulnerability live and where there is high climate disaster risk. For example, the 3 clusters respectively imply that 1. counties with high poverty rates and large African-American populations have medium total climate risk, medium coastal flooding risk, and high heat wave risk, 2. counties with medium poverty rates and large Hispanic populations have high total climate risk, high coastal flooding risk, and medium heat wave risk, and 3. counties with low poverty rates and smaller African-American and Hispanic populations have comparably lower risk on all the climate change dimensions. This information could potentially be used to guide policy decisions and allocation of resources, for example, making preparations to combat heat wave-related impacts in the counties belonging to the first group. These findings also support the understanding that climate disasters, which are exacerbated by climate change, disproportionately affect certain socially vulnerable populations, not only in terms of their ability to respond and recover, but also in terms of geographical factors.