Data Prep & EDA | Ruojia Sun

Social Disparity in Impacts of Climate Disasters in the United States

Overview: Data Preparation and Exploratory Data Analysis

This project requires a combination of data about climate disaster events, quality-of-life measures before and after these events, datasets that measure social outcomes natural disasters, and predicted climate disaster risks, gathered from a variety of sources.

Data gathered:

Weather and Climate Billion-Dollar Disasters in the US (1980-2023) (dataset, source: NOAA National Centers for Environmental Information (NCEI))
Household income data by race and age for the year of and following each included climate disaster in the affected state(s) (gathered using an API, source: US Census Bureau)
Displacement From Home Because of Natural Disaster, by Select Characteristics (2020 - 2023) (dataset, source: US Census Bureau)
National Risk Index, natural hazard risk data across the US (dataset, source: Federal Emergency Management Agency (FEMA))

I detail the data preparation and include visualizations for Exploratory Data Analysis for each dataset. All code for all data preparation and cleaning can be found here.

1. Weather and Climate Billion-Dollar Disasters in the US (1980-2023)

Dataset, source: NOAA National Centers for Environmental Information (NCEI)

This dataset is record data of billion-dollar climate disasters in the US from 1880 to 2023 using CPI-Adjusted Cost. The columns are Name, Disaster, Begin Date, End Date, Total CPI-Adjusted Cost (Millions of Dollars), and Deaths. This dataset is available with all the disasters in the US, by region, or by state. I started out with using the dataset with all disasters in the US, but I needed location data for each disaster in order to find the quality-of-life indicators for my 2nd dataset, so I downloaded the data for disasters in every state, and combined all of these datasets into one, while keeping track of the state(s) affected in a new column Location. I may try to gather more data at a later time to make the location data have a higher spatial resolution (i.e. by county).

Raw data: Example - climate disasters by state (Alaska)

Prepared + cleaned data

In addition to combining the datasets for each state, 1 value of "TBD" for costs was manually replaced with an estimate based on a Google search. At this stage, all columns/rows were kept to retain as much potentially useful data as possible.

Visualization and exploratory data analysis:

These visualizations show that climate disasters have been more concentrated in the southeast, central, and southern US, and that the number of billion dollar disasters each year is trending upwards.

2. Household income data by race and age for the year of and following each included climate disaster in the affected state(s)

Gathered using an API, source: US Census Bureau

The US Census has several types of available record datasets. I used the "American Community Survey" which has estimates for each year. The survey also contains different types of tables with varying levels of detail (from 65K variables to broad profiles). I searched through variables to find relevant ones, and used variables for estimated median household income by race and age from the subject tables. For each included disaster, I queried for this data for the year of and the year following the disaster, in the affected state(s). I only included disasters that affected 3 or fewer states in my search, to be able to better spatially resolve the location of the impact, while still retaining sufficient data points, as well as states where the year of the disaster and the mrcy year both had the relevant census data available (i.e. disasters from 2009-2019 and 2021; 2020 data was not released because of the impacts of COVID-19 on data collection).

Raw data: Each query returned json data like this, where the first list contains all the variables in the group S1903, and the second list contains the values.

Prepared + cleaned data:

I cleaned the data by keeping only the columns for estimated median household income by race and age. This took some trial and error as the variable codes changed after 2016. I also replaced not-applicable data (labeled with negative values) with the estimated income for all groups.

Visualization and exploratory data analysis:

The top 2 visualizations were used to check that the data seems correct overall. These plots supported me to correct some mistakes I initially made in the data gathering. All estimated median household values seem reasonable. The included climate disasters does not necessarily represent the frequency of disaster events overall, but rather, is the set of events I was able to include when gathering the income census data based on the criteria I stated previously. The final visualization gives an example of comparing estimated median household income by race and age comparison over 2 climate events. There are no clear patterns from this, as can be expected from a sample of 2.

3. Displacement From Home Because of Natural Disaster, by Select Characteristics (2022 - 2023)

Dataset, source: US Census Bureau

The US Census's Household Pulse Survey Data was designed to quickly collect data on the impacts of COVID-19, and has since expanded to include data about effects of natural disasters, such as displacement from home because of natural disaster, by demographic. This data is available at a national or state level. Although these measures are highly applicable to this project, it's only available from December 2022- August 2023, and thus, is a good addendum to the previous census data.

Raw data:

I did not clean this data at this stage since it is already well-processed, in order retain as much potentially useful data as possible.

Visualization and exploratory data analysis:

These simple visualizations are already helpful for showing some of the social disparities we can expect to see in the aftermath of climate disaster outcomes. Displacement from home rates seem to be at least to some extent dependent on all 3 social dimensions of income level, race, and sexual orientation. Being able to connect this dataset with others, such as the specific climate events, may reveal interesting patterns.

4. National Risk Index, natural hazard risk data across the US

Dataset, source: Federal Emergency Management Agency (FEMA)

FEMA publishes data on the natural hazard risk for each US county that helps illustrate which US communities are most at risk for 18 natural disasters, including hurricanes and heat waves, as well as presenting an aggregated risk score.

Raw data:

I cleaned this data by manually deleting columns for natural disasters that are outside the scope of this study, such as avalanches, earthquakes, and volcanic activity.

Visualization and exploratory data analysis:

Visualization of this dataset makes it immediately apparent which areas of the United States could be most impacted by climate disasters. The second visualization in particular reveals how a few locations may be disproportionately affected by certain climate events.

Concluding Thoughts

I am still in the process of gathering more and higher quality data to support my analysis, as well as working on refining my decision making behind gathering and cleaning the data. Data gathering and cleaning is an iterative process, and the quality of the data that emerges from this stage will heavily impact the quality of the analysis.