Association Rule Mining

Social Disparity in Impacts of Climate Disasters in the United States

Association Rule Mining: Overview

Association Rule Mining (ARM) is an unsupervised learning method that enables discovery of correlations (or associations) in data items. One common example of applying ARM is in Market Basket Analysis, which aims to find associations between customers purchasing one item (or items) to the customer also purchasing another item (or items) based on purchase transaction datasets. ARM is also powerful for identifying correlations in text data.

(Image Credit: Mathworks)

ARM is run on unlabeled transaction data, which is data that is captured from a transaction or event, such as the fabricated example below.

ARM generates rules in the form of {A} -> {B}, where {A} and {B} are sets of items in the set of all possible transaction items, which is the rule that if {A} occurs in a transaction, {B} also occurs in that translation. Example rules with the transaction dataset above include {coffee} -> {chocolate} and {coffee, bread} -> {lettuce, bacon}. ARM generates several measures for each rule

Support is how often A and B occur together compared to all transactions
Sup(A,B) = P(A,B) = (Count of A, B together) / (# transactions)
Confidence is how often A and B occur together compared to transactions with A. This measure is useful because it determines whether A and B are likely to occur together, even when there is a small frequency of A occurring.
Conf(A,B) = P(B|A) = P(A,B)/P(A), = (Count of A, B together)/ (Count of A)
Lift is a correlation measure defined as
Lift(A, B) = P(A, B) / [P(A)P(B)] = (Count of A, B together)/ [(Count of A) (Count of B)]
This measure tells us:
If Lift(A,B) = 1: A and B are independent
If Lift(A,B) > 1: A and B are positively correlated
If Lift(A,B) < 1: A and B are negatively correlated
We are looking for rules with Lift > 1 to indicate correlation (or association).

ARM also implements the Apriori algorithm, which eliminates supersets of sets that do not meet the thresholds for the measures that we establish, as the supersets necessarily have a lower probability of occurring (i.e. {coffee, bread} -> {lettuce, bacon} is more improbable than {coffee} -> {chocolate}). The Apriori algorithm is visualized below.

(Image Credit: Prof. Ami Gates)

In this project, I apply association rule mining to uncover associations in data from the Survey of Trauma, Resilience, and Opportunity Among Neighborhoods in the Gulf (STRONG) II (2018) with respondents residing in Louisiana, Alabama, Mississippi, and Florida following hurricanes Harvey, Irma, and Nate in 2017. This dataset includes individual survey responses on aspects such as whether the respondent was living in an area affected by the hurricane(s), whether they experienced home damage due to the hurricane(s), job loss due to the hurricane(s), adversities due to the hurricane(s) such as being unable to meet rent or mortgage; assessments of efficacy, depression, and post-traumatic stress disorder (PTSD); as well as demographic data such as sex, age, income, and race/ethnicity. I use association rule mining to discover associations between demographic groups and reported impacts of the hurricane.

Data Preparation

The raw STRONG survey data was primarily in numerical format, with integer codes for different responses to survey questions. A sample of the raw data is shown below.

Raw STRONG survey data

Since ARM requires unlabeled transaction data, I determined what data from the survey I was interested in mining, and how to translate the responses into transaction data values, where each response would be unique. I decided to use the following features and values, some of which are directly corresponding to individual survey questions and their response codes, whereas others are aggregated from multiple related questions.

For some questions, I converted all the possible values into transaction data values, and for others, I only chose to keep and convert a subset of values. For example, for whether the respondent was injured or not, I only included a transaction item for the injured response and did not include a transaction item for the non-injured response. This was due to the fact that the prevalence of the non-injured response was so high, that it would likely show up as a rule during association rule mining, and it wasn’t something I was interested in when trying to understand the impacts of the hurricanes. I also dropped rows where the respondents were neither living in an affected area nor present for the 2017 hurricane(s). Below are the data dimensions and the transactional data values I included for each storm impact dimension.

This is what the data looked like after preparation. In addition to the hurricane impacts above, it also includes demographic information (“aian” corresponding to American Indian or Alaska Native). In total, there are 254 transactions. The prepared data can be found here.

Code

Code for ARM in R, with top rules, and visualization of the results can be found here.

Code to prepare the data can be found here.

Results

For Association Rule Mining, I used thresholds of support = 0.05, confidence = 0.5, and minlen = 2. I chose a low threshold for support because there were some values that were really infrequent (such as medium or high adversity due to the hurricane occurring with a particular demographic), but interesting and relevant for this analysis. Below is a plot of the top 20 most frequent data items, which indicates that female, white, and age 51-75 are the most represented demographics in this dataset:

The top 15 rules, sorted by support, were:

These are not particularly insightful, as they’re mostly rules from one demographic group to another one, and I am interested in rules that map between demographics and storm impact. Furthermore, a majority of the rules have lift less than 1, which are not actually correlated. The large number of occurrences of female, white, 51-75 aligns with the highest frequency items from the frequency plot. However, we can learn from this that the storm impact and demographic with the highest co-occurrence, based on this survey, was females that faced adversity due to the hurricanes(s).

The top 15 rules, sorted by confidence, were:

These rules are mostly describing that given that certain demographics and/or medium adversity occur, that female or white demographics will also occur. Again, this may be a result of female and white demographics having a higher representation in the dataset. Some interesting rules are rules 1 {10k-20k, medium adversity} => {female}, 3 {medium adversity, over 75} => {female}, and 9 {black, medium adversity} => {female}, which show “confidence” given a respondent (experiences medium adversity due to the hurricane(s) AND belongs to a socially vulnerable demographic group (low income, over 75, or black)), that the respondent is female.

The top 15 rules, sorted by lift, were:

The top 4 rules sorted by lift all included the black demographic, even though this demographic was not represented when sorted by support or confidence, and have lift > 2. The top 3 rules indicate that there is a strong association between being aged 51-75/experiencing high adversity due to the hurricane(s) and being black; experiencing high adversity and being black; and being a female/experiencing high adversity and being black.

Below is a visualization of the top 15 rules sorted by lift as a network, which further illustrates the strong associations between high adversity and black demographic, as well as the intersectionality between being black, aged 51-75, and female in experiencing high adversity due to the hurricane(s).

The only measure of hurricane impact that appeared during this analysis were the medium and high adversity, so I experimented with running the ARM again with a support = 0.01, confidence = 0.5, and minlen = 2, specifying that “lost job” occurs on the right hand side.

Below are the top 15 rules related to job loss, sorted by lift, and visualization of these rules as a network:

This shows that job loss is most highly correlated with individuals that are 51-75, black, and male. However, due to the low support and total count number, this result should be taken with a grain of salt, since there likely are not sufficient samples to form a strong association.

Conclusion

I applied Association Rule Mining (ARM) to identify associations between hurricane impacts and demographics based on survey data of people affected by hurricanes Harvey, Irma, and/or Nate in 2017. The rules with highest support and confidence were dominated by demographics that were most represented in the dataset (which were female, white, and age 51-75). However, these measures stilled revealed information in their own ways. For example, the highest support rules showed that the storm impact and demographic with the highest co-occurrence, based on this survey, was females that faced medium adversity due to the hurricanes(s) and white individuals whose homes were damaged but still livable. The most useful rules for associations were the rules of high lift. The top 3 rules with the highest lift support that there are strong associations between (being aged 51-75 AND experiencing high adversity due to the hurricane(s)) and being black; experiencing high adversity and being black; and (being a female AND experiencing high adversity) and being black. These rules together reveal strong associations between being black, aged 51-75, and female in experiencing high adversity due to the hurricane(s), where high adversity corresponds to expericning at least 3 hurricane impacts, including did not meet all essential expenses, did not pay the full rent or mortgage, evicted from home or apartment, did not have adequate food, and more . These results contributes to an intersectional understanding of the impacts of hurricanes Harvey, Irma, and Nate in the gulf states.