Support Vector Machines

Social Disparity in Impacts of Climate Disasters in the United States

Support Vector Machines: Overview

Support vector machines (SVMs) are linear models used for classification and regression problems. SVMs are a supervised learning approach, meaning they use labeled data to train a model and predict values for new data vectors. The goal of SVMs is to find an optimal linear separator, or a hyperplane, that divides the data vectors into two classes. A hyperplane (of dimension n-1) separate any vector space (of dimension n) into two subspaces. For example, a hyperplane in a 2-dimensional space is a line, a hyperplane in a 3-dimensional space is a plane; this can be generalized to higher dimensions.

(Image credit: Rohith Gandhi)

There can be an infinite number of hyperplanes that successfully separate the data into classes. To find the optimal separator, SVMs aim to maximize the margin between the hyperplane and the data vectors closest to the hyperplane, which are called the support vectors. The equation for the hyperplane can be solved as an optimization problem. The figure below illustrates multiple possible hyperplanes on the left, and the optimal hyperplane, margin, and support vectors (filled in points) on the right.

(Image credit: Rohith Gandhi)

For data that is not linearly separable in the original vector space, SVMs apply kernel functions, which map the data into higher-dimensional spaces where they become linearly separable. Kernels do not actually transform the data into the projected space, but rather, they compute the dot product between two points in the projected space. Geometrically, the dot product is for vectors a and b, where θ is the angle between them, is defined as

This can be understood as the magnitude of the projection of one vector onto a second vector multiplied by the magnitude of the second vector. This is significant because the dot product between all the pairs of points in the projected space, given by the kernel, is the only thing the SVM needs in order to cast the data into different dimensional spaces and determine the linear classifier.

Some examples of kernels include the linear, polynomial, and radial (rbf) kernels. The polynomial kernel is given by

where a, b are vectors, r is the coefficient of the polynomial, and d is the degree of the polynomial. Below is a visualization of the polynomial kernel in the original 2D space and the mapped feature space.

An example of casting 2D vectors into a higher dimension using a polynomial with r = 1 and d = 2 is as follows:

The rbf kernel comes from the Gaussian equation, and is given by

where a, b are vectors, and γ scales the effect the 2 points have on each other. Below is a visualization of the rbf kernel in the original 2D space and the mapped feature space.

(Image credit: Suvigya Saxena)

A comparison of linear, polynomial, and rbf kernels is visualized below.

In this portion of the project, I apply SVMs to classify news headlines about hurricanes into 2 classes: ones that address (social) inequality, and ones that do not address inequality. Such a classification algorithm can reveal how (and whether) people report climate disasters differently if they address social inequality or if they don’t, and use these differences to classify news articles. This headline classification can be further applied to understand trends in climate reporting or recommend news articles on the intersection of climate disasters and social disparity.

Data Preparation

With supervised learning algorithms such as SVMs, labeled data is required in order to train the model to classify data or make predictions. Furthermore, when applying SVMs, the data must be numeric, because SVMs fundamentally solve a mathematical optimization problem on numerical vectors.

For my SVM models, I am using vectorized text data of news headlines scraped from Google news search. The news headlines search results for “hurricane AND inequality” are labeled “inequality”, and results for “hurricane NOT inequality” are labeled “not inequality.” This data is balanced, with 100 data vectors for each class, which is important for supervised learning. Below is a sample of the vectorized data, which contains 100 features (originally I used 50 but this did not result in good SVM model performance). This data can be found here.

For supervised learning algorithms such as SVMs, we need to split the data into a training set to train the model and a testing set to test the accuracy of the model. The training and testing sets must be disjoint, so that the model does not see the testing data during training. This ensures that the testing data can be used to accurately evaluate the model’s performance for making predictions with new data. To prepare the data for SVM modeling, I used 70%-30% training-to-testing data split. A sample of the training and testing datasets are shown below.

Training data

Testing data

Code

Code to model SVMs in Python can be found here.

Code for scraping and vectorizing the news headlines can be found here.

Results

I ran SVMs with linear, rbf, and polynomial kernels with degrees 2 and 3. During the process of creating each model, I experimented with different values for C, the tradeoff parameter between margin and error in a soft-margin SVM, to optimize for the accuracy. A soft-margin SVM is an SVM model where a certain number of points can be misclassified in order to maximize the margin.

For the SVM with the linear kernel, I used C=0.1, 1, and 10. The resulting models had accuracies of 78.33% (C=0.1), 80% (C=1), and 78.33% (C=10). The confusion matrix for the best C, C = 1, is:

For the SVM with the rbf kernel, I used C=100, 200, and 500. The resulting models had accuracies of 81.67% (C=100), 83.33% (C=200), and 73.33% (C=500). The confusion matrix for the best C, C = 200, is:

For the SVM with the polynomial kernel of degree 2, I used C=1000, 5000, and 10000. The resulting models had accuracies of 80% (C=1000), 81.67% (C=5000), and 73.33% (C=10000). The confusion matrix for the best C, C = 5000, is:

A comparison of all the kernels is shown below:

All models were able to achieve accuracies of at least 80%. In this case, the rbf model had the best performance. The best C parameter was larger for the more complex (i.e. higher degree) models. The top features for the two labels (negative for inequality, positive for not inequality) are visualized below:

Some of the top features for the inequality class are “cost,” “destruction,” “recovery,” and “survivors” which deal with the impact of the disaster. Others like “climate” and “disaster” contextualizes the event as a climate issue. Revealingly, “racial” also appeared as a top word, indicating that race is often involved when inequality is mentioned when reporting on hurricanes. For the not inequality class, some of the top features are “ahead,” “ storm,” “enters,” and “cyclone,” which are more related to the occurrence of the climate event.

Conclusion

I used SVMs to classify news headlines about hurricanes into either ones that addressed inequality, or ones that did not. I created SVMs with 4 different kernels: linear, rbf, and polynomial with degrees 2 and 3. All 4 SVMs models reached a minimum of 80% accuracy for this classification problem, with the rbf kernel performing best at 83.33%. This indicates that there are patterns in how hurricanes are reported differently when they address inequality and when they do not, which can be relatively accurately modeled by SVMs. The top features for the inequality class such as “cost,” “destruction,” “recovery” and “survivors” were more related to the impact of the disaster, whereas the top features for the not inequality class were more related to the occurrence of the climate event. Furthermore, top words for the inequality such as “climate” and “disaster” contextualize the event as a climate issue, and “racial” also appeared as a top word, possibly suggesting the significance of racial inequality in hurricane impacts.