Neha kumari
4 min readJan 9, 2020

Is Income Inequality Related To High Cancer Mortality Risk

Prediction of cancer Mortality Risk Based on Socioeconomic Factor

Cancer is the second leading cause of death after heart disease in USA. Data set reveals that death rate aren’t uniformly distributed across the United States. In other words, a greater number of death rates were reported in some states (per 100,000 people) than in others. So I was curious except from some well known factors(alcohol consumption, smoking…..etc.) this high rate has any relation with Income Inequality and education level ?

Data Cleaning and Feature Engineering

To try to answer this question, I collected data on key socioeconomic factors for each county, including indicators for education (percent of adults 25 and older with at least a high school degree,), diversity (percent non white population and percent non citizen), economic health (median house hold income) unemployment rate , percent poverty (You can find more about data on GitHub here.)

I derive my own target to redefine the target as a classification problem, I calculated the mean of average death rates and everything above the mean is considered as a high mortality risk and everything below the mean is considered as a low mortality risk.

Baseline Model

I used logistic regression with only numerical features to create baseline model to predict mortality risk with 70% accuracy and 77% roc auc score.

Predictive Model

I used tree based XGB Classifier with nested cross validation to create a model that predicts cancer mortality risk. Data for this model has only 3080 county death rate observations. To avoid overfitting I used Nested Cross validation for hyperparameter optimization and performance evaluation

This confusion matrix describe performance of model on a set of test data. A total of 309 actual high mortality risk cases , the model predicted high mortality risk 236 times and low mortality risk 73 times with 76.62% accuracy and 85% Roc_Auc score, which beats the 26% accuracy of a simple majority classifier.

How the Model Works

While documenting our process, we saw how well the model performed, What exactly causes the model to make a certain prediction? Taking a quick look at the model’s Feature importance graph , we can see which features are the most important in the model’s determination of mortality Risk.

Only about ten features were predictive, percentage population with High school degree only , percentage Black population, poverty percent, and Median House Hold Income being the most predictive.

From Permutation Importance graph we can see how these feature affect the score when it is replaced by random noise.

Partial Dependence Plots

Let’s see how these important features are related to Mortality Risk

For lower range of poverty percent likelihood of dying from cancer is decreasing but the counties with higher poverty percent has lower impact on prediction.

Here we see that as Percentage population with only High school degree increases predicted probability of high cancer mortality risk decreases.

For lower range of median household income likely hood of dying from cancer is high where as higher range of income has very little impact on prediction.

Conclusion

An analysis of Nation Institutes of health data revealed one factor that stood out as a predictor of cancer mortality risk in a given county : Poverty percent, another interesting point is that a person’s education level plays a significant role in the model’s determination of mortality risk. And intuitively, this makes sense on average high-school-educated individuals aren’t able to earn as much as college-educated .This — combination with median house hold income raising questions about access to care, prevention efforts, treatment, and other issues