Exploratory Data Analysis & Machine Learning Application

Are You Suffering from Yesterday’s Dinner?

US Foodborne Outbreak Incident in 1998–2018

8 min readDec 21, 2020

Food is one of the necessary things in our daily life. With economic growth, people value the safety of food more. Although the government had passed many acts related to food safety as time goes by, roughly one in six (or 48 million) people in the United States get sick from eating contaminated food every year. More than 250 pathogens and toxins have been known to cause foodborne diseases, and almost all of them can cause an outbreak. Through this analysis project, we can review the status of food safety in the US.

Photo from Yale School of Public Health (shorturl.at/beghC)

Outline

Introduction
Goal
Data Preparation
Exploratory Data Analysis
Model Training & Evaluation
US Heatmap
Conclusion

Introduction

A foodborne disease outbreak occurs when two or more people get the same illness from the same contaminated food or drink. While most foodborne illnesses are not part of a recognized outbreak, outbreaks provide important information on how germs spread, which foods cause illness, and how to prevent infection.

In this project, all the code was conducted on the Google Colab Notebook. The following links are the source code uploaded to Github.

FoodBorne Event Dataset

In all 50 states, public health agencies, the District of Columbia, U.S. territories, and Freely Associated States have primary responsibility for identifying and investigating outbreaks and using a standard form to report outbreaks voluntarily to CDC. During 1998–2018, reporting was made through the electronic Foodborne Outbreak Reporting System (eFORS).

This dataset provides data on foodborne disease outbreaks reported to CDC from 1998 through 2015. Data fields include year, state (outbreaks occurring in more than one state are listed as “multistate”), the location where the food was prepared, reported food vehicle and contaminated ingredient, etiology (the pathogen, toxin, or chemical that caused the illnesses), status (whether the etiology was confirmed or suspected), total illnesses, hospitalizations, and fatalities. In many outbreak investigations, a specific food vehicle is not identified; for these outbreaks, the food vehicle variable is blank.

US Population dataset

The US population dataset comes from the US Census Bureau.

Goal

There are some question and goal we want to find out through this project:

Is the foodborne disease outbreaks increase or decrease across time?
Is it possible to predict the illnesses and hospitalization of the foodborne disease outbreaks?
Using visualization to have a more clear review of the foodborne disease outbreaks.

Data Preparation

Importing the Required Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Displaying the Original data

data = pd.DataFrame()
data = foodOutBreak.drop(["Serotype or Genotype",
                          "Food Vehicle",
                          "Food Contaminated Ingredient",
                          "Food Contaminated Ingredient",
                          "IFSAC Category","Water Exposure",
                          "Water Type","Animal Type",
                          "Animal Type Specify",
                          "Water Status"], axis =1)data["ratio_h"] = data["Hospitalizations"]/data["Illnesses"]
data["ratio_d"] = data["Deaths"]/data["Illnesses"]

Data Wrangling

Set a new category of Etiology
Multiple microorganisms recode in the Etiology column. We need to sort the microorganisms that are close to each other into one category for further exploration. The following are the new categories, and the original class is noted by #.

Then create a new column for each etiology. If the incident involves a specific microorganism, the corresponding category's cell will show as True, and vice versa.

It is not easy to use True/False data. Therefore, we need to transfer it into 1/0.

def replace_boolean(data):
   for col in data:
      data[col].replace(True, 1, inplace=True)
      data[col].replace(False, 0, inplace=True)replace_boolean(data)

2. Set a new category of Setting

Same thing here; there are multiple locations recorded in the Setting column. Repeat the step like the etiology category above.

After finishing the data cleaning process, we are going to the exploratory data analysis.

Exploratory Data Analysis

First of all, we review the cases and illnesses in months to find out whether there is a seasonal pattern or not.

Months

Both results increase from February to May and gently decrease till September. Then start to rise again.

Import US Population Data

The Incidence Rate in epidemiology is a measure of the probability of a given medical condition in a population within a specified period of time. We need population data to calculate the incidence rate.

After import the population data, we use it to calculate the incidence rate of illnesses and hospitalization, respectively and grouped by years and states.

Years

The incidence rate we adopt here is per 1,000 population.

If we group it by years, cases, illnesses, and incidence rate of illnesses own a very similar pattern.

There is a drop in 2005. The influence of the Passage of the Food Allergy Labeling and Consumer Protection Act passed in 2004, or the Sanitary Food Transportation Act, which passed in 2005, may be the reasons.

There is a comparatively high peak in 2006. The Dole Baby Spinach E.coli outbreak and Taco Bell fast-food E. coli outbreak may be the reasons.

States

Since different states have different populations, using incidence rates could have a clear and accurate review of the status.

North Dakoda owns the highest incidence rate of illnesses. The following are Wyoming, Oregon, and Washington DC.

Hawaii and Alaska own the highest incidence rate of hospitalization. The following are Wyoming, Minnesota, and Wisconsin.

Although multistate owns the highest death rate and second-highest hospitalization rate, the value drop significantly after consider the population factor.

Top 5 Illnesses States

California, Multisates, Illinois, Ohio, and Florida are the top 5 illnesses states.

Let’s review them separately.

California

Multisates

Illinois

Ohio

Florida

Model Training & Evaluation

Correlation

Before establishing the model, we need to review the correlation between columns. The original data’s correlation matrix shows low independence between every column. Therefore, we group the data by state to eliminate the effect of population. We can see the correlation between columns is highly increased, which provides a better angle to review the data's relationship.

data_state = data.groupby('State').sum()plt.figure(figsize=(25,25))
sns.heatmap(data_state[data_state.columns[2:]].corr(), annot=True,square=True)

Import Library

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.cluster import KMeans
from sklearn import metrics

Prediction of Illnesses

features = etiology + settingX = data[features]
y = data['Illnesses']train_X,val_X,train_y,val_y = train_test_split(X,y,random_state=1)

Model Evaluation (accuracy number need an update)

svr = svm.SVR()
svr.fit(train_X,train_y)
y_pred_svr = svr.predict(val_X)errors_svr = val_y - y_pred_svr
errors_svr.hist(bins=100)
print('MAE= ', metrics.mean_absolute_error(val_y,y_pred_svr))
print('MSE= ', metrics.mean_squared_error(val_y,y_pred_svr))
print('RMS= ',np.sqrt(metrics.mean_squared_error(val_y,y_pred_svr)))
print('EVS= ', explained_variance_score(val_y,y_pred_svr))
print('MSLE= ', mean_squared_log_error(val_y,y_pred_svr))
print('R^2= ', r2_score(val_y,y_pred_svr))

Prediction of Hospitalizations

Although judging from the error plot, all the results are quite centralized. The evaluation of the model is not ideal; the outliers may be the cause.

US Heatmap

Alaska stands out on the average hospitalization, but with a considerably low death rate. We can roughly say that the medical system's performance in Alaska is quite good compared with other states.

The illnesses and hospitalization of Utah and New Mexico are not high; however, the death rate is significantly high.

Conclusion

Generally speaking, the frequency of foodborne outbreak events decreases over time thanks to government regulation being more integrated.

The prediction model of illnesses and hospitalization isn’t ideal. The black swan event may cause inaccuracy.

Creating a model to predict illnesses and hospitalization is to support the government and hospital to have a better plan once there is an incident. With the information of location and etiology, the hospital may have an accurate estimation of hospital capacity.

The combination of weather data may predict the probability of the incident caused by specific etiology.

The sanitation where food is processing, supply chain, weather, etc. is the cause of foodborne illness. Therefore, comprehensive regulation and well implementation can improve the issue, which the government has been doing.