Exploratory Data Analysis on Behavioral Risk Factor Surveillance System (BRFSS) dataset

Published

May 23, 2021

Key ideas:

Before beginning with any sort of model building work it is important you get to know the data you’re handling. This includes not just understand the columns and row entries in the dataset but also the simple understanding on what each of the variable in the data means.

This step is a vital foundation for any successful analytics, modeling, and prediction task. It is called as the data exploratory analysis.

These are few some steps you can keep in mind when starting with an exploration of new dataset being presented: 1. Understand each descriptor in the data, type of data being encoded 2. Cleaning the data – look at null and NaN 3. Visualize variable distributions – use (appropriate) summary statistics * Primary analysis of data spread – histograms * Distribution functions 1. Probability mass functions 2. Cumulative distribution function 3. Kernel density estimates

1. Explore relationship between variables
• Scatter plots
• Simple (linear) correlations (Pearson statistics)
• Simple (linear) regression
2. Explore multivariate relationships
• Multiple regression (for continuous variables)
• Logistic regression (for categorical variables)

In this notebook I will cover some of these steps as we explore the Behavioral Risk Factor Surveillance Survey (BRFSS) dataset and try to tease out simple correlations with the variables.

BRFSS data obtained from: * https://www.kaggle.com/sakinak/behavioral-risk-factor-surveillance-survey-201619

Code book: This is an extremely important document helping us make sense of the data we’ve imported * https://www.cdc.gov/brfss/annual_data/2019/pdf/codebook19_llcp-v2-508.HTML

``````import os
import numpy as np
import copy
import pandas as pd``````
``````#----- PLOTTING PARAMS ----#
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
import seaborn as sns

# High DPI rendering for mac
%config InlineBackend.figure_format = 'retina'

# Plot matplotlib plots with white background:
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}

plot_params = {
'font.size' : 15,
'axes.titlesize' : 20,
'axes.labelsize' : 15,
'axes.labelweight' : 'bold',
'xtick.labelsize' : 10,
'ytick.labelsize' : 10,
}

plt.rcParams.update(plot_params)
sns.set_palette("colorblind")
sns.color_palette('colorblind')``````
``````# Functions for PMF and CDF, we will come to those later in the notebook
def pmf(pandas_series):
values, counts = np.unique(pandas_series, return_counts = True)
pmf = np.c_[ values, counts / sum(counts) ]
return pmf

def cdf(pandas_series):
values, counts = np.unique(pandas_series, return_counts = True)
pmf = np.c_[ values, counts / sum(counts) ]
cdf = np.zeros(shape=pmf.shape)

for i in range(0, pmf.shape[0]):
cdf[i] = [pmf[i][0], np.sum(pmf[:i+1], axis=0)[-1]]

return cdf ``````
``````# Read archive file -- considering only the 2019 dataset
``df.columns``
``````Index(['Unnamed: 0', '_STATE', 'FMONTH', 'IDATE', 'IMONTH', 'IDAY', 'IYEAR',
'DISPCODE', 'SEQNO', '_PSU',
...
'_VEGESU1', '_FRTLT1A', '_VEGLT1A', '_FRT16A', '_VEG23A', '_FRUITE1',
'_VEGETE1', '_FLSHOT7', '_PNEUMO3', '_AIDTST4'],
dtype='object', length=343)``````
``df.shape``
``(418268, 343)``
``df.info()``
``````<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418268 entries, 0 to 418267
Columns: 343 entries, Unnamed: 0 to _AIDTST4
dtypes: float64(280), int64(63)
memory usage: 1.1 GB``````
``df.columns[ df.isna().any() ]``
``````Index(['CTELENM1', 'PVTRESD1', 'COLGHOUS', 'STATERE1', 'CELPHONE', 'LADULT1',
...
'FRUTDA2_', 'GRENDA1_', 'FRNCHDA_', 'POTADA1_', 'VEGEDA2_', '_FRUTSU1',
'_VEGESU1', '_FLSHOT7', '_PNEUMO3', '_AIDTST4'],
dtype='object', length=275)``````

The data is too big to be visualized and described using traditional `NaN` and summary statistics, rather let’s just checkout key columns and understand their distribution

1. Looking at individual columns

Sex of the respondent

Data is provided in the column marked as `_SEX` and the label are: * MALE = 1 * FEMALE = 2

``df['_SEX'].value_counts()``
``````2    228433
1    189835
Name: _SEX, dtype: int64``````

Annual Income of the respondent

`INCOME2` Question: Is your annual household income from all sources

Value of `77` or `99` in the row is either Refused or Not sure. So take them out

`_INCOMG` Computed income categories

``df['INCOME2'].value_counts()``
``````8.0     117793
7.0      54252
6.0      46572
99.0     40246
5.0      34496
77.0     32654
4.0      30001
3.0      23391
2.0      16122
1.0      15860
Name: INCOME2, dtype: int64``````
``df['_INCOMG'].value_counts()``
``````5    172045
9     79781
2     53392
4     46572
3     34496
1     31982
Name: _INCOMG, dtype: int64``````
``````# Replace 77 99 with NaN
df['INCOME2'].replace([77, 99], np.nan, inplace=True)``````
``df['INCOME2'].value_counts()``
``````8.0    117793
7.0     54252
6.0     46572
5.0     34496
4.0     30001
3.0     23391
2.0     16122
1.0     15860
Name: INCOME2, dtype: int64``````
``df.shape``
``(418268, 343)``

Drop the entries with NaN

``````df_income_no_nan = df.dropna(subset=['INCOME2'])
df_income_no_nan.shape``````
``(338487, 343)``
``df_income_no_nan['INCOME2'].value_counts()``
``````8.0    117793
7.0     54252
6.0     46572
5.0     34496
4.0     30001
3.0     23391
2.0     16122
1.0     15860
Name: INCOME2, dtype: int64``````
``pmf_income = pmf(df_income_no_nan['INCOME2'])``
``pmf_income``
``````array([[1.        , 0.04685557],
[2.        , 0.0476296 ],
[3.        , 0.06910457],
[4.        , 0.08863265],
[5.        , 0.10191233],
[6.        , 0.13758874],
[7.        , 0.16027794],
[8.        , 0.34799859]])``````
``````plt.plot(pmf_income[:,0], pmf_income[:,1], marker='o')
plt.xlabel('Income Classes')
plt.ylabel('Frequency (normalized)')``````
``Text(0, 0.5, 'Frequency (normalized)')``

``cdf_income = cdf(df_income_no_nan['INCOME2'])``
``cdf_income``
``````array([[1.        , 0.04685557],
[2.        , 0.09448516],
[3.        , 0.16358974],
[4.        , 0.25222239],
[5.        , 0.35413472],
[6.        , 0.49172346],
[7.        , 0.65200141],
[8.        , 1.        ]])``````
``````plt.plot(cdf_income[:,0], cdf_income[:,1], marker='o')
plt.xlabel('Income Classes')
plt.ylabel('Cumulative Frequency (normalized)')``````
``Text(0, 0.5, 'Cumulative Frequency (normalized)')``

Looking at annual income classes as per sex of the respondent

``df_income_no_nan_male = df_income_no_nan.loc[ df_income_no_nan['_SEX'] == 1 ]``
``df_income_no_nan_female = df_income_no_nan.loc[ df_income_no_nan['_SEX'] == 2 ]``
``````cdf_income_male = cdf(df_income_no_nan_male['INCOME2'])
cdf_income_female = cdf(df_income_no_nan_female['INCOME2'])``````
``````plt.plot(cdf_income_male[:,0], cdf_income_male[:,1], marker='o', label='Male')
plt.plot(cdf_income_female[:,0], cdf_income_female[:,1], marker='o', label='Female')
plt.plot(cdf_income[:,0], cdf_income[:,1], linestyle='--', label='Total')
plt.xlabel('Income Classes')
plt.ylabel('Cumulative Frequency (normalized)')
plt.legend()``````
``<matplotlib.legend.Legend at 0x7fb1a8eb5520>``

Considering the education level of the respondents

`EDUCA` What is the highest grade or year of school you completed?

9 and BLANK == Missing and Refused

`_EDUCAG` Computed level of education completed categories

``df['EDUCA'].value_counts().sort_index()``
``````1.0       619
2.0      9940
3.0     19506
4.0    111890
5.0    116591
6.0    157887
9.0      1809
Name: EDUCA, dtype: int64``````
``df['EDUCA'].replace([9], np.nan, inplace=True)``
``df['_EDUCAG'].value_counts()``
``````4    157887
3    116591
2    111890
1     30065
9      1835
Name: _EDUCAG, dtype: int64``````
``df['_EDUCAG'].value_counts(normalize=True).sort_index()``
``````1    0.071880
2    0.267508
3    0.278747
4    0.377478
9    0.004387
Name: _EDUCAG, dtype: float64``````
``df_education_no_na = df.dropna(subset = ['EDUCA'])``
``df_education_no_na.shape``
``(416433, 343)``
``````df_education_no_na_male = df_education_no_na.loc[ df_education_no_na['_SEX'] == 1 ]
df_education_no_na_female = df_education_no_na.loc[ df_education_no_na['_SEX'] == 2 ]``````
``````cdf_educa_male = cdf(df_education_no_na_male['EDUCA'])
cdf_educa_female = cdf(df_education_no_na_female['EDUCA'])``````
``cdf_educa = cdf(df_education_no_na['EDUCA'])``
``cdf_educa``
``````array([[1.00000000e+00, 1.48643359e-03],
[2.00000000e+00, 2.53558195e-02],
[3.00000000e+00, 7.21964878e-02],
[4.00000000e+00, 3.40883167e-01],
[5.00000000e+00, 6.20858577e-01],
[6.00000000e+00, 1.00000000e+00]])``````
``````plt.plot(cdf_educa_male[:,0], cdf_educa_male[:,1], marker='o', label='Male')
plt.plot(cdf_educa_female[:,0], cdf_educa_female[:,1], marker='o', label='Female')
plt.plot(cdf_educa[:,0], cdf_educa[:,1], linestyle='--', label='Total')
plt.xlabel('Level of education')
plt.ylabel('Cumulative Frequency (normalized)')
plt.legend()``````
``<matplotlib.legend.Legend at 0x7fb0dcfd2160>``

Height and Weight

``````## Height in meters
df.hist('HTM4')``````
``array([[<AxesSubplot:title={'center':'HTM4'}>]], dtype=object)``

``````## Weight in kilograms:
df.hist('WTKG3')``````
``array([[<AxesSubplot:title={'center':'WTKG3'}>]], dtype=object)``

``df['WTKG3'] = df['WTKG3'] / 100``
``````## Weight in kilograms:
df.hist('WTKG3')``````
``array([[<AxesSubplot:title={'center':'WTKG3'}>]], dtype=object)``

``````wt_male = df.loc[ df['_SEX'] == 1 ]['WTKG3']
wt_female = df.loc[ df['_SEX'] == 2 ]['WTKG3']``````
``````plt.hist(wt_male, alpha=0.6, label='male')
plt.hist(wt_female, alpha=0.6, label='female')
plt.xlabel('Weight (kg)')
plt.legend()``````
``<matplotlib.legend.Legend at 0x7fafcf2eabe0>``

``````## Age category -- divided in 5 year interval
df.hist('_AGEG5YR')``````
``array([[<AxesSubplot:title={'center':'_AGEG5YR'}>]], dtype=object)``

Quick plot for weight vs age-group

``````## Quick plot for weight as per age
df_no_nans = df.dropna( subset=['_AGEG5YR', 'WTKG3'] )``````
``````fig, ax = plt.subplots(1,1, figsize=(6,6))
ax.scatter( df_no_nans['_AGEG5YR'], df_no_nans['WTKG3'], marker='o', alpha=0.5)
ax.set_xlabel('Age group')
ax.set_ylabel('Weight (kg)')``````
``Text(0, 0.5, 'Weight (kg)')``

It makes much more sense to show these as either violin or box plot:

• Violin plot – KDE for that column around the y Each column is a graphical representation of the distribution of weight in one age group. The width of these shapes is proportional to the estimated density, so it’s like two vertical PDFs plotted back to back.
• Box plot – Each box represents the interquartile range, or IQR, from the 25th to the 75th percentile. The line in the middle of each box is the median. The spines sticking out of the top and bottom show the minimum and maximum values. Looking at the medians, it seems like people in their 40s are the heaviest; younger and older people are lighter. Looking at the sizes of the boxes, it seems like people in their 40s have the most variability in weight, too. These plots also show how skewed the distribution of weight is; that is, the heaviest people are much farther from the median than the lightest people.
``````import seaborn as sns
ax = sns.boxplot(x = '_AGEG5YR', y = 'WTKG3', whis=10, data = df_no_nans)
ax.set_xlabel('AGE GROUP')
ax.set_ylabel('WEIGHT (Kgs)')``````
``Text(0, 0.5, 'WEIGHT (Kgs)')``

Quick plot for height vs weight

``df_height_wt = df.dropna( subset=['HTM4', 'WTKG3'] ).sample(50000)``
``````fig, ax = plt.subplots(1,1, figsize=(6,6))
ax.scatter( df_height_wt['HTM4'], df_height_wt['WTKG3'], marker='o', alpha=0.5)
ax.set_xlabel('Height (m)')
ax.set_ylabel('Weight (kg)')``````
``Text(0, 0.5, 'Weight (kg)')``

``````height_jitters = df_height_wt['HTM4'] + np.random.normal(0, 2, size=len(df_height_wt))
weight_jitters = df_height_wt['WTKG3'] + np.random.normal(0, 2, size=len(df_height_wt))``````
``````fig, ax = plt.subplots(1,1, figsize=(6,6))
ax.scatter( height_jitters, weight_jitters, marker='o', alpha=0.01)
ax.set_xlabel('Height (m)')
ax.set_ylabel('Weight (kg)')``````
``Text(0, 0.5, 'Weight (kg)')``

Fit a linear regression model

``````from sklearn.linear_model import LinearRegression
x = height_jitters.values.reshape(-1,1)
y = weight_jitters.values.reshape(-1,1)
reg = LinearRegression().fit(x, y)``````
``x.min()``
``90.14298210689765``
``print(reg.coef_, reg.intercept_, reg.score(x,y))``
``[[0.90102916]] [-71.05446729] 0.2163310658640062``
``height_test = np.linspace(x.min(), x.max()).reshape(-1, 1)``
``````fig, ax = plt.subplots(1,1, figsize=(6,6))
ax.plot( height_test, reg.predict(height_test), 'r--', alpha=0.5)
ax.scatter( height_jitters, weight_jitters, marker='o', alpha=0.01)
ax.set_xlabel('Height (m)')
ax.set_ylabel('Weight (kg)')``````
``Text(0, 0.5, 'Weight (kg)')``