Key ideas:
Before beginning with any sort of model building work it is important you get to know the data you're handling. This includes not just understand the columns and row entries in the dataset but also the simple understanding on what each of the variable in the data means.
This step is a vital foundation for any successful analytics, modeling, and prediction task. It is called as the data exploratory analysis.
These are few some steps you can keep in mind when starting with an exploration of new dataset being presented:
- Understand each descriptor in the data, type of data being encoded
- Cleaning the data -- look at null and NaN
-
Visualize variable distributions -- use (appropriate) summary statistics
- Primary analysis of data spread -- histograms
- Distribution functions
- Probability mass functions
- Cumulative distribution function
- Kernel density estimates
-
Explore relationship between variables
- Scatter plots
- Simple (linear) correlations (Pearson statistics)
- Simple (linear) regression
-
Explore multivariate relationships
- Multiple regression (for continuous variables)
- Logistic regression (for categorical variables)
In this notebook I will cover some of these steps as we explore the Behavioral Risk Factor Surveillance Survey (BRFSS) dataset and try to tease out simple correlations with the variables.
BRFSS data obtained from:
Code book: This is an extremely important document helping us make sense of the data we've imported
import os
import numpy as np
import copy
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
import seaborn as sns
# High DPI rendering for mac
%config InlineBackend.figure_format = 'retina'
# Plot matplotlib plots with white background:
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
plot_params = {
'font.size' : 15,
'axes.titlesize' : 20,
'axes.labelsize' : 15,
'axes.labelweight' : 'bold',
'xtick.labelsize' : 10,
'ytick.labelsize' : 10,
}
plt.rcParams.update(plot_params)
sns.set_palette("colorblind")
sns.color_palette('colorblind')
def pmf(pandas_series):
values, counts = np.unique(pandas_series, return_counts = True)
pmf = np.c_[ values, counts / sum(counts) ]
return pmf
def cdf(pandas_series):
values, counts = np.unique(pandas_series, return_counts = True)
pmf = np.c_[ values, counts / sum(counts) ]
cdf = np.zeros(shape=pmf.shape)
for i in range(0, pmf.shape[0]):
cdf[i] = [pmf[i][0], np.sum(pmf[:i+1], axis=0)[-1]]
return cdf
df = pd.read_csv('./archive/2019.csv')
df.columns
df.shape
df.info()
df.columns[ df.isna().any() ]
The data is too big to be visualized and described using traditional NaN
and summary statistics, rather let's just checkout key columns and understand their distribution
df['_SEX'].value_counts()
Annual Income of the respondent
INCOME2
Question: Is your annual household income from all sources
Value of 77
or 99
in the row is either Refused or Not sure. So take them out
_INCOMG
Computed income categories
df['INCOME2'].value_counts()
df['_INCOMG'].value_counts()
df['INCOME2'].replace([77, 99], np.nan, inplace=True)
df['INCOME2'].value_counts()
df.shape
Drop the entries with NaN
df_income_no_nan = df.dropna(subset=['INCOME2'])
df_income_no_nan.shape
df_income_no_nan['INCOME2'].value_counts()
pmf_income = pmf(df_income_no_nan['INCOME2'])
pmf_income
plt.plot(pmf_income[:,0], pmf_income[:,1], marker='o')
plt.xlabel('Income Classes')
plt.ylabel('Frequency (normalized)')
cdf_income = cdf(df_income_no_nan['INCOME2'])
cdf_income
plt.plot(cdf_income[:,0], cdf_income[:,1], marker='o')
plt.xlabel('Income Classes')
plt.ylabel('Cumulative Frequency (normalized)')
Looking at annual income classes as per sex of the respondent
df_income_no_nan_male = df_income_no_nan.loc[ df_income_no_nan['_SEX'] == 1 ]
df_income_no_nan_female = df_income_no_nan.loc[ df_income_no_nan['_SEX'] == 2 ]
cdf_income_male = cdf(df_income_no_nan_male['INCOME2'])
cdf_income_female = cdf(df_income_no_nan_female['INCOME2'])
plt.plot(cdf_income_male[:,0], cdf_income_male[:,1], marker='o', label='Male')
plt.plot(cdf_income_female[:,0], cdf_income_female[:,1], marker='o', label='Female')
plt.plot(cdf_income[:,0], cdf_income[:,1], linestyle='--', label='Total')
plt.xlabel('Income Classes')
plt.ylabel('Cumulative Frequency (normalized)')
plt.legend()
Considering the education level of the respondents
EDUCA
What is the highest grade or year of school you completed?
9 and BLANK == Missing and Refused
_EDUCAG
Computed level of education completed categories
df['EDUCA'].value_counts().sort_index()
df['EDUCA'].replace([9], np.nan, inplace=True)
df['_EDUCAG'].value_counts()
df['_EDUCAG'].value_counts(normalize=True).sort_index()
df_education_no_na = df.dropna(subset = ['EDUCA'])
df_education_no_na.shape
df_education_no_na_male = df_education_no_na.loc[ df_education_no_na['_SEX'] == 1 ]
df_education_no_na_female = df_education_no_na.loc[ df_education_no_na['_SEX'] == 2 ]
cdf_educa_male = cdf(df_education_no_na_male['EDUCA'])
cdf_educa_female = cdf(df_education_no_na_female['EDUCA'])
cdf_educa = cdf(df_education_no_na['EDUCA'])
cdf_educa
plt.plot(cdf_educa_male[:,0], cdf_educa_male[:,1], marker='o', label='Male')
plt.plot(cdf_educa_female[:,0], cdf_educa_female[:,1], marker='o', label='Female')
plt.plot(cdf_educa[:,0], cdf_educa[:,1], linestyle='--', label='Total')
plt.xlabel('Level of education')
plt.ylabel('Cumulative Frequency (normalized)')
plt.legend()
Height and Weight
df.hist('HTM4')
df.hist('WTKG3')
df['WTKG3'] = df['WTKG3'] / 100
df.hist('WTKG3')
wt_male = df.loc[ df['_SEX'] == 1 ]['WTKG3']
wt_female = df.loc[ df['_SEX'] == 2 ]['WTKG3']
plt.hist(wt_male, alpha=0.6, label='male')
plt.hist(wt_female, alpha=0.6, label='female')
plt.xlabel('Weight (kg)')
plt.legend()
df.hist('_AGEG5YR')
df_no_nans = df.dropna( subset=['_AGEG5YR', 'WTKG3'] )
fig, ax = plt.subplots(1,1, figsize=(6,6))
ax.scatter( df_no_nans['_AGEG5YR'], df_no_nans['WTKG3'], marker='o', alpha=0.5)
ax.set_xlabel('Age group')
ax.set_ylabel('Weight (kg)')
It makes much more sense to show these as either violin or box plot:
- Violin plot -- KDE for that column around the y Each column is a graphical representation of the distribution of weight in one age group. The width of these shapes is proportional to the estimated density, so it's like two vertical PDFs plotted back to back.
- Box plot -- Each box represents the interquartile range, or IQR, from the 25th to the 75th percentile. The line in the middle of each box is the median. The spines sticking out of the top and bottom show the minimum and maximum values. Looking at the medians, it seems like people in their 40s are the heaviest; younger and older people are lighter. Looking at the sizes of the boxes, it seems like people in their 40s have the most variability in weight, too. These plots also show how skewed the distribution of weight is; that is, the heaviest people are much farther from the median than the lightest people.
import seaborn as sns
ax = sns.boxplot(x = '_AGEG5YR', y = 'WTKG3', whis=10, data = df_no_nans)
ax.set_xlabel('AGE GROUP')
ax.set_ylabel('WEIGHT (Kgs)')
df_height_wt = df.dropna( subset=['HTM4', 'WTKG3'] ).sample(50000)
fig, ax = plt.subplots(1,1, figsize=(6,6))
ax.scatter( df_height_wt['HTM4'], df_height_wt['WTKG3'], marker='o', alpha=0.5)
ax.set_xlabel('Height (m)')
ax.set_ylabel('Weight (kg)')
height_jitters = df_height_wt['HTM4'] + np.random.normal(0, 2, size=len(df_height_wt))
weight_jitters = df_height_wt['WTKG3'] + np.random.normal(0, 2, size=len(df_height_wt))
fig, ax = plt.subplots(1,1, figsize=(6,6))
ax.scatter( height_jitters, weight_jitters, marker='o', alpha=0.01)
ax.set_xlabel('Height (m)')
ax.set_ylabel('Weight (kg)')
Fit a linear regression model
from sklearn.linear_model import LinearRegression
x = height_jitters.values.reshape(-1,1)
y = weight_jitters.values.reshape(-1,1)
reg = LinearRegression().fit(x, y)
x.min()
print(reg.coef_, reg.intercept_, reg.score(x,y))
height_test = np.linspace(x.min(), x.max()).reshape(-1, 1)
fig, ax = plt.subplots(1,1, figsize=(6,6))
ax.plot( height_test, reg.predict(height_test), 'r--', alpha=0.5)
ax.scatter( height_jitters, weight_jitters, marker='o', alpha=0.01)
ax.set_xlabel('Height (m)')
ax.set_ylabel('Weight (kg)')