Model building checklist
Curated list of resources for data, science, and navigating life
Things to consider when building a ML model
Exploratory Data Analysis (EDA)
 Learn about your data
 Type of the data being analyzed
 Is the data set, sensor measurereading what you think it is measuring
 Dsitribution of the data / target
 Outliers in the data
 Single variables
 Type – continuous, discreet, categorical
 Distribution in the data
 Scale of each variable
 What is the resolution I am interested in?
 Outliers in the variables
 Feature correlations
 Start simple – linear correlation
 Use domain knowledge and see if they make sense
 Look at subset of the data to make it tractable / subsampling
 Selection and feature engineering
 Make new (better?) features combining the orginal features
 Recast, resample, forward difference, simple arthimatic operations
Get your feet wet
 Split in train and test – look at target prop statistics
 Split train into CV or train / validation
 Train models on the training data:  Linear model  Non linear models  Ensemble models  Decision models
 Model hyperparameters
Understand the model predictions, hyperparameters
 Train on full training dataset
 Avenues of data bleed
 Split quality  is the train/validation data representative of test data / reallife data?
Points to think about When reviewing the ML method:

What is the source of the data (database, publication, direct experiment)?

How many data points are in the training, validation and test sets?

How were the sets split? Is any bias being introduced based on the type of split?

Are the data, including the data splits used, released in a public forum?

How were the data encoded and preprocessed for the ML algorithm?

How many parameters (p) are used in the model?

How many features (f) are used as input?

Is p much larger than the number of training points and/or is f large?

Which overfitting prevention techniques used?

Are the hyperparameter configurations, optimization schedule, model files and optimization parameters reported?

Is the model black box or interpretable?

Is the model classification or regression?

How much time does a single representative prediction require on a standard machine?

Is the source code released?

How was the method evaluated?

Which performance metrics are reported?

Was a comparison to publicly available methods performed on benchmark datasets?

Do the performance metrics have confidence intervals?

Are the raw evaluation files available?