Things to consider when building a ML model

Learn about your data
- Type of the data being analyzed
- Is the data set, sensor measure-reading what you think it is measuring
- Dsitribution of the data / target
- Outliers in the data
Single variables
- Type – continuous, discreet, categorical
- Distribution in the data
- Scale of each variable
- What is the resolution I am interested in?
- Outliers in the variables
Feature correlations
- Start simple – linear correlation
- Use domain knowledge and see if they make sense
- Look at subset of the data to make it tractable / subsampling
Selection and feature engineering
- Make new (better?) features combining the orginal features
- Recast, resample, forward difference, simple arthimatic operations

Split in train and test – look at target prop statistics
Split train into CV or train / validation
Train models on the training data: - Linear model - Non linear models - Ensemble models - Decision models
Model hyperparameters

Train on full training dataset
Avenues of data bleed
Split quality - is the train/validation data representative of test data / real-life data?

What is the source of the data (database, publication, direct experiment)?
How many data points are in the training, validation and test sets?
How were the sets split? Is any bias being introduced based on the type of split?
Are the data, including the data splits used, released in a public forum?
How were the data encoded and preprocessed for the ML algorithm?
How many parameters (p) are used in the model?
How many features (f) are used as input?
Is p much larger than the number of training points and/or is f large?
Which overfitting prevention techniques used?
Are the hyperparameter configurations, optimization schedule, model files and optimization parameters reported?
Is the model black box or interpretable?
Is the model classification or regression?
How much time does a single representative prediction require on a standard machine?
Is the source code released?
How was the method evaluated?
Which performance metrics are reported?
Was a comparison to publicly available methods performed on benchmark datasets?
Do the performance metrics have confidence intervals?
Are the raw evaluation files available?