Things to consider when building a ML model
Exploratory Data Analysis (EDA)
- Learn about your data
- Type of the data being analyzed
- Is the data set, sensor measure-reading what you think it is measuring
- Dsitribution of the data / target
- Outliers in the data
- Single variables
- Type – continuous, discreet, categorical
- Distribution in the data
- Scale of each variable
- What is the resolution I am interested in?
- Outliers in the variables
- Feature correlations
- Start simple – linear correlation
- Use domain knowledge and see if they make sense
- Look at subset of the data to make it tractable / subsampling
- Selection and feature engineering
- Make new (better?) features combining the orginal features
- Recast, resample, forward difference, simple arthimatic operations
Get your feet wet
- Split in train and test – look at target prop statistics
- Split train into CV or train / validation
- Train models on the training data: - Linear model - Non linear models - Ensemble models - Decision models
- Model hyperparameters
Understand the model predictions, hyperparameters
- Train on full training dataset
- Avenues of data bleed
- Split quality - is the train/validation data representative of test data / real-life data?
Points to think about When reviewing the ML method:
What is the source of the data (database, publication, direct experiment)?
How many data points are in the training, validation and test sets?
How were the sets split? Is any bias being introduced based on the type of split?
Are the data, including the data splits used, released in a public forum?
How were the data encoded and preprocessed for the ML algorithm?
How many parameters (p) are used in the model?
How many features (f) are used as input?
Is p much larger than the number of training points and/or is f large?
Which overfitting prevention techniques used?
Are the hyperparameter configurations, optimization schedule, model files and optimization parameters reported?
Is the model black box or interpretable?
Is the model classification or regression?
How much time does a single representative prediction require on a standard machine?
Is the source code released?
How was the method evaluated?
Which performance metrics are reported?
Was a comparison to publicly available methods performed on benchmark datasets?
Do the performance metrics have confidence intervals?
Are the raw evaluation files available?