Things to consider when building a ML model
Exploratory Data Analysis (EDA)
- Learn about your data
- Type of the data being analyzed
- Is the data set, sensor measure-reading what you think it is measuring
- Dsitribution of the data / target
- Outliers in the data
- Single variables
- Type – continuous, discreet, categorical
- Distribution in the data
- Scale of each variable
- What is the resolution I am interested in?
- Outliers in the variables
- Feature correlations
- Start simple – linear correlation
- Use domain knowledge and see if they make sense
- Look at subset of the data to make it tractable / subsampling
- Selection and feature engineering
- Make new (better?) features combining the orginal features
- Recast, resample, forward difference, simple arthimatic operations
Get your feet wet
- Split in train and test – look at target prop statistics
- Split train into CV or train / validation
- Train models on the training data: - Linear model - Non linear models - Ensemble models - Decision models
- Model hyperparameters
Understand the model predictions, hyperparameters
- Train on full training dataset
- Avenues of data bleed
- Split quality - is the train/validation data representative of test data / real-life data?
Points to think about When reviewing the ML method:
-
What is the source of the data (database, publication, direct experiment)?
-
How many data points are in the training, validation and test sets?
-
How were the sets split? Is any bias being introduced based on the type of split?
-
Are the data, including the data splits used, released in a public forum?
-
How were the data encoded and preprocessed for the ML algorithm?
-
How many parameters (p) are used in the model?
-
How many features (f) are used as input?
-
Is p much larger than the number of training points and/or is f large?
-
Which overfitting prevention techniques used?
-
Are the hyperparameter configurations, optimization schedule, model files and optimization parameters reported?
-
Is the model black box or interpretable?
-
Is the model classification or regression?
-
How much time does a single representative prediction require on a standard machine?
-
Is the source code released?
-
How was the method evaluated?
-
Which performance metrics are reported?
-
Was a comparison to publicly available methods performed on benchmark datasets?
-
Do the performance metrics have confidence intervals?
-
Are the raw evaluation files available?