import os
import copy
import numpy as np
42) np.random.seed(
What is classification?
Some times the data in a supervised learning task is more qualitative than quantitative such as high, medium, low; or the task has categorical outputs such as colors - red, blue, green, or type of fruits: orange or apple. In such cases we look at classification models for training a model predictor. Various types of classification tasks exists:
Type of classification:
1. Binary classifier:
Distinguish between two classes - high, low, or cat and a dog. A simplest algorithm to train for such a task is logistic regression. Most of the off-the-shelf algorithms work directly on the such tasks – SVM, Random Forests, Naive Bayes.
2. Multi-class
classifier:
Distinguish between more than two classes - digits classifier is an example of multi-class classification since for a given digit image the answer could any number from 0-9. It is still one
class but multiple options exists. Certain algorithms like Random Forest, Naive Bayes are capable of handling multi-class classifier. Others like SVM and linear classifiers are strictly binary classifier.
There are ways you can convert a binary classifier: * One-versus-all:
Train n
classifier for n
classes such that each classifier ONLY predicts whether that class is present or not. Eg: Train a classifier to predict if a digit is 2 or not.
- One-versus-one:
Pair-wise classifier, in this case models are trained in a binary fashion for as many pair there can be between n
classes. This can become computationally expensive since for n
classes: \(n(n-1)/2\) classifiers are needed.
Main advantage in this approach is that size of training data is small as only pair-wise data is required. Such an approach is useful when models dont scale well with large data – such as SVM or gaussian based
3. Multi-label
or Multi-output classifier:
Distinguish between more than class but also the answer is not just one value but a list of possiblities. When the output amongst the list is only Binary it is usually refered to as Multi-label
classification. For example:
If we train a model to classify a digit image as:
1. Is it smaller than 4? (1:yes; 0:no)
2. Is it odd number? (1:yes; 0:no)
3. Is it is greater than 7? (1:yes; 0:no)
Then the output would be a list - for 5: [0, 1, 0] ; 4: [0, 0, 0]; 3: [1, 1, 0]
K-Nearest neighbor is a type of classifier which supports such a classification.
Scikit-learn has a wonderful documentation on metrics to be used for different types of classification tasks (Link here)
From the Scikit-learn documentation:
Some metrics are essentially defined for binary classification tasks (e.g. f1_score, roc_auc_score). In these cases, by default only the positive label is evaluated, assuming by default that the positive class is labelled 1 (though this may be configurable through the pos_label parameter).
In extending a binary metric to multiclass or multilabel problems, the data is treated as a collection of binary problems, one for each class. There are then a number of ways to average binary metric calculations across the set of classes, each of which may be useful in some scenario. Where available, you should select among these using the average parameter.
The code in this notebook is adapted from Aurélien Geron’s hands-on machine learning tutorial on Classifications Github Link
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
# High DPI rendering for mac
%config InlineBackend.figure_format = 'retina'
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
= {
plot_params 'font.size' : 22,
'axes.titlesize' : 24,
'axes.labelsize' : 20,
'axes.labelweight' : 'bold',
'xtick.labelsize' : 16,
'ytick.labelsize' : 16,
}
plt.rcParams.update(plot_params)
MNIST dataset
70,000 small images of hand-written numbers. Each image has 784 features. Those features are split in 28x28 pixels and each feature is simply that pixel gray-scale intensity. Value for each pixel ranges from 0 to 255.
from sklearn.datasets import load_digits
= load_digits()
mnist print(mnist.data.shape)
(1797, 64)
# Using MNIST data from openML website
from sklearn.datasets import fetch_openml
= fetch_openml('mnist_784', version=1, cache=True)
mnist = mnist.target.astype(np.int8) # fetch_openml() returns targets as strings mnist.target
def sort_by_target(mnist):
= np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1]
reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1]
reorder_test
#minist.data is a pandas DataFrame
= np.array(mnist.data.values)
mnist_data_numpy = np.array(mnist.target.values)
mnist_target_numpy
60000] = mnist_data_numpy[reorder_train]
mnist_data_numpy[:60000] = mnist_target_numpy[reorder_train]
mnist_target_numpy[:60000:] = mnist_data_numpy[reorder_test + 60000]
mnist_data_numpy[60000:] = mnist_target_numpy[reorder_test + 60000]
mnist_target_numpy[
return mnist_data_numpy, mnist_target_numpy
= sort_by_target(mnist) X, y
= 62123
random_idx
= X[random_idx], y[random_idx]
digit_image, digit_label print('The {0} entry is a photo of {1}'.format(random_idx, digit_label))
=digit_image.reshape(28,28)
random_digit_image=cm.binary,
plt.imshow(random_digit_image, cmap="nearest")
interpolation"off"); plt.axis(
The 62123 entry is a photo of 2
#Plot digit convenience function
def plot_digit(data):
= data.reshape(28, 28)
image = mpl.cm.binary,
plt.imshow(image, cmap ="nearest")
interpolation"off") plt.axis(
= X[:60000], X[60000:], y[:60000], y[60000:] X_train, X_test, y_train, y_test
Before training we shuffle the data to ensure all cross-validation folds to be similar. Moreover some classficiation algorithms are sensitive to the order of training instances, and they perform poorly if they get many similar instances in a row.
#Index shuffling
import numpy as np
= np.random.permutation(60000)
index_shuffle = X_train[index_shuffle], y_train[index_shuffle] X_train, y_train
Binary classification
Here we will build a single digit classifier – for example looking at just 2. Hence in total there will be only 2 classes – Those which are 2 and those which are not.
= (y_train == 2) #True for all 2s, False for all other digits
y_train_2 = (y_test == 2) y_test_2
Using Stochastic Gradient Descent
classifier. Known to handle large datasets very well.
from sklearn.linear_model import SGDClassifier
= SGDClassifier(max_iter=5, tol=-np.infty, random_state=42)
sgd_clf sgd_clf.fit(X_train, y_train_2)
SGDClassifier(max_iter=5, random_state=42, tol=-inf)
sgd_clf.predict([digit_image])
array([ True])
Performance metrics
Evaluating classifiers is often significantly challenging than the case for a regressor wherein we can use RMSE or MAE. Let’s look at some usual metrics used to gauge the classifier performance.
1. Accuracy using Cross-validation
It involves splitting your training data in K-folds. Training the model on K-1 folds and testing it on the left out fold. Scikit learn has in-built method to do so: cross_val_score()
. We can implement our own version as well.
\[ \% Accuracy = \frac{Correct}{Total} * 100 \]
#Inside the scikit-learn's crossvalidation accuracy method
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
= StratifiedKFold(n_splits=3, shuffle=False)
skfolds
for train_index, test_index in skfolds.split(X_train,y_train_2):
= clone(sgd_clf)
clone_clf = X_train[train_index]
X_train_folds = y_train_2[train_index]
y_train_folds
= X_train[test_index]
X_test_folds = y_train_2[test_index]
y_test_folds
clone_clf.fit(X_train_folds, y_train_folds)=clone_clf.predict(X_test_folds)
y_pred
= sum(y_pred == y_test_folds)
n_correct print(n_correct/len(y_pred))
0.97365
0.96255
0.96165
#Using scikit-learn's in-built method
from sklearn.model_selection import cross_val_score
=3, scoring='accuracy') cross_val_score(sgd_clf, X_train, y_train_2, cv
array([0.97365, 0.96255, 0.96165])
Does this high accuracy tell us anything?
Is the sample space we are looking at uniform enough for this accuracy?
Maybe we have way less one-digit samples for training in the first place.
=0.
_countfor i in range(len(y_train)):
if y_train[i] == 2:
=_count+1.
_countprint(_count/len(y_train)*100)
9.93
So ~9% of the sample are actually 2. So even if we guess ALWAYS that image is not 2 we will be right 90% of the time!
The dumb classifier
To check whether classifier accuracy of ~95% is good enough so just a over-exagerration
from sklearn.base import BaseEstimator
class Never2(BaseEstimator):
def fit(self, X, y=None):
pass
def predict(self, X):
return(np.zeros((len(X),1),dtype=bool))
= Never2()
never2 =3,scoring='accuracy') cross_val_score(never2,X_train,y_train_2,cv
array([0.9017, 0.9001, 0.9003])
This shows our data is skewed!
2. Confusion Matrix
General idea is to count the number of times instances of Class A are classified as Class B.
Table that describes the performance of a classification model by grouping predictions into 4 categories. - True Positives: we correctly predicted they do have diabetes - True Negatives: we correctly predicted they don’t have diabetes - False Positives: we incorrectly predicted they do have diabetes (Type I error) - False Negatives: we incorrectly predicted they don’t have diabetes (Type II error)
The ROWS in the matrix are the real class-labels i.e. the TRUTH values while COLUMNS are the predicted values.
<td colspan="2"><p><b>Actual class (observation)</b></p></td>
Predicted class (expectation) |
tp (true positive) Correct result |
fp (false positive) Unexpected result |
fn (false negative) Missing result |
tn (true negative) Correct absence of result |
from sklearn.model_selection import cross_val_predict
= cross_val_predict(sgd_clf, X_train, y_train_2, cv=3) y_train_pred
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_2, y_train_pred)
array([[53159, 883],
[ 1160, 4798]])
Each row in the confusion matrix represent actual class
, while each column represents a predicted class
. Following are the terms of the confusion matrix: 1. First row of this matrix is the non-2 images – - (0,0) instances were correctly classified as non 2 (True Negative) - (0,1) instances were wrongly classified as 2s (False Positive) 2. Second row considers the images of 5 – - (1,0) instances were wrongly classified as non 2s (False negatives) - (1,1) instances were correctly classified as 2s (True positives)
An ideal classifier would be a diagonal matrix with no false positives or false negatives
Precision
Think of this as precision of the model in estimating the binary class. FROM POV of the MODEL
It is the ratio of the total classification whether as True
or Wrongly classified as True
to True
. That is, \[Precision = \frac{TP}{TP+FP}\]
This is looking at +ve classification and how many are really +ve and how many are wrongly shown as +ve. So Precision looks at the prediction of +ve results.
Recall
Think of this as comparing to the actual data in the model training. FROM POV of the REAL DATA
It is the ratio of total classification on the +ve samples from where they are classified correctly (TP) to wrongly classified as negative (FN).
\[Recall = \frac{TP}{FN+TP}\]
So Recall looks at the prediction of the +ve samples.
This is by just comparing the +ve samples in the binary classification. To check how many of them are correctly recalled as +ve.
F1 score
Harmonic mean of recall and precision. Higher the Precision and Recall, lower are the instances of FP and FN. So we want to have higher Recall and Precision both.
F1 favors classifiers with similar recall and precision.
from sklearn.metrics import precision_score, recall_score, f1_score
print('Precision score: {}'.format(precision_score(y_train_2, y_train_pred)))
print('Recall score: {}'.format(recall_score(y_train_2, y_train_pred)))
print('F1 score: {}'.format(f1_score(y_train_2, y_train_pred)))
Precision score: 0.8445696180249956
Recall score: 0.8053037932192011
F1 score: 0.8244694561388435
Recall/Precision tradeoff
Unfortunately increasing precision reduces recall and vise-versa. However sometimes one of the qualities could be desirable in a model.
Recall looks at lowering the False Negatives
so culling +ve cases. That could be detrimental in catching robberies. So we need classifiers with high recall and we are okay with low Precision
wherein we would get False alarms.
Meanwhile, if we are censoring videos we need high Precision
to ensure unsafe videos categorised as Safe ones. While we could be removing good videos by wrongly classifying them to be Unsafe (low recall).
More discussion here: Link
Decision functions evaluate a decision_score
we can manually set the threshold
for the score to whether that will accpted or rejected for the binary case.
Increasing threshold reduces recall, but increases precision.
Why? The more Precise
you want to be i.e. more True Positive than False Positives – the higher the threshold for passing the case of accepting the data as a given class. However doing so we are strict in what we define as a ideal class and can neglect samples which are positive but are not closest to ideal. Hence we do incorrectly mark them as Negative thus increasing the case of False Negaitives and hence lowering Recall
.
#Decision scores for all instnces in the training set --
= cross_val_predict(sgd_clf, X_train, y_train_2, cv=3,
y_scores ="decision_function") method
from sklearn.metrics import precision_recall_curve
= precision_recall_curve(y_train_2, y_scores) precisions, recalls, thresholds
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
-1], "b--", label="Precision", linewidth=2)
plt.plot(thresholds, precisions[:-1], "g-", label="Recall", linewidth=2)
plt.plot(thresholds, recalls[:"Threshold", fontsize=10)
plt.xlabel(="best", fontsize=12)
plt.legend(loc
0, 1])
plt.ylim([
=(8, 4))
plt.figure(figsize
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)-700000, 700000])
plt.xlim([ plt.show()
=(8, 4))
plt.figure(figsize-1],precisions[:-1], "b--", label="Precision", linewidth=2)
plt.plot(recalls[:"Precision", fontsize=16)
plt.ylabel("Recall", fontsize=16) plt.xlabel(
Text(0.5, 0, 'Recall')
If someone says let’s reach 99% PRECISION, we must ALWAYS ask at what RECALL?
Manually set the Recall/Precision using threshold
= sgd_clf.decision_function([digit_image])
y_scores print(y_scores)
= sgd_clf.predict([digit_image])
y_pred_thresh print(y_pred_thresh)
#Setting threshold higher than the y_score
= y_scores + 1.0
threshold = (y_scores > threshold)
y_pred_thresh print(y_pred_thresh)
[247991.40436599]
[ True]
[False]
#Changing the threshold for everyother training example -- more than 90% precision
= cross_val_predict(sgd_clf, X_train, y_train_2, cv=3,
y_scores ="decision_function")
method= (y_scores > 200000)
y_train_pred_90 print('Precision score: {}'.format(precision_score(y_train_2, y_train_pred_90)))
print('Recall score: {}'.format(recall_score(y_train_2, y_train_pred_90)))
print('F1 score: {}'.format(f1_score(y_train_2, y_train_pred_90)))
Precision score: 0.9637028700056275
Recall score: 0.5748573346760658
F1 score: 0.720142977291842
We have made classifier with an arbitrary Precision score: 97% However doing so we reduced the Recall.
The ROC curve
Another common tool used for binary classifiers apart from Precision/Recall. Instead of plotting precision vs recall we plot True Positive Rate
(TPR) i.e. Recall
against False Positive Rate
(FPR). FPR is the ratio of negative instances that are incorrectly classified as positive.
ROC plots sensitivity vs 1-specificty
from sklearn.metrics import roc_curve
#Decision scores for all instnces in the training set --
= cross_val_predict(sgd_clf, X_train, y_train_2, cv=3,
y_scores ="decision_function")
method
= roc_curve(y_train_2, y_scores)
fpr, tpr, thresholds
def plot_roc_curve(fpr, tpr, label=None):
=2, label=label)
plt.plot(fpr, tpr, linewidth0, 1], [0, 1], 'k--')
plt.plot([0, 1, 0, 1])
plt.axis(['False Positive Rate', fontsize=16)
plt.xlabel('True Positive Rate', fontsize=16)
plt.ylabel(
=(8, 6))
plt.figure(figsize
plot_roc_curve(fpr, tpr) plt.show()
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_2, y_scores)
0.9651158581307573
PR curve when we care of precision – getting False +ve and not so much of getting False -ve. We are okay with losing some +ve cases but for sure do not want to neglect any -ve ones.
Random forest classifier
from sklearn.ensemble import RandomForestClassifier
= RandomForestClassifier(n_estimators=10, random_state=42)
forest_clf = cross_val_predict(forest_clf, X_train, y_train_2, cv=3,
y_probas_forest ="predict_proba") method
= y_probas_forest[:, 1] # score = proba of positive class
y_scores_forest = roc_curve(y_train_2, y_scores_forest) fpr_forest, tpr_forest, thresholds_forest
=(8, 6))
plt.figure(figsize"b:", linewidth=2, label="SGD")
plt.plot(fpr, tpr, "Random Forest")
plot_roc_curve(fpr_forest, tpr_forest, ="lower right", fontsize=16)
plt.legend(loc plt.show()
Multiclass classification
Multiclass classifiers are able to label and distinguish between more than two classes. Some algorithms such as Random Forest and Näive Bayes are capable of handling this directly. Having said that, Naive Baye’s has shortcomming of considering class conditional independence and having discrete entries in the input.
OvA (One-versus-all classifiers): Herein, we would train n binary classifiers for n type of labels and see which n-th classifier has highest decision score.
OvO (One-versus-one strategy): Binary classifier for every pair. So for n labels we will have n(n-1)/2 classifiers.
Error analysis
= X[:60000], X[60000:], y[:60000], y[60000:]
X_train, X_test, y_train, y_test = np.random.permutation(60000)
index_shuffle = X_train[index_shuffle], y_train[index_shuffle]
X_train, y_train
= SGDClassifier(max_iter=5, tol=-np.infty, random_state=42)
sgd_clf
sgd_clf.fit(X_train, y_train)
= cross_val_predict(sgd_clf, X_train, y_train, cv=3)
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train, cv=3,
y_scores ="decision_function")
method= confusion_matrix(y_train, y_train_pred) conf_mx
=plt.cm.gray)
plt.matshow(conf_mx, cmap plt.show()
Plotting heat-map for the errors in the classification
= conf_mx.sum(axis=1, keepdims=True)
row_sums = conf_mx / row_sums
norm_conf_mx #Diagonals are filled to be zero to concentrate only at the errors
0)
np.fill_diagonal(norm_conf_mx, =plt.cm.gray)
plt.matshow(norm_conf_mx, cmap plt.show()
ROWS
in the confusion matrix are the REAL labels. COLUMNS
in the confusion matrix are the PREDICTED values. It can seen that in the case of row 3 and column 5: - 5 is most of the times confused with 3 and 8 - 9 is confused with 4 and 7
def plot_digits(instances, images_per_row=10, **options):
= 28
size = min(len(instances), images_per_row)
images_per_row = [instance.reshape(size,size) for instance in instances]
images = (len(instances) - 1) // images_per_row + 1
n_rows = []
row_images = n_rows * images_per_row - len(instances)
n_empty * n_empty)))
images.append(np.zeros((size, size for row in range(n_rows):
= images[row * images_per_row : (row + 1) * images_per_row]
rimages =1))
row_images.append(np.concatenate(rimages, axis= np.concatenate(row_images, axis=0)
image = cm.binary, **options)
plt.imshow(image, cmap "off") plt.axis(
= 3, 5
cl_a, cl_b = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ab = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_b)] X_bb
=(8,8))
plt.figure(figsize221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)
plt.subplot( plt.show()
Given above are two sets of ‘3’ and ‘5’ – the boxes to the left are 3 and 5 classified as 3. Top left are the images of 3 classified as 3 while Bottom left are the images of 5 classified as 3. It can seen that some imags of 5 quite poor and the algorithm (which is linear in this case) will have difficulty predicting it.
Multi-label classifier
Determine a label such that it is a list for a every digit answering two questions: 1. Is this number odd? 1: Yes, 0: No 2. Is this number greater than 7? 1: Yes, 0: No
Creating new y_label for model
# Setting two conditions
= (y % 2 == 0).astype(int)
is_odd = (y > 7).astype(int)
is_greater_7
= np.c_[is_odd, is_greater_7] y_multilabel
from sklearn.neighbors import KNeighborsClassifier
= KNeighborsClassifier() knn_clf
= X[:60000], X[60000:], y_multilabel[:60000], y_multilabel[60000:]
X_train, X_test, y_ml_train, y_ml_test = np.random.permutation(60000)
index_shuffle = X_train[index_shuffle], y_ml_train[index_shuffle] X_train, y_ml_train
knn_clf.fit(X_train, y_ml_train)
KNeighborsClassifier()
# Predict
12]]) knn_clf.predict([X_train[
array([[0, 0]])
12] y_ml_train[
array([0, 0])
# Metric
= cross_val_predict(knn_clf, X_train, y_ml_train, cv=3) y_knn_ml_pred
# macro assumes all labels are equally important
# another option is the 'weighted'
='macro') f1_score(y_ml_train, y_knn_ml_pred, average
0.9719059410724892