 What is classification?
 MNIST dataset
 Binary classification
 Performance metrics
 Random forest classifier
 Multiclass classification
 Error analysis
 Plotting heatmap for the errors in the classification
 Multilabel classifier
What is classification?
Some times the data in a supervised learning task is more qualitative than quantitative such as high, medium, low; or the task has categorical outputs such as colors  red, blue, green, or type of fruits: orange or apple. In such cases we look at classification models for training a model predictor. Various types of classification tasks exists:
Type of classification:
1. Binary classifier:
Distinguish between two classes  high, low, or cat and a dog. A simplest algorithm to train for such a task is logistic regression. Most of the offtheshelf algorithms work directly on the such tasks  SVM, Random Forests, Naive Bayes.
class
classifier:
2. MultiDistinguish between more than two classes  digits classifier is an example of multiclass classification since for a given digit image the answer could any number from 09. It is still one
class but multiple options exists. Certain algorithms like Random Forest, Naive Bayes are capable of handling multiclass classifier. Others like SVM and linear classifiers are strictly binary classifier.
There are ways you can convert a binary classifier:
 Oneversusall:
Train n
classifier for n
classes such that each classifier ONLY predicts whether that class is present or not. Eg: Train a classifier to predict if a digit is 2 or not.
 Oneversusone:
Pairwise classifier, in this case models are trained in a binary fashion for as many pair there can be between n
classes. This can become computationally expensive since for n
classes: $n(n1)/2$ classifiers are needed.
Main advantage in this approach is that size of training data is small as only pairwise data is required. Such an approach is useful when models dont scale well with large data  such as SVM or gaussian based
label
or Multioutput classifier:
3. MultiDistinguish between more than class but also the answer is not just one value but a list of possiblities. When the output amongst the list is only Binary it is usually refered to as Multilabel
classification. For example:
If we train a model to classify a digit image as:
1. Is it smaller than 4? (1:yes; 0:no)
2. Is it odd number? (1:yes; 0:no)
3. Is it is greater than 7? (1:yes; 0:no)
Then the output would be a list  for 5: [0, 1, 0] ; 4: [0, 0, 0]; 3: [1, 1, 0]
KNearest neighbor is a type of classifier which supports such a classification.
Scikitlearn has a wonderful documentation on metrics to be used for different types of classification tasks (Link here)
From the Scikitlearn documentation:
Some metrics are essentially defined for binary classification tasks (e.g. f1_score, roc_auc_score). In these cases, by default only the positive label is evaluated, assuming by default that the positive class is labelled 1 (though this may be configurable through the pos_label parameter).
In extending a binary metric to multiclass or multilabel problems, the data is treated as a collection of binary problems, one for each class. There are then a number of ways to average binary metric calculations across the set of classes, each of which may be useful in some scenario. Where available, you should select among these using the average parameter.
The code in this notebook is adapted from Aurélien Geron's handson machine learning tutorial on Classifications Github Link
import os
import copy
import numpy as np
np.random.seed(42)
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
# High DPI rendering for mac
%config InlineBackend.figure_format = 'retina'
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
plot_params = {
'font.size' : 22,
'axes.titlesize' : 24,
'axes.labelsize' : 20,
'axes.labelweight' : 'bold',
'xtick.labelsize' : 16,
'ytick.labelsize' : 16,
}
plt.rcParams.update(plot_params)
from sklearn.datasets import load_digits
mnist = load_digits()
print(mnist.data.shape)
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, cache=True)
mnist.target = mnist.target.astype(np.int8) # fetch_openml() returns targets as strings
def sort_by_target(mnist):
reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1]
reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1]
#minist.data is a pandas DataFrame
mnist_data_numpy = np.array(mnist.data.values)
mnist_target_numpy = np.array(mnist.target.values)
mnist_data_numpy[:60000] = mnist_data_numpy[reorder_train]
mnist_target_numpy[:60000] = mnist_target_numpy[reorder_train]
mnist_data_numpy[60000:] = mnist_data_numpy[reorder_test + 60000]
mnist_target_numpy[60000:] = mnist_target_numpy[reorder_test + 60000]
return mnist_data_numpy, mnist_target_numpy
X, y = sort_by_target(mnist)
random_idx = 62123
digit_image, digit_label = X[random_idx], y[random_idx]
print('The {0} entry is a photo of {1}'.format(random_idx, digit_label))
random_digit_image=digit_image.reshape(28,28)
plt.imshow(random_digit_image, cmap=cm.binary,
interpolation="nearest")
plt.axis("off");
def plot_digit(data):
image = data.reshape(28, 28)
plt.imshow(image, cmap = mpl.cm.binary,
interpolation="nearest")
plt.axis("off")
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
Before training we shuffle the data to ensure all crossvalidation folds to be similar. Moreover some classficiation algorithms are sensitive to the order of training instances, and they perform poorly if they get many similar instances in a row.
import numpy as np
index_shuffle = np.random.permutation(60000)
X_train, y_train = X_train[index_shuffle], y_train[index_shuffle]
y_train_2 = (y_train == 2) #True for all 2s, False for all other digits
y_test_2 = (y_test == 2)
Using Stochastic Gradient Descent
classifier. Known to handle large datasets very well.
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(max_iter=5, tol=np.infty, random_state=42)
sgd_clf.fit(X_train, y_train_2)
sgd_clf.predict([digit_image])
Performance metrics
Evaluating classifiers is often significantly challenging than the case for a regressor wherein we can use RMSE or MAE. Let's look at some usual metrics used to gauge the classifier performance.
1. Accuracy using Crossvalidation
It involves splitting your training data in Kfolds. Training the model on K1 folds and testing it on the left out fold. Scikit learn has inbuilt method to do so: cross_val_score()
. We can implement our own version as well.
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
skfolds = StratifiedKFold(n_splits=3, shuffle=False)
for train_index, test_index in skfolds.split(X_train,y_train_2):
clone_clf = clone(sgd_clf)
X_train_folds = X_train[train_index]
y_train_folds = y_train_2[train_index]
X_test_folds = X_train[test_index]
y_test_folds = y_train_2[test_index]
clone_clf.fit(X_train_folds, y_train_folds)
y_pred=clone_clf.predict(X_test_folds)
n_correct = sum(y_pred == y_test_folds)
print(n_correct/len(y_pred))
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_2, cv=3, scoring='accuracy')
Does this high accuracy tell us anything?
Is the sample space we are looking at uniform enough for this accuracy?
Maybe we have way less onedigit samples for training in the first place.
_count=0.
for i in range(len(y_train)):
if y_train[i] == 2:
_count=_count+1.
print(_count/len(y_train)*100)
So ~9% of the sample are actually 2. So even if we guess ALWAYS that image is not 2 we will be right 90% of the time!
from sklearn.base import BaseEstimator
class Never2(BaseEstimator):
def fit(self, X, y=None):
pass
def predict(self, X):
return(np.zeros((len(X),1),dtype=bool))
never2 = Never2()
cross_val_score(never2,X_train,y_train_2,cv=3,scoring='accuracy')
This shows our data is skewed!
2. Confusion Matrix
General idea is to count the number of times instances of Class A are classified as Class B.
Table that describes the performance of a classification model by grouping predictions into 4 categories.
 True Positives: we correctly predicted they do have diabetes
 True Negatives: we correctly predicted they don’t have diabetes
 False Positives: we incorrectly predicted they do have diabetes (Type I error)
 False Negatives: we incorrectly predicted they don’t have diabetes (Type II error)
The ROWS in the matrix are the real classlabels i.e. the TRUTH values while COLUMNS are the predicted values.
Actual class (observation) 

Predicted class (expectation) 
tp (true positive) Correct result 
fp (false positive) Unexpected result 
fn (false negative) Missing result 
tn (true negative) Correct absence of result 
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_2, cv=3)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_2, y_train_pred)
Each row in the confusion matrix represent actual class
, while each column represents a predicted class
. Following are the terms of the confusion matrix:
 First row of this matrix is the non2 images 
 (0,0) instances were correctly classified as non 2 (True Negative)
 (0,1) instances were wrongly classified as 2s (False Positive)
 Second row considers the images of 5 
 (1,0) instances were wrongly classified as non 2s (False negatives)
 (1,1) instances were correctly classified as 2s (True positives)
An ideal classifier would be a diagonal matrix with no false positives or false negatives
Precision
Think of this as precision of the model in estimating the binary class. FROM POV of the MODEL
It is the ratio of the total classification whether as True
or Wrongly classified as True
to True
. That is,
$$Precision = \frac{TP}{TP+FP}$$
This is looking at +ve classification and how many are really +ve and how many are wrongly shown as +ve. So Precision looks at the prediction of +ve results.
Recall
Think of this as comparing to the actual data in the model training. FROM POV of the REAL DATA
It is the ratio of total classification on the +ve samples from where they are classified correctly (TP) to wrongly classified as negative (FN).
$$Recall = \frac{TP}{FN+TP}$$
So Recall looks at the prediction of the +ve samples.
This is by just comparing the +ve samples in the binary classification. To check how many of them are correctly recalled as +ve.
F1 score
Harmonic mean of recall and precision. Higher the Precision and Recall, lower are the instances of FP and FN. So we want to have higher Recall and Precision both.
F1 favors classifiers with similar recall and precision.
from sklearn.metrics import precision_score, recall_score, f1_score
print('Precision score: {}'.format(precision_score(y_train_2, y_train_pred)))
print('Recall score: {}'.format(recall_score(y_train_2, y_train_pred)))
print('F1 score: {}'.format(f1_score(y_train_2, y_train_pred)))
Recall/Precision tradeoff
Unfortunately increasing precision reduces recall and viseversa. However sometimes one of the qualities could be desirable in a model.
Recall looks at lowering the False Negatives
so culling +ve cases. That could be detrimental in catching robberies. So we need classifiers with high recall and we are okay with low Precision
wherein we would get False alarms.
Meanwhile, if we are censoring videos we need high Precision
to ensure unsafe videos categorised as Safe ones. While we could be removing good videos by wrongly classifying them to be Unsafe (low recall).
More discussion here: Link
Decision functions evaluate a decision_score
we can manually set the threshold
for the score to whether that will accpted or rejected for the binary case.
Increasing threshold reduces recall, but increases precision.
Why?
The more Precise
you want to be i.e. more True Positive than False Positives  the higher the threshold for passing the case of accepting the data as a given class. However doing so we are strict in what we define as a ideal class and can neglect samples which are positive but are not closest to ideal. Hence we do incorrectly mark them as Negative thus increasing the case of False Negaitives and hence lowering Recall
.
y_scores = cross_val_predict(sgd_clf, X_train, y_train_2, cv=3,
method="decision_function")
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_2, y_scores)
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:1], "b", label="Precision", linewidth=2)
plt.plot(thresholds, recalls[:1], "g", label="Recall", linewidth=2)
plt.xlabel("Threshold", fontsize=10)
plt.legend(loc="best", fontsize=12)
plt.ylim([0, 1])
plt.figure(figsize=(8, 4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.xlim([700000, 700000])
plt.show()
plt.figure(figsize=(8, 4))
plt.plot(recalls[:1],precisions[:1], "b", label="Precision", linewidth=2)
plt.ylabel("Precision", fontsize=16)
plt.xlabel("Recall", fontsize=16)
If someone says let's reach 99% PRECISION, we must ALWAYS ask at what RECALL?
y_scores = sgd_clf.decision_function([digit_image])
print(y_scores)
y_pred_thresh = sgd_clf.predict([digit_image])
print(y_pred_thresh)
#Setting threshold higher than the y_score
threshold = y_scores + 1.0
y_pred_thresh = (y_scores > threshold)
print(y_pred_thresh)
y_scores = cross_val_predict(sgd_clf, X_train, y_train_2, cv=3,
method="decision_function")
y_train_pred_90 = (y_scores > 200000)
print('Precision score: {}'.format(precision_score(y_train_2, y_train_pred_90)))
print('Recall score: {}'.format(recall_score(y_train_2, y_train_pred_90)))
print('F1 score: {}'.format(f1_score(y_train_2, y_train_pred_90)))
We have made classifier with an arbitrary Precision score: 97% However doing so we reduced the Recall.
The ROC curve
Another common tool used for binary classifiers apart from Precision/Recall. Instead of plotting precision vs recall we plot True Positive Rate
(TPR) i.e. Recall
against False Positive Rate
(FPR). FPR is the ratio of negative instances that are incorrectly classified as positive.
ROC plots sensitivity vs 1specificty
from sklearn.metrics import roc_curve
#Decision scores for all instnces in the training set 
y_scores = cross_val_predict(sgd_clf, X_train, y_train_2, cv=3,
method="decision_function")
fpr, tpr, thresholds = roc_curve(y_train_2, y_scores)
def plot_roc_curve(fpr, tpr, label=None):
plt.plot(fpr, tpr, linewidth=2, label=label)
plt.plot([0, 1], [0, 1], 'k')
plt.axis([0, 1, 0, 1])
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.figure(figsize=(8, 6))
plot_roc_curve(fpr, tpr)
plt.show()
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_2, y_scores)
PR curve when we care of precision  getting False +ve and not so much of getting False ve. We are okay with losing some +ve cases but for sure do not want to neglect any ve ones.
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(n_estimators=10, random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_2, cv=3,
method="predict_proba")
y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_2, y_scores_forest)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, "b:", linewidth=2, label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right", fontsize=16)
plt.show()
Multiclass classification
Multiclass classifiers are able to label and distinguish between more than two classes. Some algorithms such as Random Forest and Näive Bayes are capable of handling this directly. Having said that, Naive Baye's has shortcomming of considering class conditional independence and having discrete entries in the input.

OvA (Oneversusall classifiers): Herein, we would train n binary classifiers for n type of labels and see which nth classifier has highest decision score.

OvO (Oneversusone strategy): Binary classifier for every pair. So for n labels we will have n(n1)/2 classifiers.
Error analysis
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
index_shuffle = np.random.permutation(60000)
X_train, y_train = X_train[index_shuffle], y_train[index_shuffle]
sgd_clf = SGDClassifier(max_iter=5, tol=np.infty, random_state=42)
sgd_clf.fit(X_train, y_train)
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train, cv=3)
y_scores = cross_val_predict(sgd_clf, X_train, y_train, cv=3,
method="decision_function")
conf_mx = confusion_matrix(y_train, y_train_pred)
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums
#Diagonals are filled to be zero to concentrate only at the errors
np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()
ROWS
in the confusion matrix are the REAL labels. COLUMNS
in the confusion matrix are the PREDICTED values. It can seen that in the case of row 3 and column 5:
 5 is most of the times confused with 3 and 8
 9 is confused with 4 and 7
def plot_digits(instances, images_per_row=10, **options):
size = 28
images_per_row = min(len(instances), images_per_row)
images = [instance.reshape(size,size) for instance in instances]
n_rows = (len(instances)  1) // images_per_row + 1
row_images = []
n_empty = n_rows * images_per_row  len(instances)
images.append(np.zeros((size, size * n_empty)))
for row in range(n_rows):
rimages = images[row * images_per_row : (row + 1) * images_per_row]
row_images.append(np.concatenate(rimages, axis=1))
image = np.concatenate(row_images, axis=0)
plt.imshow(image, cmap = cm.binary, **options)
plt.axis("off")
cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]
plt.figure(figsize=(8,8))
plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)
plt.show()
Given above are two sets of '3' and '5'  the boxes to the left are 3 and 5 classified as 3. Top left are the images of 3 classified as 3 while Bottom left are the images of 5 classified as 3. It can seen that some imags of 5 quite poor and the algorithm (which is linear in this case) will have difficulty predicting it.
Creating new y_label for model
is_odd = (y % 2 == 0).astype(int)
is_greater_7 = (y > 7).astype(int)
y_multilabel = np.c_[is_odd, is_greater_7]
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
X_train, X_test, y_ml_train, y_ml_test = X[:60000], X[60000:], y_multilabel[:60000], y_multilabel[60000:]
index_shuffle = np.random.permutation(60000)
X_train, y_ml_train = X_train[index_shuffle], y_ml_train[index_shuffle]
knn_clf.fit(X_train, y_ml_train)
knn_clf.predict([X_train[12]])
y_ml_train[12]
y_knn_ml_pred= cross_val_predict(knn_clf, X_train, y_ml_train, cv=3)
# another option is the 'weighted'
f1_score(y_ml_train, y_knn_ml_pred, average='macro')