- Adapted from Deep Learning online course notes from NYU. Note link
- Paper about using Dropout as a Bayesian Approximation
Another notebook which uses PyTorch dropout: Link
In addition to predicting a value from a model it is also important to know the confidence in that prediction. Dropout is one way of estimating this. After multiple rounds of predictions, the mean and standard deviation in the prediction can be viewed as the prediction value and the corresponding confidence in the prediction. It is important to note that this is different from the error in the prediction. The model may have error in the prediction but could be precise in that value. It is similar to the idea of accuracy vs precision.
When done with dropout -- the weights in the NN are scale by $\frac{1}{1-r}$ to account for dropping of the weights
Type of uncertainties: Aleaotric and Epistemic uncertainty
- Aleatoric uncertainty captures noise inherent in the observations
- Epistemic uncertainty accounts for uncertainty in the model
The ideal way to measure epistemic uncertainty is to train many different models, each time using a different random seed and possibly varying hyperparameters. Then use all of them for each input and see how much the predictions vary. This is very expensive to do, since it involves repeating the whole training process many times. Fortunately, we can approximate the same effect in a less expensive way: by using dropout -- effectively training a huge ensemble of different models all at once. Each training sample is evaluated with a different dropout mask, corresponding to a different random subset of the connections in the full model. Usually we only perform dropout during training and use a single averaged mask for prediction. But instead, let's use dropout for prediction too. We can compute the output for lots of different dropout masks, then see how much the predictions vary. This turns out to give a reasonable estimate of the epistemic uncertainty in the outputs
import torch
from torch import nn, optim
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
%config InlineBackend.figure_format = 'retina'
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
plot_params = {
'font.size' : 22,
'axes.titlesize' : 24,
'axes.labelsize' : 20,
'axes.labelweight' : 'bold',
'lines.linewidth' : 3,
'lines.markersize' : 10,
'xtick.labelsize' : 16,
'ytick.labelsize' : 16,
}
plt.rcParams.update(plot_params)
print(torch.cuda.device_count())
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
m = 40
x = (torch.rand(m) - 0.5) * 20 #Returns a tensor filled with random numbers from a uniform distribution on the interval [0, 1)
y = x * torch.sin(x)
#y = 2 * torch.exp( - torch.sin( (x/2)**2 ))
fig, ax = plt.subplots(1,1, figsize=(5,5))
ax.plot(x.numpy(), y.numpy(), 'o')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.axis('equal');
class MLP(nn.Module):
def __init__(self, hidden_layers=[20, 20], droprate=0.2, activation='relu'):
super(MLP, self).__init__()
self.model = nn.Sequential()
self.model.add_module('input', nn.Linear(1, hidden_layers[0]))
if activation == 'relu':
self.model.add_module('relu0', nn.ReLU())
elif activation == 'tanh':
self.model.add_module('tanh0', nn.Tanh())
for i in range(len(hidden_layers)-1):
self.model.add_module('dropout'+str(i+1), nn.Dropout(p=droprate))
self.model.add_module('hidden'+str(i+1), nn.Linear(hidden_layers[i], hidden_layers[i+1]))
if activation == 'relu':
self.model.add_module('relu'+str(i+1), nn.ReLU())
elif activation == 'tanh':
self.model.add_module('tanh'+str(i+1), nn.Tanh())
self.model.add_module('dropout'+str(i+2), nn.Dropout(p=droprate))
self.model.add_module('final', nn.Linear(hidden_layers[i+1], 1))
def forward(self, x):
return self.model(x)
net = MLP(hidden_layers=[200, 100, 80], droprate=0.1).to(device) #Move model to the GPU
print(net)
criterion = nn.MSELoss()
optimizer = optim.Adam(net.parameters(), lr=0.005, weight_decay=0.00001)
x_dev = x.view(-1, 1).to(device)
for epoch in range(6000):
x_dev = x.view(-1, 1).to(device)
y_dev = y.view(-1, 1).to(device)
y_hat = net(x_dev)
loss = criterion(y_hat, y_dev)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 500 == 0:
print('Epoch[{}] - Loss:{}'.format(epoch, loss.item()))
Define a separate continuous vector XX
XX = torch.linspace(-11, 11, 1000)
def predict_reg(model, X, T=10):
'''
Running the model in training mode.
model = torch.model: NN implemented in pytorch
X = torch.tensor: Input vector
T = int: number of samples run
OUT:
Y_hat = sample of predictions from NN model
Y_eval = average prediction value from NN model
'''
model = model.train()
Y_hat = list()
with torch.no_grad():
for t in range(T):
X_out = model(X.view(-1,1).to(device))
Y_hat.append(X_out.cpu().squeeze())
Y_hat = torch.stack(Y_hat)
model = model.eval()
with torch.no_grad():
X_out = model(X.view(-1,1).to(device))
Y_eval = X_out.cpu().squeeze()
return Y_hat, Y_eval
%%time
y_hat, y_eval = predict_reg(net, XX, T=1000)
mean_y_hat = y_hat.mean(axis=0)
std_y_hat = y_hat.std(axis=0)
fig, ax = plt.subplots(1,1, figsize=(10,10))
ax.plot(XX.numpy(), mean_y_hat.numpy(), 'C1', label='prediction')
ax.fill_between(XX.numpy(), (mean_y_hat + std_y_hat).numpy(), (mean_y_hat - std_y_hat).numpy(), alpha=0.5, color='C2', label='confidence')
ax.plot(x.numpy(), y.numpy(), 'oC0', zorder=1, label='ground truth')
ax.plot(XX.numpy(), (XX * torch.sin(XX)).numpy(), 'k--', alpha=0.4, zorder=0, label='original function')
ax.set_title('Plotting the NN predictions and ground truth')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.axis('equal')
plt.legend(loc='best', fontsize=10)
The plot above is the combination of ground truth
points (blue), original function
(grey) and the predicted function
(orange) with corresponding confidence interval at each ML-prediction (green). It is seen that the ML model does well for the region of X in the range of original points. Another interesting observation is the uncertainity of the model prediction, which is seen to increase in the space where training points dont exist e.g. X=(5,10) and X > 10 or X < -10. This shows that the model is highly uncertain outside the domain of training data (extrapolation) but does a decent job within the range of training data (interpolation).