- Step 1: Formulate the problem
- Step 2: Get the data
- Create a test-set
- Preliminary visualization of the data
- Experimenting with attributes
- Data cleaning and prepping
- Handling Text and Categorical Attribute
- Feature scaling
- Transformation pipelines
- Step 3: Select and train a model
- Step 4: Fine-tune the model
- Step 5: Analyze the Best Models and their Errors
- Step 6: Evaluate the model on the Test Set
This project is adapted from Aurelien Geron's ML book (Github link) . The aim to predict median house values in Californian districts, given a number of features from these districts.
Main steps we will go through:
- Formulate the problem
- Get the data
- Discover and visualize data / Data exploration to gain insight
- Prep data for ML algorithm testing
- Select model and train it
- Fine-tuning the model
Step 1: Formulate the problem
Prediction of district's median housing price given all other metrics. A supervised learning task is where we are given 'labelled' data for training purpose. Regression model to predict a continuous variable i.e. district median housing price. Given multiple features, this is a multi-class regression type problem. Univariate regression since a single output is estimated.
Step 2: Get the data
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
import seaborn as sns
sns.set(style="whitegrid")
sns.color_palette("husl")
%config InlineBackend.figure_format = 'retina'
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
plot_params = {
'font.size' : 30,
'axes.titlesize' : 24,
'axes.labelsize' : 20,
'axes.labelweight' : 'bold',
'lines.linewidth' : 3,
'lines.markersize' : 10,
'xtick.labelsize' : 16,
'ytick.labelsize' : 16,
}
plt.rcParams.update(plot_params)
housing = pd.read_csv('./data/housing.csv')
housing.sample(7)
Each row presents one district. Each of these districts has 10 attributes (features).
housing.info()
One thing to notice in this dataset is the number of total_bedroom entries is different from other entries. This suggests there are some missing entries or null in the dataset.
housing[housing.total_bedrooms.isnull()]
For categorical entries (here, ocean_proximity entries) we can find out the entries and their number using the value_counts(). We can do this for any entry we wish but makes more sense for categorical entries.
housing["ocean_proximity"].value_counts()
housing.describe().round(2)
Describe is powerful subroutine since that allows us to check the stat summary of numerical attributes
The 25%-50%-75% entries for each column show corresponding percentiles. It indicates the value below which a given percentage of observations in a group of observations fall. For example, 25% of observation have median income below 2.56, 50% observations have median income below 3.53, and 75% observations have median income below 4.74. 25% --> 1st Quartile, 50% --> Median, 75% --> 3rd Quartile
housing.hist(bins=50,figsize=(20,20));
Few observations from the Histogram plots, again remember each row is an entry for an ENTIRE district:
NOTICE:From the dataset's source disclaimer: The housing_median_value, housing_median_age, median_income_value are capped at an arbitrary value.
- From latitute and longitude plots there seems to be lots of district in four particular locations (34,37 -- latitude) and (-120,-118 -- longitude). We cannot comment on the exact location but only one on these pairs giving most data.
- We see a tighter distribution for total_rooms, total_bedrooms, and population but spread for house_value and an intresting spike at its end.
- Small spike at the end of median_income plot suggests presence of small group of affluent families but interestingly that spike does not correlate with the spike in the house_value (More high-end property entries than more "rich" people in a district)
Finally, the dataset is tail-heavy that is they extend further to the right from the median which might make modeling using some ML algorithm a bit chanellenging. Few entries should be scaled such that the distribution is more normal.
Create a test-set
This ensures that this is the data on which training, testing occurs and we do not try overfitting to account for all the variance in the data. Typical 20% of data-points are randomly chosen.
def split_train_test(data,test_ratio):
shuffled_indices=np.random.permutation(len(data))
test_set_size=int(len(data)*test_ratio)
test_indices=shuffled_indices[:test_set_size]
train_indices=shuffled_indices[test_set_size:]
return(data.iloc[train_indices],data.iloc[test_indices])
#shuffled indices risking the possibility of the algo seeing the entire dataset!
np.random.seed(42)
#Random seed set to 42 for no particular reason
#but just cause its the answer to the Ultimate Question of Life, The Universe, and Everything
train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), "train +", len(test_set), "test")
Better way is to have an instance identifier (like id) for each entry to distingusih each entry and see if its sampled or not.
from zlib import crc32
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
The dataset currently doesnt have inherent id. We could use the row index as id. Or we could use an ad-hoc unique identifier as an interim id.
housing_with_id = housing.reset_index() # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")
#HOUSING DATA WITH ID AS COMBO OF LAT AND LONG.
housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")
#SCIKIT-LEARN IMPLEMENTATION
#from sklearn.model_selection import train_test_split
#train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
test_set.head()
However the sampling we have considered here or the one used in Scikit-learn is random sampling by default. This is fine for large dataset however for smaller dataset it is utmost important that the sampled data is representative of the main population data or else we will introduce sampling bias.
This is an important bias that could be introduced without prior knowledge and could be overlooked at multiple occassion leading to wrong conclusions. To ensure the sampled dataset is representative of the population set we use stratified sampling (pseudo-random sampling). To make the stratified sampling tractable we first divide the main data into multiple 'stratas' based on an variable which we feel is an feature that should be replicated in our test set. The sample is divided into strata and right number of instances are chosen from each strata. We must not have too many stratas and the each strate must have appropriate number of instances.
For the case of property pricing in the district, median_income variable is chosen as the variable whose distribution in the main population and the randomly chosen test sample is same. This attribute is an important attribute to predict the final median housing price. So we can think of converting the continuous variable of median_variable into categorical variable -- that is stratas.
Stratified sampling using median income
housing["median_income"].hist()
From the median_income histogram it is seen that most of the entries are clustered in the range of 2-5 (arbitrary units). We can then use this information to make stratas around these instances. Cut routine in the pandas is used for this purpose. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges.
housing["income_cat"] = pd.cut(housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf], #bins around 2-5 income bracket
labels=[1, 2, 3, 4, 5])
housing["income_cat"].value_counts()
housing["income_cat"].hist()
Now with the population categorised into various median income groups we can use stratified sampling routine (as implemented in scikit-learn) to make our test-set. As an additional proof let's compare this to a randomly sampled test_case. We will redo the random sampling we did prviously but with the new population with categorised median_income.
from sklearn.model_selection import StratifiedShuffleSplit, train_test_split
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
#Using random sampling
rand_train_set, rand_test_set = train_test_split(housing, test_size=0.2, random_state=42)
Let's check the distribution of the income_cat variable in the strat_test, random_test, and the main sample.
housing["income_cat"].value_counts()/len(housing["income_cat"])
rand_test_set["income_cat"].value_counts()/len(rand_test_set["income_cat"])
strat_test_set["income_cat"].value_counts()/len(strat_test_set["income_cat"])
def income_cat_proportions(data):
return data["income_cat"].value_counts() / len(data)
compare_props = pd.DataFrame({
"Overall": income_cat_proportions(housing),
"Stratified": income_cat_proportions(strat_test_set),
"Random": income_cat_proportions(rand_test_set),
}).sort_index()
compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100
compare_props
Now, we can remove the income_cat column
for set_ in (strat_train_set, strat_test_set):
set_.drop("income_cat", axis=1, inplace=True)
housing=strat_train_set.copy()
housing.info()
housing.plot(kind='scatter',x='longitude',y='latitude');
This look's like California however, we cannot infer anything more out of this. Let's play around a little bit more...
- Playing with the alpha value in the plotting routine allows us to see the frequency of THAT datapoint in the plot
housing.plot(kind='scatter',x='longitude',y='latitude',alpha=0.1);
From here, we can see the high density of listings in the Bay area and LA also around Sacramento and Fresco.
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing["population"]/100, label="population", figsize=(8,5),
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
sharex=False)
plt.legend()
This is more interesting! We have now plotted the data with more information. Each data-point has two additional set of info apart of frequency of occurence. First being the color of the point is the median_house_value entry (option c). The radius of the data-point is the population of that district (option s). It can be seen that the housing prices are very much related to the location. The ones closer to the bay area are more expensive but need not be densely populated.
Looking for simple correlations
In addition to looking at the plot of housing price, we can check for simpler correaltions. Pearson's correlation matrix is something which is in-built in pandas and can be directly used. It checks for correlation between every pair of feature provided in the data-set. It estimates the covariance of the two features and estimates whether the correlation is inverse, direct, or none.
corr_matrix=housing.corr()
corr_matrix.style.background_gradient(cmap='coolwarm').set_precision(2)
corr_matrix['median_house_value'].sort_values(ascending=True)
The correlation matrix suggests the amount of correlation between a pair of variables. When close to 1 it means a strong +ve correlation whereas, -1 means an inverse correlation. Looking at the correlation between median_house_values and other variables, we can see that there's some correlation with median_income (0.68 -- so +ve), and with the latitude (-0.14 -- so an inverse relation).
Another to check this relation is to plot scatter plots for each pair of variables in the dataset. Below we plot this for few potential/interesting variables
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_axes = scatter_matrix(housing[attributes], figsize=(12, 8));
#y labels
[plt.setp(item.yaxis.get_label(), 'size', 10) for item in scatter_axes.ravel()];
#x labels
[plt.setp(item.xaxis.get_label(), 'size', 10) for item in scatter_axes.ravel()];
The diagonal entries show the histogram for each variable. We saw this previously for some variables. The most promising variable from this analysis seems to be the median_income
.
housing.plot(kind="scatter", x="median_income", y="median_house_value",
alpha=0.1);
Plotting it shows the stronger correlation with the target variable i.e. median_house_value however we can see horizontal lines (especially at USD 500k, 450k 350k) these could be due to some stratifying done in the dataset implicitly. It would be better to remove those to ensure our model does not spuriously fit for those since they are some of the quirks in the data.
Experimenting with attributes
Before we began proposing models for the data. We can play around with the variables and try different combinations of them to see if we get better trends. Let's look at a few. First, the total_room and/or total_bedroom variable could be changed to average_bedroom_per_house to better for bedrooms rather than looking for total bedroom in that district we would be looking at avg_bedroom per district and similarly we would do it for rooms.
housing['avg_bedroom']=housing['total_bedrooms']/housing['households']
#Average room per house-holds in the district
housing['avg_room']=housing['total_rooms']/housing['households']
#Average bedrooms per rooms in a given district
housing['bedroom_per_room']=housing['total_bedrooms']/housing['total_rooms']
#Average population per household in a given district
housing['population_per_household']=housing['population']/housing['households']
#Average room per population in a given district
housing['room_per_popoulation']=housing['total_rooms']/housing['population']
#Average room per population in a given district
housing['room_per_popoulation']=housing['total_rooms']/housing['population']
corr_matrix=housing.corr()
corr_matrix.style.background_gradient(cmap='coolwarm').set_precision(2)
corr_matrix['median_house_value'].sort_values(ascending=True)
This is interesting! We see that bedroom_per_room
is another potential descriptor with negative corelation moreover we get room_per_population
and avg_room
to be decent new descriptors for the median_house_value
. Not bad for a simple math manipulation to better represent the data. This is a crucial step and where domain knowledge and intuition would come handy.
Data cleaning and prepping
- Separate the predictors and the target values
- Write functions to conduct various data transformations ensuring consistency and ease
- Make sure the data is devoid of any NaN values since that would raise warning and errors. We have three strategies we can implement here: a. Get rid of those points (districts) entirely b. Get rid of whole attribute c. Set missing values to either zero or one of the averages (mean, median, or mode)
In our case, total bedrooms had some missing values.
# Option a:
housing.dropna(subset=["total_bedrooms"])
# Option b:
housing.drop("total_bedrooms", axis=1)
# Option c:
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True) # option 3
Before we do any of this let's first separate the predictor and target_values
housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
sample_incomplete_rows
sample_incomplete_rows.dropna(subset=["total_bedrooms"]) # option 1
sample_incomplete_rows.drop("total_bedrooms", axis=1) #option 2
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # option 3
sample_incomplete_rows
Scikit-learn imputer class
This is a handy class to take of missing values. First, we create an instance of that class with specifying what is to be replaced and what strategy is used. Before doing so, we need to make srue the entire data-set has ONLY numerical entries and Imputer will evaluate the given average for all the dataset and store it in the statistics_
instance
What the imputer will do is,
- Evaluate an specified type of average.
- For a given numerical data-set look for NaN or Null entires in a given attribute and replace it with the computed avearge for that attribute
try:
from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
except ImportError:
from sklearn.preprocessing import Imputer as SimpleImputer
imputer = SimpleImputer(strategy="median") #We define the strategy here
housing_num = housing.drop('ocean_proximity', axis=1)
# alternatively: housing_num = housing.select_dtypes(include=[np.number])
imputer.fit(housing_num)
imputer.statistics_
housing_num.median().values
We can now use this as a training variables set for our model
X = imputer.transform(housing_num)
We convert the Pandas dataframe entries to a numpy array which is transformed with appropriately replacing the missing entries with median.
print(type(X), type(housing_num))
print(np.shape(X), housing_num.shape)
'''
If we need the data-frame back
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index=housing.index)
'''
housing_cat = housing[['ocean_proximity']]
type(housing_cat)
Converting the categorical entries to integers
try:
from sklearn.preprocessing import OrdinalEncoder
except ImportError:
from future_encoders import OrdinalEncoder # Scikit-Learn < 0.20
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]
housing["ocean_proximity"].value_counts()
Now, housing_cat_encoded
has converted the categorical entries to purely numerical values for each category
ordinal_encoder.categories_
Now, this is helpful with the string categories getting converted to numerical categories. However, there is still an small issue. The categorical numbering may introduce some bias in the final model. ML algorithms will assume that two nearby values are more similar than two distant values.
In the above case, '<1H OCEAN' and 'INLAND'have category values as 0 and 1 however '<1H OCEAN' is more closer to 'NEAR OCEAN' with category value 4.
To fix this issue, one solution is to create one binary attribute per category. This is called One-hot encoding as ONLY one of the attribute in the vector is 1 (hot) and others are 0 (cold).
Scikit-learn provides a OneHotEncoder
encoder to convert integer categorical values to one-hot vectors.
try:
from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20
from sklearn.preprocessing import OneHotEncoder
except ImportError:
from future_encoders import OneHotEncoder # Scikit-Learn < 0.20
cat_encoder = OneHotEncoder()
#1-Hot encoded vector for the housing training data-set
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
type(housing_cat_1hot)
A sparse array is declared in this case which has the position of the non-zero value and not necessarily the entire numpy matrix. This is helpful in the cases where there are too many categories and also many datapoints. For examples, if we have 4 categories and 1000 datapoints the final one-hot matrix would be 1000x4 size. Most of that would be full of 0s, with only one 1 per row for a particular category.
The housing_cat_1hot
can be converted to numpy array by using the housing_cat_1hot.toarray()
Alternatively, you can set sparse=False
when creating the OneHotEncoder
cat_encoder.categories_
Let's create a custom transformer to add extra attributes:
housing.columns
from sklearn.base import BaseEstimator, TransformerMixin
# get the right column indices: safer than hard-coding indices
rooms_ix, bedrooms_ix, population_ix, household_ix = [
list(housing.columns).index(col) for col in ("total_rooms", "total_bedrooms", "population", "households")]
#Here we convert the housing.columns to list and
#then find the index for the entry which matches the string in the loop
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kwargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=True)
housing_extra_attribs = attr_adder.transform(housing.values)
From the above class, there's only one hyperparameter in the class. add_bedrooms_per_room
is the only option and is set True by default. Let's check the new feature space by converting it to a Pandas Dataframe
housing_extra_attribs = pd.DataFrame(
housing_extra_attribs,
columns=list(housing.columns)+["bedrooms_per_room","rooms_per_household", "population_per_household"],
index=housing.index)
housing_extra_attribs.head()
Feature scaling
Feaature scaling is an important transformation needed to be applied to the data. With some exceptions, ML algorithms dont perform well when the input numerical entries have very different scales. Eg: One variable has range 0-1 but other variable has range 1-1000. This is the case in our data-base where the total number of rooms range from 6 to 39,320 while the objective variable i.e. median income only ranges from 0-15. Two common ways of scaling:
-
Min-max scaling (also called Normalization) Values are shifted such that they are normalized. They are rescaled in the range of 0-1. We do this by subtracting the min value and dividing by the range in the data \begin{equation} x_{i} = \frac{X_{i}-min}{max-min} \end{equation}
-
Standardization This is when the mean of the dataset is subtracted from each entry so that the data has mean as 0 and then divided by the standard deviation so that the resulting distribution has a unit variance. Unlike min-max scaling, standardization does not bound to a particular range like 0-1. However, standardisation is much less affected by outliers. If a particular values is extremely high or low that could affect the other inputs in the case of min-max scaling. However that effect is reduced in the case of standardization given it does not directly account for the range in the scaling but the mean and variance. \begin{equation} x_{i} = \frac{X_{i}-\mu}{\sigma} \end{equation}
NOTE: It is important that these scaling operations are performed on the training data only and not on the full dataset
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")), #Fill in missing values using the median of each entry
('attribs_adder', CombinedAttributesAdder(add_bedrooms_per_room=True)), #Add additional columns entrys
('std_scaler', StandardScaler()), #Feature scaling -- using standardisation here
])
housing_num_tr = num_pipeline.fit_transform(housing_num)
housing_num_tr
This Pipeline
constructor takes a list of name/estimator pairs defining a sequence of steps. All but the last step/estimator must be the trasnformers (like feature scaling).
try:
from sklearn.compose import ColumnTransformer
except ImportError:
from future_encoders import ColumnTransformer # Scikit-Learn < 0.20
Now let's join all these components into a big pipeline that will preprocess both the numerical and the categorical features (again, we could use CombinedAttributesAdder()
instead of FunctionTransformer(...)
if we preferred):
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(), cat_attribs),
])
housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared
housing_prepared.shape
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
trial_input = housing.iloc[:5] #First 5 entries
trial_label = housing_labels.iloc[:5] #First 5 labels corresponding to the entries
prep_trial_input = full_pipeline.transform(trial_input) #Transforming the entries to suit the trained input
print('Predictions:',lin_reg.predict(prep_trial_input))
print('Labels:',list(trial_label))
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
housing_labels.describe()
As seen the RMSE is 68628 which is better than nothing but still it is quite high when the range of the median_house_values
range from 15000 to 500000
from sklearn.metrics import mean_absolute_error
lin_mae = mean_absolute_error(housing_labels, housing_predictions)
lin_mae
Here the model is underfitting the data since the RMSE is so high.
When this happens either the features do not provide enough information to make good predictions or the model is not powerful enough.
Let's try using a more complex model, DecisionTreeRegressor
which is capable of finding non-linear relationships in the data
Decision Tree Regressor
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
LOL!
This model is badly overfitting the data! Let's proof this hypothesis using cross-validation schemes. We dont want to touch the test set, JUST WORK WITH THE TRAINING SET. Only touch the test set when the model we are using is good enough. We will use the part of the training set for training and other part for model validation.
Fine-tune the model
Cross-validation
Cross-validation is a method of getting reliable estimate of model performance using only the training data 10-fold cross-validation --- breaking training data in 10 equal parts creating 10 miniature test/train splits. Out of the 10 folds, train data on 9 and test on 10th. Do this 10 times each time holding out different fold.
Scikit-learn cross-validation feature expects a utility function (greater the better) rather than a cost function (lower the better), so to score the functions we use opposite of MSE, which is why we again compute
-scores
before calculating the square root
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())
display_scores(tree_rmse_scores)
Cross-validation not only allows us to get an estimate of the performance of the model but also the measure of how precise this estimate is (i.e. standard deviation). The Decision tree has high std-dev. This information could not be obtained with just one validation set. However, caveat is that cross-validation comes at the cost of training the model several times, so it is not always possible.
Let's compute the same score for LinearRegresson
model.
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)
Here it can be seen that DecisionTree model performs much worse than the LinearRegression model.
Random Forrest Regressor
Random forrest works by employing many decision trees on random subsets of the features, then averaging out their predictions. Building a model on top of many other models is called Ensemble Learning
and it is often great way to push ML algorithms even further.
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(n_estimators=10, random_state=42)
forest_reg.fit(housing_prepared, housing_labels)
Note:we specify
n_estimators=10
to avoid a warning about the fact that the default value is going to change to 100 in Scikit-Learn 0.22.
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse
from sklearn.model_selection import cross_val_score
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
Random forest regressor looks better than DecisionTree
and LinearRegression
. The RMSE is still quite high for production quality code and could be due to overfitting. We can try other algorithms before spending time on a particular algorithm tweaking the hyperparameters. The goal is to shortlist 2-3 methods that are promising then fine-tune the model. Before we move ahead we can take a look at one more ML algorithm which is commonly employed for such supervised learning cases: Support Vector Regression
from sklearn.svm import SVR
svm_reg = SVR(kernel="linear")
svm_reg.fit(housing_prepared, housing_labels)
housing_predictions = svm_reg.predict(housing_prepared)
svm_mse = mean_squared_error(housing_labels, housing_predictions)
svm_rmse = np.sqrt(svm_mse)
svm_rmse
Step 4: Fine-tune the model
Once we settle for an algorithm we can fine-tune them efficiently using some of the in-built scikit-learn routines.
Grid Search
GridSearchCV
is a faster way of tweaking the hyper-parameters for a given algorithm. It needs the hyper-parameters you want to experiment with, what values to try out, and it will evaluate possible combination of hyperparameters values using cross-validation. We can do that step for RandomForrestRegressor
which we found to have lowesst RMSE of the three methods we tried
from sklearn.model_selection import GridSearchCV
param_grid = [
# try 12 (3×4) combinations of hyperparameters
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
# then try 6 (2×3) combinations with bootstrap set as False
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
The param_grid
tells Scikit-learn to:
- First evaluate 3x4 combinations of n_estimators and max_features with bootstrap
True
which is the default. - Then with bootstrap set as
False
we look for 2x3 combinations of n_estimators and max_featurs for the random forest - Finally, both these models are trained 5 times for the cross validation purposes in a 5-fold cross-validation fashion.
Total of (12+6)x5=90 round of training are conducted. Finally when it is done we get the best model parameters which give lowest RMSE.
grid_search.best_params_
grid_search.best_estimator_
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
pd.DataFrame(grid_search.cv_results_)
Randomised search
Grid search approach is fine when we are exploring relatively few combinations. But when hyperparameter space is large it is often preferrable to use RandomizedSearchCV
instead. Here instead of doing all the possible combinationes of hyperparameters, it evaluates a given number of random combinations by selecting a random value for each hyper parameter at every iteration.
Ensemble search
Combine models that perform best. The group or 'ensemble' will often perform better than the best individual model just like RandomForest peforms better than Decision Trees especially if we have individual models make different types of errors.
Step 5: Analyze the Best Models and their Errors
RandomForestRegressor
can indicate the relative importance of each attribute for making the accurate predictions.
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)
final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
fig, ax = plt.subplots(1,1, figsize=(8,8))
ax.scatter(y_test, final_predictions, s=100)
lims = [np.min([ax.get_xlim(), ax.get_ylim()]), # min of both axes
np.max([ax.get_xlim(), ax.get_ylim()]), # max of both axes
]
ax.plot(lims, lims, 'k--', linewidth=2.0, alpha=0.75, zorder=0)
ax.set_aspect('equal')
ax.set_xlim(lims)
ax.set_ylim(lims)
ax.set_xlabel('ML Prediction')
ax.set_ylabel('Actual Value')
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
print(final_rmse)
We can compute a 95% confidence interval for the test RMSE:
from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
mean = squared_errors.mean()
m = len(squared_errors)
np.sqrt(stats.t.interval(confidence, m - 1,
loc=np.mean(squared_errors),
scale=stats.sem(squared_errors)))
Alternatively, we could use a z-scores rather than t-scores:
zscore = stats.norm.ppf((1 + confidence) / 2)
zmargin = zscore * squared_errors.std(ddof=1) / np.sqrt(m)
np.sqrt(mean - zmargin), np.sqrt(mean + zmargin)