Import Libraries

# Import required libraries
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

# Get data
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error

Get Data

# Load data
boston_data = load_boston()

# Define feature and target
X = boston_data['data']        # Feature
y = boston_data['target']      # Label

# Subsume into a dataframe
boston = pd.DataFrame(boston_data['data'], columns=boston_data['feature_names'])
boston.head()

Explore Data

boston.shape

(506, 13)

# Verify missing values on features
boston.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
dtype: int64

# Verify missing values on labels
pd.DataFrame(y).isnull().sum()

0    0
dtype: int64

boston.describe()

print(boston_data['DESCR'])

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Overview of Scikit-Learn's API

Data Preprocessing

from sklearn.preprocessing import StandardScaler

# create and fit scaler
scaler = StandardScaler()
scaler.fit(X)

# scale data set
Xt = scaler.transform(X)

Model Training

Using Linear Regression¶

from sklearn.linear_model import LinearRegression, Lasso, Ridge

# Train Linear Regression model
model = LinearRegression()
model.fit(X, y) # training

# Predict label values on X
y_pred = model.predict(X) # predicting

print(y_pred[:10])
print(f'R2 Score: {model.score(X, y):0.4}')

[30.00384338 25.02556238 30.56759672 28.60703649 27.94352423 25.25628446
 23.00180827 19.53598843 11.52363685 18.92026211]
R2 Score: 0.7406

How to improve R2 Score¶

Remove outliers
Test for multicollinearity and remove features
Change the predictor

Using Gradient Boosting (Ensemble method)¶

from sklearn.ensemble import GradientBoostingRegressor

# create model and train/fit
model = GradientBoostingRegressor()
model.fit(X, y)

# predict label values on X
y_pred = model.predict(X)

print(y_pred[:10])
print(f'R2 Score: {model.score(X, y):0.4}')

[25.90772604 21.96320179 33.92712155 34.14528061 35.41267912 26.7925396
 21.48031031 20.87839556 16.95411564 18.45898255]
R2 Score: 0.9761

Pipeline and FeatureUnion

Pipeline

Predictor is always the last step in a pipeline
Pipeline arranges estimators in a series

FeatureUnion

FeatureUnion arranges transformers in parallel
It combines the output of the each of the transformers in parallel to generate one output matrix

In the code below, pca_pipe gives us 4 features and selector gives us 2. Hence, we train our LinearRegression model using a total of 6 features.

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.feature_selection import f_regression, SelectKBest

# pca_pipe combines the scaling and dimensionality reduction steps
pca_pipe = Pipeline([('scaler', StandardScaler()), ('dim_red', PCA(n_components=4))])

# union combines pca_pipe with select_k in parallel to give us 6 features
union = FeatureUnion([('pca_pipe', pca_pipe), ('select_k', SelectKBest(f_regression, k=2))])

Training the model using a new pipeline `final_pca`¶

# final_pipe passes the 6 features to the predictor
final_pipe = Pipeline([('union', union), ('regressor', LinearRegression())]) # predictor is the last step
final_pipe.fit(X, y)

Pipeline(memory=None,
         steps=[('union',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('pca_pipe',
                                                 Pipeline(memory=None,
                                                          steps=[('scaler',
                                                                  StandardScaler(copy=True,
                                                                                 with_mean=True,
                                                                                 with_std=True)),
                                                                 ('dim_red',
                                                                  PCA(copy=True,
                                                                      iterated_power='auto',
                                                                      n_components=4,
                                                                      random_state=None,
                                                                      svd_solver='auto',
                                                                      tol=0.0,
                                                                      whiten=False))],
                                                          verbose=False)),
                                                ('select_k',
                                                 SelectKBest(k=2,
                                                             score_func=<function f_regression at 0x0000013D840D9C10>))],
                              transformer_weights=None, verbose=False)),
                ('regressor',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

Overview of the `Pipeline` and `FeatureUnion` objects¶

Printing outputs of `pca_pipe` and `union`¶

pca_pipe.fit_transform(X)

array([[-2.09829747,  0.77311275,  0.34294273, -0.89177403],
       [-1.45725167,  0.59198521, -0.69519931, -0.48745853],
       [-2.07459756,  0.5996394 ,  0.1671216 , -0.73920419],
       ...,
       [-0.31236047,  1.15524644, -0.40859759, -0.78630409],
       [-0.27051907,  1.04136158, -0.58545406, -0.67813391],
       [-0.12580322,  0.76197805, -1.294882  , -0.2883292 ]])

As shown above, we end up with 4 features. For instance, for the first instance in the dataset, we end up with 4 features (i.e. dimensions) of values -2.09829747, 0.77311275, 0.34294273 and -0.89177403.

union.fit_transform(X, y)

array([[-2.09829747,  0.77311275,  0.34294273, -0.89177403,  6.575     ,
         4.98      ],
       [-1.45725167,  0.59198521, -0.69519931, -0.48745853,  6.421     ,
         9.14      ],
       [-2.07459756,  0.5996394 ,  0.1671216 , -0.73920419,  7.185     ,
         4.03      ],
       ...,
       [-0.31236047,  1.15524644, -0.40859759, -0.78630409,  6.976     ,
         5.64      ],
       [-0.27051907,  1.04136158, -0.58545406, -0.67813391,  6.794     ,
         6.48      ],
       [-0.12580322,  0.76197805, -1.294882  , -0.2883292 ,  6.03      ,
         7.88      ]])

Here, we end up with six features. The first 4 are from pca_pipe while the additional two are from select_k.

Custom Transformer

Scikit-learn has a base class called BaseEstimator that all estimators inherit.

Depending on the estimator type, they also inherit mixin classes such as RegressorMixin (for regressors), ClassifierMixin (for classifiers), and TransformerMixin (for transformers).

We can customize models by inheriting these classes and leverage on sklearn's functionalities such as Pipeline, GridSearchCV and such other features.

Description¶

In the code sample below, we create three variables X1, X2, and y, where y = X1 + 2*np.sqrt(X2).

If we try to use a linear regression model to fit these data, the model will not fit perfectly.

However, if we transform X2 to X2 = 2*np.sqrt(X2), we now have a perfect linear relationship, where y = X1 + X2. Training a linear regression model using X1, X2, and y will now give us a perfect fit (where RMSE = 0).

The code below demonstrates how to create a custom transformer to transform X2 to X2 = 2*np.sqrt(X2).

Code¶

# Creating X1, X2, and y and adding them to a DataFrame df

X1 = np.arange(1,21)
X2 = np.arange(11,31)
y = X1 + 2*np.sqrt(X2)

df = pd.DataFrame({'X1':X1, 'X2':X2, 'y': y}, columns = ['X1', 'X2', 'y'])
print(df.head())

   X1  X2          y
0   1  11   7.633250
1   2  12   8.928203
2   3  13  10.211103
3   4  14  11.483315
4   5  15  12.745967

# Split into training and test subsets (80:20)

train_X = df.iloc[:16, :-1]
test_X = df.iloc[16:, :-1]

train_y = df.iloc[:16, -1]
test_y = df.iloc[16:, -1]

# Training the model

lr1 = LinearRegression()
lr1.fit(train_X, train_y)
pred_y1 = lr1.predict(test_X)

print(f'{pred_y1}')
print(f'RMSE: {np.sqrt(mean_squared_error(test_y, pred_y1))}')

[27.5384685 28.7743983 30.0103281 31.2462579]
RMSE: 0.2240231957964702

# Creating a custom transformer to transform X2 and train the linear regression model

from sklearn.base import BaseEstimator, TransformerMixin

# Build custom transformer SQRTTransformer that inherits BaseEstimator and TransformerMixin
class SQRTTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, feature):
        self.feature = feature
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        Xt = X.copy()
        Xt[self.feature] = 2 * np.sqrt(Xt[self.feature])
        return Xt

# Training the model again

pipe = Pipeline([('sqrt_trans', SQRTTransformer('X2')), # user defined
                 ('regressor', LinearRegression())]) # from sklearn

pipe.fit(train_X, train_y)
pred_y2 = pipe.predict(test_X)

print(f'{pred_y2}')
print(f'RMSE: {np.sqrt(mean_squared_error(test_y, pred_y2)):.0f}')

[27.39230485 28.58300524 29.77032961 30.95445115]
RMSE: 0

Custom Regressor

In the code sample below, we build a custom regression that returns the mean of the y values passed to its fit() method.

from sklearn.base import RegressorMixin

# Build regressor
class MeanRegressor(BaseEstimator, RegressorMixin):
    def __int__(self):
        pass
    
    def fit(self, X, y):
        self.y_mean = np.mean(y)
        return self
    
    def predict(self, X):
        return self.y_mean * np.ones(X.shape[0])

# Training a MeanRegressor model

mr = MeanRegressor()

mr.fit(train_X, train_y)
results = mr.predict(test_X)

results

array([17.0330652, 17.0330652, 17.0330652, 17.0330652])

Let's verify that 17.0330652 is indeed the mean of train_y

train_y.mean()

17.033065200308354

# Evaluating the performance of mr

print(f"R2 score: {mr.score(train_X, train_y)}")   # score method inherited from base class

R2 score: 0.0

As expected, mr offers no improvement over using the mean as the predictor since all that it does is return the mean of the y values.

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.613524	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063
std	8.601545	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000
75%	3.677083	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000