This tutorial discusses five different ways to select features for our machine learning projects in sklearn. The five methods are:
- Variable Inflation Factors (VIF)
- SelectKBest
- Recursive Feature Elimination (RFE)
- Recursive Feature Elimination CV (RFECV)
- SHapley Additive exPlanations (SHAP)
Variable Inflation Factors (VIF)
VIF determines the strength of the correlation between the independent variables of a model. It is predicted by taking a variable and regressing it against every other variable. The R2 score is then calculated for each regression and used to compute the VIF score for the independent variable.
The VIF score of an independent variable represents how well the variable is explained by other independent variables and is given by the formula:
VIF = \frac{1}{1-R^2}
- VIF starts at 1 and has no upper limit
- VIF = 1 indicates no correlation between the independent variable and the other variables
- VIF exceeding 5 or 10 indicates high multicollinearity between the independent variable and others
SelectKBest
SelectKBest selects the k best features based on univariate statistical tests (such as p-values, chi2 etc).
Recursive Feature Elimination (RFE) and Recursive Feature Elimination CV (RFECV)
RFE and RFECV select features by recursively dropping the least important feature.
First, the estimator is trained on the initial set of features and the importance of each feature is obtained. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
RFECV evaluates the features using cross-validation, hence the CV suffix.
SHapley Additive exPlanations (SHAP)
SHAP (SHapley Additive exPlanations) is based on the classic Shapley values from game theory. These values measure the contribution of each individual member in a coalition to the final value.
SHAP documentation and example
Leave a Reply