At a recent data science meetup in Houston, I spoke to one of the panel speakers who was a data scientist for an oil and gas consulting company. They mostly used tree based methods like random forest, bagging and boosting and one of the challenges he said they frequently encountered while presenting their results, were putting confidence intervals on the variable importance measures.

Often times, industry experts would not only like good predictive models but at the same time, models that are somewhat interpretable. It reminds me of a very useful graphic from the ISLR book (An excellent book by the way for anyone who hasn’t yet read it)

From ISLR

From ISLR

As you can see, tree based methods are highly flexible (and thus, potential to have better predictions) but low on the interpretability scale.

I sought out to look for recent papers and techniques to try to interpret random forest algorithms. A recent publication by Dr. Hemant Ishwaran sought out to do just this https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.7803

Below I’ll be implementing the techniques in this paper using the randomForestSRC package on the ever famous Ames housing dataset!

Import the required libraries

library(tidyverse)
library(naniar)
library(caret)
library(e1071)
library(ranger)
library(corrplot)

Import the training data (downloaded from [kaggle] https://www.kaggle.com/datasets)

train <- read_csv("/Users/dilsherdhillon/Documents/Website/DilsherDhillon/static/files/train.csv")

The package naniar provides a nice feature which can be utilized to visualize completeness of data in one simple line of code

gg_miss_var(train,show_pct = TRUE)

## Other functions that maybe useful for your specific problem
#vis_miss(train,show_perc_col = TRUE,warn_large_data = TRUE)
#vis_miss(train,show_perc_col = TRUE,warn_large_data = TRUE)
#miss_case_table(train)

The plot shows ~17% data missing for ‘Lot Frontage’, and ‘Alley’,‘PoolQC’,‘Fence’ and ‘MiscFeatures’ are missing >70% of their data. About ~48% of the ‘FireplaceQU’ variable is missing.

At this point it would suffice to say we should remove all these variables except lot frontage. Since this tutorial is more focused towards showcasing how to put confidence intervals on variable importance for RandomForest models, we’ll work with data that is complete and not worry too much about data imputation etc.

For imputation of missing data, the MICE package in R and the corresponding text serves as a excellent resource for the interested reader

Remove variables with high % of missing data

dta<-train%>%
  select(-c(Alley,FireplaceQu,PoolQC,Fence,MiscFeature,Id,LotFrontage))

Near Zero Variance Predictors

Before we start implementing the random forest model, it’s a good idea to run some quality checks on the variables. We know how the saying goes, garbage in, garbage out. One of these is checking for variables that have near zero variance - they are essentially the same for all samples and would simply add unwanted noise to our model.

The caret package has a nice function that lets us check for variables that may have zero or near zero variance

nzv<-nearZeroVar(dta,saveMetrics = TRUE)
nzv
##                 freqRatio percentUnique zeroVar   nzv
## MSSubClass       1.792642     1.0273973   FALSE FALSE
## MSZoning         5.279817     0.3424658   FALSE FALSE
## LotArea          1.041667    73.4931507   FALSE FALSE
## Street         242.333333     0.1369863   FALSE  TRUE
## LotShape         1.911157     0.2739726   FALSE FALSE
## LandContour     20.809524     0.2739726   FALSE  TRUE
## Utilities     1459.000000     0.1369863   FALSE  TRUE
## LotConfig        4.000000     0.3424658   FALSE FALSE
## LandSlope       21.261538     0.2054795   FALSE  TRUE
## Neighborhood     1.500000     1.7123288   FALSE FALSE
## Condition1      15.555556     0.6164384   FALSE FALSE
## Condition2     240.833333     0.5479452   FALSE  TRUE
## BldgType        10.701754     0.3424658   FALSE FALSE
## HouseStyle       1.631461     0.5479452   FALSE FALSE
## OverallQual      1.061497     0.6849315   FALSE FALSE
## OverallCond      3.257937     0.6164384   FALSE FALSE
## YearBuilt        1.046875     7.6712329   FALSE FALSE
## YearRemodAdd     1.835052     4.1780822   FALSE FALSE
## RoofStyle        3.989510     0.4109589   FALSE FALSE
## RoofMatl       130.363636     0.5479452   FALSE  TRUE
## Exterior1st      2.319820     1.0273973   FALSE FALSE
## Exterior2nd      2.355140     1.0958904   FALSE FALSE
## MasVnrType       1.941573     0.2739726   FALSE FALSE
## MasVnrArea     107.625000    22.3972603   FALSE FALSE
## ExterQual        1.856557     0.2739726   FALSE FALSE
## ExterCond        8.780822     0.3424658   FALSE FALSE
## Foundation       1.020505     0.4109589   FALSE FALSE
## BsmtQual         1.050162     0.2739726   FALSE FALSE
## BsmtCond        20.169231     0.2739726   FALSE  TRUE
## BsmtExposure     4.312217     0.2739726   FALSE FALSE
## BsmtFinType1     1.028708     0.4109589   FALSE FALSE
## BsmtFinSF1      38.916667    43.6301370   FALSE FALSE
## BsmtFinType2    23.259259     0.4109589   FALSE  TRUE
## BsmtFinSF2     258.600000     9.8630137   FALSE  TRUE
## BsmtUnfSF       13.111111    53.4246575   FALSE FALSE
## TotalBsmtSF      1.057143    49.3835616   FALSE FALSE
## Heating         79.333333     0.4109589   FALSE  TRUE
## HeatingQC        1.731308     0.3424658   FALSE FALSE
## CentralAir      14.368421     0.1369863   FALSE FALSE
## Electrical      14.191489     0.3424658   FALSE FALSE
## 1stFlrSF         1.562500    51.5753425   FALSE FALSE
## 2ndFlrSF        82.900000    28.5616438   FALSE FALSE
## LowQualFinSF   478.000000     1.6438356   FALSE  TRUE
## GrLivArea        1.571429    58.9726027   FALSE FALSE
## BsmtFullBath     1.455782     0.2739726   FALSE FALSE
## BsmtHalfBath    17.225000     0.2054795   FALSE FALSE
## FullBath         1.181538     0.2739726   FALSE FALSE
## HalfBath         1.706542     0.2054795   FALSE FALSE
## BedroomAbvGr     2.245810     0.5479452   FALSE FALSE
## KitchenAbvGr    21.415385     0.2739726   FALSE  TRUE
## KitchenQual      1.254266     0.2739726   FALSE FALSE
## TotRmsAbvGrd     1.221884     0.8219178   FALSE FALSE
## Functional      40.000000     0.4794521   FALSE  TRUE
## Fireplaces       1.061538     0.2739726   FALSE FALSE
## GarageType       2.248062     0.4109589   FALSE FALSE
## GarageYrBlt      1.101695     6.6438356   FALSE FALSE
## GarageFinish     1.433649     0.2054795   FALSE FALSE
## GarageCars       2.233062     0.3424658   FALSE FALSE
## GarageArea       1.653061    30.2054795   FALSE FALSE
## GarageQual      27.312500     0.3424658   FALSE  TRUE
## GarageCond      37.885714     0.3424658   FALSE  TRUE
## PavedDrive      14.888889     0.2054795   FALSE FALSE
## WoodDeckSF      20.026316    18.7671233   FALSE FALSE
## OpenPorchSF     22.620690    13.8356164   FALSE FALSE
## EnclosedPorch   83.466667     8.2191781   FALSE  TRUE
## 3SsnPorch      478.666667     1.3698630   FALSE  TRUE
## ScreenPorch    224.000000     5.2054795   FALSE  TRUE
## PoolArea      1453.000000     0.5479452   FALSE  TRUE
## MiscVal        128.000000     1.4383562   FALSE  TRUE
## MoSold           1.081197     0.8219178   FALSE FALSE
## YrSold           1.027356     0.3424658   FALSE FALSE
## SaleType        10.385246     0.6164384   FALSE FALSE
## SaleCondition    9.584000     0.4109589   FALSE FALSE
## SalePrice        1.176471    45.4109589   FALSE FALSE
dim(nzv)
## [1] 74  4

There seem to be predictors with very low variance - we use the default metrics here to assesss whether we classify a predictor as near zero-variance or not

nzv<-nearZeroVar(dta) ## don't save metrics here since we need the index of the near zero variance variables
dta_v2<-dta[,-nzv]
dim(dta_v2)
## [1] 1460   54

20 variables with near zero variance were removed

Another quick measure we should look at is multi-collinearity.
Check for correlations between the numeric variables to avoid multi-collinearity issues

dta_v2%>%
  select_if(.,is.integer)%>%
  cor(.,use="pairwise.complete.obs",method="spearman")%>%
    corrplot(., type = "upper",method="circle",title="Spearman  Correlation",mar=c(0,0,1,0),number.cex = .2)

Some very interesting things come out from the plot above. We see that the variable ‘YearBuilt’ is very highly correlated with ‘GarageYrBuilt’ - which makes sense. It’s likely that the garage for the house was built as the same time as the house was built. Similarly, GarageCars is highly correlated with GarageArea which again makes sense - more the area, more cars can be parked in the garage.

An encouraging thing that does come out is several of the variables are somewhat moderate to high correlated with the sale price.

We see that there are some variables that are actually categorical in nature, but are treated as integers/numbers - they should really be converted to categorical variables. For example, MoSold is the month of the year in which the house was sold from 1… through 12. In addition, the character vectors need to be converted to factors as well to be used in the random forest model.

dta_v2<-dta_v2%>%
  mutate_at(.,vars(MSSubClass,MoSold,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,YrSold),as.factor)%>%
  map_if(is.character,as.factor)%>%
  as.tibble()
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.

How many predictors have correlation >0.80?

dta_v2%>%
  select_if(.,is.integer)%>%
  cor(.,method="spearman",use="pairwise.complete.obs")%>%
  findCorrelation(.,cutoff = 0.80)
## [1] 21 12 17  4  9

How many have >0.90

dta_v2%>%
  select_if(.,is.integer)%>%
  cor(.,method="spearman",use="pairwise.complete.obs")%>%
  findCorrelation(.,cutoff = 0.90)
## integer(0)

No predictors have correlation >0.90

A random forest model for training in caret needs complete data - in cases where the specified method can handle missing data, using the argument ‘na.action=na.pass’ in the train function (look up documentation in caret to which models can work with missing date)

But for our purposes, we drop all cases with any missing values

dta_rf<-dta_v2%>%
  drop_na()
dim(dta_rf)
## [1] 1339   54

We’re down to 1339 samples from 1460 originally

Let’s fit a vanilla random forest model. The default option in caret runs the specified model over 25 bootstrap samples across 3 options of the mtry tuning parameter.

rf_vanilla<-train(log(SalePrice)~.,method="ranger",data=dta_rf)
rf_vanilla
## Random Forest 
## 
## 1339 samples
##   53 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1339, 1339, 1339, 1339, 1339, 1339, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE       Rsquared   MAE       
##     2   variance    0.2107885  0.7992672  0.14917042
##     2   extratrees  0.2282024  0.7586043  0.16510928
##   104   variance    0.1366434  0.8714597  0.09229815
##   104   extratrees  0.1432452  0.8619109  0.09671694
##   207   variance    0.1435587  0.8556214  0.09824293
##   207   extratrees  0.1430987  0.8600771  0.09689750
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 104, splitrule =
##  variance and min.node.size = 5.

The above was a ‘vanilla’ model, let’s finetune the model by trying out different values of the hyperparameters

tuning<-expand.grid(mtry = c(90:110),
                      splitrule = c("variance"),
                      min.node.size = c(3:8))
fit_control<-trainControl(method = "oob",number = 5) ## out of bag estimation for computational efficiency 
rf_upgrade<-train(log(SalePrice)~.,method="ranger",data=dta_rf,trControl=fit_control,tuneGrid=tuning)
plot(rf_upgrade)

The final values used for the model were mtry = 93, splitrule = variance and min.node.size = 3.

We will use the optimal paramters to fit a random forest model and assess variable importance and put confidence intervals on the variables

library(randomForestSRC)
## 
##  randomForestSRC 2.8.0 
##  
##  Type rfsrc.news() to see new features, changes, and bug fixes. 
## 
## 
## Attaching package: 'randomForestSRC'
## The following objects are masked from 'package:e1071':
## 
##     impute, tune
## The following object is masked from 'package:purrr':
## 
##     partial
dta_v3<-dta_rf%>%
 mutate(SalePrice=log(SalePrice))

rf_ciMod<-rfsrc(SalePrice ~ .,data=as.data.frame(dta_v3),mtry = 93,nodesize = 3)
## Print the variable importance 
#var_imp<-vimp(rf_ciMod)
#print(vimp(rf_ciMod))

Variable Importance

The bootstrap is a popular method that can be used for estimating the variance of an estimator. So why not use the boot- strap to estimate the standard error for VIMP? One problem is that running a bootstrap on a forest is computationally expensive. Another more serious problem, however, is that a direct application of the bootstrap will not work for VIMP. This is because RF trees already use bootstrap data and applying the bootstrap creates double‐bootstrap data that affects the coherence of being OOB(out of bag).

A solution to this problem was given by a 0.164 bootstrap estimator, which is careful to only use OOB data. However, A problem with the .164 bootstrap estimator is that its OOB data set is smaller than a typical OOB estimator. Truly OOB data from a double bootstrap can be less than half the size of OOB data used in a standard VIMP calculation (16.4% versus 36.8%). Thus, in a forest of 1000 trees, the .164 estimator uses about 164 trees on average to calculate VIMP for a case compared with 368 trees used in a standard calculation. This can reduce efficiency of the .164 estimator. Another problem is computational expense. The .164 estimator requires repeatedly fitting RF to bootstrap data, which becomes expensive as n increases

The paper I referenced above has a another solution for this called sub sampling. The idea rests on calculating VIMP over small iid subsets of the data. Because sampling is without replacement, this avoid ties in the OOB data that creates problems for the bootstrap. Also, because each calculation is fast, the procedure is computation- ally efficient, especially in big n settings.

We’ve fit a random forest model above rf_ciMod using the randomForestSRC package functions- now we caclulate the variable importance.

rf_sub contains point estimates as well as the bootstrap estimates

var_imps<-as.data.frame(rf_sub$vmp)
var_imps$Predictors<-rownames(var_imps)
var_imps%>%
  ggplot(.,aes(y=reorder(Predictors,SalePrice),x=SalePrice))+geom_point(stat = "identity")+ labs(y="Variable Importance")

How to extract the bootstrap estimates?

The object rf_sub contains the bootstrap estimates for all predictors in the list vmpS

boots<-rf_sub$vmpS
## how to bind all the bootstraps into one df ?

## Create an empty list 
boots_2<-vector("list",100)
for (i in 1:length(boots_2)){
  boots_2[[i]]<-t(as.data.frame(boots[i]))
}

# Bind all the rows in a dataframe 
boot_df<-do.call(rbind,boots_2)
rownames(boot_df)<-seq(1,100,by=1)
boot_df<-as.tibble(boot_df)

boot_df<-boot_df%>%
  mutate(`1stFlrSF`=X1stFlrSF,`2ndFlrSF`=X2ndFlrSF)%>%
  select(-c(X1stFlrSF,X2ndFlrSF))

boot_df now contains all the bootstrap estimates in a convenient dataframe

head(boot_df)
## # A tibble: 6 x 53
##   MSSubClass MSZoning LotArea LotShape LotConfig Neighborhood Condition1
##        <dbl>    <dbl>   <dbl>    <dbl>     <dbl>        <dbl>      <dbl>
## 1   0.000286  6.65e-5 0.00133  7.99e-4  -2.21e-5     0.000407 -0.0000697
## 2   0.00226  -1.29e-5 0.00294  2.97e-4  -1.03e-4     0.00192  -0.0000112
## 3   0.000132  4.65e-4 0.00672  1.29e-3   1.55e-4     0.000660  0        
## 4   0.000643  2.19e-6 0.00185 -7.49e-5  -1.15e-5     0.000812  0.000372 
## 5   0.00187   2.33e-5 0.00221 -1.30e-4   2.20e-4     0.00150   0.000137 
## 6   0.00141   8.83e-4 0.00145 -5.57e-5  -6.07e-5     0.000785 -0.0000403
## # … with 46 more variables: BldgType <dbl>, HouseStyle <dbl>,
## #   OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## #   YearRemodAdd <dbl>, RoofStyle <dbl>, Exterior1st <dbl>,
## #   Exterior2nd <dbl>, MasVnrType <dbl>, MasVnrArea <dbl>,
## #   ExterQual <dbl>, ExterCond <dbl>, Foundation <dbl>, BsmtQual <dbl>,
## #   BsmtExposure <dbl>, BsmtFinType1 <dbl>, BsmtFinSF1 <dbl>,
## #   BsmtUnfSF <dbl>, TotalBsmtSF <dbl>, HeatingQC <dbl>, CentralAir <dbl>,
## #   Electrical <dbl>, `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, GrLivArea <dbl>,
## #   BsmtFullBath <dbl>, BsmtHalfBath <dbl>, FullBath <dbl>,
## #   HalfBath <dbl>, BedroomAbvGr <dbl>, KitchenQual <dbl>,
## #   TotRmsAbvGrd <dbl>, Fireplaces <dbl>, GarageType <dbl>,
## #   GarageYrBlt <dbl>, GarageFinish <dbl>, GarageCars <dbl>,
## #   GarageArea <dbl>, PavedDrive <dbl>, WoodDeckSF <dbl>,
## #   OpenPorchSF <dbl>, MoSold <dbl>, YrSold <dbl>, SaleType <dbl>,
## #   SaleCondition <dbl>
Calculate lower and upper quantiles
cis<-boot_df%>%
  gather(var,value,MSSubClass:SaleCondition)%>%
  group_by(var)%>%
  summarise(lower_ci=quantile(value,0.025,na.rm = TRUE),
            upper_ci=quantile(value,0.975,na.rm=TRUE))



ci_data<-var_imps%>%
  mutate(var=Predictors,importance=SalePrice)%>%
  select(-c(Predictors,SalePrice))%>%
  inner_join(.,cis)
## Joining, by = "var"
Plot confidence regions
ci_data%>%
  ggplot(.,aes(x=reorder(var,importance),y=importance))+geom_point(stat = "identity")+ labs(y="Variable Importance",x="Predictors")+geom_errorbar(aes(ymin=lower_ci, ymax=upper_ci), colour="black", width=.1)+coord_flip()

And here we have it - 95% confidence intervals for the importance measures for all the predictors.

We note that ‘OverallQual’, ‘GrLivArea’ and ‘Yearbuilt’ do not contain zero in their confidence interval. Even though p-values and the frequentist interpretation of confidence intervals is often mis-used in both academic and business settings, but if it’s done in an appropriate manner and setting, it provides the end user with a better interpretation of the model as compared to one without confidence intervals.

And that’s it, folks!

And that’s it, folks!