Title: | Unbiased Variable Importance for Random Forests |
---|---|
Description: | Computes a novel variable importance for random forests: Impurity reduction importance scores for out-of-bag (OOB) data complementing the existing inbag Gini importance, see also <doi: 10.1080/03610926.2020.1764042>. The Gini impurities for inbag and OOB data are combined in three different ways, after which the information gain is computed at each split. This gain is aggregated for each split variable in a tree and averaged across trees. |
Authors: | Markus Loecher <[email protected]> |
Maintainer: | Markus Loecher <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.3 |
Built: | 2025-02-21 04:15:41 UTC |
Source: | https://github.com/cran/rfVarImpOOB |
Accuracy is defined as the proportion of correct labels
Accuracy(y, yHat, dig = 8)
Accuracy(y, yHat, dig = 8)
y |
vector of categorical/nominal values |
yHat |
prediction/estimate |
dig |
number of digits |
Accuracy defined as proportion of values equal to majority
Markus Loecher <[email protected]>
Accuracy(c(rep(0,9),1), 1) Accuracy(c(rep(0,9),1), 0)
Accuracy(c(rep(0,9),1), 1) Accuracy(c(rep(0,9),1), 0)
RNA editing is the process whereby RNA is modified from the sequence of the corresponding DNA template [1]. For instance, cytidine-to-uridine conversion (abbreviated C-to-U conversion) is common in plant mitochondria. The mechanisms of this conversion remain largely unknown, although the role of neighboring nucleotides is emphasized. Cummings and Myers [1] suggest to use information from sequence regions flanking the sites of interest to predict editing in Arabidopsis thaliana, Brassicanapus and Oryza sativa based on random forests. The Arabidopsis thaliana data of [1] can be loaded from the journal Web site.
For each of the 876 observations, the data set gives
the response at the site of interest (binary: edited/not edited) and as potential predictor variables the 40 nucleotides at positions -20 to 20, relative to the edited site (4 categories), cp: the codon position (4 categories), fe: the estimated folding energy (continuous) and dfe: the difference in estimated folding energy between pre- edited and edited sequences (continuous).
arabidopsis
arabidopsis
Data frame with columns
binary:the response at the site of interest
nucleotides at positions -k, relative to the edited site (4 categories)
nucleotides at positions k, relative to the edited site (4 categories)
the codon position (4 categories),
the estimated folding energy (continuous)
the difference in estimated folding energy between pre- edited and edited sequences (continuous)
[1] Cummings, Michael P, and Daniel S Myers. Simple Statistical Models Predict C-to-U Edited Sites in Plant Mitochondrial RNA. BMC Bioinformatics, 2004, 7.
arabidopsis
arabidopsis
simple function to compute simple or penalized Gini impurity
The "penalty" compares the class probabilities pHat
with a reference estimate pEst
which would typically serve as a prediction (e.g. in a tree node).
gini_index(pHat, pEst = NULL, k = 2, kind = 1, w = 2)
gini_index(pHat, pEst = NULL, k = 2, kind = 1, w = 2)
pHat |
probabilities from the current data, |
pEst |
estimated class probabilities (typically from an earlier inbag estimation). Only pass if you intend to compute the "validation-penalized Gini" |
k |
exponent of penalty term: abs(pHat-pEst)^k |
kind |
kind of penalty |
w |
weights, default is 2 if you pass just a single probability instead of the vector (p,1-p) |
simple or penalized Gini impurity
Markus Loecher <[email protected]>
#Test binary case: gini_index(0.5,0.5,kind=1) gini_index(0.9,0.1,kind=1) gini_index(0.1,0.9,kind=1) gini_index(0.5,0.5,kind=2) gini_index(0.9,0.1,kind=2) gini_index(0.1,0.9,kind=2) gini_index(0.5,0.5,kind=3) gini_index(0.9,0.1,kind=3) gini_index(0.1,0.9,kind=3)
#Test binary case: gini_index(0.5,0.5,kind=1) gini_index(0.9,0.1,kind=1) gini_index(0.1,0.9,kind=1) gini_index(0.5,0.5,kind=2) gini_index(0.9,0.1,kind=2) gini_index(0.1,0.9,kind=2) gini_index(0.5,0.5,kind=3) gini_index(0.9,0.1,kind=3) gini_index(0.1,0.9,kind=3)
computes Gini index
gini_process(classes, splitvar = NULL)
gini_process(classes, splitvar = NULL)
classes |
vector of factors/categorical vars |
splitvar |
split variable |
Gini index
Markus Loecher <[email protected]>
#Test binary case: #50/50split gini_process(c(rep(0,10),rep(1,10)))#0.5 CORRECT ! #10/90split gini_process(c(rep(0,1),rep(1,9)))#0.18= CORRECT ! #0/100split gini_process(factor(c(rep(0,0),rep(1,10)), levels=c(0,1)))#0 #Test binary case: #25/25/25/25 split gini_process(factor(c(rep(0,5),rep(1,5),rep(2,5), rep(3,5)), levels=c(0:3)))#0.75 = 4*0.25*0.75 CORRECT ! #10/10/10/70 split gini_process(factor(c(rep(0,1),rep(1,1),rep(2,1), rep(3,7)), levels=c(0:3)))#0.48 = 3*0.1*0.9+0.7*0.3 CORRECT ! #0/0/0/100 split gini_process(factor(c(rep(0,0),rep(1,0),rep(2,0), rep(3,20)), levels=c(0:3)))#0. CORRECT !
#Test binary case: #50/50split gini_process(c(rep(0,10),rep(1,10)))#0.5 CORRECT ! #10/90split gini_process(c(rep(0,1),rep(1,9)))#0.18= CORRECT ! #0/100split gini_process(factor(c(rep(0,0),rep(1,10)), levels=c(0,1)))#0 #Test binary case: #25/25/25/25 split gini_process(factor(c(rep(0,5),rep(1,5),rep(2,5), rep(3,5)), levels=c(0:3)))#0.75 = 4*0.25*0.75 CORRECT ! #10/10/10/70 split gini_process(factor(c(rep(0,1),rep(1,1),rep(2,1), rep(3,7)), levels=c(0:3)))#0.48 = 3*0.1*0.9+0.7*0.3 CORRECT ! #0/0/0/100 split gini_process(factor(c(rep(0,0),rep(1,0),rep(2,0), rep(3,20)), levels=c(0:3)))#0. CORRECT !
workhorse function of this package
GiniImportanceForest(RF, data, ylabel = "Survived", zeroLeaf = TRUE, agg = c("mean", "median", "none")[1], score = c("PMDI21", "MDI", "MDA", "MIA")[1], Predictor = Mode, verbose = 0)
GiniImportanceForest(RF, data, ylabel = "Survived", zeroLeaf = TRUE, agg = c("mean", "median", "none")[1], score = c("PMDI21", "MDI", "MDA", "MIA")[1], Predictor = Mode, verbose = 0)
RF |
object returned by call to randomForest() |
data |
data which was used to train the RF. NOTE: assumes setting of inbag=TRUE while training |
ylabel |
name of dependent variable |
zeroLeaf |
if TRUE discard the information gain due to splits resulting in n=1 |
agg |
method of aggregating importance scores across trees. If "none" return the raw arrays (for debugging) |
score |
scoring method:MDI=mean decrease impurity (Gini),MDA=mean decrease accuracy (permutation),MIA=mean increase accuracy |
Predictor |
function to estimate node prediction, such as Mode or mean or median. Alternatively, pass an array of numbers as replacement for the yHat column of tree |
verbose |
level of verbosity |
matrix with variable importance scores and their stdevs
Markus Loecher <[email protected]>
data("titanic_train", package = "rfVarImpOOB", envir = environment()) set.seed(123) ranRows=sample(nrow(titanic_train), 300) data=titanic_train[ranRows,] RF = randomForest::randomForest(formula = Survived ~ Sex + Pclass + PassengerId, data=data, ntree=5,importance=TRUE, mtry=3,keep.inbag=TRUE, nodesize = 20) data$Survived = as.numeric(data$Survived)-1 VI_Titanic = GiniImportanceForest(RF, data,ylab="Survived")
data("titanic_train", package = "rfVarImpOOB", envir = environment()) set.seed(123) ranRows=sample(nrow(titanic_train), 300) data=titanic_train[ranRows,] RF = randomForest::randomForest(formula = Survived ~ Sex + Pclass + PassengerId, data=data, ntree=5,importance=TRUE, mtry=3,keep.inbag=TRUE, nodesize = 20) data$Survived = as.numeric(data$Survived)-1 VI_Titanic = GiniImportanceForest(RF, data,ylab="Survived")
computes importance scores for an individual tree.
These can be based on Gini impurity or Accuracy or logloss
GiniImportanceTree(bag, RF, k, ylabel = "Survived", returnTree = FALSE, zeroLeaf = TRUE, score = c("PMDI21", "MDI", "MDA", "MIA")[1], Predictor = Mode, verbose = 0)
GiniImportanceTree(bag, RF, k, ylabel = "Survived", returnTree = FALSE, zeroLeaf = TRUE, score = c("PMDI21", "MDI", "MDA", "MIA")[1], Predictor = Mode, verbose = 0)
bag |
data to compute the Gini gain for |
RF |
object returned by call to randomForest() |
k |
which tree |
ylabel |
name of dependent variable |
returnTree |
if TRUE returns the tree data frame otherwise the aggregated Gini importance grouped by split variables |
zeroLeaf |
if TRUE discard the information gain due to splits resulting in n=1 |
score |
scoring method:PMDI=mean decrease penalized Gini impurity (note:the last digit is the exponent of the penalty!), MDI=mean decrease impurity (Gini), MDA=mean decrease accuracy (permutation), MIA=mean increase accuracy |
Predictor |
function to estimate node prediction, such as Mode or mean or median. Alternatively, pass an array of numbers as replacement for the yHat column of tree |
verbose |
level of verbosity |
if returnTree==TRUE returns the tree data frame otherwise the aggregated Gini importance grouped by split variables
Markus Loecher <[email protected]>
rfTit = rfTitanic(nRows = 500,nodesize=10) rfTit$data$Survived = as.numeric(rfTit$data$Survived)-1 k=1 tmp <- InOutBags(rfTit$RF, rfTit$data, k) IndivTree =getTree(rfTit$RF,k) #plot(as.party(tmp))#does not work InTree = GiniImportanceTree(tmp$inbag,rfTit$RF,k,returnTree=TRUE) OutTree = GiniImportanceTree(tmp$outbag,rfTit$RF,k,returnTree=TRUE)
rfTit = rfTitanic(nRows = 500,nodesize=10) rfTit$data$Survived = as.numeric(rfTit$data$Survived)-1 k=1 tmp <- InOutBags(rfTit$RF, rfTit$data, k) IndivTree =getTree(rfTit$RF,k) #plot(as.party(tmp))#does not work InTree = GiniImportanceTree(tmp$inbag,rfTit$RF,k,returnTree=TRUE) OutTree = GiniImportanceTree(tmp$outbag,rfTit$RF,k,returnTree=TRUE)
convenience function to mitigate risk of improperly disentangling train/test
NOTE: the original row names (too dangerous for repeated rows) are not kept but instead recorded in a separate column
InOutBags(RF, data, k, inclRowNames = TRUE, NullRowNames = TRUE, verbose = 0)
InOutBags(RF, data, k, inclRowNames = TRUE, NullRowNames = TRUE, verbose = 0)
RF |
object returned by call to randomForest() |
data |
data which was used to train the RF. NOTE: assumes setting of inbag=TRUE while training |
k |
tree number |
inclRowNames |
create extra column of original row names |
NullRowNames |
if TRUE set row names to NULL |
verbose |
level of verbosity |
inbag and outbag subsets of the original data
Markus Loecher <[email protected]>
rfTit = rfTitanic(nRows = 200,nodesize=10, ntree = 5) k=1 tmp <- InOutBags(rfTit$RF, rfTit$data, k)
rfTit = rfTitanic(nRows = 200,nodesize=10, ntree = 5) k=1 tmp <- InOutBags(rfTit$RF, rfTit$data, k)
Compute the Lp norm of a vector.
lpnorm(x, p = 2)
lpnorm(x, p = 2)
x |
vector to compute the Lp norm of |
p |
parameter of p norm |
Lp norm of a vector or NA
Markus Loecher <[email protected]>
lpnorm(1:10) lpnorm(matrix(1:25, 5, 5)) lpnorm(split(1:25, rep(1:5, each = 5))) lpnorm(1:10, 1) lpnorm(matrix(1:25, 5, 5), 1) lpnorm(split(1:25, rep(1:5, each = 5)), 1) lpnorm(rnorm(10), 0) lpnorm(matrix(rnorm(25), 5, 5), 0) lpnorm(split(rnorm(25), rep(1:5, each = 5)), 0) lpnorm(-5:5, Inf) lpnorm(matrix(-25:-1, 5, 5), Inf) lpnorm(split(-25:-1, rep(1:5, each = 5)), Inf)
lpnorm(1:10) lpnorm(matrix(1:25, 5, 5)) lpnorm(split(1:25, rep(1:5, each = 5))) lpnorm(1:10, 1) lpnorm(matrix(1:25, 5, 5), 1) lpnorm(split(1:25, rep(1:5, each = 5)), 1) lpnorm(rnorm(10), 0) lpnorm(matrix(rnorm(25), 5, 5), 0) lpnorm(split(rnorm(25), rep(1:5, each = 5)), 0) lpnorm(-5:5, Inf) lpnorm(matrix(-25:-1, 5, 5), Inf) lpnorm(split(-25:-1, rep(1:5, each = 5)), Inf)
computes log loss for multiclass problem
mlogloss(actual, pred_m, eps = 0.001)
mlogloss(actual, pred_m, eps = 0.001)
actual |
integer vector with truth labels, values range from 0 to n - 1 classes |
pred_m |
predicted probs: column 1 => label 0, column 2 => label 1 and so on |
eps |
numerical cutoff taken very high |
Markus Loecher <[email protected]>
# require(nnet) # set.seed(1) # actual = as.integer(iris$Species) - 1 # fit = nnet(Species ~ ., data = iris, size = 2) # pred = predict(fit, iris)#note this is a 3-column prediction matrix! # # mlogloss(actual, pred) # 0.03967 #library(titanic) #baseline prediction #data(titanic_train, package="titanic") yHat = mean(titanic_train$Survived)#0.383838 mlogloss(titanic_train$Survived,yHat) #try factors titanic_train$Survived = as.factor(titanic_train$Survived) mlogloss(titanic_train$Survived,yHat)
# require(nnet) # set.seed(1) # actual = as.integer(iris$Species) - 1 # fit = nnet(Species ~ ., data = iris, size = 2) # pred = predict(fit, iris)#note this is a 3-column prediction matrix! # # mlogloss(actual, pred) # 0.03967 #library(titanic) #baseline prediction #data(titanic_train, package="titanic") yHat = mean(titanic_train$Survived)#0.383838 mlogloss(titanic_train$Survived,yHat) #try factors titanic_train$Survived = as.factor(titanic_train$Survived) mlogloss(titanic_train$Survived,yHat)
returns the mode of a vector
Mode(x)
Mode(x)
x |
vector to find mode of |
Markus Loecher <[email protected]>
Mode(rep(letters[1:3],1:3)) Mode(c(TRUE,TRUE,FALSE)) Mode(c(TRUE,TRUE,FALSE,FALSE))
Mode(rep(letters[1:3],1:3)) Mode(c(TRUE,TRUE,FALSE)) Mode(c(TRUE,TRUE,FALSE,FALSE))
creates barplots for variable importances
plotVI(VIbench, order_by = "Gini_OOB", decreasing = TRUE)
plotVI(VIbench, order_by = "Gini_OOB", decreasing = TRUE)
VIbench |
matrix with importance scores as returned by GiniImportanceForest |
order_by |
how to order |
decreasing |
which direction to sort |
Markus Loecher <[email protected]>
data("titanic_train", package = "rfVarImpOOB", envir = environment()) set.seed(123) ranRows=sample(nrow(titanic_train), 300) data=titanic_train[ranRows,] RF = randomForest::randomForest(formula = Survived ~ Sex + Pclass + PassengerId, data=data, ntree=5,importance=TRUE, mtry=3,keep.inbag=TRUE, nodesize = 20) data$Survived = as.numeric(data$Survived)-1 VI_Titanic = GiniImportanceForest(RF, data,ylab="Survived") plotVI(VI_Titanic,decreasing = TRUE)
data("titanic_train", package = "rfVarImpOOB", envir = environment()) set.seed(123) ranRows=sample(nrow(titanic_train), 300) data=titanic_train[ranRows,] RF = randomForest::randomForest(formula = Survived ~ Sex + Pclass + PassengerId, data=data, ntree=5,importance=TRUE, mtry=3,keep.inbag=TRUE, nodesize = 20) data$Survived = as.numeric(data$Survived)-1 VI_Titanic = GiniImportanceForest(RF, data,ylab="Survived") plotVI(VI_Titanic,decreasing = TRUE)
creates barplots for variable importances including permutation scores
plotVI2(VIbench, decreasing = TRUE, with_MDA = TRUE, ordered_by = "inbag", score = "Gini Importance", horizontal = TRUE, fill = "order", labelSize = 10, nrow = 3)
plotVI2(VIbench, decreasing = TRUE, with_MDA = TRUE, ordered_by = "inbag", score = "Gini Importance", horizontal = TRUE, fill = "order", labelSize = 10, nrow = 3)
VIbench |
matrix with importance scores as returned by GiniImportanceForest |
decreasing |
which direction to sort |
with_MDA |
also visualize mean decrease in accuracy (permutation importance) |
ordered_by |
how to order |
score |
type of importance score: Gini, MIA,.. |
horizontal |
horizontal barplot instead of vertical ? |
fill |
fill style for barplots; use e.g. shQuote("blue") to pass color strings |
labelSize |
size of axis labels |
nrow |
number of rows of ploztz arrangement |
Markus Loecher <[email protected]>
data("titanic_train", package = "rfVarImpOOB", envir = environment()) set.seed(123) ranRows=sample(nrow(titanic_train), 300) data=titanic_train[ranRows,] RF = randomForest::randomForest(formula = Survived ~ Sex + Pclass + PassengerId, data=data, ntree=5,importance=TRUE, mtry=3,keep.inbag=TRUE, nodesize = 20) data$Survived = as.numeric(data$Survived)-1 VI_Titanic = GiniImportanceForest(RF, data,ylab="Survived") plotVI2(VI_Titanic,decreasing = TRUE)
data("titanic_train", package = "rfVarImpOOB", envir = environment()) set.seed(123) ranRows=sample(nrow(titanic_train), 300) data=titanic_train[ranRows,] RF = randomForest::randomForest(formula = Survived ~ Sex + Pclass + PassengerId, data=data, ntree=5,importance=TRUE, mtry=3,keep.inbag=TRUE, nodesize = 20) data$Survived = as.numeric(data$Survived)-1 VI_Titanic = GiniImportanceForest(RF, data,ylab="Survived") plotVI2(VI_Titanic,decreasing = TRUE)
Recursive calling stops at leaf after which the function propagates back up the tree
preorder2(treeRow, bag, tree, verbose = 0)
preorder2(treeRow, bag, tree, verbose = 0)
treeRow |
current row of tree dataframe to be |
bag |
The data for the current row |
tree |
tree (from randomForest::getTree to be traversed |
verbose |
level of verbosity |
tree with rownames in column node
Markus Loecher <[email protected]>
data("titanic_train", package = "rfVarImpOOB", envir = environment()) set.seed(123) ranRows=sample(nrow(titanic_train), 300) RF = randomForest::randomForest(formula = Survived ~ Sex + Pclass + PassengerId, data=titanic_train[ranRows,], ntree=5,importance=TRUE, mtry=3,keep.inbag=TRUE, nodesize = 1) k=2 tree = randomForest::getTree(RF, k, labelVar = TRUE) tree$node=NA attr(tree, "rflib") = "randomForest" inbag = rep(rownames(RF$inbag),time=RF$inbag[,k]) #trainBag=titanic_train[inbag,] trainBag=titanic_train[ranRows,][inbag,] tree=preorder2(1,trainBag,tree)
data("titanic_train", package = "rfVarImpOOB", envir = environment()) set.seed(123) ranRows=sample(nrow(titanic_train), 300) RF = randomForest::randomForest(formula = Survived ~ Sex + Pclass + PassengerId, data=titanic_train[ranRows,], ntree=5,importance=TRUE, mtry=3,keep.inbag=TRUE, nodesize = 1) k=2 tree = randomForest::getTree(RF, k, labelVar = TRUE) tree$node=NA attr(tree, "rflib") = "randomForest" inbag = rep(rownames(RF$inbag),time=RF$inbag[,k]) #trainBag=titanic_train[inbag,] trainBag=titanic_train[ranRows,][inbag,] tree=preorder2(1,trainBag,tree)
convenience function to reduce overhead of repeatedly fitting RF to titanic data
rfTitanic(formel = Survived ~ Sex + Pclass + PassengerId, nRows = 500, ntree = 10, mtry = 3, nodesize = 1)
rfTitanic(formel = Survived ~ Sex + Pclass + PassengerId, nRows = 500, ntree = 10, mtry = 3, nodesize = 1)
formel |
formula |
nRows |
subsample size |
ntree |
number of trees |
mtry |
mtry |
nodesize |
nodesize |
Markus Loecher <[email protected]>
rfTit = rfTitanic(nRows = 500,nodesize=10)
rfTit = rfTitanic(nRows = 500,nodesize=10)
The function properly splits on factor levels
splitBag(treeRow, bag, tree)
splitBag(treeRow, bag, tree)
treeRow |
current row of tree dataframe to be |
bag |
The data for the current row |
tree |
tree (from randomForest::getTree) |
list with elements left_daughter, right_daughter
Markus Loecher <[email protected]>
Titanic train data.
titanic_train
titanic_train
Data frame with columns
Passenger ID
Passenger Survival Indicator
Passenger Class
Name
Sex
Age
Number of Siblings/Spouses Aboard
Number of Parents/Children Aboard
Ticket Number
Passenger Fare
Cabin
Port of Embarkation
https://www.kaggle.com/c/titanic/data
titanic_train
titanic_train