Author: Damjan Krstajic
Available as R CRAN aloom(1) package.
Instead of creating one binary classifier from a training set containing N samples, ALOOM(2)(3)(4) creates N binary classifiers in the exactly way, i.e. using the same hyper-parameters, but on samples of size (N-1). The model trained on all N samples is here referred as the original model.
For a single test sample ALOOM produces N predicted probabilities and thus one may create an ALOOM individual prediction interval(4) (min ALOOM probabilities, max ALOOM probabilities) for the test sample.
ALOOM provides a solution for assessing the reliability(5) of a single binary prediction. As shown below, the widths of ALOOM individual prediction intervals vary between test samples. Therefore, the width of the ALOOM individual prediction interval may be used as a measure of the reliability of the original model's single predicted probability.
ALOOM also provides a solution for assessing the decideability(5) for the single binary prediction. If All Leave-One-Out Models do not all agree on the predicted category for the test sample, then we would suggest returning NotAvailable.
ALOOM is a non-parametric approach where binary model and data define NotAvailable predictions.
In our experience ALOOM predictions which are available, i.e. not NotAvailable, have on average higher accuracy. Otherwise, there would be no point in using the ALOOM approach.
An initial simulation study(3) has shown that ALOOM may be very useful in active learning. This means that if one has a binary model and wants to update the training set with new samples, then we would suggest to update it with samples that ALOOM currently predicts as NotAvailable.
ALOOM is not meaningful for model building algorithms which are affected by the value of a seed number. It is not suitable for deep learning models. However, for random forests it works fine with large-ish number of trees.
ALOOM is a simple idea, but its application is very computer-intensive and thus not really suitable for personal computers.
We use publicly available mutagenicity dataset from Kazius et al. (2005)(6). It contains 4335 compounds, 2400 categorised as “mutagen” and the remaining 1935 compounds as “nonmutagen”. The dataset is available from the QSARdata(7) R package and each compound comes with 1579 descriptors. Half of the dataset is used for training and the remaining half for testing.
library(QSARdata)
library(caret)
data(Mutagen)
x <- as.matrix(Mutagen_Dragon)
rownames(x) <- rownames(Mutagen_Dragon)
colnames(x) <- colnames(Mutagen_Dragon)
y <- Mutagen_Outcome
set.seed(1)
lvFolds <- createFolds(y,k=2)
train.x <- x[-lvFolds[[1]],]
train.y <- y[-lvFolds[[1]]]
test.x <- x[lvFolds[[1]],]
test.y <- y[lvFolds[[1]]]
The train dataset consists of 2168 samples, while test has 2167. Here they are available as csv files: train_x.csv, train_y.csv, test_x.csv, test_y.csv.
NOTE 1: On machine with 48 CPUs AMD Opteron 6168 with 15 Gb RAM when using all
48 CPUs this takes 23 hours
NOTE 2: On machine with 24 CPUs Intel Xeon 6240@2.60GHz with 4 Gb RAM when using all
24 CPUs this takes 8 hours
library(aloom)
library(parallel)
library(randomForest)
ntree <- 1000
num.cores <- detectCores()
fit <- aloom(train.x, train.y, test.x, method="rf",list(ntree=ntree),mc.cores=num.cores)
NOTE 1: On machine with 48 CPUs AMD Opteron 6168 with 15 Gb RAM when using all
48 CPUs this takes 4 hours
NOTE 2: On machine with 24 CPUs Intel Xeon 6240@2.60GHz with 4 Gb RAM when using all
24 CPUs this takes 2 hours
Prior to calling aloom() we execute cv.glmnet() to find optimal lambda.
library(aloom)
library(parallel)
library(glmnet)
cv.fit <- cv.glmnet(train.x,train.y,family="binomial",type.measure="auc")
selected.lambda <- cv.fit$lambda.1se
lambda <- cv.fit$lambda
model.params <- list(lambda=lambda, alpha=1, selected.lambda=selected.lambda)
num.cores <- detectCores()
fit <- aloom(train.x, train.y, test.x, method="glmnet",model.params,mc.cores=num.cores)
All Leave-One-Out Models, as well as the original model, are created during the execution of aloom().
Their predictions of test samples are the return results.
An aloom object is a list containing:
predicted.y <- fit$predicted.y
predicted.prob.y <- fit$predicted.prob.y
aloom.probs <- fit$aloom.probs
Calculate original's misclassification error, ALOOM's proportion of NA and ALOOM's misclassification error.
original.misclassification <- sum(predicted.y!=test.y)/length(test.y)
find.na <- function(x){if ((min(x) < 0.5) & (max(x) > 0.5)) TRUE else FALSE}
predicted.na <- apply(aloom.probs,1,find.na)
aloom.proportion.na <- sum(predicted.na)/length(predicted.na)
aloom.misclassification <- sum(predicted.y[!predicted.na]!=test.y[!predicted.na])/length(test.y[!predicted.na])
Original misclassification | % NotAvailable | ALOOM misclassification | |
---|---|---|---|
glmnet | 0.196 | 7.24 | 0.17 |
randomForest | 0.182 | 15.32% | 0.152 |
min.aloom <- apply(aloom.probs,1,min)
max.aloom <- apply(aloom.probs,1,max)
Calculate width of every ALOOM individual prediction interval and examine its distrubition.
width <- max.aloom - min.aloom
summary(width)
ALOOM individual prediction interval width stats are:
Min | Q1 | Median | Mean | Q3 | Max | |
---|---|---|---|---|---|---|
glmnet | 0 | 0.019 | 0.034 | 0.053 | 0.062 | 0.766 |
randomForest | 0.009 | 0.085 | 0.104 | 0.109 | 0.119 | 0.571 |
id.with.max.width <- rownames(test.x)[which.max(width)]
max.width.range <- c(min.aloom[which.max(width)],max.aloom[which.max(width)])
Compound ID | Min ALOOM interval | Max ALOOM interval | |
---|---|---|---|
glmnet | '3224' | 0.135 | 0.901 |
randomForest | '782' | 0.261 | 0.832 |