A critical assessment of conformal prediction methods applied in binary classification settings

In a recent article (currently on arxiv) I, Damjan Krstajic, have presented my critical assessment of conformal prediction methods applied in binary classification settings. Here I summarise my main points. As I am critical of concepts, and not the people, I here only present concepts with which I disagree, while references are available in the article.

Background

In conformal predictions, binary classifiers may return the following four prediction regions for a single test sample: {positive}, {negative}, {positive and negative} and {null}. The last two predictions are usually referred to as 'both' and 'empty'.

Conformal prediction requires a nonconformity measure to be specified. The nonconformity measure is a real-valued function which measures how different a test sample is from training samples. Given the nonconformity measure, the conformal algorithm produces a prediction region for any specified N%.

Conformal prediction defines a new concept of validity for prediction with confidence. An N% prediction region is valid if N% of predictions contain the correct label.

My main points

1. What does it mean in practice that a binary classifier provides a valid N% prediction region? In the binary classification setting it does not mean that N% of the predictions will be correct, but that N% of the predictions will contain the correct label. Therefore, if a conformal prediction model produces an 86% valid prediction region it could be that we obtain, for example:
- 86% 'both' predictions and 14% 'empty'
- 50% 'both' , 36% correct, 10% false and 4% 'empty'
- 15% 'both, 71% correct, 2% false and 12% 'empty'
- 86% correct and 14% false single predictions
I see a substantial difference in the practical value of the above cases all having the same valid 86% prediction region.

If our conformal prediction model would predict all 'both',i.e. everything predicted would be {positive and negative}, we would have 100% validity, but in my view we would have 0% useful predictions. Therefore, I question the practical value of validity as a measure in binary classification settings.

2. In my view, saying that the nonconformity measure is a real-valued function which measures how different a test sample is from training samples requires additional clarification. My initial understanding was that it is a measure how markers/descriptors {X1,..,Xn} of a test sample are different from markers/descriptors {X1,..,Xn} of training samples. However, in the calculation of the nonconformity measure the values of the output variable {Y} are used.

3. Currently in the cheminformatics community, a common practice is to use random forest model to generate a nonconformity measure. One would build a random forest model with Y as an output variable (e.g. solubility) and {X1,..,Xn} as input variables (e.g. chemical descriptors) and define the nonconformity score to be the probability for the prediction of Y from the decision trees in the forest. In other words, we presume that the measure of the difference between a test sample from training samples is equal to the percentage of correct predictions of Y (e.g. solubility) given by the individual decision trees in the forest.

I am puzzled how predicting the probability of Y (e.g. solubility of a compound) may be a a real-valued function which measures how different a test sample is from training samples. I cannot find any arguments which would support that but there are dozens of papers in leading cheminformatics journals which do that.

How can predicting the probability of Y be the measure how different a test sample is from training samples? Isn't it the measure of a test sample being positive (e.g. soluble)?

4. In my opinion, there is still an unresolved problem in conformal predictions as to how to deal with 'both' and 'empty' prediction regions. There are some authors who presume that 'both' predictions may be treated as correct classification. Thus, if a sample has a positive output value and it is predicted as 'both' it would be treated as correctly predicted. However, if it has a negative output value it would again be treated as correctly predicted. This implies that if we have a CP model with all 'both' predictions we would have 100% correct predictions. In my opinion, this does not make sense.

There are others who indirectly presume 'both' predictions to be treated as correct classification. They discuss and examine false positives in conformal predictions, where 'both' predictions of positive samples are treated as true positives. I think that such practice is misleading.