In a recent article (currently on arxiv) I, Damjan Krstajic, have presented my critical assessment of conformal prediction methods applied in binary classification settings. Here I summarise my main points. As I am critical of concepts, and not the people, I here only present concepts with which I disagree, while references are available in the article.

Conformal prediction requires a *nonconformity measure* to be specified.
The nonconformity measure is a real-valued function
which measures how different a test sample is from training samples. Given the nonconformity
measure, the conformal algorithm produces a prediction region for any specified N%.

Conformal prediction defines a new concept of *validity* for prediction with confidence. An N%
prediction region is valid if N% of predictions contain the correct label.

1. What does it mean in practice that a binary classifier provides a valid N%
prediction region?
In the binary classification setting it does not mean that N% of the predictions will be
correct, but that N% of the predictions will contain the correct label. Therefore, if a
conformal prediction model produces an 86% valid prediction region it could be that we obtain,
for example:

- 86% 'both' predictions and 14% 'empty'

- 50% 'both' , 36% correct, 10% false and 4% 'empty'

- 15% 'both, 71% correct, 2% false and 12% 'empty'

- 86% correct and 14% false single predictions

I see a substantial difference in the practical value of the above cases all having the
same valid 86% prediction region.

If our conformal prediction model would predict all 'both',i.e.
everything predicted would be {positive and negative}, we would have 100% validity, but in my view
we would have 0% useful predictions. Therefore, I question the practical value of validity as a measure
in binary classification settings.

2. In my view, saying that the nonconformity measure is a real-valued function which measures how different a test sample is from training samples requires additional clarification. My initial understanding was that it is a measure how markers/descriptors {X1,..,Xn} of a test sample are different from markers/descriptors {X1,..,Xn} of training samples. However, in the calculation of the nonconformity measure the values of the output variable {Y} are used.

3. Currently in the cheminformatics community, a common practice is to use random forest model to generate a nonconformity measure. One would build a random forest model with Y as an output variable (e.g. solubility) and {X1,..,Xn} as input variables (e.g. chemical descriptors) and define the nonconformity score to be the probability for the prediction of Y from the decision trees in the forest. In other words, we presume that the measure of the difference between a test sample from training samples is equal to the percentage of correct predictions of Y (e.g. solubility) given by the individual decision trees in the forest.

I am puzzled how predicting the probability of Y (e.g. solubility of a compound) may be a
*a real-valued function
which measures how different a test sample is from training samples*. I cannot find any arguments which would support that but there are dozens of papers in leading cheminformatics journals which do that.

How can predicting the probability of Y be the measure how different a test sample is from training samples? Isn't it the measure of a test sample being positive (e.g. soluble)?

There are others who indirectly presume 'both' predictions to be treated as correct classification. They discuss and examine false positives in conformal predictions, where 'both' predictions of positive samples are treated as true positives. I think that such practice is misleading.