P4 metric[1] [2] (also known as FS or Symmetric F [3]) enables performance evaluation of the binary classifier.It is calculated from precision, recall, specificity and NPV (negative predictive value).P4 is designed in similar way to F1 metric, however addressing the criticisms leveled against F1. It may be perceived as its extension.
Like the other known metrics, P4 is a function of: TP (true positives), TN (true negatives), FP (false positives), FN (false negatives).
The key concept of P4 is to leverage the four key conditional probabilities:
P(+\midC{+})
P(C{+}\mid+)
P(C{-}\mid-)
P(-\midC{-})
The main assumption behind this metric is, that a properly designed binary classifier should give the results for which all the probabilities mentioned above are close to 1.P4 is designed the way that
P4=1
P4 is defined as a harmonic mean of four key conditional probabilities:
P4=
4 | ||||
|
+
1 | |
P(C{+ |
\mid+)}+
1 | |
P(C{- |
\mid-)}+
1 | |
P(-\midC{- |
)}}=
4 | |||||||||||||
|
In terms of TP,TN,FP,FN it can be calculated as follows:
P4=
4 ⋅ TP ⋅ TN | |
4 ⋅ TP ⋅ TN+(TP+TN) ⋅ (FP+FN) |
Evaluating the performance of binary classifier is a multidisciplinary concept. It spans from the evaluation of medical tests, psychiatric tests to machine learning classifiers from a variety of fields. Thus, many metrics in use exist under several names. Some of them being defined independently.
P4\in[0,1]
P4 ≈ 1
P4 ≈ 0
Dependency table for selected metrics ("true" means depends, "false" - does not depend):
P(+\midC{+}) | P(C{+}\mid+) | P(C{-}\mid-) | P(-\midC{-}) | ||
---|---|---|---|---|---|
P4 | true | true | true | true | |
true | true | false | false | ||
false | true | true | false | ||
true | false | false | true |
Metrics that do not depend on a given probability are prone to misrepresentation when it approaches 0.
Let us consider the medical test aimed to detect kind of rare disease. Population size is 100 000, while 0.05% population is infected. Test performance: 95% of all positive individuals are classified correctly (TPR=0.95) and 95% of all negative individuals are classified correctly (TNR=0.95).In such a case, due to high population imbalance, in spite of having high test accuracy (0.95), the probability that an individual who has been classified as positive is in fact positive is very low:
P(+\midC{+})=0.0095
And now we can observe how this low probability is reflected in some of the metrics:
P4=0.0370
F1=0.0188
J=0.9100
MK=0.0095
We are training neural network based image classifier. We are considering only two types of images: containing dogs (labeled as 0) and containing cats (labeled as 1). Thus, our goal is to distinguish between the cats and dogs. The classifier overpredicts in favor of cats ("positive" samples): 99.99% of cats are classified correctly and only 1% of dogs are classified correctly. The image dataset consists of 100000 images, 90% of which are pictures of cats and 10% are pictures of dogs. In such a situation, the probability that the picture containing dog will be classified correctly is pretty low:
P(C-|-)=0.01
P4=0.0388
F1=0.9478
J=0.0099
MK=0.8183