The promise of deep neural networks (DNN) has had the science community abuzz in recent years. DNNs’ claim to fame would seem well founded in view of the impressive results these networks have achieved in computer vision. Researchers have tracked and measured advances in the architectures using various image classification, object detection and segmentation datasets as benchmarks. Their findings confirm that DNN-based models deliver on that promise in the lab. However, their performance in real-world applications has been less impressive – particularly in safety-critical systems.
Trained to excel at working with a given dataset, these models tend to make overconfident predictions. They are frequently unable to spot out-of-distribution detections. What’s more, most frameworks today are unable to express uncertainty, which is why softmax confidence scores are today the baseline method for determining the accuracy of network outputs.
Object detection and uncertainty
Rather than using fixed weights in the way of standard DNNs, Bayesian neural networks (BNNs) – a statistically modified version of DNNs – inherently express uncertainty by learning a distribution over weights. There is a caveat, though – this awareness comes at the cost of considerable computational effort.
Sampling-based methods such as Monte Carlo Dropout and Deep Ensembles approximate BNNs by computing variance over multiple predictions for the same input. This requires multiple forward passes, which increases inference time.
Other methods embed uncertainty estimation into the model. For example, a model with Kullback-Leibler (KL) divergence integrated into its loss function can predict both bounding box coordinates and variances.
Confidence versus variance
Uncertainty estimates not only help spot false positive detections. Evidence suggests they are also better indicators than confidence scores of how well box predictions match the true objects in an image.
Researchers performed an experiment using the KITTY dataset (Karlsruher Institut für Technologie, KIT) to validate this hypothesis. To this end, they computed the intersection over union (IoU) of each predicted bounding box with the given ground truth for confidence-based and variance-based models. This experiment compared vanilla Efficientdet-D0 models to Efficientdet-D0 models trained using the KL-modified loss function, which enables these models to predict variances as well as box coordinates.
The results as shown in the figures below confirm that variances are indeed a better indicator of the spatial accuracy of detections. The correlation is also stronger (bearing in mind that higher variance translates to greater spatial uncertainty, which explains the negative detections). The correlation for confidences is still there, but weaker. Note that poor estimates are more frequent for confidences, as scatter plots (a) and (b) show. Points converge in denser groups at the top right and top left corners, respectively, indicating good estimates for high IoU values. However, the distribution of confidences for the remaining points is nearly uniform. On the other hand, variances proved more accurate, with a greater concentration around the first-order polynomial matching the samples. The histogram in (d) underpins this conclusion with its steep drop in the IoU for high variance detections.
Evaluating uncertainty in object detection tasks
We know that assessing uncertainty matters to safer decision-making. And we now know that variances better express the spatial accuracy of detected bounding boxes. How, then, do we evaluate uncertainty estimates?
The most frequently used metric in object detection, mAP (mean average precision) fails to consider uncertainty. It simply deems every detection with an intersection over union above a certain threshold to be a true positive. If this threshold is 0.5, mAP equates a positive detection at just 0.5 of IoU with a perfect match for the ground truth.
Researchers have come up with an evaluation measure to resolve this issue – Probability-based Detection Quality (PDQ). PDQ uses a categorical distribution over all class labels for the given classification task and a probability map with a set of pixels for object localization. This metric consists of two indexes, label quality and spatial quality. Label quality is defined by the probability the detector assigns to the true label. Spatial quality sums the log probabilities the detector assigns to the pixels of a ground-truth segment. This index penalizes detections heavily when the detector assigns a high probability to a pixel that does not match the ground truth.
Expectations and prospects
DNNs have proven their potential. The challenge now is to how to safely tap this potential so as to integrate these networks into critical systems. The research community has already set out in pursuit of new advances in AI with uncertainty, explainability and robustness in mind. And with uncertainty-aware object detectors and a suitable metric to evaluate uncertainty estimates in their kitbags, they are all the better equipped to make strides down that road.
This activity is being funded by the Bavarian Ministry for Economic Affairs, Regional Development and Energy as part of a project to support the thematic development of the Institute for Cognitive Systems.