Fig. 3

Evolution of Expected Calibration Error (ECE) for Tox21 (left, averaged across 12 tasks and 3 seeds) and ClinTox (right, averaged across 10 seeds). Lower ECE indicates better-calibrated uncertainty estimates. EPIG with BERT features (solid red) achieves the fastest convergence to low ECE values, demonstrating superior uncertainty calibration compared to other methods. While all methods eventually converge to similar ECE values after sufficient iterations, ECFP features require substantially more labeled data to achieve good calibration, highlighting the importance of informative feature representations for reliable uncertainty estimation