Fig. 5

Performance gains of EPIG (top) and BALD (bottom) compared to random sampling baseline for Tox21 (left) and ClinTox (right). BERT features (light colors) show consistently higher gains than ECFP (dark colors), with EPIG demonstrating more stable improvements than BALD across iterations. The y-axis shows the difference in average precision between each acquisition function and its corresponding random baseline (averaged across 12 tasks and 3 seeds for Tox21; 10 seeds for ClinTox)