Comparative evaluation of methods for the prediction of protein–ligand binding sites

Utgés, Javier S.; Barton, Geoffrey J.

doi:10.1186/s13321-024-00923-z

Journal of Cheminformatics

Table 2 Summary statistics of the different datasets analysed in this study

From: Comparative evaluation of methods for the prediction of protein–ligand binding sites

Dataset	Type	# Structures	# Sites	# Ligands	Overlap (%)	Methods
LIGYSIS	NEW	3448	8244	65,116⁺	–	–
LIGYSIS_NI	NEW	2275	4572	38,595	–	–
sc-PDB_FULL	TRAIN	17,594⁺	17,594⁺	17,594	801⁻ (9.7)	VN-EGNN, GrASP, PUResNet, DeepPocket
bMOAD_SUB	TRAIN	5899	11,184	11,184	606 (7.6)	IF-SitePred
CHEN11	TRAIN	244⁻	479⁻	479⁻	40⁺ (0.5)	PRANK, P2Rank
PDBbind_REF	TEST	5316	5316	5316	310 (3.8)	VN-EGNN
SC6K	TEST	6147	6147	6147	259 (3.1)	DeepPocket
HOLO4K	TEST	4009	10,175	10,175	207 (2.5)	ALL*
COACH420	TEST	413	624	624	41 (0.5)	VN-EGNN, GrASP, DeepPocket, P2Rank, PUResNet
JOINED	TEST	557	752	752	110 (1.3)	PRANK

LIGYSIS is our reference dataset, LIGYSIS_NI is a subset with no ion (NI) ligand binding sites, sc-PDB_FULL, bMOAD_SUB and CHEN11 constitute the training datasets, whereas PDBbind_REF, SC6K, HOLO4K, COACH420 and JOINED represent test sets. # Structures, # Sites and # Ligands represent the number of PDB structures, ligand sites and total number of ligands for each dataset. Note that for LIGYSIS and LIGYSIS_NI, 3448 and 2775, are the number of human structural segments considered, each represented by a single chain. For each segment, all biologically relevant ligand-binding structures were considered: N = 23,321 (LIGYSIS) and N = 19,012 (LIGYSIS_NI). The number of ligands, or protein–ligand complexes, is not equal to the number of sites for LIGYSIS, as data from multiple structures of the same protein are aggregated into unique sites, i.e., a LIGYSIS site often includes multiple ligands. Overlap is the number of LIGYSIS binding sites represented by at least one protein–ligand complex for a given dataset. Percentage relative to LIGYSIS also reported. Methods represents the ligand site predictors that use these datasets for training or test. Only the original version of each dataset is considered in the analysis, e.g., HOLO4K is analysed, but not HOLO4K_Mlig, nor HOLO4K_Mlig+ HAP, or HAP-small. The same goes for Mlig, Mlig+ versions of COACH420, sc-PDB_SUB and sc-PDB_RICH. ALL* represents all the methods compared in this work except for PRANK, fpocket, PocketFinder⁺, Ligsite⁺ and Surfnet⁺. For # Structures, # Sites and # Ligands, highest values are indicated with “⁺” bold superscript and lowest with “⁻”. This is the other way around for Overlap

Back to article page

ISSN: 1758-2946

Contact us

Submission enquiries: journalsubmissions@springernature.com