Skip to main content

Table 2 Summary statistics of the different datasets analysed in this study

From: Comparative evaluation of methods for the prediction of protein–ligand binding sites

Dataset

Type

# Structures

# Sites

# Ligands

Overlap (%)

Methods

LIGYSIS

NEW

3448

8244

65,116+

LIGYSISNI

NEW

2275

4572

38,595

sc-PDBFULL

TRAIN

17,594+

17,594+

17,594

801 (9.7)

VN-EGNN, GrASP, PUResNet, DeepPocket

bMOADSUB

TRAIN

5899

11,184

11,184

606 (7.6)

IF-SitePred

CHEN11

TRAIN

244

479

479

40+ (0.5)

PRANK, P2Rank

PDBbindREF

TEST

5316

5316

5316

310 (3.8)

VN-EGNN

SC6K

TEST

6147

6147

6147

259 (3.1)

DeepPocket

HOLO4K

TEST

4009

10,175

10,175

207 (2.5)

ALL*

COACH420

TEST

413

624

624

41 (0.5)

VN-EGNN, GrASP, DeepPocket, P2Rank, PUResNet

JOINED

TEST

557

752

752

110 (1.3)

PRANK

  1. LIGYSIS is our reference dataset, LIGYSISNI is a subset with no ion (NI) ligand binding sites, sc-PDBFULL, bMOADSUB and CHEN11 constitute the training datasets, whereas PDBbindREF, SC6K, HOLO4K, COACH420 and JOINED represent test sets. # Structures, # Sites and # Ligands represent the number of PDB structures, ligand sites and total number of ligands for each dataset. Note that for LIGYSIS and LIGYSISNI, 3448 and 2775, are the number of human structural segments considered, each represented by a single chain. For each segment, all biologically relevant ligand-binding structures were considered: N = 23,321 (LIGYSIS) and N = 19,012 (LIGYSISNI). The number of ligands, or protein–ligand complexes, is not equal to the number of sites for LIGYSIS, as data from multiple structures of the same protein are aggregated into unique sites, i.e., a LIGYSIS site often includes multiple ligands. Overlap is the number of LIGYSIS binding sites represented by at least one protein–ligand complex for a given dataset. Percentage relative to LIGYSIS also reported. Methods represents the ligand site predictors that use these datasets for training or test. Only the original version of each dataset is considered in the analysis, e.g., HOLO4K is analysed, but not HOLO4KMlig, nor HOLO4KMlig+ HAP, or HAP-small. The same goes for Mlig, Mlig+ versions of COACH420, sc-PDBSUB and sc-PDBRICH. ALL* represents all the methods compared in this work except for PRANK, fpocket, PocketFinder+, Ligsite+ and Surfnet+. For # Structures, # Sites and # Ligands, highest values are indicated with “+” bold superscript and lowest with “”. This is the other way around for Overlap