Fig. 2

Pipeline for assembly and curation of the UniProtSMB dataset. First, we collected 14,064 experimentally supported proteins with 3D structures and small molecule binding sites from among 248,805,733 total proteins in the UniProtKB database as of April 17, 2024. After removing proteins longer than 1,024 amino acids, we examined UniProtKB annotations to collect binding site information, including the relevant residues, drugs, cofactors, ATP, and other small molecule ligands. A total of 7,828 small molecules binding protein sequences were collected in this step. Residues involved in binding (pink) or not involved in small molecule binding (blue) were labeled in the sequence of each protein. We then clustered proteins with a sequence similarity cutoff of 50% using UCLUST, which resulted in 4,964 sequence clusters. All proteins within each cluster were subsequently aligned by MAFFT and all binding sites in each cluster were merged onto the longest sequence in that cluster, resulting in a final total set of 4,964 proteins. Finally, the resulting UniProtSMB dataset was divided into a training set (3,972 proteins), a validation set (496 proteins) and a test set (496 proteins)