Publishing neural networks in drug discovery might compromise training data privacy

Krüger, Fabian P.; Östman, Johan; Mervin, Lewis; Tetko, Igor V.; Engkvist, Ola

doi:10.1186/s13321-025-00982-w

Research
Open access
Published: 26 March 2025

Publishing neural networks in drug discovery might compromise training data privacy

Journal of Cheminformatics volume 17, Article number: 38 (2025) Cite this article

674 Accesses
3 Altmetric
Metrics details

Abstract

This study investigates the risks of exposing confidential chemical structures when machine learning models trained on these structures are made publicly available. We use membership inference attacks, a common method to assess privacy that is largely unexplored in the context of drug discovery, to examine neural networks for molecular property prediction in a black-box setting. Our results reveal significant privacy risks across all evaluated datasets and neural network architectures. Combining multiple attacks increases these risks. Molecules from minority classes, often the most valuable in drug discovery, are particularly vulnerable. We also found that representing molecules as graphs and using message-passing neural networks may mitigate these risks. We provide a framework to assess privacy risks of classification models and molecular representations, available at https://github.com/FabianKruger/molprivacy. Our findings highlight the need for careful consideration when sharing neural networks trained on proprietary chemical structures, informing organisations and researchers about the trade-offs between data confidentiality and model openness.

Scientific contribution

This study presents the first systematic assessment of the privacy risks associated with the sharing of neural networks trained to predict molecular properties. We are the first to develop a comprehensive framework for assessing these privacy risks in the context of cheminformatics, enabling the evaluation of vulnerabilities across different molecular representations and model architectures. Our work bridges the gap between privacy research and cheminformatics, providing a foundation for safer data sharing practices in drug discovery.

Introduction

The use of neural networks has gained significant traction in early drug discovery, with organisations increasingly relying on these models for a range of important modelling tasks [1]. One of the most common applications is the prediction of molecular properties [2, 3]. The performance of these models is heavily dependent on the quality and quantity of available datasets [2]. However, generating these datasets in drug discovery is an expensive and resource-intensive process, often requiring significant investment in both time and money [4]. As a result, organisations are highly protective of their data, as they have invested significant resources in generating the proprietary datasets and are accordingly reluctant to make this information publicly available.

While organisations are interested in keeping their proprietary datasets private due to the significant investments involved, they still recognise the value of engaging with the broader drug discovery community and artificial intelligence (AI) communities [5]. In the AI research field, it is common practice to share models through open-source platforms or alternatively to offer them as secure web services, fostering collaboration and innovation [6]. This interaction is mutually beneficial, as it allows for the refinement and validation of models while also advancing the field as a whole [7]. However, this type of collaboration inevitably raises concerns about data security, an issue of growing importance in AI research [8]. As organisations seek to balance the advantages of community engagement with the need to protect valuable data, the issue of privacy is becoming increasingly important.

In this work, we adopt an interdisciplinary approach that bridges the fields of drug discovery and data privacy research. This bridge has largely been missing and we firmly believe that there are great opportunities for scientific progress by bringing the two fields closer to each other. To empirically evaluate the privacy of machine learning models, membership inference attacks have become the most widely used method [9,10,11]. These attacks can be conceptualized as a privacy game, where the adversary seeks to determine whether a specific sample was part of the model’s training data (Algorithm 1). There are various levels of information the adversary might have access to regarding the model [12]. In our study, we focus on the so-called black-box scenario, where the adversary is provided with the output logits of the trained model, rather than the model’s weights, which would correspond to a white-box scenario. This black-box scenario is similar to making machine learning models available as web services.

Building on the growing body of research on membership inference attacks, Hu et al. conducted an extensive survey, highlighting that they have been studied in the domains of image data, text data, tabular data, as well as node classification in graph data [13]. Among the different implementations of attacks, likelihood ratio attacks (LiRA) and robust membership inference attacks (RMIA) have been shown to be the most effective in identifying training data samples, setting state-of-the-art performance benchmarks for the most commonly used benchmark datasets [11, 14]. Despite the growing interest in membership inference attacks, their application to molecular property prediction in drug discovery remains largely unexplored. To the best of our knowledge, Pejo et al. conducted the only study about membership inference attacks in the context of molecular property prediction, but they focused on federated learning scenarios using attacks tailored to this approach [15]. The broader implications and potential risks of membership inference attacks in molecular property prediction, particularly in traditional centralised machine learning models, still require investigation.

In this study, we provide the first comprehensive analysis of membership inference attacks against neural networks trained to predict molecular properties. We thereby highlight the risk that releasing machine learning models may expose proprietary chemical structures to the public, a challenge that organisations, for instance, must consider. To our knowledge this is the first study to investigate how different molecular representations affect the privacy of the resulting models. Additionally, we create a framework where the privacy risks of classification model architectures and representation algorithms can be assessed and compared. A scheme of our workflow is described in Fig. 1. Our study also explores whether different membership inference attacks can be used together, and we present some characteristics of the identified chemical structures that provide insights into the specific privacy risks. The approaches and findings of this study have relevance beyond the pharmaceutical sector, offering applicability to any field that relies on predictions of molecular properties, such as materials science or toxicology. Our framework also allows for the systematic assessment of privacy threats associated with predictive models in these fields.

Results

In this section, we present the results of membership inference attacks on different neural networks trained on different datasets for specific tasks: Blood-Brain Barrier crossing (BBB) to predict the ability of molecules to cross the blood-brain barrier [17], Ames mutagenicity prediction (Ames) to assess potential mutagenicity [18, 19], DNA Encoded Library enrichment (DEL) to analyse enrichment [20], and inhibition of the potassium ion channel encoded by the human ether-à-go-go-related gene (hERG) to assess cardiac toxicity risks [21]. The datasets differ in size with BBB and Ames being relatively small (859 and 3,264 training data molecules) and DEL and hERG being relatively large (48,837 and 137,853 training data molecules). We explore the potential of combining different attacks to identify additional molecules contained in the training data. We also investigate whether the identified molecules have distinct properties that distinguish them from the rest of the training data. Finally, we provide a detailed example of a specific attack to illustrate our findings.

Membership inference attacks

We wanted to see if we could identify whether a molecule was part of the training data from querying a neural network and analysing its outputs. To achieve this, we used two different membership inference attacks: likelihood ratio attacks (LiRA) and robust membership inference attacks (RMIA) [11, 14]. We evaluated their ability to distinguish between molecules in the training data and those outside it by measuring the true positive rate (TPR) at a false positive rate (FPR) of 0. In this context, we refer to molecules that were part of the training data as positives. Evaluating membership inference attacks at low FPRs was recommended by Carlini et al. [11]. Here we examine the TPR at an FPR of 0, which is the most conservative approach. For models trained on smaller datasets, we observed significantly higher TPRs than would be observed when randomly guessing if the chemical structure was part of the training dataset (Fig. 2). For example, in the blood-brain barrier crossing dataset, median TPRs were between 0.01 and 0.03 for most representations, corresponding to the identification of between 9 and 26 of the 859 training molecules. The baseline in our experimental setup for identifying molecules by chance is identifying 2 molecules of the training data (See Supplementary Information for a comprehensive derivation of this baseline). Models trained on larger datasets also showed significantly high TPRs, but only for one of the attacks, which varied between datasets (Fig. 2). The observed TPRs decreased with increasing dataset size.

To verify the consistency of our trends, we repeated our analysis of the TPR at an FPR of $10^{-3}$, as shown in Supplementary Fig. 1. We observed similar trends at this FPR. One notable difference was that RMIA always performed at least as well as LiRA across every dataset and representation. Specifically, RMIA was significantly better in half of the cases. For the other half, no significant difference was observed. In addition, even for the larger datasets, RMIA consistently provided higher TPRs than the baseline. We also investigated the corresponding ROC curves for all datasets and representations, which show our trends are consistent even for larger FPRs (Supplementary Fig. 2). The high TPRs across all four datasets at both FPRs indicate significant information leakage, showing that chemical structures from the training data can be identified. The amount of information leakage seems to be higher for models trained on smaller datasets.

When comparing different molecular representations for neural networks, we found that models trained on graph representations showed the least information leakage across all datasets (Fig. 2). The graph representation consistently had the lowest TPRs across all datasets and attacks, with a median TPR that was on average $66\% \pm 6\%$ lower than median TPRs of the other representations at an FPR of 0. In fact, for our larger datasets (DEL enrichment and hERG channel inhibition), models trained on graph representations were the only ones for which it was not possible to identify more training data molecules than by random guessing (Fig. 2). We observed the same trend for an FPR of $10^{-3}$, where the graph representation consistently had the lowest TPRs (Supplementary Fig. 2). We tested whether this was due to differences in model performance (Fig. 3), but found no clear correlation between model performance and information leakage. For the small datasets, most of the models trained on different representations performed similarly. For the larger datasets, there were some outliers in model performances. In the DNA encoded library enrichment dataset, this included models trained on MACCS keys, which performed significantly worse than the other representations. In the hERG channel inhibition dataset this included models trained on graph and SMILES representations, which performed significantly better than the other representations. Our findings suggest that graph representations combined with message passing neural networks may offer the safest architecture in terms of data privacy, without sacrificing model performance.

Combining membership inference attacks

After confirming that both membership inference attacks could identify molecules from the training data, we investigated whether they identified the same molecules or whether they could be used together to gain more information about the training data. To do this, we calculated the percentage of maximum possible overlap between the sets of molecules identified by each attack (Fig. 4). For our small datasets, we observed significantly higher overlap than would have been observed by chance if the attacks were completely uncorrelated. However, the overlap was still well below 100%, indicating that using both attacks can identify a wider range of molecules in the training data. For our larger datasets (DEL enrichment and hERG inhibition), there was no significant overlap, which is reasonable given our earlier findings that only one of the attacks significantly outperformed random guessing in each dataset. How much the observed overlap deviated from overlap occurring due to chance is shown in Supplementary Fig. 3. Our results suggest that using multiple different membership inference attacks is advantageous and allows the identification of more molecules from the training data.

We also investigated the overlap of identified molecules in models trained on different representations. We found a consistently large overlap between models trained on ECFP4 and ECFP6. For other representations, the overlap varied depending on the dataset and the attacks used. Detailed results can be found in Supplementary Fig. 4.

Analysing the identified training data molecules

Next, we wanted to see if the molecules identified from the training data shared any common characteristics. To do this, we analysed whether they differed in their distributions of property labels and molecular sizes compared to the overall training data. For the property labels, we found that the identified molecules had a significantly higher proportion of minority class molecules compared to the overall dataset (Table 1). The minority class refers to the less frequently occurring label category within a dataset, such as active compounds in a screening assay where the majority are inactive. This significant difference in label distribution was observed in all our imbalanced datasets and held true for both small datasets (blood-brain barrier crossing) and larger ones (DNA encoded library enrichment, hERG channel inhibition) across both membership inference attacks. We confirmed this finding by examining the TPRs of minority class molecules and discovered that their TPRs were consistently higher than the overall TPRs (Supplementary Fig. 5). Specifically, the median TPR of the minority class was approximately three times greater for all representations of the blood-brain barrier crossing dataset and up to 20 times greater for some representations of the DNA encoded library enrichment and hERG channel inhibition datasets. Detailed TPR distributions for all datasets and representations can be found in Supplementary Fig. 5. Regarding molecular sizes, we only found differences between identified and not identified structures in models trained on ECFP representations (Supplementary Figure 6). For models trained on other representations, we did not find any significant differences. While the identified structures do not seem to show a clear trend regarding their molecular size, our findings do indicate that it is easier to identify molecules from the minority class.

Table 1 Property label distributions of the identified molecules and the overall datasets.

Full size table

We also investigated whether molecules with uncommon structural features are easier to identify. Uncommon structures were defined based on both their highest (nearest neighbour) and average Tanimoto similarity to the rest of the training data, and identification was assessed at an FPR of 0. For the highest Tanimoto similarity, Mann–Whitney U tests revealed that in more than 80% of dataset-representation combinations, the identified molecules had significantly lower similarity to their nearest neighbour in the training set compared to non-identified molecules. In addition, we examined whether the fraction of identified molecules varied systematically with Tanimoto similarity-following trends such as linear or exponential relationships-but no consistent pattern emerged across all combinations of datasets and representations (Supplementary Figures 7 and 8). Similar results were observed for average Tanimoto similarity. In 75% of cases, identified molecules had a significantly lower average similarity to the rest of the training data compared to non-identified molecules. However, when analysing the fraction of identified molecules across different similarity values, we again did not observe a consistent relationship between Tanimoto similarity and identification rates (Supplementary Figures 9 and 10). Overall, these results show that molecules with lower structural similarity to the training data tend to be easier to identify, but their identification rates do not follow a simple, systematic trend based on similarity alone.

Case study

To illustrate our results, we present a specific example of attacking one neural network model trained to predict whether molecules can pass the blood-brain barrier. Molecules are represented by ECFP4s, a common representation in many related applications. This particular model was chosen because it is representative of the 20 experimental repetitions we conducted, with its TPR falling within the interquartile range of our results. Figure 5 shows the chemical structures identified using LiRA on this model under the most stringent conditions (an FPR of 0). It was possible to identify 23 of the 859 structures from the training data (Fig. 5). The baseline for random guessing in that case is identifying 2 of 859 structures (See Supplementary information). 21 of the 23 identified structures are from the minority class (Fig. 5). When we relaxed the FPR to $1.1 \times 10^{-2}$ (allowing for 10 false positives among the identified structures), we were able to identify 100 structures from the training data (baseline for random guessing is 10 structures in that case). This illustrates the rapid increase in identified structures as the restrictions on the FPR are relaxed. Additionally, when combining both LiRA and RMIA, we identified 53 structures at an FPR of 0. We hope that this concrete illustration shows the potential risks that membership inference attacks pose to neural network models used in drug discovery.

Discussion

We investigated if it is possible to identify molecules from the training data only using the output of trained neural networks, a so-called black-box attack scenario. To investigate this question, we have applied state-of-the-art membership inference attacks to neural networks trained on different machine learning tasks for molecular property prediction. We showed that it is possible to confidently identify a subset of the training data. We also showed that combining multiple different membership inference attacks allows us to identify even more molecules since each attack identifies different molecules. Furthermore, we investigated the identified molecules and found that they contain a much higher proportion of molecules from the minority class. Thus the investigation presents evidence that there can be significant information leakage of chemical structures from the training data when publishing a trained neural network model, which we will discuss in the following section.

It is important to note that our results focus on membership inference attacks against neural networks trained on classification tasks. This investigation does not cover regression tasks. Further research is needed to explore this area.

A limitation of the membership inference attacks we used is that they require the adversary to have data similar to the training data of the target model. While in some real-world scenarios it may hold — for instance, many organizations do have comparable internal datasets or can leverage publicly available datasets [22] — this does not fully capture the complexity of real-world applications. While assuming that the adversary has data from a similar distribution is a useful starting point for exploring privacy vulnerabilities, in drug discovery, private datasets often contain novel chemistries or rare scaffolds that lie outside common public libraries, potentially degrading the efficacy of these attacks when the adversary’s data distribution diverges from that of the target model. Although Shokri et al. [9] showed that synthetic data generated by the target model can still be used to perform attacks requiring shadow models, the feasibility of this approach for molecular data, where the gap between known and unexplored chemical space can be substantial, requires further investigation. Future work should address how these attacks perform under distribution shifts in order to better assess their applicability under these conditions.

In a real world scenario, this might often be the case. For many tasks in drug discovery, there are some small publicly available datasets [22]. Additionally, many organisations have their own internal datasets for these tasks. Furthermore, Shokri et al. showed that even when similar data is not available, synthetic data generated from the target model can be used to successfully perform attacks that require shadow models [9].

We also want to emphasize that membership inference attacks assess whether it is possible to identify samples from the training data, not whether it is possible to reconstruct the training data from the model. These attacks are commonly used to assess information leakage in privacy assessments and viewed as a building block towards other attacks, e.g. reconstruction attacks [12]. In the context of drug discovery, they may have even more practical applications. For example, if an organisation offers neural network based molecular property predictions as an online service, membership inference attacks could determine whether specific molecules were part of the model’s training data. Since the presence of a molecule in the training data suggests that it is being actively researched, a competitor could use this information to gain valuable insights that could give them a strategic advantage.

Our study shows that neural networks trained for molecular property prediction in drug discovery can leak training data information, as demonstrated through membership inference attacks. However, message-passing neural networks using graph representations of molecules showed significantly reduced vulnerability to these attacks. We argue that this shows that these models are the safest architecture in terms of privacy conservancy of the training data in our setting. An alternative interpretation could be that message-passing neural networks are not inherently safer, but rather that the specific membership inference attacks we used were less effective against this particular combination of model and representation. However, we think this is very unlikely, as the results are held across two different attacks, both of which rely only on model outputs rather than architecture-specific features. The only way the attacks are influenced by the specific architecture is through the training of shadow models that share the architecture of the target model. Notably, LiRA and RMIA are robust to mismatches in shadow model architectures, as shown by Carlini et al. and Zarifzadeh et al. [11, 14], meaning that variations in shadow model architectures do not significantly affect the success of the attacks. This supports our claim that graph representations of molecules with message-passing neural networks are the safest architecture in terms of protecting training data privacy in drug discovery.

We are confident that our results would be similar even if attacks tailored to graph classification neural networks, such as those proposed by Wu et al. [23], were used. Our conclusion is supported by Zarifzadeh et al. [14], who showed both theoretically and empirically that the Attack-P method of Ye et al. [24] — which is essentially identical to the threshold-based attack of Wu et al. — is less effective than both LiRA and RMIA. Therefore, we focused on the use of RMIA and LiRA, as they are widely recognised as state-of-the-art techniques in the field and can be applied to any model architecture.

Our findings align with those of Zarifzadeh et al., who investigated membership inference attacks in the domains of computer vision (using CIFAR-10, CIFAR-100, and CINIC-10 datasets) and tabular data (using the Purchase-100 dataset) [14]. At a false positive rate (FPR) of 0, they reported true positive rates ranging from 0.0082 to 0.0778, which is in the same range as our results. This indicates that the findings of attacks on neural networks in other deep learning fields translate into the field of molecular property prediction. Another finding of Zarifzadeh et. al was that RMIA consistently outperformed LiRA [14]. They derived this both theoretically and empirically. Our results generally support this, with one exception: for attacks on the hERG channel inhibition dataset, LiRA outperformed RMIA at an FPR of 0. However, at an FPR of $10^{-3}$, this was not the case. At an FPR of $10^{-3}$, our results completely agreed with the findings of Zarifzadeh et al. [14]. The small discrepancy at an FPR of 0 may be due to the computational constraints we faced with the hERG dataset, which was the largest in our study. Due to its large size, we had to use a small amount of samples Z from the underlying distribution to do the likelihood ratio test against for RMIA. This limitation arose because comparing all data points against many points Z across our models was computationally prohibitive. In contrast, LiRA does not have these constraints, which may explain its better performance compared to RMIA in this case. While RMIA generally outperforms LiRA, the latter remains a valuable approach, as it identifies different molecules, making it a complementary method, which we will discuss in a later paragraph.

Our results also show that membership inference attacks are most effective on smaller datasets. This is consistent with the findings of Shokri et al. [9], who link the success of attacks to the generalisability of the model and the diversity of the training data — both of which improve with larger datasets. It is important to note that our neural networks are by no means designed in a way that makes them vulnerable to attack. On the contrary, we have implemented robust regularisation techniques that have been shown to make neural networks more resilient to membership inference attacks and improve privacy guarantees. In particular, our models use early stopping, dropout, and L2 weight regularisation. For the latter two, it has been specifically shown to reduce the efficiency of membership inference attacks [9, 25].

In practice, it could even be possible to increase the effectiveness of the attacks further by augmenting the attack query with some similar data as was shown by Zarifzadeh et al. [14]. We did not explore this due to computational limitations and given the broader scope of our study.

Another way to further increase the effectiveness of privacy attacks could be to incorporate scaffold-based inference strategies. Shifting from identifying complete molecular structures to detecting the presence or absence of specific molecular scaffolds within the training data could be an easier task that still provides information about sensitive intellectual property. Future research in this direction could potentially uncover additional privacy vulnerabilities in molecular property prediction.

We found that by applying multiple membership inference attacks, we were able to identify more molecules within the training data. This is consistent with previous work by Ye et al. [24], which demonstrated that some data points are only identified by certain attacks. We extended this by investigating the current state-of-the-art methods, LiRA and RMIA, and explicitly quantifying the overlap between these attacks across different datasets. From a practical point of view, using both attacks makes sense because it is possible to reuse the same shadow models between attacks, allowing more training data to be identified with limited computational overhead. In addition, the attacks remain feasible even when minimal computational resources are available. For example, RMIA has been shown to perform effectively with as few as two shadow models [14]. In such cases, the attacks can be run on any device capable of training neural networks with architectures similar to the target model.

Our finding that molecules in the minority class are more likely to be identified could be explained by the lower diversity in the training data for these compounds, as discussed above. This observation has important implications for drug discovery. In many datasets, the pharmacologically relevant compounds often belong to the minority class. For example, in high-throughput screening assays such as DNA-encoded library enrichment, researchers focus on the few molecules that bind to the target protein, while the majority that do not bind are of less interest [26]. This pattern is also seen in various cell-based screening assays, such as phenotypic assays aimed at identifying molecules that inhibit cancer cell proliferation [27]. In these scenarios, the minority class contains the compounds of greatest interest, making their identification far more valuable.

Our findings also indicate that molecules with low similarity to the rest of the training data are easier to identify, which has potential implications for drug discovery. Unique molecular structures that differ from established library compounds may correspond to proprietary lead compounds or novel scaffolds under development. Our results suggest that models might be more likely to memorize and subsequently reveal information about such structures. This observation aligns with previous research indicating that models tend to memorize outliers [28], which is suspected to contribute to the easier identification of these molecules [11].

To address these privacy concerns, we have developed a Python package to assess the privacy of training data for molecular property prediction.^{Footnote 1} This package allows users to evaluate their own data by applying our workflow to determine the extent to which training data can be identified when using different molecular representation methods. In addition, the package supports the testing of new representation methods by providing insight into their training data privacy and model performance on both user-provided and pre-supplied datasets. We hope that this tool will help researchers assess privacy risks before publishing their models.

Our research shows the potential dangers of information leakage from training data when publishing a trained neural network for drug discovery tasks. This risk exists even when the weights of the neural network are not published, and the model is offered as a supposedly safe web service. This has significant implications for organisations, which must constantly balance the need to make scientific discoveries openly available with the imperative to protect confidential data. We have shown that information leakage is consistently observed, but it can be mitigated by representing molecules as graphs and using message-passing neural networks, which also proved to be among the best performing models on our datasets. However, when planning to publish a model, it is crucial to consider not only performance but also the privacy implications of different model architectures. Our findings also open up new research questions, such as how to adapt reconstruction attacks to the domain of molecules and how to develop models that are safer in terms of training data privacy in this field. The baseline for developing safer models might be to represent molecules as graphs and use message-passing neural networks for predictions. Our research highlights the essential balance between publicly available innovation and privacy, a balance that will impact the future of AI-driven drug discovery.

Methods

In this section, we first describe how we trained neural networks on biological datasets to predict molecular properties. Then, we outline the membership inference attacks used to evaluate the vulnerabilities of the models. Finally, we explain the methods used to compare and analyse the molecules leaked by these attacks. A high-level overview of our workflow is presented in Fig. 1. The code for our models and membership inference attacks, along with the datasets used in this study, are available on GitHub.

https://github.com/FabianKruger/molprivacy

Datasets

We used four different datasets to predict pharmacologically relevant molecular properties. The datasets differ in size, task, and class imbalance. The first dataset is used for mutagenicity prediction [18, 19]. It contains Ames test results for 7,255 drugs. Of these, 54% show positive results. The second dataset assesses blood-brain barrier permeability [17]. It contains 1,909 molecules, with 76% able to penetrate the barrier. The third dataset provides information on the inhibition of the potassium ion channel encoded by the human Ether-à-go-go-Related Gene (hERG) [21]. Inhibition is defined as a half-maximal inhibitory concentration of less than 10 $\mu$M. This dataset contains 306,341 compounds, with 4.5% being inhibitors. These three datasets were obtained from Therapeutics Data Commons [22]. The fourth dataset contains information on whether a molecule is enriched in a DNA-encoded library (DEL) for binding to carbonic anhydrase IX [20]. Positive enrichment is defined as the top 5% of enrichment scores. This dataset includes 108,528 molecules, with 4.9% showing enrichment after cleaning the data.

We pre-processed all datasets to remove ambiguities and incorrect compounds. Molecules were standardised for correct bonding, aromaticity, and hybridisation. Salts were removed to isolate the primary compound and Simplified Molecular Input Line Entry System (SMILES) [29] strings were converted to their canonical forms. Duplicate molecules and those with conflicting labels were removed. Molecules with canonical SMILES strings longer than 200 characters were also excluded. These steps were performed using the RDKit package version 2024.03.1. The reported dataset sizes are after cleaning. The cleaned datasets were randomly divided into a training set (45%), a validation set (10%), and a population subset (45%). The population subset was used for membership inference attacks, while the training and validation sets were used for model training and hyperparameter optimisation.

Model architectures

To capture the variety in molecular representation approaches, we trained neural networks on a range of commonly used representations. Our study included extended-connectivity fingerprints (ECFPs) [30], molecular access system (MACCS) keys [31], graph representations, RDKit fingerprints (RDKitFPs) [32], and SMILES [29] representations. We chose these representations to cover various conceptually different approaches to molecular representation. For ECFPs, we investigated fingerprints with radii of 2 and 3, both mapped to 2048-bit vectors. MACCS keys were represented as binary vectors, indicating the presence or absence of 166 structural patterns. RDKitFPs identified all subgraphs in the molecule up to a length of 7, hashed into 2048-bit vectors. These three representations were generated using RDKit [32]. The graph representation was generated using Chemprop version 1.6.1 [33].

The type of neural network we used varied depending on the specific molecular representation. We used multi-layer perceptrons (MLPs) for ECFPs, MACCS keys, and RDKitFPs. We employed message passing neural networks implemented in Chemprop for the graph representation. For the SMILES representation, we used a pre-trained transformer encoder combined with a convolutional neural network based on Karpov et al. [34]. All our models were implemented in Pytorch version 2.2.2 [35]. We pre-trained the transformer encoder to convert non-canonical SMILES strings to their canonical counterparts for 20 epochs using the ChEMBL_V29 dataset from Therapeutics Data Commons [36]. We randomly split this dataset into 90% training data and 10% validation data. For the transformer encoder, we used the same hyperparameters as in the original publication but increased the context length of the transformer from 110 to 202 tokens in order to also generate encodings for larger molecules. We determined the hyperparameters for the MLPs, message passing neural networks, and convolutional neural networks using a Bayesian optimisation method, which we will describe in the next paragraph. All our models had one output node to predict the logits for our binary classification problems.

Hyperparameter optimization

To avoid introducing subjective bias into our models, we decided to automatically optimise the hyperparameters of the neural networks using a tree structured Parzen estimator [16]. This was done using Optuna version 3.6.0. [37]. We optimised dropout rate, number and dimension of hidden layers, learning rate, and weight decay for MLPs. For message passing neural networks, we optimised message passing steps, dropout, encoder hidden dimension, bias addition in the encoder, aggregation function, number and dimension of classifier hidden layers, learning rate, and weight decay. For convolutional neural networks, we kept the filter sizes from the original publication and optimised dropout, learning rate, and weight decay. Detailed ranges for the hyperparameter search spaces are shown in Supplementary Table 1. We optimised each neural network architecture for three hours on an NVIDIA Volta V100 GPU. During this time, we evaluated the validation cross-entropy loss for different hyperparameter combinations. Each training run was performed for a maximum of 20 epochs. We stopped runs early if the validation loss did not improve for two consecutive epochs or if, after 15 epochs, the validation loss was below the median value for that epoch.

Model training

After finding the optimised hyperparameters, we trained the final models until their performance converged on the validation set. We used early stopping with a patience of 10 epochs and saved the model weight of the epoch with the lowest validation loss. For all our models, we used a weighted binary cross-entropy loss as a loss function. The weights accounted for the class imbalance and were inversely proportional to the frequency of the classes. We used the adaptive moment estimation with decoupled weight decay regularization (AdamW) optimiser for MLPs and message passing neural networks [38]. For convolutional neural networks, we used the original adaptive moment estimation (Adam) optimiser to remain consistent with the original implementation [39]. Training was done in batch sizes of 64 samples. We repeated our experiment 20 times for each dataset and representation to capture the marginal distribution of all randomness in the experiment, including dataset splitting, hyperparameter optimisation, and model weight initialisation. We examined the performance of each model on the population sample dataset (Fig. 1), as it was not used in any way for training or hyperparameter optimisation, and testing the performance of the model is independent of the membership inference attacks.