- Research
- Open access
- Published:
ADMET evaluation in drug discovery: 21. Application and industrial validation of machine learning algorithms for Caco-2 permeability prediction
Journal of Cheminformatics volume 17, Article number: 3 (2025)
Abstract
The Caco-2 cell model has been widely used to assess the intestinal permeability of drug candidates in vitro, owing to its morphological and functional similarity to human enterocytes. While Caco-2 cell assay is considered safe and cost-effective, it is also characterized by being time-consuming. Therefore, computational models that achieve high accuracies in predicting Caco-2 permeability are crucial for enhancing the efficiency of oral drug development. In this study, we conducted an in-depth analysis of the characteristics of an augmented Caco-2 permeability dataset, and evaluated a diverse range of machine learning algorithms in combination with different molecular representations. The results indicated that XGBoost generally provided better predictions than comparable models for the test sets. In addition, we investigated the transferability of machine learning models trained on publicly available data to internal pharmaceutical industry datasets. Our findings, based on the Shanghai Qilu’s in-house dataset, showed that the boosting models retained a degree of predictive efficacy when applied to industry data. Furthermore, Y-randomization test and applicability domain analysis were employed to assess the robustness and generalizability of these models. Matched Molecular Pair Analysis (MMPA) was utilized to extract chemical transformation rules. We believe that the model developed in this study could represent a reliable tool for assessing Caco-2 permeability during early-stage drug discovery and the chemical transformation rules derived here could provide insights for optimizing Caco-2 permeability.
Scientific contribution
A comprehensive validation of various machine learning algorithms combined with diverse molecular representations on a large dataset for predicting Caco-2 permeability was reported. The transferability of machine learning models trained on publicly available data to internal pharmaceutical industry datasets was also investigated. Matched molecular pair analysis was carried out to provide reasonable suggestions for researchers to improve the Caco-2 permeability of compounds.
Graphical Abstract

Introduction
The oral route is often preferred for drug administration because of its convenience, cost-effectiveness, and high patient adherence [1]. However, about 10% of drugs fail in development due to poor pharmacokinetic properties [2]. One of the foremost challenges faced by orally administered drugs is the traversal across the intestinal epithelial barrier, which determines the rate and extent of absorption in humans, thereby critically influencing their bioavailability [3]. Furthermore, the permeability process itself is intricate. Transport across the intestinal epithelial cell barrier may occur through one or more of four different routes: paracellular and transcellular passive diffusion, the carrier-mediated route, and transcytosis [4]. Therefore, it is paramount to accurately reconstitute the differentiated human epithelial cell monolayer in vitro, as this enables precise prediction of oral drug absorption in humans.
Several cells culture models have been developed to replicate the pertinent characteristics of in vivo absorption, such as Parallel Artificial Membrane Permeability Assay (PAMPA) [5], the human colon adenocarcinoma (Caco-2) cell lines [6], the Madin–Darby canine kidney (MDCK) cell [7], and porcine kidney epithelial cell lines (LLC-PK1) [8]. Among these models, Caco-2 cell monolayers have emerged as the “gold standard” for drug permeability due to their ability to closely mimic the human intestinal epithelium [9]. Indeed, this in vitro model has been endorsed by the US Food and Drug Administration (FDA) for assessing the permeability of compounds categorized under the Biopharmaceutics Classification System (BCS) [10].
Nevertheless, high-throughput screening (HTS) with the traditional Caco-2 cell model poses challenges due to its extended culturing period (7–21 days), which is necessary for full differentiation into an enterocyte-like phenotype [11, 12]. The extended cultivation not only increases the risk of contamination but also imposes significant costs, making it less desirable in the context of drug discovery. Furthermore, there is a growing need to apply selection filters prior to designing and acquiring large compound libraries, thereby shifting the focus towards in silico predictions of ADME-related properties.
Numerous machine learning models have been developed for both classification and regression analysis of Caco-2 permeability based on a variety of molecular representations, such as fingerprints, descriptors and embeddings extracted from neural network models [13,14,15,16,17,18,19]. Notably, there are fewer classification models than regression models for predicting Caco-2 cell permeability, primarily due to the lack of a consensus on the optimal cut-off for modeling and practical application. Here, this study also focused on the regression task. Prior research on Caco-2 permeability prediction has exhibited considerable variation in dataset size, descriptor types, and modeling methods. For instance, Wang et al. constructed different models based on the MOE 2D/3D descriptors to predict Papp values using a structurally diverse data set of 1272 compounds [13]. The comparison between Boosting and other methods revealed that the Boosting model achieved better results, with an R2 value of 0.81 and an RMSE of 0.31 for the test set. Additionally, Wang and colleagues proposed the MESN deep learning model, which uses a curated Caco-2 dataset with 4464 compounds to predict the features associated to oral bioavailability [15]. Three molecular embedding approaches were employed in the MESN model, namely, Morgan fingerprints, SMILES-based embedding, and molecular-graph-based embedding. For membrane permeability prediction, the MESN model obtained an MAE score of 0.410 and an RMSE of 0.545. More recently, Gabriela et al. introduced a QSPR approach developed on the KNIME analytical platform, based on a structurally diverse dataset of over 4900 molecules [16]. They employed random forest supervised recursive algorithms for data cleaning and feature selection. The modeling algorithm consisted of a conditional consensus model composed of individual regression random forests, achieving RMSE values ranging from 0.43 to 0.51 and R2 values between 0.57–0.61 for all validation sets.
Although progress has been made in predicting Caco-2 permeability, several issues still required discussion. Firstly, Caco-2 permeability is a highly complex process that can occur through various nonlinear routes. The characteristics of the Caco-2 permeability data were not discussed and the scarcity of high-quality Caco-2 permeability data has impeded the development of accurate models with a wide applicability domain. Secondly, there remains a question regarding the performance of deep learning in Caco-2 permeability data. Thirdly, the effectiveness of machine learning models trained on publicly accessible databases in industrial settings remains unclear.
Taking into account the mentioned concerns, we thereby reported a comprehensive validation of various machine learning algorithms combined with diverse molecular representations for predicting Caco-2 permeability. Initially, a relatively larger Caco-2 permeability dataset was collected, and the structural diversity of the compounds was analyzed. Four machine learning methods (XGBoost, RF, GBM, and SVM) and two deep learning models (DMPNN and CombinedNet) were used to build prediction models. For each algorithm, the optimal model was selected for further analysis. Next, we conducted a direct comparison of different in silico predictors through several model validation methods, including the Y-randomization test and application domain (AD) analysis. Additionally, we carried out a performance assessment of different models trained on public data using the Shanghai Qilu’s in-house dataset. Lastly, based on the results of matched molecular pair analysis (MMPA), we provided reasonable suggestions for researchers to improve the Caco-2 permeability of compounds. We anticipate that our findings will shed light on future method development and offer insights for optimizing Caco-2 permeability.
Methods
Data collection and preparation
Experimental values of Caco-2 permeability were obtained from three publicly available datasets [13,14,15]. The first dataset, reported by Wang et al., comprised 1272 compounds used for developing predictive models and an additional 298 compounds for validation [13]. Wang and Cheng collected the second dataset of 1827 compounds to develop QSPR models using neural networks and other machine learning methods [14]. The third dataset, consisting of 4464 compounds, was utilized by Wang et al. for modeling Caco-2 permeability with neural networks [15]. These datasets were combined into an initial dataset of 7861 compounds for subsequent analysis. To ensure data consistency and minimize uncertainty, the following procedures were applied: (1) Permeability measurements were converted to cm/s × 10–6 and transformed logarithmically (base 10) for modeling. (2) Entries with missing permeability values were excluded. (3) Mean values and standard deviations were calculated for duplicate entries. Only entries with a standard deviation ≤ 0.3 were retained, and the mean values of these compounds were used as standard values for model training. (4) The RDKit module MolStandardize was employed for molecular standardization to achieve consistent tautomer canonical states and final neutral forms, while preserving stereochemistry. After curation, an exhaustive dataset comprising 5654 non-redundant Caco-2 permeability records was compiled. Subsequently, these records were randomly divided into the training, validation, and test sets in an 8:1:1 ratio, ensuring an identical distribution across the datasets. To enhance the robustness of model evaluation against data partitioning variability, the experimental dataset underwent 10 splits using different random seeds. The model was then assessed based on the average performance across these ten independent runs. According to the OECD principles, both internal and external validations are essential to assess model reliability and predictive capability [20, 21]. An additional set of 67 compounds from the Shanghai Qilu’s in-house collection was included as an external validation set to test the prediction performance of the model trained on the public data set on the private data set.
Molecular representations
To incorporate comprehensive chemical information at both global and local levels, three types of molecular representation methods were employed to depict the structural features of molecules. Specifically, the molecular representations utilized were as follows:
-
1.
Morgan fingerprints with a radius of 2 and 1024 bits: The RDKit implementation of Morgan fingerprints was utilized [22].
-
2.
RDKit2D descriptors: Normalized descriptors from descriptastorus (https://github.com/bp-kelley/descriptastorus), which wraps the RDKit implementation and normalizes values using a cumulative density function from Novartis’ compound catalog.
-
3.
Molecular graphs: For the message-passing neural network, molecular graphs \(G=(V,E)\) served as the foundational representation, where \(V\) represents atoms (nodes) and \(E\) represents bonds (edges). This approach was implemented using the open-source ChemProp package to enhance model performance and effectively capture nuanced molecular features [23].
Furthermore, a combination of Morgan fingerprints and RDKit2D normalized descriptors was employed to train all methods, with the exception of DMPNN and CombinedNet, which utilized molecular graphs. Specifically, CombinedNet employed a hybrid approach that combined Morgan fingerprints and molecular graphs: the former providing information on substructure existence and the latter conveying connectivity knowledge.
Model construction and evaluation
In this study, we conducted a comprehensive comparison of multiple machine learning and deep learning algorithms for quantitative predictions of Caco-2 permeability. The evaluated algorithms included Random Forest (RF) [24], extreme gradient boosting (XGBoost) [25], Support Vector Machine (SVM) [26], Gradient Boosting Machine (GBM) [27], Directed Message Passing Neural Network (DMPNN) [28], and CombinedNet (Figure S1) [29]. The first four methods were implemented using the scikit-learn package, while the latter two deep learning techniques were implemented using the ChemProp package [23, 30]. Hyperparameters for all algorithms, except DMPNN and CombinedNet, were optimized through fivefold cross-validation and grid search. The DMPNN algorithm employed the Adam optimizer for gradient descent optimization, while CombinedNet utilized the LAMB optimizer with a mean squared error (MSE) loss function [31]. Hyperparameters in neural networks were fine-tuned using Bayesian optimization algorithms, and to prevent overfitting, an early stopping strategy was applied. Further details about each algorithm can be found in the Table S1.
To ensure robust generalization ability of the QSPR models in predicting Caco-2 permeability for new chemical entities, both fivefold cross-validation and independent test sets were employed. The performance of different machine learning models on the Caco-2 permeability datasets was evaluated utilizing the following metrics: Coefficient of determination (R2), Mean absolute error (MAE), Root mean squared error (RMSE), Pearson correlation coefficient (Pearson’s r), and Spearman’s rank correlation coefficient (Spearman’s rho).
Definition of applicability domain
Due to the limitations inherent by the chemical space represented by the training set, each machine learning model demonstrated a predisposition towards predicting certain types of compounds [32]. This suggests that predictions may be more reliable when the compounds being evaluated fall within the specific applicability domain of each machine learning model. In this study, we adopted a methodology named the Euclidean distance-based method for applicability domain analysis [33]. This method introduces a distance cutoff value \({D}_{c}\) that defines a similarity threshold for external compounds. Morgan fingerprints was utilized to calculate the similarity between molecules. The detailed formula is as follows:
where \(\overline{y }\) represents the average Euclidean distance of the k nearest neighbors of each compound in the training set, \(\sigma\) is the corresponding standard deviation of Euclidean distances, and \(Z\) is an optional parameter to control the significance level. If the distance from an external compound to any of its nearest neighbors in the training set is above the distance cutoff \({D}_{c}\), the compound is considered to be outside of domain; otherwise, it is considered to be within the domain.
Matched molecular pairs (MMP) analysis
Matched molecular pair (MMP) analysis elucidates the impact of specific chemical transformations on activity or property changes within a molecule [34]. This approach has become widely adopted for optimizing both biological activity and ADMET properties [35, 36]. To gain deeper insights into how minor chemical modifications influence Caco-2 permeability properties and to guide ligand optimization, we constructed an MMP database. This database incorporates transformation rules linked to shifts in logPapp derived from our curated Caco-2 permeability datasets. For the creation of this MMP database, we utilized the open-source mmpdb package [37]. The process involved two main stages: fragmentation and indexing. During the fragmentation phase, we permitted up to three cuts per molecule, with constraints on the maximum number of heavy atoms (100) and rotatable bonds (10). Chirality was preserved when cutting bonds. In the subsequent Indexing phase, variable fragments were defined with a minimum of 2 and a maximum of 15 heavy atoms. The maximum radius for indexing was set to 3. The portion of heavy atoms in changing fragments was no more than half of the molecule. Default values were applied for other parameters as per the mmpdb package settings. We included only the transforms that contained at least ten matched pairs and showed a significant property shift with a p-value < 0.05. This rigorous approach ensures that the results comprise meaningful and actionable insights into how specific chemical alterations affect Caco-2 permeability.
Results and discussion
Data set analysis
We collected data from public databases and previous studies, resulting in 7861 compounds being included in this research. After data preparation, a total of 5654 compounds with structural diversity were obtained. The distribution of the logPapp values was shown in Fig. 1a. The logPapp values ranged from − 8.778 to − 3.510, with a mean of − 5.340. The permeabilities were then categorized as follows: compounds with a permeability less than 1 × 10–6 cm/s are considered low permeability, those up to 7 × 10–6 cm/s are classified as medium permeability, and compounds with a permeability greater than 7 × 10–6 cm/s are regarded as highly permeable. It can be observed that the curated dataset is highly unbalanced and contains a higher number of highly permeable compounds (n = 2746) compared to medium (n = 1747) and low (n = 1161) permeable compounds. Figure 1b shows that most structural clusters are quite mixed in terms of permeability classes. However, these clusters are not equally representative of all permeability classes, as many are predominantly composed of highly permeable (green) compounds. The presence of a biased chemical space presents a substantial complication in the accurate prediction of Caco-2 permeability. To further explore the chemical space of the Caco-2 data set, the Tanimoto similarity based on the ECFP4 fingerprints was calculated and the Murcko scaffolds were analyzed. The Murcko scaffolds of the total data set were extracted by removing side chain substituents but retaining the linkers and ring systems with the RDKit package. The overall color of the Tanimoto similarity heat map was light blue with an average similarity of 0.5 (Fig. 1c), indicating the structural diversity of the data set. Additionally, we detected 2979 different Murcko scaffolds from the data set, suggesting that each Murcko scaffold shared an average of ~ 1.9 molecules. The most common scaffolds in the data set were shown in Fig. 1d along with their frequency. Apparently, the six-membered ring was the most frequent scaffold in many drug compounds and molecules with polycyclic scaffolds may be the focus of Caco-2 permeability studies. In a word, the above analysis demonstrated the structural diversity and biased distribution of the current Caco-2 permeability data set.
Chemical space and scaffold analysis of the curated Caco-2 permeability data set. A The experimental logPapp value distribution showing that most of the compounds belong to the highly permeable class (n = 2746) while the least number of structures are categorized as low permeable. B t-SNE plot of curated Caco-2 permeability data set (green: high permeable compounds; blue: medium permeable; orange: low permeable). C Heat map of Tanimoto similarity based on with the ECFP4 fingerprints of the total data set. D Frequency of the Murcko scaffolds in the data set
Performance of Caco-2 permeability prediction models
First, we evaluated the ability of machine learning approaches to predict Caco-2 permeability. To improve the representation of the chemical and structural features of molecules for classical machine learning and DNN models, various combinations of molecular descriptors and fingerprints were considered. The detailed statistical results for all models across the training, validation, and test sets were summarized in Table S2. The performance was quantified using the R2 and RMSE metrics on the test set molecules. Notably, large differences in predictive performance were observed among the data sets, with the R2 values ranging from ~ 0.4 to ~ 0.65 log units (Fig. 2a). This trend was also evident in the overall performance of the test set molecules, with the RMSE values ranging from ~ 0.5 to ~ 0.6 log units (Fig. 2b). The performance variations were primarily attributed to the chosen molecular descriptor, with rdkit2d descriptors yielding the relatively low average prediction error. Binary fingerprints generally performed worse than nonbinary descriptors, with higher variation across data sets. Additionally, it was observed that a combined molecular representation would benefit machine learning algorithms.
Performance of different machine learning methods. The summary of A R2 and B RMSE for different machine learning models on test set (indicated by colors). C Global ranking of all methods using PCA, scaled between best and worst performance. Every point represents a different combination of the machine learning method and the descriptor it relied on. “Worst” and “Best” indicated the worst and best performance obtained across all data sets, respectively
To provide a comprehensive assessment of methodological performance across the analyzed data sets, we conducted principal component analysis (PCA) on the obtained RMSE values (Fig. 2c). To enhance the PCA results, we added additional rows representing the best and worst RMSE values for each dataset along the direction defined by both the best and worst outcomes, as observed in previous studies [38, 39]. The proximity of a method’s position to the “best” point in the PCA plot signified its superior overall performance across the considered data sets. Furthermore, the magnitude of orthogonal deviation from the best–worst line indicated the extent of variability in a method’s performance based on the data set. Compared to machine learning models, deep learning models were obviously more sensitive to the specific data used during training. Additionally, the results confirmed the higher impact of molecular descriptors than the chosen machine learning algorithm on the model performance. On average, XGB coupled with a combined molecular representation proved to be the best method for predicting Caco-2 permeability. Machine learning models coupled with ECFPs outperformed DMPNN based on graphs, which was somewhat surprising given that ECFPs were derived from molecular graphs. The finding highlights a current gap in efficiently learning features from “raw” molecular representations in the small-data regimes typical of drug discovery. However, the deep learning model coupled with ECFPs showed comparable results to machine learning models coupled with ECFPs. These results once again emphasize the importance of including both generic molecular representations and local chemical information in MPNN models, which is in alignment with the prior research findings [40].
Understanding of the generalization of Caco-2 permeability prediction models
The model with the most superior performance throughout 10 independent runs for all machine learning algorithms was selected to carry out further analysis. As shown in Fig. 3, a similar tendency was discerned, wherein the majority of combinations of models and molecular representations attained favorable performance in predicting highly permeable compounds. However, medium and, in particular, low permeable compounds exhibited a propensity to be overestimated. We supposed that the suboptimal prediction performance regarding low-permeable compounds is not attributable to the improper choice of algorithms and descriptors. Instead, the model’s bias towards highly-permeable compounds might give rise to an overestimation of their permeability (Fig. 1a and b).
To ensure the robustness of our QSPR model against random generation, we employed the Y-randomization test to validate its reliability. This method involved random shuffling of the logPapp values to disrupt their original order [41]. It was expected that the resulting QSPR models would exhibit poor performance when applied to the validation dataset. This randomization process was repeated 100 times, and the collective results were illustrated in Figure S2. It was obvious that the performance of the randomly shuffled models was quite different from the real model, with R2 values of approximately − 0.5 to 0. The limited predictive ability of the randomized models demonstrated that our predictive models have adeptly captured the genuine relationship between molecular properties and logPapp, rather than spurious correlations.
Defining the domain of applicability for any predictive model is critical in order to gain an understanding of the model generalization and the origin of estimated errors [32]. Herein, we calculated the similarity between a test compound and its nearest neighbor in the training set to quantify its distance to the model’s domain of applicability and to understand further the potential link with the uncertainty in prediction. More specifically, the structural similarity between any two samples was measured using the Tanimoto coefficient based on the ECFP4 fingerprints with a folding size of 1024 bits. The resulting similarity score was binned in intervals of 0.1 units. Figure 4 highlighted the correlation between the prediction error and the binned average structural similarity for all compounds in the Caco-2 data set. A general trend was observed where the model performance, regardless of machine learning algorithms, increased as the similarity for the test compound against the training set increased, which was in agreement with the conclusions from previous studies [40]. In this case, we believed that compounds with similarities higher than 0.70 to the training set tended to obtain more reliable prediction results.
The Euclidean distance-based method was then used to evaluate the domain of application. According to equation, it can be presumed that the elevation of the Z value and reduction in k led to a corresponding augmentation of the distance threshold and a consistent diminution of compounds external to the AD. We calculated the average Euclidean distance of the compounds in the training set of 9.203 and the standard deviation of the Euclidean distance of 0.920. For each model, the statistical results before and after considering the applicability domain were collected in Table 1 with the R2 values for model fit and RMSE for the test set. As anticipated, the improvement in model performance correlated positively with the reduction in the distance threshold. It makes sense because decreasing the distance threshold causes a greater number of compounds with lower similarity scores to fall outside AD (Figure S3 showed the examples of out-of-domain compound structures). To minimize the loss of valuable chemical space, the value of k and Z were set to 6 and 0.9, respectively. Accordingly, the final distance threshold was 8.375. As a result, the XGBoost model and SVM model performed best. For instance, XGB_combined and SVM_combined model yielded the similar performance with R2 value of 0.663, 0.667 and RMSE of 0.479, 0.475, respectively.
Model validation on the Shanghai Qilu’s in-house dataset
We next assessed how well machine learning models trained on public data can be transferred to pharmaceutical industry data by applying them to the Shanghai Qilu’s in-house dataset. We also compared the established best-performing model with existing Caco-2 prediction model in the ADMET prediction platform ADMETlab 2.0 [42]. The regression performance statistics for prospective validation of aforementioned prediction models on the Shanghai Qilu’s in-house dataset were given by Table 2 and further visualized by Fig. 5. In general, the performance of these models on the external test set was significantly lower than that on the test set in terms of R2 value. There may be two reasons that lead to such a poor performance and negative R2 values. First, the similarity of molecules between the Shanghai Qilu’s internal dataset and training set is low. As illustrated in Figure S4, the highest Tanimoto similarity was less than 0.45. Second, most compounds in the Shanghai Qilu’s internal data set belong to the low/medium permeability class (Figure S5), while the high permeability bias in the curated data set tends to overpredict the low and medium permeable compounds. However, particular models showed a high correlation between their prediction results and experimental values with Pearson’s r of at least 0.6 and Spearman’s rho of at least 0.6. More specifically, two boosting algorithms (XGBoost, GB) coupled with ECFPs outperformed RF in all cases in terms of Pearson’s r, Spearman’s rho, MAE and RMSE. SVM seemed to work well with all representations but did not outperform the boosting models. The optimal model XGB_fp and GB_fp achieved the Pearson’s r of 0.702, 0.719, Spearman’s rho of 0.704, 0.725, MAE of 0.629, 0.659 and RMSE of 0.771, 0.797, respectively. Similarly, the deep learning approach showed improvements across metrics of Pearson’s r and Spearman’s rho compared to the other machine learning models but did not outperform XGBoost and GB. Although in the last few years, it is a general belief that the increase of the dimension of descriptors gives models with better predictive performances [43]. Based on our findings, the extra addition of 2D descriptors conversely increases the redundancy of features and makes the model become more complicated, thus inversely reducing the prediction performance. Noticeable exceptions from this conclusion seemed to be the Project3, where GB_combined model was the best method. Additionally, the CombinedNet model which used a feature combination of the molecular graph and the molecular fingerprints achieved acceptable performance with the Pearson’s r of 0.677, Spearman’s rho of 0.677, MAE of 0.730 and RMSE of 0.873. However, it did not further enhance prediction performance when compared to boosting algorithms. We also investigated for each project, which of the machine learning methods worked best (see Table S3). Here only projects with more than 8 molecules were included. It can be noted that boosting models were advantageous for the majority of projects in general. Subsequently, we used these models to predict the in-domain compounds in the in-house data set and the performance was provided by Table S4 and Figure S6. It can be found that the GB_fp model had the best overall predictive performance, but it may not necessarily apply to specific project. For example, GB_combined model achieved the best performance in Project 3 (Table S5). However, it is worth to mention that the overall performance of all models has decreased when compared to Table 2. The results suggested that defining a strict similarity threshold correlating with decreased generalization error is challenging and potentially highly dependent on the explored chemical space and in vitro ADME property.
Together, these results suggested that (1) the boosting model presents a favorable option for the prediction of Caco-2 permeability, (2) the public model transfer is observed with a certain performance decrease but is still valuable for prospective predictions, (3) deep learning models outperforms few methods investigated in this study on industrial data but may not be the best solutions, and (4) it is worth keeping in mind that the feature combination becomes effective only when the added features indeed complement the lacked information and sometimes less can be more.
Representative MMP transformations of Caco-2 permeability
The diffusion of a compound is a complex process influenced by various types of interactions [4]. In an attempt to understand how different functional groups affect in Caco-2 permeability, we applied the open-source package mmpdb version 2 to generate a MMP knowledge base tool using all the 5654 compounds in the curated dataset. The three-cut method was used for the analysis, resulting in 65,440 matched molecular pairs and a total of 415,716 rules. A representative set of MMP rules reflecting common medicinal chemistry transformations was listed in Fig. 6 for the optimization of Caco-2 endpoints. For this analysis, we included only transformations that contained at least 10 matched pairs and showed a significant property shift as measured with a p-value < 0.05. In the Fig. 6, “MMP transformations” refers to the transform rules of molecular pairs, “ΔCaco-2 ± std” denotes the average difference and standard deviation in the value of logPapp of molecular pairs, and “nPairs” is the number of molecular pairs. Entries in Fig. 6 can broadly be classified in two groups. The first group involves the introduction of hydrocarbon groups, which tends to increase the logPapp values. Because of the presence of phospholipid bilayer, high lipophilicity of a compound is favorable for efficient permeability. The second group includes transformations of polar groups. As we know, high polarity is detrimental to permeability [44]. Removing or masking hydrogen-bond donors seems to be a good tactic to improve permeability. The introduction of heteroatoms (e.g., O, N) tends to reduce logPapp. Additionally, the halogen substitution in the molecules may play an important role in increasing the polarization and thus decreasing the water solubility and absorption of the molecule. In conclusion, the above rules are generally consistent with our chemical intuition that Caco-2 permeability is related to key physicochemical parameters of molecules, such as lipophilicity, ionization, hydrogen bonding, and molecular size [45]. We believe that these rules will be an important assistive tool for Caco-2 prediction and provide a reference for medicinal chemists to optimize lead compounds.
Conclusions
Understanding Caco-2 permeability is crucial for predicting drug behavior within the gastrointestinal tract and guiding decisions in the early stages of drug discovery and development. This study systematically evaluated Caco-2 permeability using machine learning models trained on public data. Regression models for Caco-2 permeability were developed using four machine learning methods (RF, SVM, GBM, and XGBoost) and two deep learning methods (DMPNN and CombinedNet). Various molecular representations, including ECFP4, RDKit2D descriptors, and combinations thereof (ECFP4 + RDKit2D), were explored to assess their impact on model performance. Our results reinforced previous evidence suggesting that deep learning methods currently do not consistently outperform simple machine learning methods in drug discovery applications [46]. Boosting models demonstrated superior overall performance compared to deep learning models across both training and test datasets. Approaches leveraging human-engineered molecular descriptors surpassed graph-based deep learning methods in predictive accuracy, while the deep learning model coupled with ECFPs yielded comparable results to machine learning methods. Furthermore, the models trained on public data were transferred and evaluated on Shanghai Qilu’s in-house dataset, revealing that particular models such as XGBoost_fp, GBM_fp, and CombinedNet maintained robust predictive quality to some extent with Pearson’s r and Spearman’s rho of about 0.65 to 0.70, despite potential differences in data distributions between public and industry databases. Additionally, the contributions of common substituents to logPapp were elucidated using Matched Molecular Pair Analysis (MMPA), offering insights into structural optimization strategies.
Availability of data and materials
The public Caco-2 data set, external data set and the Python code for training and validating the models are available at https://github.com/Duke-W91/Caco2_prediction. Shanghai Qilu’s internal dataset is considered proprietary by their organization. To demonstrate that the results on the Shanghai Qilu’s internal dataset are consistent with the findings from the time-split external set, we additionally collected 271 compounds reported in 2022 and 2023 from the ChEMBL database. The external set covers a wide range in terms of experimental values for Caco-2 permeability and shares low similarity with the curated Caco-2 data set (Figure S7). We also compared the established best-performing model with existing platform ADMETlab 2.0 [42] and its updated version ADMETlab 3.0 [47]. Table S6 and Figure S8 presented the model’s performance on the external dataset. The results corroborated the observations from Shanghai Qilu’s internal dataset, where boosting models generally provided the most accurate predictions, and the hybrid molecular representation could enhance the model performance. To address the potential issue of inadequate applicability contributing to the suboptimal results of the time split, we examined the Applicability Domain (AD). As shown in Table S7 and Figure S9, our findings indicated that model application relying on the AD could be misleading, especially when the AD was defined using distance-based approaches with simple fingerprints, such as ECFPs.
Abbreviations
- PAMPA:
-
Parallel Artificial Membrane Permeability Assay
- Caco-2:
-
Human colon adenocarcinoma cell lines
- MDCK:
-
Madin-Darby canine kidney cell
- LLC-PK1:
-
Porcine kidney epithelial cell lines
- RF:
-
Random Forest
- XGBoost:
-
Extreme gradient boosting
- SVM:
-
Support Vector Machine
- GBM:
-
Gradient Boosting Machine
- DMPNN:
-
Directed Message Passing Neural Network
- MMP:
-
Matched molecular pair
References
Alqahtani MS, Kazi M, Alsenaidy MA, Ahmad MZ (2021) Advances in oral drug delivery. Front Pharmacol 12:618411
Sun D, Gao W, Hu H, Zhou S (2022) Why 90% of clinical drug development fails and how to improve it? Acta Pharm Sin B 12(7):3049–3062
Lin L, Wong H (2017) Predicting oral drug absorption: mini review on physiologically-based pharmacokinetic models. Pharmaceutics 9(4):41
Ungell A-LB (2004) Caco-2 replace or refine? Drug Discov Today Technol 1(4):423–430
Avdeef A, Bendels S, Di L, Faller B, Kansy M, Sugano K, Yamauchi Y (2007) PAMPA—critical factors for better predictions of absorption. J Pharm Sci 96(11):2893–2909
Artursson P, Palm K, Luthman K (2001) Caco-2 monolayers in experimental and theoretical predictions of drug transport. Adv Drug Del Rev 46(1–3):27–43
Irvine JD, Takahashi L, Lockhart K, Cheong J, Tolan JW, Selick H, Grove JR (1999) MDCK (Madin–Darby canine kidney) cells: a tool for membrane permeability screening. J Pharm Sci 88(1):28–33
Bohets H, Annaert P, Mannens G, Anciaux K, Verboven P, Meuldermans W, Lavrijsen K (2001) Strategies for absorption screening in drug discovery and development. Curr Top Med Chem 1(5):367–383
Hubatsch I, Ragnarsson EGE, Artursson P (2007) Determination of drug permeability and prediction of drug absorption in Caco-2 monolayers. Nat Protoc 2(9):2111–2119
Bocci G, Oprea TI, Benet LZ (2022) State of the art and uses for the biopharmaceutics drug disposition classification system (BDDCS): new additions, revisions, and citation references. AAPS J 24(2):37
Alsenz J, Haenel E (2003) Development of a 7-day, 96-well Caco-2 permeability assay with high-throughput direct UV compound analysis. Pharm Res 20:1961–1969
Natoli M, Leoni BD, D’Agnano I, Zucco F, Felsani A (2012) Good Caco-2 cell culture practices. Toxicol In Vitro 26(8):1243–1246
Wang N-N, Dong J, Deng Y-H, Zhu M-F, Wen M, Yao Z-J, Lu A-P, Wang J-B, Cao D-S (2016) ADME properties evaluation in drug discovery: prediction of caco-2 cell permeability using a combination of NSGA-II and boosting. J Chem Inf Model 56(4):763–773
Wang Y, Chen X (2020) QSPR model for Caco-2 cell permeability prediction using a combination of HQPSO and dual-RBF neural network. RSC Adv 10(70):42938–42952
Wang X, Liu M, Zhang L, Wang Y, Li Y, Lu T (2020) Optimizing pharmacokinetic property prediction based on integrated datasets and a deep learning approach. J Chem Inf Model 60(10):4603–4613
Falcón-Cano G, Molina C, Cabrera-Pérez MÁ (2022) Reliable prediction of Caco-2 permeability by supervised recursive machine learning approaches. Pharmaceutics 14(10):1998
Pham The H, González-Álvarez I, Bermejo M, Mangas Sanjuan V, Centelles I, Garrigues TM, Cabrera-Pérez MÁ (2011) In silico prediction of Caco-2 cell permeability by a classification QSAR approach. Mol Inf 30(4):376–385
Hou TJ, Zhang W, Xia K, Qiao XB, Xu XJ (2004) ADME evaluation in drug discovery. 5. Correlation of Caco-2 permeation with simple molecular properties. J Chem Inf Comput Sci 44(5):1585–1600
Likitha S, Kamath S. In ML based QSAR models for prediction of pharmacological permeability of Caco-2 cell. In: 2021 IEEE 4th international conference on computing, power and communication technologies (GUCON). IEEE; 2021. P. 1–6.
Gramatica P (2007) Principles of QSAR models validation: internal and external. QSAR Comb Sci 26(5):694–701
De P, Kar S, Ambure P, Roy K (2022) Prediction reliability of QSAR models: an overview of various validation tools. Arch Toxicol 96(5):1279–1295
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742–754
Heid E, Greenman KP, Chung Y, Li S-C, Graff DE, Vermeire FH, Wu H, Green WH, McGill CJ (2024) Chemprop: a machine learning package for chemical property prediction. J Chem Inf Model 64(1):9–17
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Chen T, Guestrin C. In Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; 2016. p. 785–794.
Zhang T (2001) An introduction to support vector machines and other kernel-based learning methods. AI Mag 22(2):103–103
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232
Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M, Palmer A, Settels V, Jaakkola T, Jensen K, Barzilay R (2019) Analyzing learned molecular representations for property prediction. J Chem Inf Model 59(8):3370–3388
Nguyen N-Q, Jang G, Kim H, Kang J (2022) Perceiver CPI: a nested cross-attention network for compound–protein interaction prediction. Bioinformatics 39(1):btac731
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in python journal of machine learning research. J Mach Learn Res 12:2825–2830
You Y, Li J, Reddi S, Hseu J, Kumar S, Bhojanapalli S, Song X, Demmel J, Keutzer K, Hsieh CJ. Large batch optimization for deep learning: training bert in 76 minutes; 2019. arXiv preprint arXiv:1904.00962.
Cherkasov A, Muratov EN, Fourches D, Varnek A, Baskin II, Cronin M, Dearden J, Gramatica P, Martin YC, Todeschini R, Consonni V, Kuz’min VE, Cramer R, Benigni R, Yang C, Rathman J, Terfloth L, Gasteiger J, Richard A, Tropsha A (2014) QSAR modeling: where have you been? Where are you going to? J Med Chem 57(12):4977–5010
Shen M, LeTiran A, Xiao Y, Golbraikh A, Kohn H, Tropsha A (2002) Quantitative structure−activity relationship analysis of functionalized amino acid anticonvulsant agents using k nearest neighbor and simulated annealing PLS methods. J Med Chem 45(13):2811–2823
Hussain J, Rea C (2010) Computationally efficient algorithm to identify matched molecular pairs (MMPs) in large data sets. J Chem Inf Model 50(3):339–348
Griffen E, Leach AG, Robb GR, Warner DJ (2011) Matched molecular pairs as a medicinal chemistry tool. J Med Chem 54(22):7739–7750
Dossetter AG, Griffen EJ, Leach AG (2013) Matched molecular pair analysis in drug discovery. Drug Discov Today 18(15–16):724–731
Dalke A, Hert J, Kramer C (2018) mmpdb: an open-source matched molecular pair platform for large multiproperty data sets. J Chem Inf Model 58(5):902–910
Todeschini R, Ballabio D, Cassotti M, Consonni V (2015) N3 and BNN: two new similarity based classification methods in comparison with other classifiers. J Chem Inf Model 55(11):2365–2374
van Tilborg D, Alenicheva A, Grisoni F (2022) Exposing the limitations of molecular machine learning with activity cliffs. J Chem Inf Model 62(23):5938–5951
Fang C, Wang Y, Grater R, Kapadnis S, Black C, Trapa P, Sciabola S (2023) Prospective validation of machine learning algorithms for absorption, distribution, metabolism, and excretion prediction: an industrial perspective. J Chem Inf Model 63(11):3263–3274
Rücker C, Rücker G, Meringer M (2007) y-Randomization and its variants in QSPR/QSAR. J Chem Inf Model 47(6):2345–2357
Xiong G, Wu Z, Yi J, Fu L, Yang Z, Hsieh C, Yin M, Zeng X, Wu C, Lu A, Chen X, Hou T, Cao D (2021) ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties. Nucleic Acids Res 49(W1):W5–W14
Orosz Á, Héberger K, Rácz A (2022) Comparison of descriptor-and fingerprint sets in machine learning models for ADME-Tox targets. Front Chem 10:852893
Goetz GH, Shalaeva M, Caron G, Ermondi G, Philippe L (2017) Relationship between passive permeability and molecular polarity using block relevance analysis. Mol Pharm 14(2):386–393
O’Donovan DH, De Fusco C, Kuhnke L, Reichel A (2023) Trends in molecular properties, bioavailability, and permeability across the Bayer compound collection. J Med Chem 66(4):2347–2360
Jiang D, Wu Z, Hsieh C-Y, Chen G, Liao B, Wang Z, Shen C, Cao D, Wu J, Hou T (2021) Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models. J Cheminf 13:1–23
Fu L, Shi S, Yi J, Wang N, He Y, Wu Z, Peng J, Deng Y, Wang W, Wu C, Lyu A, Zeng X, Zhao W, Hou T, Cao D (2024) ADMETlab 30: an updated comprehensive online ADMET prediction platform enhanced with broader coverage, improved performance, API functionality and decision support. Nucleic Acids Res 52(W1):W422–W431
Acknowledgements
Not applicable.
Funding
This work was financially supported by National Natural Science Foundation of China (22220102001, 82304380), and Shanghai Qilu Pharmaceutical R&D Center.
Author information
Authors and Affiliations
Contributions
T.H. and K.Y. conceptualized the study; W.D., J.J. and S.G. implemented methods. W.D., S.G., B.J., W.Z., L.S. and P.P. performed data analysis and interpretation. W.D., L.D., K.Y. and H.T. wrote the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
13321_2025_947_MOESM1_ESM.docx
Supplementary Material 1. Table S1. The optimal hyperparameters for the six machine learning approaches. Table S2. The performances of different models across the training, validation, and test sets. Table S3. The performances of different models across four Shanghai Qilu’s internal project. Table S4 The performance of different models considering the applicability domain on Shanghai Qilu’s internal data set. Table S5. The performances of different models considering the applicability domain across two Shanghai Qilu’s internal project. Table S6. The performances of different models on the external set. Table S7. The performances of different models on the external set considering the applicability domain. Figure S1. The architecture of CombinedNet. Figure S2. The distribution of R2 of 100 randomized models on the validation set compared with the real model in the Y-randomization test. The orange vertical line on the right side represents the R2 value of the true model, and the distribution on the left side represents the distribution of R2 of prediction models after randomization. Figure S3. Structures of the five out-of-domain compounds. Figure S4. The Tanimoto similarity between the Shanghai Qilu’s internal data set and the training set from the curated Caco-2 permeability data set, measured by Morgan fingerprint. Figure S5. The experimental logPapp value distribution of Shanghai Qilu’s internal data set showing that most of the compounds belong to the low permeable class (n=30) while the least number of structures are categorized as high permeable (n=17). Figure S6. Correlation of machine learning model predictions (x-axis) of compounds within the applicability domain against logPapp measurements (y-axis) across 6 Shanghai Qilu’s internal projects. Figure S7. (A) Density plot showing the distribution of logPapp values between external data set and curated data sets. (B) Heatmap showing the Tanimoto similarity between external set and training set. (C) Frequency plot showing the highest Tanimoto similarity to the training set. Figure S8. Correlation of machine learning model predictions (x-axis) against logPapp measurements (y-axis) on the external set. Figure S9. Correlation of machine learning model predictions (x-axis) of compounds within the applicability domain against logPapp measurements (y-axis) on the external set.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wang, D., Jin, J., Shi, G. et al. ADMET evaluation in drug discovery: 21. Application and industrial validation of machine learning algorithms for Caco-2 permeability prediction. J Cheminform 17, 3 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13321-025-00947-z
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13321-025-00947-z