- Research
- Open access
- Published:
Molecular identification via molecular fingerprint extraction from atomic force microscopy images
Journal of Cheminformatics volume 16, Article number: 130 (2024)
Abstract
Non–Contact Atomic Force Microscopy with CO–functionalized metal tips (referred to as HR-AFM) provides access to the internal structure of individual molecules adsorbed on a surface with totally unprecedented resolution. Previous works have shown that deep learning (DL) models can retrieve the chemical and structural information encoded in a 3D stack of constant-height HR–AFM images, leading to molecular identification. In this work, we overcome their limitations by using a well-established description of the molecular structure in terms of topological fingerprints, the 1024–bit Extended Connectivity Chemical Fingerprints of radius 2 (ECFP4), that were developed for substructure and similarity searching. ECFPs provide local structural information of the molecule, each bit correlating with a particular substructure within the molecule. Our DL model is able to extract this optimized structural descriptor from the 3D HR–AFM stacks and use it, through virtual screening, to identify molecules from their predicted ECFP4 with a retrieval accuracy on theoretical images of 95.4%. Furthermore, this approach, unlike previous DL models, assigns a confidence score, the Tanimoto similarity, to each of the candidate molecules, thus providing information on the reliability of the identification. By construction, the number of times a certain substructure is present in the molecule is lost during the hashing process, necessary to make them useful for machine learning applications. We show that it is possible to complement the fingerprint-based virtual screening with global information provided by another DL model that predicts from the same HR–AFM stacks the chemical formula, boosting the identification accuracy up to a 97.6%. Finally, we perform a limited test with experimental images, obtaining promising results towards the application of this pipeline under real conditions.
Scientific contribution
Previous works on molecular identification from AFM images used chemical descriptors that were intuitive for humans but sub–optimal for neural networks. We propose a novel method to extract the ECFP4 from AFM images and identify the molecule via a virtual screening, beating previous state-of-the-art models.
Introduction
Atomic Force Microscopy (AFM) operated in the frequency modulation (FM) mode in ultra–high vacuum conditions (commonly known as Non–Contact AFM, NCAFM) has become an essential tool for nanoscience [1, 2]. NCAFM allows us to explore and manipulate matter at the atomic scale through the interaction between a sharp apex probe and the sample. The functionalization of AFM metal tips with closed-shell molecules, in particular with CO, provides access with totally unprecedented resolution to the inner structure of small organic molecules adsorbed on surfaces [3,4,5,6]. Since the first High–Resolution (HR) AFM image of the pentacene molecule [3], this striking resolution has been exploited to disclose bond orders [7], to image frontier orbitals [5] and charge distributions, and to track the intermediate products of chemical reactions [8]. Nowadays, HR–AFM has become an essential tool for on-surface chemistry [8, 9] and fundamental catalysis studies [10].
The utmost resolution provided by HR–AFM arises from the Pauli repulsion between an inert probe like CO probe with the electronic charge distribution of the sample molecule [11, 12] modified by the electrostatic interaction between the potential created by the sample and the charge distribution associated with the oxygen lone pair at the probe [13,14,15]. The flexibility of the bond between the CO and the last atom of the metal probe magnifies the saddle lines of the total potential energy surface sensed by the CO, further enhacing the resolution [16].
This exquisite sensitivity to the sample charge density immediately rises the question whether we can go beyond structure and use HR–AFM as a molecular identification tool. Given the capability of HR–AFM to address individual molecules, such a tool would not only serve to on–surface chemistry applications but has the potential to overcome some of the fundamental limitations of the spectroscopic techniques [17] such as vibrational spectroscopy (Fourier Transform Infrared (FTIR) and Raman spectroscopies) [18], nuclear magnetic resonance (NMR) [19], or mass spectrometry [20, 21] traditionally used for molecular identification.
For molecular identification solely based on HR–AFM, the repulsive nature of the CO-sample interaction prevents the application of force spectroscopy protocols, based on the determination of maximum attractive forces, that achieved single-atom chemical identification [22].
Attempts to discriminate atoms in molecules by HR–AFM have been based so far either on differences found in the tip-sample interaction decay at the molecular sites [13, 23] or on characteristic image features associated with the chemical properties of particular molecular moieties [5, 6, 13, 24,25,26,27,28,29,30,31]. For instance, sharper vertices are displayed for substitutional N atoms on hydrocarbon aromatic rings [13, 23, 24] due to their lone pair. Furthermore, the decay of the CO-sample interaction over those substitutional N atoms is faster than over their neighboring C atoms [13, 23]. In general, due to their slower charge density decay, C atoms in aromatic rings are usually sensed as more repulsive than N, which, in turn, is more repulsive than oxygen. Halogen atoms can also be distinguished in AFM images thanks to their oval shape (associated to their \(\sigma\)-hole) and to the significantly stronger repulsion compared to atoms like nitrogen or carbon [28]. Although promising, these rules do not represent a reliable solution to the atom identification problem as the molecular environment plays an important role: C atoms in carboxylic groups literally disappear from the image of trimesic acid (TMA) self-assembled networks as they are much less repulsive than the neighboring O atoms in the acid moiety, that strongly attract the electronic charge towards them [29]. Furthermore, small height differences can significantly modify the images [32], leading in many cases to contrast inversion with respect to the above rules.
The previous analysis suggests that not a single HR–AFM image, but a 3D stack of constant–height images covering a range of relevant tip heights is needed to provide enough information on the molecular electronic charge distribution to disentangle the contribution of the bonding topology, the chemical composition and the internal corrugation of the molecule to the contrast of the HR-AFM images. While 2D features, like the sharper vertex associated with N atoms [13] are easily recognized by a human via visual inspection, handling 3D information to discriminate, for example, among the different halogens (that produce the same oval-shape contrast but with different decays moving out of molecule [28]) calls for the application of a Machine Learning (ML) approach. In particular, Deep Learning (DL) has proven to be a powerful tool for learning long–range, complex correlations over large sets of images using a data–centric approach. Convolutional Neural Networks (CNNs) [33] have been employed over 3D stacks of constant-height AFM images with remarkable success at different tasks. In 2020, Alldritt et al. [34] developed a CNN model that obtained information about the 3D molecular structure from an 3D image stack by predicting the van der Waals spheres representation of the molecule. They also reported a preliminary test for the prediction of the chemical composition, with modest but promising results. Later in 2022, Oinonen et al. [35] created a pipeline for obtaining the molecular graph also from 3D image stacks. This pipeline consisted of a CNN that extracted a point cloud representation of the atoms, a peak finding algorithm and a combination of Multilayer Perceptron (MLP) and Graph Neural Network (GNN) models to classify each node and assign the bonds. The detection of atomic positions worked quite reliably even for relatively large molecules such as PTCDA (3,4,9,10-Perylenetetracarboxylic dianhydride CID: 67,191)although there were inconsistencies for non-planar systems (error rate of \(\sim 20\) %) and the model was sensitive to the choice of coordinate system. However, the compositional analysis, that was restricted to families of atoms –1: (H), 2: (C, Si), 3: (N, P), 4: (O, S), 5: (F, Cl, Br)–, showed errors up to 30% for the family (N,P), that was commonly mistaken with the C- and O-groups.
In previous work, we have addressed the problem of complete molecular identification (structure and composition) of quasi planar organic molecules with no prior information about them using two different DL approaches, taking as input a stack of 10 constant-height HR-AFM images covering the range of tip-sample distances commonly used for AFM imaging, spanning a distance variation of 1 Å. Firstly, we framed it as an image captioning challenge and used multimodal networks [36] to solve it. Each multimodal network (M-RNN) included a CNN for image analysis and a Recurrent Neural Network (RNN) for language processing. The first network took as input the 3D image stack and provided the attributes, the IUPAC terms corresponding to all the chemical groups present in the molecule. The second M-RNN exploited both the 3D image stack and the attributes provided by the first M-RNN to predict the IUPAC name of the molecule, that completely describes the structure and composition of the molecule. The determination of the chemical groups within the molecule had a 95% accuracy, showing that HR–AFM images did carry significant chemical information and that the CNN model is able to retrieve it. For the prediction of the complete IUPAC name, although the model outperforms most applications of RNN to language translation, the accuracy was limited to 76% using the cumulative 4-gram BLEU metric [37], the standard metric for natural language processing. This performance drop is probably related to intrinsic limitations of RNNs models and to the IUPAC formulation rules, specifically designed for humans but not particular suitable for machine learning applications.
In order to overcome this language limitation, we devised a completely new perspective using visualisation techniques that map images onto images [38]. Our Conditional Generative Adversarial Network (CGAN) converts the image stack into a ball-and-stick depiction, where balls of different color and size represent the chemical species and sticks represent the bonds, providing, in this way, complete information on the structure and chemical composition. As an additional advantage, this approach can handle images containing groups of molecules bonded by hydrogen or halogen-bond interactions or molecular fragments that cannot be described by the IUPAC formulation. To estimate the accuracy of our identification method we used a global assessment and two specific evaluations focused on either structure or composition. The CGAN model achieved a remarkable 74% of perfect predictions, that increased to 95% (96%) when considering only structure (composition). Our criteria in the total accuracy and the structure accuracy was really tough as a prediction was considered correct only if there was a perfect match (in all the predictions, most of the structure is revealed correctly, providing valuable information about the molecule, in spite of been considered as incorrect in the determination of the accuracy.)
Diagram of the molecular identification pipeline. From the experiment, we obtain the 3D HR–AFM stack consisting of 10 constant–height images, which is fed to our neural network to extract the Extended Connectivity Topological Fingerprints (ECFP4). Then, we perform a virtual screening with the predicted fingerprints against a molecular database molecule/fingerprints pairs and rank by decreasing tanimoto similarity
The results of the two DL models described above show the potential for chemical and structural identification of molecules encoded in HR–AFM images. However they are still limited by the deficiencies of the IUPAC nomenclature as a language in the M-RNN model and by the visual character of the information retrieved by the CGAN, perfectly informative for a human but not useful for its possible use for a prediction of the molecular properties based on the chemical information stored in the HR–AFM images. Here, we seek to overcome these limitations by using an alternative, well-established description of the molecular structure in terms of topological fingerprints [39], that were developed for substructure and similarity searching. In particular, we have selected a widely used and optimized topological fingerprint, the 1024–bit Extended Connectivity Chemical Fingerprints of radius 2 (ECFP4) [40]. We show that we can design and train (with the QUAM–AFM database [41]) a DL model that is able to extract this optimized structural descriptor from the 3D HR–AFM stacks and use it, through virtual screening [42], to identify molecules from their predicted ECFP4 with very high accuracy (see Fig. 1).
ECFPs, developed specifically for structure-activity modeling, are circular fingerprints with a number of useful qualities: they can be very rapidly calculated; they are not predefined, and can represent an essentially infinite number of different molecular substructures (including stereochemical information). ECFPs have proven to be useful in several applications including virtual screening, quantitative structure–activity relationship (QSAR) modeling [43, 44] and similarity searching [45]. Particularly relevant examples are the use of ECFPs applied to different molecular datasets for the prediction of electronic properties, solubility and binding affinities for bio-molecular complexes [44], and the recent application to predict compounds with high antibiotic activity and low cytotoxicity [46].
The rest of the paper is organized as follows. After introducing our model for predicting the molecular fingerprints, the 1024–bit ECFP4 [40], of a target molecule and exposing other methodological details used in the work, we show the performance of the model for fingerprint extraction from 3D HR–AFM stacks by using the Tanimoto similarity [47, 48]. As the direct reconstruction of molecular representations from ECFPs is far from being straightforward [49], we have chosen a virtual screening process as the strategy for molecular identification. Our results show that molecules can be identified from the predicted ECFP4 with very high accuracy (95.4%). This method, unlike previous works [35, 36, 38], has the additional advantage of selecting an arbitrary number of candidate molecules and assigning a confidence score, the Tanimoto similarity [47, 48]) to each one of them, thus providing information on the reliability of the identification. This approach let us identify the correct molecule even when the prediction of the fingerprint is partially wrong.
By construction, ECFPs provide local structural information and the frequency of the identified substructures is lost during the hashing process necessary to map them into a fixed sized vector (see Methods). To address this limitation, we complement the fingerprint-based virtual screening with global information from another deep learning model that predicts the chemical formula from the same high-resolution atomic force microscopy (HR-AFM) stacks, enhancing the identification accuracy to 97.6%. Finally, we conducted a limited test with experimental images, yielding promising results that support the feasibility of applying this pipeline under real-world conditions.
Methods
Molecular fingerprints
Molecular fingerprints [39] are representations of the chemical structure of the molecule optimized for substructure searching and machine learning tasks. In molecular fingerprints, each integer represents the presence of a particular substructure. In our work, we have chosen the Extended Connectivity Fingerprints (ECFPs) [40], a class of topological fingerprints that can be efficiently computed and represent an essentially infinite number of different molecular substructures.
The ECFP generation process begins with the assignment of an initial integer identifier to each atom in the molecule. These identifiers are typically based on atom types and incorporate properties such as the valence, atomic mass and so forth. Following this, an iterative neighborhood expansion process takes place up to a defined radius. In each iteration, a new integer identifier is created for each atom by hashing its current identifier together with those of its immediate neighbors, in order to incorporate information from the atom’s local environment. In ECFP4, the radius is set to second neighbors. Finally, we map the fingerprints to a fixed–size 1024–bit vector. To obtain the index of the “on” bits in the final bit vector, we use the modulo operator on each integer. This hashing step, although necessary for machine learning applications [40, 44], produces a loss of information: firstly, the frequency of each substructure’s occurrence within the molecule is lost; secondly, different integers can be mapped to the same index (a situation referred to as a “bit collision” [40]).
In our implementation, we have used the RDKit [50] library to compute the molecular fingerprints from the SMILES code of the molecules, obtained from the PubChem [51] repository.
Tanimoto similarity and virtual screening
Labelling molecules with these fingerprints allows an easy and fast quantification of the difference or similarity between two molecules A and B. We have chosen the Taminoto Similarity [47, 48], \(S_{A,B}\), calculated as:
where a is the number of on bits in molecule A, b the number of on bits in molecule B and c the number of bits that are on in both molecules [47]. The closer the value is to 1, the more similar molecules A and B are. Therefore, \(S_{A,B}=1\) means A and B are the same, except for the limitations due to the local character of the fingerprints and the information lost in the hashing step of the fingerprint generation process.
Using this similarity metric, we can identify and rank candidate molecules via virtual screening as described in [42]: firstly, the Tanimoto similarity [47, 48] between the predicted fingerprint and each molecule in the database is computed. Then, the candidates are ranked by decreasing order. At the end, the top-k candidates are returned as the output of the screening process, where k is an optional parameter set by the user. Here, the Tanimoto similarity serves both as a ranking metric and as the model’s confidence in the prediction.
Architecture of models
In this work, we have developed two CNN models: (i) a multilabel classification model for the prediction of molecular fingerprints; and (ii) a regression model for the count of each atomic species within the structure, from which we construct the chemical formula.
The molecular fingerprint model is an adaptation of EfficientNet-B0 [52], where we change the first convolutional layer from 3 to 10 channels so it accepts stacks of 10 constant-height HR-AFM images as input, allowing the model to take the whole z–range at once (see Fig. S1 for a diagram of the architecture). The final layer consists of a Dense layer of size 1024 with sigmoid activation. A critical step for improving the model’s performance on experimental images was to substitute the first BatchNorm layer of the EfficientNet model for a Dropout layer with dropout probability \(p=0.5\). The dropout layer prevents co-adaptation of neurons [53], what makes the model robust against experimental conditions (noise, plane tilting, etc.) which are not present in the simulated images used for the training (see below). The chemical formula model is constructed in the same way, but using as the final layer a Dense layer of 10 neurons with ReLu activation.
Training and evaluation
This work aims to create an end–to–end molecular identification tool that uses a 3D stack of experimental AFM images as input. However, training neural networks requires a high amount of labeled samples and there is currently no such dataset for experimental images. For this reason, we train and evaluate our models primarily on simulated images from the QUAM–AFM [41] dataset, containing 165 million HR-AFM images theoretically generated from a selection of 685,513 isolated quasi-planar molecules from PubChem [51] that span the most relevant bonding structures and chemical species in organic chemistry. These molecules can contain carbon, hydrogen, nitrogen oxygen, sulphur, phosphorus and halogen atoms (fluorine, chlorine, bromine, and iodine) and range from 9 to 85 atoms. A more detailed analysis of the main functional groups in the database can be found in the Supplementary Information under section S9. The QUAM–AFM can be freely downloaded [54]. For each molecule, HR-AFM images were simulated for 10 tip–sample distances considering six different values of cantilever oscillation amplitude and four values of the tilt stiffness of the CO molecule to cover a wide range of experimental operation conditions.
We performed data curation by removing the non–live molecules [55] from our dataset. These molecules were accessible through the PubChem API at the date of creation of QUAM–AFM, but have since then become inaccessible. After this step, our train/validation/test split consisted of 285k/15k/280k randomly sampled molecules respectively. As the training set is so huge, we divide each epoch in 10 virtual epochs and compute validation metrics at the end of each of these virtual epochs (see Supplementary Information, Fig. S1).
For the molecular fingerprint model, the binary cross-entropy with logits loss function, equipped with balanced positive weights, was used as the training criterion:
where \(y_i\) is the ground truth for the \(i\)-th bit of the fingerprint, \(\sigma\) is the sigmoid function and \(\sigma (x_i)\) the probability predicted by the model for that same bit. The \(p_c\) parameter is used to give more weight to correctly predicting the on bits (see section S3). Molecular fingerprints are quite sparse and without this term, the network could be trained into only predicting 0’s.
Regarding the training strategy, we initialized the fingerprint model from pre-trained weights [52] as it improves accuracy when in–domain training data is scarce [56, 57]. The bias of the last layer is initialized with a prior to accelerate the convergence of the model (see section S1 for details). Then, the model was trained until a plateau at mean Tanimoto similarity \(S=0.88\) is reached in the validation set (Figure S2a). We select the last checkpoint as our models’ weight.
For the model that predicts the chemical formula, we followed a transfer learning strategy: we cloned the weights from the backbone of the fingerprint model to the backbone of the chemical formula model and trained it end-to-end (Figure S2b). Since we had a very good pre-training, the model is almost converged on the first virtual epoch (MSE: \(0.1 \, \text {atoms}^2\)), after which our validation oscillates for the rest of the training. The training hyperparameters of both models were chosen to be the standard for classification and regression tasks (Table S1). For the chemical formula model, we define accuracy as the probability of perfect prediction, a very hard metric on our model as a miscounting in the number of hydrogens is considered a failed prediction. Finally, both models were trained on NVIDIA A40 gpu and 12 Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz cpus using a maximum wall time of 72 h.
It is important to stress that, during the training of both models, strong data augmentations were applied in order to regularize the model and to reproduce the effect of experimental conditions on the HR–AFM images. These augmentations include rotations, translations, shears, in/out zooms, and gaussian noise (see Table S2 for details).
DFT calculations and simulation of HR–AFM images
The HR–AFM images in the QUAM–AFM dataset were simulated using the gas phase molecular structure available from PubChem. However, when molecules are deposited in a substrate, the interaction with the surface changes the molecule corrugation, which translates into differences in the contrast of the images.
To study the ability of the model to identify molecules in a substrate, we simulated the adsorption of PTCDA on Cu(111) and Ag(111) surfaces. For the Cu(111) slab, we used a unit cell of size 20.4 x 22.08 x 30.83 Å (including about 25.49 Å of vacuum). The slab model contained 3 layers of copper, making 238 atoms in total. As starting geometry, we placed the PTCDA molecule in its gas phase structure at 2.86 Å above the slab. For the geometry relaxation, DFT calculations were carried out using the VASP package [58] with a cutoff energy of 425 eV for the plane-wave basis set. The projector augmented wave method [59, 60] was used to build the pseudopotentials of all constituent species. We use the PBE generalized gradient approximation [61] for the exchange-correlation part of the energy and the semiempirical DFT-D3 dispersion correction [62] to model the Van der Waals interaction. The PTCDA structure adsorbed on Cu(111) was converged using a conjugate gradient algorithm until forces upon atoms were smaller than 0.01 eV/Å while each electronic self-consistent loop was calculated with a precision of \(10^{-5}~\text {eV}\). A vertical vacuum region of 22.4 Å was established between the periodical images and a dipole correction along the z-axis was also used. As we were only interested in the effect of the substrate on the molecular corrugation, we left the substrate fixed and sampled the Brillouin zone using only the \(\Gamma\) point.
The geometry of the PTCDA molecule adsorbed on Ag(111) was obtained from ref [63], which was calculated with the same parameters but an energy cutoff of 400 eV and convergence criterium of \(10^{-6}~\text {eV}\) for the SCF calculation.
HR–AFM images are simulated with the same model used to generate the QUAM–AFM data set [41]: an approximate implementation of the full density based model (FDBM) [13] in the latest version of the PPMAFM code [41, 64]. Only the molecular structures obtained from the adsorption calculations are included in the corresponding HR–AFM simulations.
Results and discussion
Predicting ECFP4 from HR–AFM images
First, we evaluate the performance of our chemical fingerprint prediction model. We use our model to predict the chemical fingerprints from the 3D stacks of HR–AFM images for the 279,905 molecules in the test set and plot the histogram of the Tanimoto similarity (red histogram in Fig. 2) between the predicted and ground truth fingerprints. This histogram has a median Tanimoto similarity equal to 0.95, demonstrating that we can predict the chemical fingerprints from HR–AFM image stacks very accurately. To compare with a baseline, we also compute a control histogram (blue histogram on Fig. 2) corresponding to the Tanimoto similarity of pairs of randomly chosen molecules. This control histogram has median 0.11 and a very low density for values of Tanimoto greater than 0.4. From Fig. 2 we can conclude that, in most cases, we won’t be able to predict the molecular fingerprints perfectly (S=1). Nevertheless, a not-perfectly predicted still can store enough information of the molecule to be identified. Statistically, a prediction with Tanimoto similarity higher than 0.5 should be enough to identify a molecule from the HR–AFM images.
The chemical information provided in the predicted fingerprint outperforms previous models [34,35,36, 38]. Alldrit et al. [34] focused on structural elucidation and only presented few preliminary results for chemical recognition. Later work [35] addressed molecular identification using GNNs but showed modest accuracy compared to this work. In our work with M-RNNs [36], we presented a model that was able to identify the chemical groups in a molecule from a stack of simulated HR-AFM images with 95% precision, comparable to the present work. However, the retrieved ECFP4s contain more information as they encode both molecular moieties and structural information. Finally, in our recent CGAN model for balll-and-stick prediction [38], although remarkable in its prediction for either the structure or composition (\(\sim\)95%), the combined performance dropped to \(\sim\) 76%. Furthermore, the visual character of the information retrieved by the CGAN is not suitable for its subsequent use for the prediction of other molecular properties, as structure–activity relationships and similarity searching where the ECFPs have already shown their potential. Thus, the outstanding precision demonstrated by our novel model in forecasting ECFP4 marks a significant advancement in HR-AFM image analysis.
Molecular identification via virtual screening
One of the main goals of this work is to automate molecular identification. Our hypothesis is that a molecule can be identified through its predicted ECFP4 performing a virtual screening: we calculate the Tanimoto similarity of the predicted fingerprint against the fingerprints of all the molecules from the reference dataset and retrieve the top candidates, i.e., those with the highest Tanimoto.
Identification accuracy versus corrugation. We compute the accuracy for molecules with corrugation < 25 pm (green), 25–75 pm (orange) 75–125 pm (purple) and > 125 pm (magenta). Dashed black lines represent the accuracy over all corrugation groups. Enriching the virtual screening with the chemical formula improves accuracy across all corrugation groups
Examples of identification of polycyclic aromatic hydrocarbons over theoretical 3D stacks. Columns from left to right, constant-height AFM images (1–3), ground truth molecule (4) and top (5) and second (6) candidates. Under each candidate, tanimoto similarity, S and corrugation, \(\Delta z\) is expressed. Molecules from first to last row are Tetrabenzo(a,c,g,s)heptaphene (CID: 143,932), Benzo[1,2,3-bc: 4,5,6-b’c’]dicoronene (CID: 636,081) and Tetramethyl-Undecacyclo-Tetraconta-Icosaene (CID: 59,721,948), where methyl groups have been highlighted. The model identifies the correct molecules with high confidence
Examples of identification of molecules with nitrogen, oxygen and sulfur atoms. From first to last row: 4,4’-Bi[1,2,3-thiadiazole] (CID: 2,748,722), 5-methyl-2-(2 H-triazole-4-carbonylamino)thiophene-3-carboxylic acid (CID: 63,616,469) and 5-Pyrazolo[1,5-a]pyridin-3-yl-1,2,4-oxadiazole-3-carboxylic acid (CID: 103,122,053). IN the last two rows, the differences between candidate molecules have been highlighted to guide the reader
Examples of identification of molecules with chemical species of the same group. Columns organized as in Fig. 4. From first to last row: 3-Methylthieno[3,2-b]furan (CID:58,899,415), 2-Bromo-4-chloro-3-iodopyridine (CID: 59,332,995) and 4-Oxo-4-(quinolin-3-ylamino)butanoic acid (CID: 861,757)
Examples of incorrect identifications. Molecules from first to last row are Methyl 12-oxobenzo[b]xanthene-9-carboxylate (CID:135,178,930), Hexacene, (CID:123,044), Resorcinoxide (CID:129,866,873) and N-(3-hydrazinylidene-1 H-inden-2-ylidene)hydroxylamine (CID:137,221,883). In Resorcinoxide, there is an oxygen atom inaccessible to the tip, as it is under the benzene ring (highlighted in purple). The identification fails not because of the model, but rather because the AFM doesn’t have access to this region of the molecule
On our test-dataset of simulated images of 279,905 molecules, we achieve a top1 and top5 retrieval accuracy of 95.43% and 97.92% respectively (see Fig. 3). This means that our model is able to correctly identify the molecule in practically all the cases of our dataset which includes a large variety of homo- and hetero- acyclic or cyclic compounds with the most relevant functional groups including alkanes, alkenes, alkyne, alcohols, thiols, ethers, aldehydes and ketones, carboxylic acids, amines, amides, imines, esters, nitriles, nitro and azo compounds, halocarbons, and acylhalide. Figures 4, 5, 6 show a few examples of correct identification on different sets of molecules.
Our first test is on polyaromatic hydrocarbons (PAHs) (Fig. 4), which include only two chemical species (carbon and hydrogen). Independently of the number of rings or the bond order distribution, the model correctly identifies these molecules. All these PAHs have correctly been identified, even if the predicted fingerprints weren’t 100% correct (Tanimoto \(S=1\)). This highlights the robustness of the virtual screening, where even partially correct predictions of the fingerprints is enough to correctly identify the molecule (Fig. 2). In the first row, the top and second candidates have quite different topologies and the difference in tanimoto similarity is \(\Delta S=0.14\). In comparison, the second and third rows candidates with very similar topologies, which is why the difference in tanimoto is less than a third (\(\Delta S = 0.04\) in both cases). Since the topologies are so similar, the model is less confident in predicting if the correct molecule is one or the other (even if in the end recognizes the correct structure).
The molecule in the last row, Tetramethyl-Undecacyclo-Tetraconta-Icosaene (CID: 59,721,948), shows that the model not only works for perfectly flat cases, but it correctly identifies the presence of methyl groups. The tanimoto (\(S=0.54\)) is relatively low, comparing with the two previous examples, meaning that the corrugation had an effect on the accuracy of the model.
Next, we consider molecules including nitrogen, oxygen and sulfur atoms besides carbon and hydrogen. The three chosen molecules (see Fig. 5) display a variety of bonding configurations and include a number of different chemical groups: carboxylic, methyl, and amide groups, thiophene, thiadiazole or oxadiazole rings. Despite the presence of several atomic species and the structural complexity, our model correctly identified the target molecule with high confidence. In the first row, the model correctly predicts the two thiadiazole groups. In the second row, the top two candidates only differ by a triazole vs a thiodiazole (this means a N-H vs a S atom), while in the third row, the model correctly discriminates between a Pyrazolo[1,5-a]pyridine and a Pyrazolopyrimidine (a C for an N).
Lastly, we test whether we can distinguish between different bonding coordinations for the same chemical species or among elements in the same chemical family (same column in the Periodic Table). Figure 6 shows (first row) how the model is able to recognize and discriminate the furan and thieno groups in the molecule 3-Methylthieno[3,2-b]furan (CID:58,899,415). We can also discern different halogens (Fig. 6, second row) as seen for the 2-Bromo-4-chloro-3-iodopyridine (CID: 59,332,995) molecule. Finally, in the third row, we see how, for the 4-Oxo-4-(quinolin-3-ylamino)butanoic acid (CID: 861,757) molecule formed by a chain with methylene, amide and carboxyl groups, the ECFP4 was perfectly predicted.
As the interaction is so sensitive to the tip–sample distance, the internal corrugation of the molecule is one of the key contributors to the HR–AFM contrast. Disentangling this effect from the bonding configuration and the chemical composition to achieve molecular identification is a major challenge. In our study, we have restricted ourselves to molecules with corrugations smaller \(185\,\text {pm}\), that include the presence of methyl groups and are within the height range from where information can be retrieved with the common constant–height operation mode for HR–AFM [34]. This limitation arises from both the strength of the Pauli repulsion on the higher atoms and the deflection of the CO probe, that contributes to sharpen the features associated to the higher atoms, but, at the same time, veils the access to the lower ones, effectively creating regions that are inaccessible to the tip. Figure 3 plots the accuracy of the model for molecular identification for molecules from our test set of 279,905 molecules with corrugations in four different ranges. We do see a drop in accuracy when we move to larger corrugations, but the reduction is rather small (\(\simeq\)3.5% for the group with larger corrugation).
Our fingerprint–based identification pipeline has very few misidentifications (less than 5% of the cases). In Fig. 7, we explore four typical failures that help us unveil the limitations of our model and illustrates how some of them can be easily fixed.
The case shown in the first row of Fig. 7 is archetypical: the retrieval fails because the molecular fingerprints of the ground truth molecule (Methyl 12-oxobenzo[b]xanthene-9-carboxylate, CID:135,178,930) and the top candidate (Methyl 12-oxobenzo[b]xanthene-8-carboxylate, CID:135,178,929) are indeed the same. Since the radius used to create the fingerprints is limited to next nearest neighbors (the two closest atoms), they cannot capture the switching in the relative position of radicals that are far away. In this case, there is a tie in the Tanimoto similarity and the order of the candidates is arbitrary.
The second row of Fig. 7 illustrates another case where molecular identification is hampered by the local character of the fingerprints: the difference between the predicted Naphthacene (CID:7080) and the second candidate (and, in this case, ground truth) Hexacene (CID:123,044) molecules is the number of benzene rings. Our fingerprints are binary, which means that they represent the presence or absence of certain molecular substructures, but they don’t retain information about the number of times they are present. This information is lost in the hashing step, subsection Molecular fingerprints, in the construction of the ECFP4 fingreprints. In these cases, different molecules can have the same ECFP4 while having different chemical formula. The retrieval failure cannot be attributed to the performance of the CNN to extract the chemical information from the HR-AFM image stack but to the fingerprint codification.
On the third row we present a failure with a completely different origin. The Resorcinoxide (CID:129,866,873) molecule is corrugated (\(175\,pm\)). In the configuration we have used to calculate the HR-AFM images, an oxygen atom is under the benzene ring, inaccessible to the tip. As the AFM tip is not able to sense the full structure, the HR-AFM images cannot provide enough chemical information, and the prediction of the network fails. In this case, the accuracy is not limited by the model or the choice of molecular descriptor, but the intrinsic limitation of the HR–AFM operated on the constant-height mode to retrieve information for complex 3D structures from a single adsorption configuration.
Finally, in the last row, we show a case where the model has problems extracting the fingerprints from the HR–AFM image stack, as shown by the low Tanimoto similarity of the two top candidates. In particular, the model did recognize the hexagonal and pentagonal rings and the presence of an OH group, but failed to identify the nitrogen atoms in the ground-truth molecule N-(3-hydrazinylidene-1 H-inden-2-ylidene)hydroxylamine, CID:137,221,883) and predicted a molecule with carbons and oxygens instead (3 H-indene-1,2-dicarbaldehyde, CID: 129,814,712) as the top candidate. However, the second candidate, with a very similar Tanimoto similarity, is the ground truth. Given the success of the model with other molecules containing N atoms, we attribute the failure in this case to the fact that the presence of OH and NH2 groups linked through an additional N atom is quite rare in organic compounds, and, in particular, in our training set.
The limitations posed by the hashing step (due to the associated information loss) can be solved with an additional model trained to predict the chemical formula from the HR–AFM stack. The accuracy of this model is near perfect (above 99.5%, Table S3), except for phosphorus atoms (78.5%), which are underrepresented compared to the rest of the chemical species in the dataset. This additional model immediately solves the misidentification between Naphthacene (7 rings) and Hexacene (6 rings) (Fig. 7, second row). Thus, the final pipeline for molecular identification consists of a virtual screening using the predicted ECFP4, which outputs k–candidates with decreasing Tanimoto similarity, and a posterior re–ranking of the candidates by calculating the mean squared error of the predicted and ground truth chemical formula. With this strategy, that combines local (fingerprint) and global (chemical formula) features, the identification accuracy jumps from 95.43% to 97.59%, almost reducing misidentifications by half.
Effect of the adsorption-induced molecular corrugation
Chemical identification of theoretically generated PTCDA molecules on gas phase (first row) and adsorbed on Cu(111) (second row) and Ag(111) (third row). Tanimoto similarity for each candidate and predicted chemical formula under the candidate images. In gas phase, the fingerprints are predicted perfectly while in Cu(111), the tanimoto drops by 0.1. In Ag(111), the surface pushes away the middle oxygens, increasing their contrast with respect to the gas phase image. The differences in contrast can be clearly seen at \(z_{ts}\) = 310 and 330 pm (purple). The model interprets this contrast as NH groups (blue) and predict the Perylimid molecule as first candidate instead of the PTCDA
Our final goal is to develop a model capable of retrieving the molecular fingerprints from experimental HR–AFM images. In experiments, the molecules are necessarily adsorbed on a substrate, and, due to the molecule-substrate interaction, the adsorption configuration will differ from their gas phase structure. As the data set used for the training of the model is based on HR–AFM images calculated for the gas–phase configuration, it is important to test the ability of the model to identify a molecule from images corresponding to their structure upon adsorption on different substrates or on different configurations within the same substrate.
We have addressed this question with the PTCDA molecule, considering HR-AFM images simulated for its gas phase structure and for the adsorption configurations on both Cu(111) and Ag(111) surfaces, as determined from DFT calculations (see section DFT calculations and simulation of HR–AFM images for details). Figure 8 displays some of the simulated images in the HR–AFM 3D stack for the three cases and the predictions of the model over PTCDA on gas phase (first row), adsorbed on Cu(111) (second row) and Ag(111) (third row). In the gas phase, PTCDA has a perfectly planar geometry and the model achieves a perfect prediction of the molecular fingerprints, with a Tanimoto similarity S=1. As shown there, virtual screening produces a tie between two structures with the same fingerprints, but the addition of the chemical formula model, that accurately retrieves the chemical composition \(\text {C}_{24}\text {H}_{8}\text {O}_{6}\) from the 3D stack, leads to the proper identification.
In the case of Cu(111), the interaction with the substrate corrugates the PTCDA structure, pulling the oxygen atoms in the corner towards the surface by 11 pm with respect to the central carbon ring, while the central oxygen is pushed 6 pm above (Fig. S4). This corresponds well with the contrast of our simulated HR-AFM image stacks, where the middle oxygen is brighter in the case of Cu(111)–adsorbed molecule than the rest of the O atoms and also brighter than in the images for the gas–phase structure. The outer hexagonal rings are also slightly deformed, with the vortex occupied by the O atom protruding beyond the real O position due to its lone pair [29], as we have also observed in the case of substitutional nitrogen atoms [13]. Our model extracts quite accurately the fingerprints (Tanimoto similarity \(S= 0.89\)) while the chemical formula predicts \(\text {C}_{25}\text {H}_{8}\text {O}_{6}\), not perfect, but good enough to achieve an unambiguous identification.
For PTCDA on Ag(111), the central O atom is pushed up by 5 pm (Fig. S4), making them brighter than in the images for the gas–phase structure. The fingerprint model retrieves two molecules with a high Tanimoto similarity, Perylimid (\(S=0.80\)) –where the central atoms are replaced by NH groups– as the first candidate, and 1 H-2-Benzopyrano[6’,5’,4’:10,5,6]anthra[2,1,9-def]isoquinoline-1,3,8,10(9 H)-tetrone (CID: 118,580) (\(S=0.71\)) as the second one, while the chemical formula model predicts the correct composition. This is a tough case, where it is difficult to disentangle the effect of corrugation and chemical composition. The simulated HR–AFM images for Perylimid (some of them are shown in Fig. S4) are very similar to those calculated for the adsorption configuration of PTCDA on Ag(111), with only subtle differences in the outer areas beyond the O (or N) position. From our experience with other molecular systems [38], O and NH substitutionals produced very similar charge density distributions and, thus, HR–AFM contrast, slightly more repulsive in the NH case. However, the small upward displacement of the central oxygen results on image features that are very difficult to be discerned from NH groups. In summary, this example shows the ability of our identification procedure, combining the fingerprint and chemical formula models, to cope with the corrugation induced by the molecular adsorption, although further work is needed to assess its accuracy for certain chemical groups.
Experimental images
Chemical identification on experimental images. Molecules from first to last row are 1-Bromo-3,5-dichlorobenzene [65] (CID: 29,766), 2-iodotriphenylene [66] (CID: 88,955,426), PTCDA [65] (3,4,9,10-Perylenetetracarboxylic dianhydride CID: 67,191) and 2,7-Dibromopyrene [67] (CID: 13,615,479). The predicted chemical formula correctly solves the tie for the 2-iodotriphenylene molecule, predicting the GT molecule, but fails for PTCDA. In all cases, we extract meaningful chemical information from the experimental image stack
In previous sections, we demonstrated that our strategy for molecular identification, combining the fingerprint and chemical formula models, works exceptionally well for simulated images (as illustrated in Fig. 3), achieving, on our large test set, a retrieval accuracy of 97.59%. In Fig. 9, we benchmark our model over a limited set of experimental cases: 1-Bromo-3,5-dichlorobenzene (CID: 29,766) [65], 2-iodotriphenylene (ITP, CID: 88,955,426) [66], PTCDA [65] and 2,7-Dibromopyrene [67] (CID:13,615,479). 2-Iodotriphenylene was adsorbed on a Ag(111) surface while the rest of the molecules were deposited on Cu(111). Figures S5 and S6 show the complete stack of 10 constant–height images measured in the experiments. These experimental images clearly display the changes in the molecular configuration induced by the interaction with the substrate that we have already discussed from a theoretical perspective in Effect of the adsorption-induced molecular corrugation section. For example, HR–AFM images for PTCDA on Cu(111) in the third row of Fig. 9 clearly show a brighter contrast on the left side of the molecule, at variance with the symmetry that we could expect from the gas–phase structure. This effect stems from the non–planar adsorption of the molecule to the substrate.
A key point when applying our model to experimental images and assessing its accuracy is the height range on which the molecules are imaged. In our dataset, the tip–sample distance ranges from 280 to 370 pm. This range, where the interaction changes from being slightly attractive to strongly repulsive, covers the typical imaging conditions. In experiments, the height range explored is determined with respect to a specific set point (the position of maximum approach or where a referenced value of the tunneling current is measured by STM), but the absolute tip–sample distance is not known. Figures S5 and S6 compare the experimental image stacks used in Fig. 9 to their corresponding simulations, with the same method our dataset [41] was generated. From the comparison, we can conclude that, for 1-Bromo-3,5-dichlorobenzenethe, experiments are exploring in a tip–sample distance range similar to the one considered in the training of the model, while, for ITP (with an experimental range of 72 pm) and PTCDA, images are sampled much closer than what the model expects based on the training data. For 2,7-Dibromopyrene, the experimental range is 135pm, 45 pm greater than the theoretical range of 90 pm. Our model has generalized to distances outside its training data to correctly predict the fingerprints of the molecule.
Despite the differences in the tip height range sampled in some of the experiments and the internal corrugation induced by the substrate, the model is able to generalize and provide meaningful information about the chemical composition and bonding topology of the molecules. For 1-Bromo-3,5-dichlorobenzene (Fig. 9, 1st row), the fingerprint model correctly identifies the molecule with a very high Tanimoto similarity, \(S=0.86\). Notice that the model is capable of discriminating among the different halogen species, identifying the presence of two Cl and one Br atom, and retrieving the correct molecule. For the ITP and PTCDA molecules, the fingerprint model arrives at a tie because the first and second candidates both have the same fingerprints. In the case of PTCDA, it is rather remarkable that the model is able to retrieve the fingerprints (although with a low Tanimoto similarity S=0.44) from the low quality of the experimental image. In both cases, the tie stands from the fact that the frequency (number of occurrences) of a certain substructure is removed from the fingerprints.
In the case of ITP, although the predicted chemical formula is not completely correct (predicted \(\hbox {C}_{19}\hbox {H}_{11}\) vs the true chemical formula \(\hbox {C}_{18}\hbox {H}_{11}\)I), it provides enough information to break the tie and achieve molecular identification. This is not the case for PTCDA, where the prediction is \(\hbox {C}_{29}\hbox {H}_{14}\hbox {N}_{8}\hbox {O}_{4}\) while the true chemical formula is \(\hbox {C}_{24}\hbox {H}_{8}\hbox {O}_{6}\). In the last row, the low values for the Tanimoto similarity indicate that the model has problems retrieving the fingerprints. It correctly predicts the overall topology of the molecule and the presence of two bromine atoms, but interchanges the position of one of the Br atoms with a neighboring H atom. The chemical formula model correctly predicts the presence of the two Br atoms (predicted \(\hbox {C}_{14}\hbox {H}_{7}\hbox {Br}_{2}\)N vs the true chemical formula \(\hbox {C}_{16}\hbox {H}_{8}\hbox {Br}_{2}\)), but in this case it is not useful to choose between the two top candidates.
In the case of ITP and 1-Bromo-3,5-dichlorobenzene, we found that small variations in the scan size and pixel resolution of the experimental images caused huge changes in the ability of the model to retrieve the molecular fingerprints, as shown by the changes in the Tanimoto (of the order of 0.4). This sensitivity was absent from both the remaining experimental images and the simulated images.
We have tried to understand this sensitivity looking at the attention maps of the images generated using Grad-CAM [68] (see section S8). After a careful exploration, we found that, for the scan size and pixel resolutions where the model performs the best, the model is paying more attention to the regions where the heteroatoms are located (Fig. S7). Although further work, exploring systematically more experimental cases, is clearly needed, these two examples suggest that attention maps, that do not require any other input as the ground truth fingerprints, should provide a powerful protocol for the validation of the model’s predictions on experimental images.
To conclude, the fingerprint model shows a very promising performance, while the results from the chemical formula model are more modest but good enough in some cases to break the ties associated with the loss of information in the construction of the fingerprints. Despite these good results, a larger, systematic analysis with proper experimental data is necessary to further address the accuracy of our model.
Conclusions
A pipeline for automated molecular identification has been presented in this study. The pipeline predicts both molecular structure and chemical composition from HR-AFM image stacks. To achieve this, a convolutional neural network was trained using the QUAM–AFM dataset. The network retrieves the molecular fingerprint, ECFP4, with high accuracy, 0.95 median Tanimoto similarity in the test set. This accuracy is attributed to the choice of molecular descriptor. ECFP4 captures a lot of structural information and, unlike other codifications such as IUPAC names, SMILES, or SELFIES, ECFP4 uses binary vectors, making their prediction the well-studied problem of multilabel classification. Knowledge of a molecule’s fingerprints has a wide range of applications. Designed for high-throughput screening, these fingerprints are particularly good at encoding the presence or absence of specific substructures. Beyond molecular identification, they can be useful for other downstream tasks, such as predicting quantum mechanical properties [44], thermodynamic properties [69] and even finding new antibiotics with specific properties [46]. We have shown how it is possible to determine the molecule among a list of candidates by a virtual screening process done by ranking the possible candidates by decreasing order of Tanimoto similarity. To compensate the loss of the frequency of the identified substructures during the hashing process, we can re–rank the final candidates using another CNN designed to predict the chemical formula, boosting the accuracy of the prediction up to a 97.6%
Although trained and tested with simulated HR–AFM images, the final goal of our model is to retrieve the molecular fingerprints and achieve molecular identification from experimental images. To that end, we have proved that our model can distinguish chemical contrast from the structural changes induced by molecular adsorption and performed few identification tests with experimental images that have shown very promising results.
A systematic collaboration between theory and experiment is needed to further develop the model to work under real experimental conditions. Particularly promising in this direction is the possibility to use attention maps to improve and to validate of the models’ predictions on experimental images. Even with its current limitations, our model provides an accurate, straightforward method for automated molecular identification that can boost the chemical analysis and characterization of complex molecular materials such as intermediates and products of on-surface reactions, soot molecules, fuel pyrolysis products, dissolved organic carbon, or other petroleum products as well as materials of interest for catalysis or astrochemistry.
Availability of data and materials
The code required to reproduce this work is freely available on https://github.com/SPMTH/afm-molecular-fingerprints. The data and models can be accessed through https://zenodo.org/records/11483708.
References
García R, Pérez R (2002) Dynamic atomic force microscopy methods. Surf Sci Rep 47:197–301. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/S0167-5729(02)00077-8
Giessibl FJ (2003) Advances in atomic force microscopy. Rev Mod Phys 75:949–983. https://doiorg.publicaciones.saludcastillayleon.es/10.1103/RevModPhys.75.949
Gross L, Mohn F, Moll N, Liljeroth P, Meyer G (2009) The chemical structure of a molecule resolved by atomic force microscopy. Science 325:1110–1114. https://doiorg.publicaciones.saludcastillayleon.es/10.1126/science.1176210
Jelinek P (2017) High resolution SPM imaging of organic molecules with functionalized tips. J Phys: Condens Matter 29:343002
Gross L et al (2018) Atomic force microscopy for molecular structure elucidation. Angew Chem Int Ed 57:3888–3908. https://doiorg.publicaciones.saludcastillayleon.es/10.1002/anie.201703509
Zhong Q, Li X, Zhang H, Chi L (2020) Noncontact atomic force microscopy: bond imaging and beyond. Surf Sci Rep 75:100509. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.surfrep.2020.100509
Gross L et al (2012) Bond-order discrimination by atomic force microscopy. Science 337:1326–1329. https://doiorg.publicaciones.saludcastillayleon.es/10.1126/science.1225621
de Oteyza DG et al (2013) Direct imaging of covalent bond structure in single-molecule chemical reactions. Science 340:1434–1437. https://doiorg.publicaciones.saludcastillayleon.es/10.1126/science.1238187
Clair S, de Oteyza DG (2019) Controlling a chemical coupling reaction on a surface: tools and strategies for on-surface synthesis. Chem Rev 119:4717–4776. https://doiorg.publicaciones.saludcastillayleon.es/10.1021/acs.chemrev.8b00601
Altman EI, Baykara MZ, Schwarz UD (2015) Noncontact atomic force microscopy: an emerging tool for fundamental catalysis research. Acc Chem Res 48:2640–2648. https://doiorg.publicaciones.saludcastillayleon.es/10.1021/acs.accounts.5b00166
Gross L, Mohn F, Moll N, Liljeroth P, Meyer G (2009) The chemical structure of a molecule resolved by atomic force microscopy. Science 325:1110–1114
Moll N, Gross L, Mohn F, Curioni A, Meyer G (2010) The mechanisms underlying the enhanced resolution of atomic force microscopy with functionalized tips. New J Phys 12:125020. https://doiorg.publicaciones.saludcastillayleon.es/10.1088/1367-2630/12/12/125020
Ellner M, Pou P, Pérez R (2019) Molecular identification, bond order discrimination, and apparent intermolecular features in atomic force microscopy studied with a charge density based method. ACS Nano 13:786–795. https://doiorg.publicaciones.saludcastillayleon.es/10.1021/acsnano.8b08209
Van Der Lit J, Di Cicco F, Hapala P, Jelinek P, Swart I (2016) Submolecular resolution imaging of molecules by atomic force microscopy: the influence of the electrostatic force. Phys Rev Lett 116:096102. https://doiorg.publicaciones.saludcastillayleon.es/10.1103/PhysRevLett.116.096102
Hapala P et al (2016) Mapping the electrostatic force field of single molecules from high-resolution scanning probe images. Nat Commun 7:11560
Hapala P et al (2014) Mechanism of high-resolution STM/AFM imaging with functionalized tips. Phys Rev B 90:085421. https://doiorg.publicaciones.saludcastillayleon.es/10.1103/PhysRevB.90.085421
Hanssen KØ et al (2012) A combined atomic force microscopy and computational approach for the structural elucidation of breitfussin a and b: highly modified halogenated dipeptides from Thuiaria breitfussi. Angew Chem Int Ed 51:12238–12241. https://doiorg.publicaciones.saludcastillayleon.es/10.1002/anie.201203960
Balan V et al (2019) Vibrational spectroscopy fingerprinting in medicine: from molecular to clinical practice. Materials 12:2884
Simpson AJ, Simpson MJ, Soong R (2012) Nuclear magnetic resonance spectroscopy and its key role in environmental research. Environ Sci Technol 46:11488–11496
Meringer M, Schymanski EL (2013) Small molecule identification with molgen and mass spectrometry. Metabolites 3:440–462
De Vijlder T et al (2018) A tutorial in small molecule identification via electrospray ionization-mass spectrometry: the practical art of structural elucidation. Mass Spectrom Rev 37:607–629
Sugimoto Y et al (2007) Chemical identification of individual surface atoms by atomic force microscopy. Nature 446:64
van der Heijden NJ et al (2016) Characteristic contrast in \(\delta \text{ f}_{min}\) maps of organic molecules using atomic force microscopy. ACS Nano 10:8517–8525. https://doiorg.publicaciones.saludcastillayleon.es/10.1021/acsnano.6b03644
Guo CS, Van Hove MA, Zhang RQ, Minot C (2010) Prospects for resolving chemical structure by atomic force microscopy: a first-principles study. Langmuir 26:16271–16277. https://doiorg.publicaciones.saludcastillayleon.es/10.1021/la101317s
Schuler B, Meyer G, Peña D, Mullins OC, Gross L (2015) Unraveling the molecular structures of asphaltenes by atomic force microscopy. J Am Chem Soc 137:9870–9876. https://doiorg.publicaciones.saludcastillayleon.es/10.1021/jacs.5b04056
Schuler B et al (2017) Characterizing aliphatic moieties in hydrocarbons with atomic force microscopy. Chem Sci 8:2315–2320. https://doiorg.publicaciones.saludcastillayleon.es/10.1039/C6SC04698C
Zahl P, Zhang Y (2019) Guide for atomic force microscopy image analysis to discriminate heteroatoms in aromatic molecules. Energy Fuels 33:4775–4780. https://doiorg.publicaciones.saludcastillayleon.es/10.1021/acs.energyfuels.9b00165
Tschakert J et al (2020) Surface-controlled reversal of the selectivity of halogen bonds. Nat Commun 11:5630. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s41467-020-19379-4
Zahl P et al (2021) Hydrogen bonded trimesic acid networks on cu(111) reveal how basic chemical properties are imprinted in hr-afm images. Nanoscale 13:18473–18482. https://doiorg.publicaciones.saludcastillayleon.es/10.1039/D1NR04471K
Schulz F et al (2021) Imaging titan’s organic haze at atomic scale. Astrophys J Lett 908:L13
Kaiser K et al (2022) Visualization and identification of single meteoritic organic molecules by atomic force microscopy. Meteorit Planet Sci 57:644–656
Shimizu TK et al (2020) Effect of molecule-substrate interactions on the adsorption of meso-dibenzoporphycene tautomers studied by scanning probe microscopy and first-principles calculations. J Phys Chem C 124:26759–26768
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60:84–90. https://doiorg.publicaciones.saludcastillayleon.es/10.1145/3065386
Alldritt B et al (2020) Automated structure discovery in atomic force microscopy. Sci Adv 6:eaay6913. https://doiorg.publicaciones.saludcastillayleon.es/10.1126/sciadv.aay6913
Oinonen N, Kurki L, Ilin A, Foster AS (2022) Molecule graph reconstruction from atomic force microscope images with machine learning. MRS Bull 47:1–11. https://doiorg.publicaciones.saludcastillayleon.es/10.1557/s43577-022-00324-3
Carracedo-Cosme J, Romero-Muñiz C, Pou P, Pérez R (2023) Molecular identification from afm images using the iupac nomenclature and attribute multimodal recurrent neural networks. ACS Appl Mater Interfaces 15:22692–22704. https://doiorg.publicaciones.saludcastillayleon.es/10.1021/acsami.3c01550
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: A method for automatic evaluation of machine translation. In 40th Proc. Annu. Meet. ACL, 311–318 (Association for Computational Linguistics, Philadelphia, Pennsylvania, 2002). https://doiorg.publicaciones.saludcastillayleon.es/10.3115/1073083.1073135
Carracedo-Cosme J, Pérez R (2024) Molecular identification with atomic force microscopy and conditional generative adversarial networks. npj Comput Mater. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s41524-023-01179-1
Todeschini R, Consonni V (2000) Handbook of molecular descriptors. John Wiley & Sons, Ltd., Hoboken
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inform Model 50:742–754
Carracedo-Cosme J, Romero-Muñiz C, Pou P, Pérez R (2022) Quam-afm: a free database for molecular identification by atomic force microscopy. J Chem Inf Model 62:1214–1223. https://doiorg.publicaciones.saludcastillayleon.es/10.1021/acs.jcim.1c01323
Willett P (2006) Similarity-based virtual screening using 2D fingerprints. Drug Discov Today 11:1046–1053
Coley CW, Barzilay R, Green WH, Jaakkola TS, Jensen KF (2017) Convolutional embedding of attributed molecular graphs for physical property prediction. J Chem Inf Model 57:1757–1772
Wu Z et al (2018) MoleculeNet: a benchmark for molecular machine learning. Chem Sci 9:513–530
Winter R, Montanari F, Noé F, Clevert D-A (2019) Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem Sci 10:1692–1701
Wong F et al (2024) Discovery of a structural class of antibiotics with explainable deep learning. Nature 626:177–185
Bajusz D, Rácz A, Héberger K (2015) Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminf 7:1–13
Butina D (1999) Unsupervised data base clustering based on daylight’s fingerprint and tanimoto similarity: a fast and automated way to cluster small and large data sets. J Chem Inf Comput Sci 39:747–750
Ucak UV, Ashyrmamatov I, Lee J (2023) Reconstruction of lossless molecular representations from fingerprints. J Cheminf 15:1–11
Landrum G (2012) Fingerprints in the RDKit. http://rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf
Kim S et al (2023) PubChem 2023 update. Nucleic Acids Res 51:D1373–D1380
Tan M, Le Q (2019) EfficientNet: Rethinking model scaling for convolutional neural networks. In Chaudhuri, K. & Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, vol. 97 of Proceedings of Machine Learning Research, 6105–6114 (PMLR, 2019). https://proceedings.mlr.press/v97/tan19a.html
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res (JMLR) 15:1929–1958
Carracedo-Cosme J, Romero-Muñiz C, Pou P, Pérez R (2021) QUAM-AFM: a free database for molecular identification by atomic force microscopy. https://doiorg.publicaciones.saludcastillayleon.es/10.21950/UTGMZ7
Kim S (2016) Getting the most out of pubchem for virtual screening. Expert Opin Drug Discov 11:843
Girshick R, Donahue J, Darrell T, Malik J (2013) Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv:1311.2524
Huh M, Agrawal P, Efros AA (2016) What makes ImageNet good for transfer learning? arXiv:1608.08614
Kresse G, Furthmüller J (1996) Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set. Phys Rev B - Condens Matter Mater Phys 54:11169–11186. https://doiorg.publicaciones.saludcastillayleon.es/10.1103/PhysRevB.54.11169
Blöchl PE (1994) Projector augmented-wave method. Phys Rev B 50:17953. https://doiorg.publicaciones.saludcastillayleon.es/10.1103/PhysRevB.50.17953
Kresse G, Joubert D (1999) From ultrasoft pseudopotentials to the projector augmented-wave method. Phys Rev B 59:1758. https://doiorg.publicaciones.saludcastillayleon.es/10.1103/PhysRevB.59.1758
Perdew JP, Burke K, Ernzerhof M (1996) Generalized gradient approximation made simple. Phys Rev Lett 77:3865–3868. https://doiorg.publicaciones.saludcastillayleon.es/10.1103/PhysRevLett.77.3865
Grimme S, Antony J, Ehrlich S, Krieg H (2010) A consistent and accurate ab initio parametrization of density functional dispersion correction (dft-d) for the 94 elements h-pu. J Chem Phys 132:154104. https://doiorg.publicaciones.saludcastillayleon.es/10.1063/1.3382344
Ventura-Macías E (2023) Imaging molecules at surfaces: First-principles methods for Force and Tunneling Microscopy with CO tips. Ph.D. thesis, Universidad Autónoma de Madrid Departamento de Física Teórica de la Materia Condensada
Liebig A, Hapala P, Weymouth AJ, Giessibl FJ (2020) Quantifying the evolution of atomic interaction of a complex surface with a functionalized atomic force microscopy tip. Sci Rep 10:14104–14116
Oinonen N et al (2022) Electrostatic discovery atomic force microscopy. ACS Nano 16:89–97. https://doiorg.publicaciones.saludcastillayleon.es/10.1021/acsnano.1c06840
Martin-Jimenez D et al (2019) Bond-level imaging of the 3d conformation of adsorbed organic molecules using atomic force microscopy with simultaneous tunneling feedback. Phys Rev Lett 122:196101. https://doiorg.publicaciones.saludcastillayleon.es/10.1103/PhysRevLett.122.196101
Zhong Q et al (2021) Constructing covalent organic nanoarchitectures molecule by molecule via scanning probe manipulation. Nat Chem 13:1133–1139
Selvaraju RR et al. (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In Int. Conf. Comput. Vision (ICCV), 618–626 (IEEE Computer Society Press, Piscataway, NJ, USA, 2017)
Besel V, Todorović M, Kurtén T, Rinke P, Vehkamäki H (2023) Atomic structures, conformers and thermodynamic properties of 32k atmospheric molecules. Sci Data 10:1–11
Acknowledgements
We thank the Adam Foster’s and Peter Liljeroth’s groups for making the experimental images of the 1-Bromo- 3,5-dichlorobenzene and PTCDA molecules publicly available. We thank Sebastian Ahles and Hermann A. Wegner for providing the 2-iodotriphenylene molecules and Daniel Martin-Jiménez for performing the LT-AFM measurements.
Funding
We acknowledge support from the Spanish Ministry of Science and Innovation, through projects PID2020–115864RB-I00, TED2021-132219A-I00 and PID2023–149150OB-I00, and the “María de Maeztu” Programme for Units of Excellence in R&D (CEX2023–001316-M). We acknowledge partial funding by the Deutsche Forschungsgemeinschaft (DFG) via grants EB 535/1–1, EB 535/4–1, SCHI 619/13 and the LOEWE Program of Excellence of the Federal State of Hesse via the LOEWE Focus Group PriOSS “Principles of On-Surface Synthesis”.
Author information
Authors and Affiliations
Contributions
M.G.L. conceived the original idea and wrote the code for training and evaluation of the models under the supervision of R.P and P.P. M.W. analyzed the experimental data, D.E. and A.S. designed and supervised the experiments. All authors discussed the results and contributed to the writing of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
González Lastre, M., Pou, P., Wiche, M. et al. Molecular identification via molecular fingerprint extraction from atomic force microscopy images. J Cheminform 16, 130 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13321-024-00921-1
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13321-024-00921-1