Fig. 1

The concept of sequence representation and pre-training is illustrated. In A, the tokenization of a drug sequence (SMILES string) is depicted. In B, the tokenized elements are converted into integer values according to the predefined dictionary, and the encoder model (in this example, ChemBERTa) restores masked tokens into the original tokens (tokens colored in gray). After pre-training, the class token (CLS) is used to represent a given sequence