Custom residues from SMILES#

You will learn: how to teach Molecule about a non-canonical residue’s bond orders, formal charges, and hydrogens by templating it from a SMILES string, then run systemPrepare() over the result.

Prerequisites:

Setup#

from moleculekit.molecule import Molecule
from moleculekit.tools.preparation import systemPrepare
from moleculekit.tools.nonstandard_residues import detectNonStandardResidues

Step 1 — Load a structure with several non-canonical residues#

We use 5VBL, a zinc metalloprotease structure that carries a bound peptide inhibitor built from five different non-canonical amino acids (HRG, ALC, OIC, NLE, and 200) plus a free OLC monoglyceride ligand. The peptide is cyclized through an isopeptide bond, which makes this a good stress test for templating + bond preservation. A Zn²⁺ ion is also present but does not need templating.

After loading, strip any hydrogens so the template’s addHs=True becomes the single, deterministic source of hydrogens:

mol = Molecule("5VBL")
mol.remove("hydrogen")
array([], dtype=uint32)

If the input is already H-free the return is just an empty array([], dtype=uint32); the call is a defensive no-op in that case.

Step 2 — Inspect what needs templating#

sorted(set(mol.resname))
['200',
 'ALA',
 'ALC',
 'ARG',
 'ASN',
 'ASP',
 'CYS',
 'GLN',
 'GLU',
 'GLY',
 'HIS',
 'HOH',
 'HRG',
 'ILE',
 'LEU',
 'LYS',
 'MET',
 'NLE',
 'OIC',
 'OLC',
 'PHE',
 'PRO',
 'SER',
 'THR',
 'TRP',
 'TYR',
 'VAL',
 'ZN']

You will see the five non-canonical resnames HRG, ALC, OIC, NLE, 200 alongside the standard amino-acid codes, plus the OLC free ligand. Each non-canonical residue needs a SMILES template so the bond orders, formal charges, and hydrogens come out right.

Step 3 — Template every NCAA from its RCSB SMILES#

The RCSB-style SMILES for a canonical amino acid is written as the free form (with N and C terminal groups). When the residue sits inside a peptide chain, templateResidueFromSmiles automatically strips the unmatched terminal heavy atoms (the OXT and one peptide-NH) before the MCS match — no manual trimming is required.

smiles = {
    "HRG": "C(CCNC(=N)N)C[C@@H](C(=O)O)N",         # homoarginine
    "ALC": "C1CCC(CC1)C[C@@H](C(=O)O)N",            # cyclohexyl-Ala
    "OIC": "C1CC[C@H]2[C@@H](C1)C[C@H](N2)C(=O)O",  # octahydroindole-2-carboxylic acid
    "NLE": "CCCC[C@@H](C(=O)O)N",                   # norleucine
    "200": "c1cc(ccc1C[C@@H](C(=O)O)N)Cl",          # 4-chloro-Phe
    "OLC": "CCCCCCCC(=O)OC[C@H](O)CO",              # monoglyceride free ligand
}

for resname, smi in smiles.items():
    mask = mol.resname == resname
    if mask.any():
        mol.templateResidueFromSmiles(mask, smiles=smi, addHs=True)
rdkit - INFO - Enabling RDKit 2026.03.2 jupyter extensions
moleculekit.rdkittools - INFO - Converted Molecule to RDKit mol with SMILES: N=C(N)NCCCCC(N)C=O
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'N=C(N)NCCCC[C@H](N)C=O'
moleculekit.rdkittools - INFO - Converted Molecule to RDKit mol with SMILES: NC(C=O)CC1CCCCC1
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'N[C@H](C=O)CC1CCCCC1'
moleculekit.rdkittools - INFO - Converted Molecule to RDKit mol with SMILES: O=CC1CC2CCCCC2N1
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'O=C[C@@H]1C[C@@H]2CCCC[C@@H]2N1'
moleculekit.rdkittools - INFO - Converted Molecule to RDKit mol with SMILES: CCCCC(N)C=O
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'CCCC[C@H](N)C=O'
moleculekit.rdkittools - INFO - Converted Molecule to RDKit mol with SMILES: NC(CC1=CC=C(Cl)C=C1)C(=O)O
moleculekit.rdkittools - INFO - Converted Molecule to RDKit mol with SMILES: CCCCCCCC(=O)OCC(O)CO

Each call mutates mol in place. Cross-residue covalent bonds — the peptide bonds connecting the NCAAs to their neighbours and the isopeptide cyclization bond — are detected automatically from mol.bonds, and the boundary atoms’ H counts are reduced so they are not over-protonated. OLC is a free ligand with no covalent bonds to the protein, so its template is applied without any boundary adjustments. When a residue appears multiple times in the structure, every copy is templated individually with the same SMILES.

import numpy as np
ncaa_sel = "resname " + " ".join(f"'{r}'" for r in smiles)
np.unique(mol.resname[mol.atomselect(ncaa_sel)], return_counts=True)
(array(['200', 'ALC', 'HRG', 'NLE', 'OIC', 'OLC'], dtype=object),
 array([22, 26, 26, 19, 24, 37]))

The per-resname atom counts after templating — each NCAA now carries heavy atoms + the hydrogens the SMILES specified.

Step 4 — Run systemPrepare#

specs = detectNonStandardResidues(mol)
pmol, applied_specs = systemPrepare(mol, detect_specs=specs, verbose=False)
moleculekit.rdkittools - INFO - Converted Molecule to RDKit mol with SMILES: NC(CO)CCCO
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'N[C@H](C=O)CCC=O'
moleculekit.rdkittools - INFO - Converted Molecule to RDKit mol with SMILES: NCCCCC(N)CO
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'NCCCC[C@H](N)C=O'
moleculekit.tools.preparation - WARNING - Both chains and segments are defined in Molecule.chain / Molecule.segid, however they are inconsistent. Protein preparation will use the chain information.
moleculekit.tools.preparation - WARNING - The following residues have not been optimized: ZN, OLC
moleculekit.tools.preparation - WARNING - Dubious protonation state: the pKa of 3 residues is within 1.0 units of pH 7.4.
moleculekit.tools.preparation - WARNING - Dubious protonation state:    HIS    11 A (pKa= 6.54)
moleculekit.tools.preparation - WARNING - Dubious protonation state:    ASP    75 B (pKa= 7.78)
moleculekit.tools.preparation - WARNING - Dubious protonation state:    CYS  1039 B (pKa= 7.59)

systemPrepare captures all non-peptidic bonds (including the isopeptide cyclization) before invoking PDB2PQR, then restores them on the prepared structure. Without templating, those bonds would either be silently dropped or trigger valence errors during hydrogenation.

pmol.atomselect(ncaa_sel).sum()
np.int64(154)

The five NCAAs and the OLC free ligand all survive the preparation pipeline with their full heavy-atom topology and hydrogens.

Recap#

  • templateResidueFromSmiles() transfers bond orders, formal charges, and hydrogens from a SMILES template onto a residue’s atoms by MCS matching.

  • Strip Hs first, then template with addHs=True to get a single, deterministic hydrogen pattern.

  • One SMILES per residue type; the templater handles every copy automatically and trims terminal atoms (OXT, terminal NH) for mid-chain residues.

  • Cross-residue covalent bonds — peptide bonds, glycosidic bonds, isopeptide bonds — are detected automatically; the boundary atom’s H count is corrected so it is not over-protonated.

Next#