Custom residues from SMILES#

You will learn: how to teach Molecule about a non-canonical residue’s bond orders, formal charges, and hydrogens by templating it from a SMILES string, then run systemPrepare() over the result.

Prerequisites:

The Non-standard residues tutorial.

Setup#

from moleculekit.molecule import Molecule
from moleculekit.tools.preparation import systemPrepare
from moleculekit.tools.nonstandard_residues import detectNonStandardResidues

Step 1 — Load a structure with several non-canonical residues#

We use 5VBL, a zinc metalloprotease structure that carries a bound peptide inhibitor built from five different non-canonical amino acids (HRG, ALC, OIC, NLE, and 200), plus a free OLC monoacylglycerol lipid (monoolein) from lipidic-cubic-phase crystallization. The peptide is cyclized through an isopeptide bond, which makes this a good stress test for templating + bond preservation. A Zn²⁺ ion is also present but does not need templating.

Load the structure:

mol = Molecule("5VBL")

Step 2 — Find what needs templating#

specs = detectNonStandardResidues(mol)
sorted(set(s.resname for s in specs))

['200', 'ALC', 'GLU', 'HRG', 'LYS', 'NLE', 'OIC', 'OLC']

Rather than eyeballing every residue name, detectNonStandardResidues() walks the bond graph and returns a spec for each residue that needs special handling. Here it flags the five non-canonical amino acids (HRG, ALC, OIC, NLE, 200), a free OLC residue, and the two canonical residues (GLU, LYS) that form the isopeptide cyclization bond.

Rather than filtering the spec list yourself, you can ask directly which resnames still need a template:

from moleculekit.tools.nonstandard_residues import residuesRequiringTemplate

# The resnames you must supply a SMILES/Molecule template for before systemPrepare:
residuesRequiringTemplate(mol)

['200', 'ALC', 'HRG', 'NLE', 'OIC', 'OLC']

residuesRequiringTemplate() is the direct way to ask which residue names still need a template from you; it returns only the residues systemPrepare cannot handle on its own, so anything it omits (canonical crosslink partners, and modified residues the force field already knows) needs no action. Here it flags the same five NCAAs found above, plus the free OLC lipid; the isopeptide partners GLU and LYS are correctly left out because systemPrepare handles that crosslink natively.

OLC is monoolein — a monoacylglycerol lipid from lipidic-cubic-phase crystallization, not part of the system we want to model. Drop it from both the structure and the spec list before going further:

mol.remove("resname OLC")
specs = [s for s in specs if s.resname != "OLC"]

moleculekit.molecule - INFO - Removed 15 atoms. 2938 atoms remaining in the molecule.

The five non-canonical amino acids each need a SMILES template so their bond orders, formal charges, and hydrogens come out right (next step); the canonical isopeptide partners (GLU, LYS) don’t need a template — systemPrepare() simply preserves their crosslink.

Step 3 — Template every NCAA from its RCSB SMILES#

The RCSB-style SMILES for a canonical amino acid is written as the free form (with N and C terminal groups). When the residue sits inside a peptide chain, templateResidueFromSmiles automatically strips the unmatched terminal heavy atoms (the OXT and one peptide-NH) before the MCS match — no manual trimming is required.

smiles = {
    "HRG": "C(CCNC(=N)N)C[C@@H](C(=O)O)N",         # homoarginine
    "ALC": "C1CCC(CC1)C[C@@H](C(=O)O)N",            # cyclohexyl-Ala
    "OIC": "C1CC[C@H]2[C@@H](C1)C[C@H](N2)C(=O)O",  # octahydroindole-2-carboxylic acid
    "NLE": "CCCC[C@@H](C(=O)O)N",                   # norleucine
    "200": "c1cc(ccc1C[C@@H](C(=O)O)N)Cl",          # 4-chloro-Phe
}

for resname, smi in smiles.items():
    mask = mol.resname == resname
    if mask.any():
        mol.templateResidueFromSmiles(mask, smiles=smi, addHs=True)

rdkit - INFO - Enabling RDKit 2026.03.4 jupyter extensions
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'N=C(N)NCCCC[C@H](N)C=O'
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'N[C@H](C=O)CC1CCCCC1'
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'O=C[C@@H]1C[C@@H]2CCCC[C@@H]2N1'
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'CCCC[C@H](N)C=O'

Each call mutates mol in place. Cross-residue covalent bonds — the peptide bonds connecting the NCAAs to their neighbours and the isopeptide cyclization bond — are detected automatically from mol.bonds, and the boundary atoms’ H counts are reduced so they are not over-protonated. When a residue appears multiple times in the structure, every copy is templated individually with the same SMILES.

import numpy as np
{r: int(np.sum(mol.resname == r)) for r in smiles}

{'HRG': 26, 'ALC': 26, 'OIC': 24, 'NLE': 19, '200': 22}

The per-resname atom counts after templating — each NCAA now carries heavy atoms + the hydrogens the SMILES specified.

Alternative: template from a reference Molecule#

If you already have the residue as a small reference structure that carries correct connectivity (for example an RCSB chemical-component CIF for the ligand), you can template from that Molecule instead of writing a SMILES string, using templateResidueFromMolecule(). The reference can come from any file format Molecule can read (SDF, mol2, CIF/PDBx, …), as long as it carries formal charges and bond orders. Here we use the RCSB component CIF; an SDF exported from a cheminformatics tool works the same way:

ref = Molecule("HRG.cif")   # reference carrying correct bonds, bond orders, and formal charges
mol.templateResidueFromMolecule("resname HRG", ref, addHs=True)

When the reference’s heavy-atom names are unique and match the residue’s, the reference is matched by atom name and its bond orders and formal charges are copied straight over. This is the unambiguous case for an RCSB component CIF, whose atom names already equal the deposited residue’s. When the names do not match (for example a reference read from an SDF, whose atom names are only element symbols), the reference is matched by element and connectivity instead, as the SMILES variant does. Either way the bond orders and formal charges are taken verbatim from the reference, so they must already be correct in it: templateResidueFromMolecule transfers them and does not re-derive them. Everything else (addHs, guessBonds, automatic cross-residue bond handling, and per-copy templating) works exactly as in the SMILES variant.

Step 4 — Run systemPrepare#

pmol, applied_specs = systemPrepare(mol, detect_specs=specs, verbose=False)

moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'N[C@H](C=O)CCC=O'
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'NCCCC[C@H](N)C=O'
moleculekit.tools.preparation - WARNING - Both chains and segments are defined in Molecule.chain / Molecule.segid, however they are inconsistent. Protein preparation will use the chain information.
moleculekit.tools.preparation - WARNING - The following residues have not been optimized: ZN
moleculekit.tools.preparation - WARNING - Dubious protonation state: the pKa of 3 residues is within 1.0 units of pH 7.4.
moleculekit.tools.preparation - WARNING - Dubious protonation state:    HIS    11 A (pKa= 6.54)
moleculekit.tools.preparation - WARNING - Dubious protonation state:    ASP    75 B (pKa= 7.78)
moleculekit.tools.preparation - WARNING - Dubious protonation state:    CYS  1039 B (pKa= 7.59)

systemPrepare captures all non-peptidic bonds (including the isopeptide cyclization) before invoking PDB2PQR, then restores them on the prepared structure. Without templating, those bonds would either be silently dropped or trigger valence errors during hydrogenation.

{r: int(np.sum(pmol.resname == r)) for r in smiles}

{'HRG': 26, 'ALC': 26, 'OIC': 24, 'NLE': 19, '200': 21}

The five NCAAs all survive the preparation pipeline with their full heavy-atom topology and hydrogens.

Recap#

templateResidueFromSmiles() transfers bond orders, formal charges, and hydrogens from a SMILES template onto a residue’s atoms by MCS matching.
It removes any hydrogens already on the matched residue and re-adds them from the SMILES (addHs=True), so the hydrogen pattern is deterministic — no manual pre-stripping needed.
One SMILES per residue type; the templater handles every copy automatically and trims terminal atoms (OXT, terminal NH) for mid-chain residues.
Cross-residue covalent bonds — peptide bonds, glycosidic bonds, isopeptide bonds — are detected automatically; the boundary atom’s H count is corrected so it is not over-protonated.
templateResidueFromMolecule() is the same operation with a reference Molecule (e.g. a CIF) as the template instead of a SMILES string. It matches by atom name when the reference’s names are unique and match the residue (ideal for a CIF), and falls back to element-and-connectivity matching otherwise (so an SDF works too); either way it copies the reference’s bond orders and formal charges verbatim, so those must already be correct in the reference.