Custom residues from SMILES#
You will learn: how to teach Molecule about a non-canonical residue’s bond orders, formal charges, and hydrogens by templating it from a SMILES string, then run systemPrepare() over the result.
Prerequisites:
The Non-standard residues tutorial.
Setup#
from moleculekit.molecule import Molecule
from moleculekit.tools.preparation import systemPrepare
from moleculekit.tools.nonstandard_residues import detectNonStandardResidues
Step 1 — Load a structure with several non-canonical residues#
We use 5VBL, a zinc metalloprotease structure that carries a bound peptide inhibitor built from five different non-canonical amino acids (HRG, ALC, OIC, NLE, and 200), plus a free OLC monoacylglycerol lipid (monoolein) from lipidic-cubic-phase crystallization. The peptide is cyclized through an isopeptide bond, which makes this a good stress test for templating + bond preservation. A Zn²⁺ ion is also present but does not need templating.
Load the structure:
mol = Molecule("5VBL")
Step 2 — Find what needs templating#
specs = detectNonStandardResidues(mol)
sorted(set(s.resname for s in specs))
['200', 'ALC', 'GLU', 'HRG', 'LYS', 'NLE', 'OIC', 'OLC']
Rather than eyeballing every residue name, detectNonStandardResidues() walks the bond graph and returns a spec for each residue that needs special handling. Here it flags the five non-canonical amino acids (HRG, ALC, OIC, NLE, 200), a free OLC residue, and the two canonical residues (GLU, LYS) that form the isopeptide cyclization bond.
OLC is monoolein — a monoacylglycerol lipid from lipidic-cubic-phase crystallization, not part of the system we want to model. Drop it from both the structure and the spec list before going further:
mol.remove("resname OLC")
specs = [s for s in specs if s.resname != "OLC"]
moleculekit.molecule - INFO - Removed 15 atoms. 2938 atoms remaining in the molecule.
The five non-canonical amino acids each need a SMILES template so their bond orders, formal charges, and hydrogens come out right (next step); the canonical isopeptide partners (GLU, LYS) don’t need a template — systemPrepare() simply preserves their crosslink.
Step 3 — Template every NCAA from its RCSB SMILES#
The RCSB-style SMILES for a canonical amino acid is written as the free form (with N and C terminal groups). When the residue sits inside a peptide chain, templateResidueFromSmiles automatically strips the unmatched terminal heavy atoms (the OXT and one peptide-NH) before the MCS match — no manual trimming is required.
smiles = {
"HRG": "C(CCNC(=N)N)C[C@@H](C(=O)O)N", # homoarginine
"ALC": "C1CCC(CC1)C[C@@H](C(=O)O)N", # cyclohexyl-Ala
"OIC": "C1CC[C@H]2[C@@H](C1)C[C@H](N2)C(=O)O", # octahydroindole-2-carboxylic acid
"NLE": "CCCC[C@@H](C(=O)O)N", # norleucine
"200": "c1cc(ccc1C[C@@H](C(=O)O)N)Cl", # 4-chloro-Phe
}
for resname, smi in smiles.items():
mask = mol.resname == resname
if mask.any():
mol.templateResidueFromSmiles(mask, smiles=smi, addHs=True)
rdkit - INFO - Enabling RDKit 2026.03.3 jupyter extensions
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'N=C(N)NCCCC[C@H](N)C=O'
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'N[C@H](C=O)CC1CCCCC1'
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'O=C[C@@H]1C[C@@H]2CCCC[C@@H]2N1'
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'CCCC[C@H](N)C=O'
Each call mutates mol in place. Cross-residue covalent bonds — the peptide bonds connecting the NCAAs to their neighbours and the isopeptide cyclization bond — are detected automatically from mol.bonds, and the boundary atoms’ H counts are reduced so they are not over-protonated. When a residue appears multiple times in the structure, every copy is templated individually with the same SMILES.
import numpy as np
{r: int(np.sum(mol.resname == r)) for r in smiles}
{'HRG': 26, 'ALC': 26, 'OIC': 24, 'NLE': 19, '200': 22}
The per-resname atom counts after templating — each NCAA now carries heavy atoms + the hydrogens the SMILES specified.
Alternative: template from a reference Molecule#
If you already have the residue as a small reference structure that carries correct connectivity (for example an RCSB chemical-component CIF for the ligand), you can template from that Molecule instead of writing a SMILES string, using templateResidueFromMolecule():
ref = Molecule("HRG.cif") # reference carrying correct bonds, bond orders, and formal charges
mol.templateResidueFromMolecule("resname HRG", ref, addHs=True)
The reference is matched to the residue by atom name (not by MCS, as with SMILES), so the two must share the same set of heavy-atom names and the reference’s heavy-atom names must be unique. The bond orders and formal charges are copied straight from the reference: they must therefore already be correct in the template Molecule, because templateResidueFromMolecule transfers them verbatim and does not re-derive them. Everything else (addHs, guessBonds, automatic cross-residue bond handling, and per-copy templating) works exactly as in the SMILES variant.
Step 4 — Run systemPrepare#
pmol, applied_specs = systemPrepare(mol, detect_specs=specs, verbose=False)
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'N[C@H](C=O)CCC=O'
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'NCCCC[C@H](N)C=O'
moleculekit.tools.preparation - WARNING - Both chains and segments are defined in Molecule.chain / Molecule.segid, however they are inconsistent. Protein preparation will use the chain information.
moleculekit.tools.preparation - WARNING - The following residues have not been optimized: ZN
moleculekit.tools.preparation - WARNING - Dubious protonation state: the pKa of 3 residues is within 1.0 units of pH 7.4.
moleculekit.tools.preparation - WARNING - Dubious protonation state: HIS 11 A (pKa= 6.54)
moleculekit.tools.preparation - WARNING - Dubious protonation state: ASP 75 B (pKa= 7.78)
moleculekit.tools.preparation - WARNING - Dubious protonation state: CYS 1039 B (pKa= 7.59)
systemPrepare captures all non-peptidic bonds (including the isopeptide cyclization) before invoking PDB2PQR, then restores them on the prepared structure. Without templating, those bonds would either be silently dropped or trigger valence errors during hydrogenation.
{r: int(np.sum(pmol.resname == r)) for r in smiles}
{'HRG': 26, 'ALC': 26, 'OIC': 24, 'NLE': 19, '200': 22}
The five NCAAs all survive the preparation pipeline with their full heavy-atom topology and hydrogens.
Recap#
templateResidueFromSmiles()transfers bond orders, formal charges, and hydrogens from a SMILES template onto a residue’s atoms by MCS matching.It removes any hydrogens already on the matched residue and re-adds them from the SMILES (
addHs=True), so the hydrogen pattern is deterministic — no manual pre-stripping needed.One SMILES per residue type; the templater handles every copy automatically and trims terminal atoms (OXT, terminal NH) for mid-chain residues.
Cross-residue covalent bonds — peptide bonds, glycosidic bonds, isopeptide bonds — are detected automatically; the boundary atom’s H count is corrected so it is not over-protonated.
templateResidueFromMolecule()is the same operation with a referenceMolecule(e.g. a CIF) as the template instead of a SMILES string. It matches by atom name rather than MCS and copies the reference’s bond orders and formal charges verbatim, so those must already be correct in the reference.