Non-standard residues and covalent modifications#

You will learn: how to detect non-standard residues, covalent modifications, and free ligands in a structure, and how to pass that information into systemPrepare() so it preserves the right bonds and renames residues correctly for the force field.

Prerequisites:

The Basic protonation tutorial.

Setup#

from moleculekit.molecule import Molecule
from moleculekit.tools.preparation import systemPrepare
from moleculekit.tools.nonstandard_residues import (
    detectNonStandardResidues,
    ChainResidueSpec,
    ScaffoldSpec,
    CovalentLigandSpec,
    LigandSpec,
)

Step 1 — Detect non-standard residues on a representative structure#

We use 1R1J, a thermolysin-like protease that carries three N-glycosylation sites (NAG sugars covalently attached to Asn residues) and a non-covalent zinc-chelating inhibitor (OIR). This gives us examples of all the important spec types in a single structure.

mol = Molecule("1R1J")
specs = detectNonStandardResidues(mol)
print(specs)

[ChainResidueSpec(resname='ASN', residue=<moleculekit.molecule.UniqueResidueID object at 0x7fefe483cfb0>
UniqueResidueID<resname: 'ASN', chain: 'A', resid: 144, insertion: '', segid: '1'>, new_resname='XX1', anchor_atom='ND2', is_n_term=False, is_c_term=False), ChainResidueSpec(resname='ASN', residue=<moleculekit.molecule.UniqueResidueID object at 0x7fefe49efcb0>
UniqueResidueID<resname: 'ASN', chain: 'A', resid: 324, insertion: '', segid: '1'>, new_resname='XX1', anchor_atom='ND2', is_n_term=False, is_c_term=False), ChainResidueSpec(resname='ASN', residue=<moleculekit.molecule.UniqueResidueID object at 0x7fefe48e5250>
UniqueResidueID<resname: 'ASN', chain: 'A', resid: 627, insertion: '', segid: '1'>, new_resname='XX1', anchor_atom='ND2', is_n_term=False, is_c_term=False), CovalentLigandSpec(resname='NAG', residue=<moleculekit.molecule.UniqueResidueID object at 0x7fefe48e6960>
UniqueResidueID<resname: 'NAG', chain: 'A', resid: 752, insertion: '', segid: '2'>), CovalentLigandSpec(resname='NAG', residue=<moleculekit.molecule.UniqueResidueID object at 0x7fefe48e6990>
UniqueResidueID<resname: 'NAG', chain: 'A', resid: 753, insertion: '', segid: '2'>), CovalentLigandSpec(resname='NAG', residue=<moleculekit.molecule.UniqueResidueID object at 0x7fefe48e69c0>
UniqueResidueID<resname: 'NAG', chain: 'A', resid: 754, insertion: '', segid: '2'>), LigandSpec(resname='OIR', residue=<moleculekit.molecule.UniqueResidueID object at 0x7fefe48e6a20>
UniqueResidueID<resname: 'OIR', chain: 'A', resid: 2001, insertion: '', segid: '4'>)]

from moleculekit.tools.nonstandard_residues import requiresTemplate

# Which of the detected specs actually need a user-supplied template:
[(s.resname, requiresTemplate(s)) for s in specs]

[('ASN', False),
 ('ASN', False),
 ('ASN', False),
 ('NAG', True),
 ('NAG', True),
 ('NAG', True),
 ('OIR', True)]

requiresTemplate() tells you, per spec, whether you must supply a template; residuesRequiringTemplate() (used in the next tutorial) is the whole-molecule shortcut.

detectNonStandardResidues() does not mutate mol — it just walks the bond graph and returns a list of spec objects (ChainResidueSpec, CovalentLigandSpec, LigandSpec, or ScaffoldSpec) describing every residue that needs special handling.

Note: Plain Cys–Cys disulfide bonds are not in this list — systemPrepare() handles those internally by renaming Cys to CYX. detectNonStandardResidues() targets non-canonical residues, sidechain crosslinks such as N-glycosylation or isopeptide bonds, and covalent or free ligands.

Step 2 — Walk through each spec subclass#

ChainResidueSpec — chain-resident residue needing special handling#

A ChainResidueSpec is emitted for every residue that sits inside a polypeptide chain and needs special parameterization. This includes:

Non-canonical amino acids embedded in a peptide chain (no inter-residue non-peptide bond).
Canonical amino acids whose sidechain is covalently bonded to something outside the peptide backbone — an Asn N-glycosylated by a sugar, a Glu–Lys isopeptide bond, a Cys thioether to a scaffold.

The 1R1J structure has three Asn residues each bonded to a NAG sugar at their ND2 atom. The detector emits a ChainResidueSpec for each, proposing a shared renamed resname so the parameterizer generates one set of AMBER parameters for all three:

chain_specs = [s for s in specs if isinstance(s, ChainResidueSpec)]
for s in chain_specs:
    print(
        f"resname={s.resname!r:4s}  chain={s.residue.chain!r}  "
        f"resid={s.residue.resid:<6}  anchor_atom={s.anchor_atom!r}  "
        f"new_resname={s.new_resname!r}"
    )

resname='ASN'  chain='A'  resid=144     anchor_atom='ND2'  new_resname='XX1'
resname='ASN'  chain='A'  resid=324     anchor_atom='ND2'  new_resname='XX1'
resname='ASN'  chain='A'  resid=627     anchor_atom='ND2'  new_resname='XX1'

Each ChainResidueSpec exposes:

Attribute	Meaning
`resname`	Residue name in the input structure
`residue`	`UniqueResidueID` (chain / resid / segid / insertion)
`new_resname`	Name to rename to before parameterization (`None` = no rename needed)
`anchor_atom`	Atom involved in the non-peptide bond (`None` for plain chain NCAAs)
`is_n_term` / `is_c_term`	Whether this is at the N- or C-terminus of a chain

Canonical amino acids that participate in a non-peptide bond get renamed too — the parameterizer needs different atom names and missing-H counts than the standard residue. A cross-residue covalent bond between two canonical amino acids therefore produces two ChainResidueSpec entries, one per side of the bond.

5VBL’s bound peptide is cyclized through an isopeptide bond. Loading it and filtering for canonical amino-acid ChainResidueSpec entries surfaces exactly the two endpoints — each with its own new_resname and its own anchor_atom:

mol_5vbl = Molecule("5VBL")
specs_5vbl = detectNonStandardResidues(mol_5vbl)

CANONICAL_AAS = {
    "ALA", "ARG", "ASN", "ASP", "CYS", "GLN", "GLU", "GLY", "HIS", "ILE",
    "LEU", "LYS", "MET", "PHE", "PRO", "SER", "THR", "TRP", "TYR", "VAL",
}
isopeptide_endpoints = [
    s for s in specs_5vbl
    if isinstance(s, ChainResidueSpec) and s.resname in CANONICAL_AAS
]
for s in isopeptide_endpoints:
    print(
        f"resname={s.resname!r:4s}  chain={s.residue.chain!r}  "
        f"resid={s.residue.resid:<6}  anchor_atom={s.anchor_atom!r}  "
        f"new_resname={s.new_resname!r}"
    )

resname='GLU'  chain='A'  resid=10      anchor_atom='CD'  new_resname='XX1'
resname='LYS'  chain='A'  resid=13      anchor_atom='NZ'  new_resname='XX2'

Both partners have new_resname set; the unique names tell antechamber to build a separate prepi for each side.

CovalentLigandSpec — single-anchor covalent ligand#

A CovalentLigandSpec is emitted for a free (non-chain-resident) residue with exactly one covalent bond to the rest of the structure. In 1R1J, the NAG N-acetylglucosamine sugars each attach to one Asn via a single C1-ND2 glycosidic bond:

cov_specs = [s for s in specs if isinstance(s, CovalentLigandSpec)]
for s in cov_specs:
    print(
        f"resname={s.resname!r}  chain={s.residue.chain!r}  "
        f"resid={s.residue.resid}"
    )

resname='NAG'  chain='A'  resid=752
resname='NAG'  chain='A'  resid=753
resname='NAG'  chain='A'  resid=754

CovalentLigandSpec has two public attributes: resname and residue.

LigandSpec — free non-covalent ligand#

A LigandSpec covers non-chain-resident residues with no covalent bonds to any other residue. In 1R1J, the thiorphan-class inhibitor OIR coordinates the active-site zinc ion via O19 and S26, but those are metal-coordination contacts (not covalent bonds), so the detector correctly classifies it as a free ligand:

lig_specs = [s for s in specs if isinstance(s, LigandSpec)]
for s in lig_specs:
    print(
        f"resname={s.resname!r}  chain={s.residue.chain!r}  "
        f"resid={s.residue.resid}"
    )

resname='OIR'  chain='A'  resid=2001

LigandSpec also has two public attributes: resname and residue.

ScaffoldSpec — multi-anchor covalent scaffold#

A ScaffoldSpec is emitted for a non-chain-resident residue with two or more covalent bonds going out to the polypeptide chain — typical of bicyclic peptide scaffolds or multi-anchor covalent inhibitors.

For a live example we load 8QFZ chain B, a lasso-peptide scaffold (LFI) thioether-bonded to three CYS residues:

mol_8qfz = Molecule("8QFZ")
mol_8qfz.filter("chain B", _logger=False)

specs_8qfz = detectNonStandardResidues(mol_8qfz)
scaffold_specs = [s for s in specs_8qfz if isinstance(s, ScaffoldSpec)]
for s in scaffold_specs:
    print(f"resname={s.resname!r}  chain={s.residue.chain!r}  resid={s.residue.resid}")

resname='LFI'  chain='B'  resid=101

The LFI scaffold appears as a ScaffoldSpec because it bonds covalently to three chain-resident CYS residues. Each of those CYS residues appears as a ChainResidueSpec with a unique auto-generated rename target, because they sit at different chain positions (N-terminal, mid-chain, C-terminal) and therefore carry different capping atoms in solution.

ScaffoldSpec has two public attributes: resname and residue.

Step 3 — Apply specs through systemPrepare#

Pass the spec list to systemPrepare() via detect_specs= to apply the proposed renames and preserve the cross-residue bonds that protonation would otherwise drop:

pmol, applied_specs = systemPrepare(mol, detect_specs=specs, verbose=False)

rdkit - INFO - Enabling RDKit 2026.03.4 jupyter extensions
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'NC(=O)C[C@H](N)C=O'
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'NC(=O)C[C@H](N)C=O'
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'NC(=O)C[C@H](N)C=O'
moleculekit.tools.preparation - WARNING - Both chains and segments are defined in Molecule.chain / Molecule.segid, however they are inconsistent. Protein preparation will use the chain information.
moleculekit.tools.preparation - WARNING - The following residues have not been optimized: NAG, OIR, ZN
moleculekit.tools.preparation - WARNING - Dubious protonation state: the pKa of 5 residues is within 1.0 units of pH 7.4.
moleculekit.tools.preparation - WARNING - Dubious protonation state:    HIS   437 A (pKa= 6.77)
moleculekit.tools.preparation - WARNING - Dubious protonation state:    LYS   471 A (pKa= 6.77)
moleculekit.tools.preparation - WARNING - Dubious protonation state:    ASP   591 A (pKa= 7.75)
moleculekit.tools.preparation - WARNING - Dubious protonation state:    HIS   637 A (pKa= 6.61)
moleculekit.tools.preparation - WARNING - Dubious protonation state:    HIS   733 A (pKa= 6.40)

detect_specs=specs tells systemPrepare() to rename force-field-relevant residues (Asn → shared auto-name so antechamber builds one prepi) and preserve the glycosidic C1-ND2 bonds that PDB2PQR’s hydrogenation step would otherwise sever. pmol is a new Molecule; mol is unchanged.

Step 4 — Suppress a specific spec#

You can filter the spec list before passing it in. For example, to skip preparation of the covalent NAG sugars (perhaps you will handle them in a separate glycan-parameterization step) you can drop all CovalentLigandSpec entries:

specs_no_nag = [s for s in specs if not isinstance(s, CovalentLigandSpec)]
pmol_no_nag, _ = systemPrepare(mol, detect_specs=specs_no_nag, verbose=False)

moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'NC(=O)C[C@H](N)C=O'
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'NC(=O)C[C@H](N)C=O'
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'NC(=O)C[C@H](N)C=O'
moleculekit.tools.preparation - WARNING - Both chains and segments are defined in Molecule.chain / Molecule.segid, however they are inconsistent. Protein preparation will use the chain information.
moleculekit.tools.preparation - WARNING - The following residues have not been optimized: NAG, OIR, ZN
moleculekit.tools.preparation - WARNING - Dubious protonation state: the pKa of 5 residues is within 1.0 units of pH 7.4.
moleculekit.tools.preparation - WARNING - Dubious protonation state:    HIS   437 A (pKa= 6.77)
moleculekit.tools.preparation - WARNING - Dubious protonation state:    LYS   471 A (pKa= 6.77)
moleculekit.tools.preparation - WARNING - Dubious protonation state:    ASP   591 A (pKa= 7.75)
moleculekit.tools.preparation - WARNING - Dubious protonation state:    HIS   637 A (pKa= 6.61)
moleculekit.tools.preparation - WARNING - Dubious protonation state:    HIS   733 A (pKa= 6.40)

You can also filter on a spec’s public attributes. For instance, to keep only the ASN entries (dropping OIR and leaving NAG out too):

specs_asn_only = [s for s in specs if s.resname == "ASN"]
pmol_asn, _ = systemPrepare(mol, detect_specs=specs_asn_only, verbose=False)

moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'NC(=O)C[C@H](N)C=O'
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'NC(=O)C[C@H](N)C=O'
moleculekit.rdkittools - INFO - Stripped unmatched terminal heavy atoms from SMILES template (e.g. leaving group displaced by a covalent link, or carboxyl -OH on a non-terminal amino acid). Modified SMILES: 'NC(=O)C[C@H](N)C=O'
moleculekit.tools.preparation - WARNING - Both chains and segments are defined in Molecule.chain / Molecule.segid, however they are inconsistent. Protein preparation will use the chain information.
moleculekit.tools.preparation - WARNING - The following residues have not been optimized: NAG, OIR, ZN
moleculekit.tools.preparation - WARNING - Dubious protonation state: the pKa of 5 residues is within 1.0 units of pH 7.4.
moleculekit.tools.preparation - WARNING - Dubious protonation state:    HIS   437 A (pKa= 6.77)
moleculekit.tools.preparation - WARNING - Dubious protonation state:    LYS   471 A (pKa= 6.77)
moleculekit.tools.preparation - WARNING - Dubious protonation state:    ASP   591 A (pKa= 7.75)
moleculekit.tools.preparation - WARNING - Dubious protonation state:    HIS   637 A (pKa= 6.61)
moleculekit.tools.preparation - WARNING - Dubious protonation state:    HIS   733 A (pKa= 6.40)

Any spec you remove is simply ignored by systemPrepare(); it uses only the entries you provide.

Recap#

detectNonStandardResidues() enumerates non-standard residues and covalent modifications without mutating mol.
Cys–Cys disulfides are not returned by it — systemPrepare() handles those internally.
Four spec subclasses cover chain crosslinks (ChainResidueSpec), bicyclic scaffolds (ScaffoldSpec), covalent ligands (CovalentLigandSpec), and free ligands (LigandSpec).
Pass the spec list (or a filtered subset) into systemPrepare() with detect_specs=... to control renaming and bond preservation.