Build a protein with a ligand#

You will learn: how to detect non-standard residues in a structure, template them from SMILES, parameterize them with antechamber, and feed the result to build().

Prerequisites:

  • HTMD installed.

When you need this flow#

If your structure contains any residue your force field doesn’t know - a small-molecule ligand, a non-canonical amino acid, a covalently-bound drug, a phosphorylated residue - the canonical protein build won’t work as-is. The builder will raise a “could not find residue X” error.

The full flow adds two steps between systemPrepare and amber.build:

  1. detectNonStandardResidues() - inspect the molecule, return one spec per non-canonical residue.

  2. templateResidueFromSmiles() - fix bond orders and hydrogens on each non-canonical residue from its reference SMILES.

  3. systemPrepare() - protonate canonicals, preserve templated non-canonicals.

  4. parameterizeFromSpecs() - run antechamber on the cluster of each non-canonical, emit topology / parameter files and a custombonds list.

  5. build() - tLeap consumes those outputs.

Setup#

from moleculekit.molecule import Molecule
from moleculekit.tools.autosegment import autoSegment
from moleculekit.tools.nonstandard_residues import detectNonStandardResidues
from moleculekit.tools.preparation import systemPrepare
from htmd.builder import amber
from htmd.builder.nonstandard import parameterizeFromSpecs
from htmd.builder.solvate import solvate
Copyright by Acellera Ltd. By executing you are accepting the License. In order to register, run htmd_register on your terminal.
The registration information must be valid so that it might be verified.
rdkit - INFO - Enabling RDKit 2026.03.3 jupyter extensions

Step 1 - Load and segment#

We use trypsin in complex with benzamidine (PDB 3PTB). BEN is the non-canonical residue:

mol = Molecule("3PTB")
mol = autoSegment(mol, fields=("segid", "chain"))

Step 2 - Detect non-standard residues#

specs = detectNonStandardResidues(mol)
for spec in specs:
    print(spec)
LigandSpec(resname='BEN', residue=<moleculekit.molecule.UniqueResidueID object at 0x7f2b231e5c10>
UniqueResidueID<resname: 'BEN', chain: 'D', resid: 1, insertion: '', segid: 'P3'>)

You should see one spec for BEN - a LigandSpec. The spec records what kind of non-canonical we’re looking at (free ligand, covalent ligand, chain-resident NCAA, scaffold, …) and any anchor information needed to handle crosslinks.

Step 3 - Template the non-canonical residues from SMILES#

templateResidueFromSmiles matches the SMILES against the atoms in the selection, sets correct bond orders and formal charges, and adds the missing hydrogens. It requires the input residue to already have bonds in mol.bonds; PDBs without CONECT records need mol.guessBonds() (or load with guessBonds=True) first:

BEN_SMILES = "[NH2+]=C(N)c1ccccc1"
mol.templateResidueFromSmiles('resname "BEN"', BEN_SMILES, addHs=True)

The SMILES carries the protonated benzamidinium form (one of the amidine nitrogens is [NH2+]), which is the physiologically relevant state at pH 7.4. The RCSB chemical component for BEN is stored as the neutral amidine N=C(N)c1ccccc1 - that’s a starting point, but the SMILES you pass to templateResidueFromSmiles must encode the protonation state at your target pH with explicit formal charges. Templating the neutral form locks the wrong charges into the parameterization.

You don’t have to hand-edit the SMILES for the mid-chain case: when a residue is peptide-bonded on one or both sides, the function automatically strips the terminal -OH / -OXT that’s absent in the bonded copy and retries the match. Full heavy-atom coverage and explicit hydrogens still work best.

The viewer opens zoomed in on BEN (ball-and-stick on top of the protein cartoon). Look at the templating result: the amidine carbon now carries a double bond, the protonated [NH2+] nitrogen has two explicit hydrogens, and the +1 formal charge sits as a small black label by the protonated nitrogen. None of that connectivity / charge information was in the input PDB - templateResidueFromSmiles pulled it from the SMILES and reconciled it with the existing heavy-atom positions.

Step 4 - Prepare with the spec list#

prepared, specs = systemPrepare(mol, pH=7.4, detect_specs=specs)
---- Molecule chain report ----
Chain A:
    First residue:  ILE    16  
    Final residue:  ASN   245  
Chain B:
    First residue:  HOH   401  
    Final residue:  HOH   809  
Chain C:
          residue:  CA    480  
Chain D:
          residue:  BEN     1  
---- End of chain report ----
moleculekit.tools.preparation - WARNING - The following residues have not been optimized: CA, BEN
moleculekit.tools.preparation - INFO - Modified residue CYS    22 A to CYX
moleculekit.tools.preparation - INFO - Modified residue HIS    40 A to HIE
moleculekit.tools.preparation - INFO - Modified residue CYS    42 A to CYX
moleculekit.tools.preparation - INFO - Modified residue HIS    57 A to HIP
moleculekit.tools.preparation - INFO - Modified residue CYS    58 A to CYX
moleculekit.tools.preparation - INFO - Modified residue HIS    91 A to HID
moleculekit.tools.preparation - INFO - Modified residue CYS   128 A to CYX
moleculekit.tools.preparation - INFO - Modified residue CYS   136 A to CYX
moleculekit.tools.preparation - INFO - Modified residue CYS   157 A to CYX
moleculekit.tools.preparation - INFO - Modified residue CYS   168 A to CYX
moleculekit.tools.preparation - INFO - Modified residue CYS   182 A to CYX
moleculekit.tools.preparation - INFO - Modified residue CYS   191 A to CYX
moleculekit.tools.preparation - INFO - Modified residue CYS   201 A to CYX
moleculekit.tools.preparation - INFO - Modified residue CYS   220 A to CYX
moleculekit.tools.preparation - INFO - Modified residue CYS   232 A to CYX
moleculekit.tools.preparation - WARNING - Dubious protonation state: the pKa of 2 residues is within 1.0 units of pH 7.4.
moleculekit.tools.preparation - WARNING - Dubious protonation state:    TYR    39 A (pKa= 8.24)
moleculekit.tools.preparation - WARNING - Dubious protonation state:    HIS    57 A (pKa= 7.44)

If you don’t pass detect_specs, systemPrepare calls detectNonStandardResidues for you. Passing detect_specs=specs explicitly is useful when you want to reuse the list we already computed in step 2 (avoiding the duplicate detect call), edit it before prep (drop specs for residues you want to leave alone, tweak a new_resname, etc.), and thread the same list into parameterizeFromSpecs in step 5. systemPrepare returns the spec list unchanged as its second value; the rebind keeps the data flow visually obvious. To opt out of non-standard residue handling entirely, pass detect_specs=[].

Step 5 - Parameterize#

out = parameterizeFromSpecs(
    specs,
    prepared,
    outdir="./params",
    charge_method="gasteiger",
)
print(out)
ClusterOutputs(topo_paths=['./params/BEN.cif'], frcmod_paths=['./params/BEN.frcmod'], custombonds=[], xml_paths=['./params/gaff_combined.xml'])

For each unique (resname, terminal-position) bucket, parameterizeFromSpecs runs antechamber to assign GAFF2 atom types and per-atom partial charges, then writes:

  • out.topo_paths - one topology file per unique non-canonical bucket. Free ligands like BEN get a .cif; chain-resident NCAAs get a .prepi.

  • out.frcmod_paths - the matching BEN.frcmod with bond / angle / dihedral parameters.

  • out.custombonds - atom-selection pairs naming the inter-residue bonds tLeap should add (empty here because BEN is a free ligand with no covalent connection to the protein).

  • out.xml_paths - OpenMM force-field XML(s), in case you later want to switch the build backend. For free ligands with the default GAFF backend you get one combined gaff_combined.xml; for cluster-bonded residues you get a per-cluster XML.

The default charge method is AM1-BCC - antechamber runs an AM1-BCC calculation per parameterisation cluster (a small model compound with ACE/NME caps that closes off the chemistry around each non-canonical residue); the resulting charges are then split back to the constituent residues. We pass charge_method="gasteiger" here because it’s much faster and good enough for a tutorial; for production builds drop the argument (or set it explicitly to "am1-bcc") for higher-quality electrostatics.

Step 6 - Solvate#

solvated = solvate(prepared, pad=10)
htmd.builder.solvate - INFO - Using water pdb file at: /home/runner/work/htmd/htmd/htmd/share/solvate/wat.pdb
htmd.builder.solvate - INFO - Replicating 4 water segments, 2 by 1 by 2
htmd.builder.solvate - INFO - 6901 water molecules were added to the system.

amber.build does not wrap a water box on its own, so we solvate the prepared structure first - same as in the canonical protein tutorial.

Step 7 - Build#

built = amber.build(
    solvated,
    outdir="./build",
    custombonds=out.custombonds,
    topo=out.topo_paths,
    param=out.frcmod_paths,
    ionize=True,
    saltconc=0.15,
)
Disulfide Bond between: UniqueResidueID<resname: 'CYX', chain: 'A', resid: 26, insertion: '', segid: 'P0'>
                   and: UniqueResidueID<resname: 'CYX', chain: 'A', resid: 42, insertion: '', segid: 'P0'>

Disulfide Bond between: UniqueResidueID<resname: 'CYX', chain: 'A', resid: 117, insertion: '', segid: 'P0'>
                   and: UniqueResidueID<resname: 'CYX', chain: 'A', resid: 184, insertion: '', segid: 'P0'>

Disulfide Bond between: UniqueResidueID<resname: 'CYX', chain: 'A', resid: 174, insertion: '', segid: 'P0'>
                   and: UniqueResidueID<resname: 'CYX', chain: 'A', resid: 198, insertion: '', segid: 'P0'>

Disulfide Bond between: UniqueResidueID<resname: 'CYX', chain: 'A', resid: 149, insertion: '', segid: 'P0'>
                   and: UniqueResidueID<resname: 'CYX', chain: 'A', resid: 163, insertion: '', segid: 'P0'>

Disulfide Bond between: UniqueResidueID<resname: 'CYX', chain: 'A', resid: 8, insertion: '', segid: 'P0'>
                   and: UniqueResidueID<resname: 'CYX', chain: 'A', resid: 138, insertion: '', segid: 'P0'>

Disulfide Bond between: UniqueResidueID<resname: 'CYX', chain: 'A', resid: 110, insertion: '', segid: 'P0'>
                   and: UniqueResidueID<resname: 'CYX', chain: 'A', resid: 211, insertion: '', segid: 'P0'>
htmd.builder.amber - INFO - Detecting disulfide bonds.
htmd.builder.builder - INFO - 6 disulfide bonds were added
htmd.builder.amber - INFO - Starting the build.
htmd.builder.amber - INFO - Finished building.
htmd.builder.ionize - INFO - Adding 10 anions + 0 cations for neutralizing and 40 ions for the given salt concentration 0.15 M.
htmd.builder.amber - INFO - Starting the build.
htmd.builder.amber - INFO - Finished building.
moleculekit.tools.sequencestructuralalignment - INFO - Alignment #0 was done on 223 residues: 2-224

The three new arguments (custombonds, topo, param) wire parameterizeFromSpecs’s outputs into tLeap. Ionisation, disulfide detection, and capping happen the same way as in a canonical-only build.

Gotchas#

  • For ionisable ligands, template with the SMILES of the protonation state at your pH (e.g. [NH2+]=C(N)c1ccccc1 for benzamidinium at 7.4). Templating the neutral form locks the wrong protonation state into the parameterization.

  • If a non-canonical residue is at the C-terminus and its template already carries OXT, pass caps={"<segid>": ("ACE", "none")} (uppercase ACE; cap files in HTMD’s library are case-sensitive) to build() so tLeap doesn’t try to add an NME cap on top.

  • parameterizeFromSpecs dedupes by (resname, is_n_term, is_c_term). Two MLE residues in the same chain-position bucket share one topology/parameter set; an N-terminal MLE plus a mid-chain MLE produce two distinct topology/parameter sets.

See also#