Build a protein#

You will learn: how to take a PDB file with a canonical protein, prepare it, solvate it, and produce an AMBER topology and coordinates pair ready for MD.

Prerequisites:

HTMD installed.

The flow#

For a protein containing only canonical residues, building is four steps after loading:

autoSegment() - automatically split the structure into independent segments by walking the residue connectivity and starting a new segment at every real chain break (decided from the backbone-atom distance between consecutive residues - resid numbering is not consulted).
systemPrepare() - protonate at the chosen pH, fix protonation states, and patch a few missing heavy atoms (full missing-sidechain restoration is off by default; opt in with restore_missing_sidechains=True).
solvate() - wrap a water box around the prepared structure.
htmd.builder.amber.build() - run tLeap to produce a prmtop + pdb pair (and ionise to the requested salt concentration).

No NCAA detection or parameterization step is needed when every residue is in tLeap’s built-in ff14SB library.

Setup#

from moleculekit.molecule import Molecule
from moleculekit.tools.autosegment import autoSegment
from moleculekit.tools.preparation import systemPrepare
from htmd.builder import amber
from htmd.builder.solvate import solvate

Copyright by Acellera Ltd. By executing you are accepting the License. In order to register, run htmd_register on your terminal.
The registration information must be valid so that it might be verified.

Step 1 - Load the structure#

We use Trp-cage (PDB 1L2Y): a 20-residue mini-protein with no ligands and no non-standard residues. Molecule accepts either a local file path (PDB, mmCIF, PSF, PRMTOP, …) or a four-character PDB ID it downloads from RCSB on the fly:

mol = Molecule("1L2Y")

Step 2 - Segment the chains#

mol = autoSegment(mol, fields=("segid", "chain"))

autoSegment() walks the structure residue-by-residue and starts a new segment whenever the backbone-atom distance between consecutive residues exceeds the chain-continuity threshold (a ~2.5 Å C–N for proteins, a ~3 Å O3’–P for nucleic acids). Residue-number jumps are ignored — only backbone geometry counts. Each contiguous run of bonded residues gets its own segid. This is what stops build() from extending a protein chain through a HETATM ligand later, and from auto-capping the wrong terminus when a non-canonical residue sits at the chain end.

For Trp-cage there’s just one continuous chain, so autoSegment produces a single segment named P0.

Step 3 - Prepare#

prepared, specs = systemPrepare(mol, pH=7.4)

---- Molecule chain report ----
Chain A:
    First residue:  ASN     1  
    Final residue:  SER    20  
---- End of chain report ----

systemPrepare() runs PDB2PQR under the hood: predicts pKa values, picks protonation states at the requested pH, adds missing hydrogens, and returns the prepared molecule plus a list of non-canonical specs. When no detect_specs argument is supplied the function auto-detects and returns that list; when one is supplied it is returned unchanged. For a canonical-only protein the spec list is empty and you can ignore it.

Step 4 - Solvate#

solvated = solvate(prepared, pad=10)

htmd.builder.solvate - INFO - Using water pdb file at: /home/runner/work/htmd/htmd/htmd/share/solvate/wat.pdb
htmd.builder.solvate - INFO - Replicating 1 water segments, 1 by 1 by 1
htmd.builder.solvate - INFO - 1640 water molecules were added to the system.

solvate() wraps a pre-equilibrated water box around the prepared molecule. The actual water model that tLeap parameterizes those atoms with is whatever the leaprc.water.* entry in ff selects (default TIP3P via leaprc.water.tip3p). pad=10 adds 10 Å of water in every direction beyond the molecule’s bounding box. We keep the box small to keep the tutorial fast; for a production run you’d typically use a larger pad (15-20 Å) so that any local unfolding or large-scale motion can’t reach across the periodic boundary and interact with the protein’s own image.

Step 5 - Build under AMBER#

amber.build(solvated, outdir="./build", ionize=True, saltconc=0.15)

htmd.builder.amber - INFO - Detecting disulfide bonds.
htmd.builder.amber - INFO - Starting the build.
htmd.builder.amber - INFO - Finished building.
htmd.builder.ionize - INFO - Adding 1 anions + 0 cations for neutralizing and 10 ions for the given salt concentration 0.15 M.
htmd.builder.amber - INFO - Starting the build.
htmd.builder.amber - INFO - Finished building.
moleculekit.tools.sequencestructuralalignment - INFO - Alignment #0 was done on 20 residues: 2-21

<moleculekit.molecule.Molecule object at 0x7efde004a300>
Molecule with 5211 atoms and 1 frames
Atom field - altloc shape: (5211,)
Atom field - atomtype shape: (5211,)
Atom field - beta shape: (5211,)
Atom field - chain shape: (5211,)
Atom field - charge shape: (5211,)
Atom field - coords shape: (5211, 3, 1)
Atom field - element shape: (5211,)
Atom field - formalcharge shape: (5211,)
Atom field - insertion shape: (5211,)
Atom field - masses shape: (5211,)
Atom field - name shape: (5211,)
Atom field - occupancy shape: (5211,)
Atom field - record shape: (5211,)
Atom field - resid shape: (5211,)
Atom field - resname shape: (5211,)
Atom field - segid shape: (5211,)
Atom field - serial shape: (5211,)
Atom field - virtualsite shape: (5211,)
angles shape: (580, 3)
bonds shape: (5206, 2)
bondtype shape: (5206,)
box shape: (3, 1)
boxangles shape: (3, 1)
crystalinfo: None
dihedrals shape: (1337, 4)
fileloc shape: (1, 2)
impropers shape: (66, 4)
reps: 
step shape: (1,)
time shape: (1,)
topoloc: /tmp/tmp5nhwvgd8/build/structure.prmtop
viewname: structure.prmtop

That single call:

Writes a tLeap input script consuming ff14SB for protein, TIP3P for water, and standard ion parameters.
Detects disulfide bridges in solvated and feeds them to tLeap as bond directives - the default is auto-detect; override with disulfide=[(sel1, sel2), ...] if needed.
Adds Na⁺ / Cl⁻ counter-ions to neutralise the system and reach 0.15 M NaCl.
Returns a built Molecule and writes the AMBER files into ./build/:

build/
├── structure.prmtop      # AMBER topology
├── structure.pdb         # built coordinates as PDB
├── structure.crd         # built coordinates as CRD
├── tleap.in              # tLeap input we generated
├── leap.log              # tLeap's own log
├── log.txt               # tLeap stdout/stderr captured by HTMD
└── ff*_leaprc.*          # force-field paths sourced by tLeap

The structure.prmtop + structure.pdb pair is what acemd.protocols.setup_equilibration() (or any other MD driver) consumes.

Force-field defaults and overrides#

amber.build consumes three categories of force-field input, each with its own argument:

Argument	Default	Format
`ff`	`amber.defaultFf()`	tLeap `leaprc.*` files (the master force-field selectors).
`topo`	`amber.defaultTopo()` (empty)	Residue topology templates: `.prepi` / `.prep` / `.in` (loaded with `loadamberprep`), or `.cif` / `.mol2` (loaded with tLeap’s `loadmol2` after an internal CIF→mol2 conversion - `prepgen` is not invoked).
`param`	`amber.defaultParam()` (empty)	Parameter overlays: `.frcmod`.

Inspect the active defaults:

print(amber.defaultFf())

['leaprc.protein.ff14SB', 'leaprc.lipid21', 'leaprc.gaff2', 'leaprc.RNA.OL3', 'leaprc.DNA.bsc1', 'leaprc.water.tip3p']

To replace the force-field set, pass your own list to ff=:

amber.build(
    solvated,
    outdir="./build",
    ff=["leaprc.protein.ff19SB", "leaprc.water.opc"],   # ff19SB protein + OPC water
)

To augment the build with extra residue templates or parameters (e.g. an NCAA you parameterized with antechamber, or a custom cofactor), pass them via topo= and param=:

amber.build(
    solvated,
    outdir="./build",
    topo=["./params/MY_RES.prepi"],
    param=["./params/MY_RES.frcmod"],
)

This is exactly what parameterizeFromSpecs() produces for non-canonical residues in tutorial 02 and onwards - the function returns topo_paths and frcmod_paths ready to feed straight in.

To browse what’s bundled with HTMD’s tLeap install (leaprc files, prepi templates, frcmod overlays):

amber.listFiles()

Parameters that matter#

Argument	Effect
`outdir`	Where the build files land. The directory is created.
`ff`	Force-field overrides, e.g. `ff=["leaprc.protein.ff14SB", "leaprc.water.tip3p"]`. Defaults to `amber.defaultFf()` (protein + water + DNA + RNA + lipid21 + GAFF2).
`ionize`	Add counter-ions and salt. `True` by default.
`saltconc`	NaCl concentration in mol/L when `ionize=True`. Defaults to `0` (counter-ions only).
`disulfide`	`None` for auto-detect, or a list of `(sel1, sel2)` atom-selection pairs.
`caps`	Per-segment caps as `{"P0": ("ACE", "NME")}` (the cap names are uppercase - they map to `ACE.pdb`/`NME.pdb` in HTMD’s cap library, so lowercase raises `FileNotFoundError`). Auto by default; pass `("none", "none")` for a free terminus.

Gotchas#

Always run autoSegment before systemPrepare. The segmentation reflects covalent connectivity, and systemPrepare’s C-terminal handling relies on it.
You don’t have to strip input hydrogens before systemPrepare. If titration=True (the default) PDB2PQR re-protonates all titratable residues at the chosen pH and effectively overrides whatever Hs the input carried; passing titration=False keeps your input protonation. Either way, the prep is what reconciles the hydrogens with the rest of the build.
Solvate before amber.build, not after. amber.build does not wrap a water box on its own.
saltconc defaults to 0 (just neutralising counter-ions). Pass saltconc=0.15 (or your target) for physiological ionic strength.
For systems with bound metals like Zn or Ca, you’ll usually want to keep them by including them in the input - tLeap has parameters for the common ones.