Segments, chains, and bonds#
Moleculekit inherits two parallel notions of molecular grouping — chains from the PDB world and segments from the MD world — and stores covalent connectivity in a separate bond array. Understanding why both exist, what populates them, and when each matters helps you avoid subtle bugs in downstream preparation and parameterization workflows.
Chains: PDB heritage#
The chain field (dtype object, shape (numAtoms,)) is read directly from
column 22 of a PDB file, or from the equivalent field in mmCIF/PDBx. It is
typically a single uppercase letter: "A", "B", "C", etc.
PDB convention: each polymer chain (protein, DNA, RNA) gets its own chain
identifier. Small molecules, ions, and water are often all lumped into a single
chain (commonly "A" or left blank). Chain identifiers reset on every new
deposited structure; they are not globally unique.
After loading a PDB:
from moleculekit.molecule import Molecule
mol = Molecule("3ptb")
import numpy as np
print(np.unique(mol.chain)) # e.g. ['A']
Chain identifiers matter for:
Identifying polymer chains visually and in selection strings (
chain A).PDB output: molecules with no chain set (
"") still write a valid file, but tools that expect a populated chain column may fail silently.
Segments: MD heritage#
The segid field (dtype object, shape (numAtoms,)) carries up to four
characters and originates from the CHARMM/NAMD world. AMBER’s tleap,
CHARMM’s PSF builder, and related parameterization tools use the segment
identifier to group atoms into logical units that can receive independent
force-field parameters.
A freshly loaded PDB has empty segment identifiers (""). A PSF or
PRMTOP file — which is written by tleap or similar — almost always has segment
identifiers populated.
print(np.unique(mol.segid)) # [''] for a plain PDB
Why both fields exist#
PDB files were designed for structure deposition and exchange. Column 22 (one character, chain) is the only built-in grouping. MD topology formats needed more granularity — multiple segments per chain for long proteins split across multiple tleap units, separate segments for lipid tails vs. head-groups, etc. — so they introduced the segment concept independently.
In practice:
Use
chainwhen reading PDB output, working with visualization tools, or writing chain-aware selection strings.Use
segidwhen preparing a system for MD parameterization. Tools liketleapread the segment column and need it non-empty and consistent.
If you load a PDB and then hand it to a parameterizer without populating
segid, you will often get an error. Running autoSegment() (see below) before
systemPrepare() is the standard fix.
autoSegment: populating segments automatically#
autoSegment() walks the residue-ID sequence within a selection and assigns
a new segment ID whenever it detects a gap in residue numbering. Each
contiguous run of residues becomes one segment:
from moleculekit.tools.autosegment import autoSegment
mol_seg = autoSegment(mol) # returns a modified copy
import numpy as np
print(np.unique(mol_seg.segid)) # e.g. ['P0', 'P1', 'W0']
The function accepts a basename argument to control naming: basename="P"
produces P0, P1, P2, etc. The fields argument controls which field(s)
are written: ("segid",) (default), ("chain",), or ("segid", "chain").
autoSegment() detects gaps by checking for breaks in resid sequence and is the supported entry point.
Bonds: the connectivity layer#
Bonds live in two parallel arrays:
mol.bonds—uint32, shape(numBonds, 2). Each row[i, j]is an atom-index pair.mol.bondtype—object, shape(numBonds,). Bond-order or type string, parallel tomol.bonds. Common values:"1"(single),"2"(double),"ar"(aromatic),"un"(unknown),"mc"(metal coordination).
The bond count is mol.bonds.shape[0].
When are bonds present?#
Bonds are populated whenever the source file contains explicit connectivity:
Format |
Bonds present? |
Source |
|---|---|---|
PDB with |
Yes |
|
mmCIF / PDBx |
Yes |
|
PSF (CHARMM) |
Yes |
|
PRMTOP (AMBER) |
Yes |
|
MOL2 |
Yes |
|
SDF / MOL |
Yes |
bond table |
Plain PDB (no |
No |
— |
For plain PDB files, call guessBonds()
to infer bonds from inter-atomic distances and radii — it updates mol.bonds
and mol.bondtype together:
mol = Molecule("structure.pdb")
mol.guessBonds()
print(mol.bonds.shape[0]) # now populated
print(mol.bondtype.shape[0]) # same length, kept in lockstep
Metal coordination bonds#
The PDB and mmCIF readers populate metal-coordination bonds in addition
to covalent ones. These appear as regular rows in
mol.bonds with bondtype == "mc". They are preserved through
systemPrepare() when hold_nonpeptidic_bonds=True (the default), so metal
chelation geometries survive the preparation pipeline.
# Select all metal-coordination bonds
mc_mask = mol.bondtype == "mc"
mc_bonds = mol.bonds[mc_mask]
print(mc_bonds) # index pairs involving metal atoms
Topology versus connectivity#
It helps to think of a Molecule as two orthogonal layers:
Topology — atoms and their per-atom fields (
name,resname,resid,chain,segid, …). This layer answers: “what atoms exist and what are their identities?”Connectivity — bonds. This layer answers: “which atoms are covalently (or metal-coordination) bonded to which?”
PDB files deliver topology but typically not connectivity (unless CONECT
records are present). Connectivity-rich formats like PSF and PRMTOP deliver
both. This distinction matters:
Selection strings that rely on
same fragment asor bond-graph traversal require non-emptymol.bonds.systemPrepare()can work without explicit bonds but callsguess_bonds()internally if needed.Any visualization in VMD or the built-in Molstar viewer will draw bonds only if they are present.
Further reading#
How-to: Guess bonds
How-to: Assign segments and chains