Segments, chains, and bonds#
Moleculekit inherits two parallel notions of molecular grouping — chains from the PDB world and segments from the MD world — and stores covalent connectivity in a separate bond array. Understanding why both exist, what populates them, and when each matters helps you avoid subtle bugs in downstream preparation and parameterization workflows.
Chains: PDB heritage#
The chain field (dtype object, shape (numAtoms,)) is read directly from
column 22 of a PDB file, or from the equivalent field in mmCIF/PDBx. It is
typically a single uppercase letter: "A", "B", "C", etc.
PDB convention: each polymer chain (protein, DNA, RNA) gets its own chain
identifier. Small molecules, ions, and water are often all lumped into a single
chain (commonly "A" or left blank). Chain identifiers reset on every new
deposited structure; they are not globally unique.
After loading a PDB:
from moleculekit.molecule import Molecule
mol = Molecule("3ptb")
import numpy as np
print(np.unique(mol.chain)) # e.g. ['A']
Chain identifiers matter for:
Identifying polymer chains visually and in selection strings (
chain A).PDB output: molecules with no chain set (
"") still write a valid file, but tools that expect a populated chain column may fail silently.
Segments: MD heritage#
The segid field (dtype object, shape (numAtoms,)) carries up to four
characters and originates from the CHARMM/NAMD world. AMBER’s tleap,
CHARMM’s PSF builder, and related parameterization tools use the segment
identifier to group atoms into logical units that can receive independent
force-field parameters.
A freshly loaded PDB has empty segment identifiers (""). A PSF or
PRMTOP file — which is written by tleap or similar — almost always has segment
identifiers populated.
print(np.unique(mol.segid)) # [''] for a plain PDB
Why both fields exist#
PDB files were designed for structure deposition and exchange. Column 22 (one character, chain) is the only built-in grouping. MD topology formats needed more granularity — multiple segments per chain for long proteins split across multiple tleap units, separate segments for lipid tails vs. head-groups, etc. — so they introduced the segment concept independently.
In practice:
Use
chainwhen reading PDB output, working with visualization tools, or writing chain-aware selection strings.Use
segidwhen preparing a system for MD parameterization. Tools liketleapread the segment column and need it non-empty and consistent.
If you load a PDB and then hand it to a parameterizer without populating
segid, you will often get an error. Running autoSegment() (see below) before
systemPrepare() is the standard fix.
autoSegment: populating segments automatically#
autoSegment() assigns segments by
following the physical backbone of each polymer. It walks the residues of a
selection in file order and starts a new segment only where the backbone is
actually broken — judged from atomic coordinates, not from residue numbering —
so it is robust to structures where residues were deleted from the sequence
while the chain stayed continuous.
from moleculekit.tools.autosegment import autoSegment
mol_seg = autoSegment(mol) # returns a modified copy
import numpy as np
print(np.unique(mol_seg.segid)) # e.g. ['P0', 'P1', 'P2']
A new segment begins between two consecutive residues when any of these holds:
the backbone link distance exceeds the cutoff — for protein the
C(i)–N(i+1)peptide bond (defaultprotein_cutoff=2.0Å, falling back toCA–CAwhen the carbonyl/amide atoms are missing), for nucleic acids theO3'(i)–P(i+1)phosphodiester bond (defaultnucleic_cutoff=2.2Å);the
chainorsegidalready recorded in the file changes;the polymer type changes (protein vs nucleic).
Because continuity is judged from coordinates, a gap in residue numbering with an intact backbone (for example residues mutated out of a sequence) stays a single segment, while a genuine spatial break — even one with continuous numbering — is split into two.
Non-polymer atoms are grouped separately: all water collapses into one
segment, all ions into another, and the remaining molecules (“other”) are
split into one segment per bonded molecule. Pass single_other_segment=True to
place all of the “other” molecules into a single segment instead.
The basename argument controls naming (basename="P" produces P0, P1,
P2, …). The fields argument controls which field(s) are written:
("segid",) (default), ("chain",), or ("segid", "chain").
Run autoSegment() before
systemPrepare() so the parameterizer
receives a populated, consistent segid.
Note
The older autoSegment2 — which segmented by the covalent bond graph — is
deprecated. It now forwards to
autoSegment() and emits a
DeprecationWarning; call autoSegment directly instead.
Bonds: the connectivity layer#
Bonds live in two parallel arrays:
mol.bonds—uint32, shape(numBonds, 2). Each row[i, j]is an atom-index pair.mol.bondtype—object, shape(numBonds,). Bond-order or type string, parallel tomol.bonds. Common values:"1"(single),"2"(double),"ar"(aromatic),"un"(unknown),"mc"(metal coordination).
The bond count is mol.bonds.shape[0].
When are bonds present?#
Bonds are populated whenever the source file contains explicit connectivity:
Format |
Bonds present? |
Source |
|---|---|---|
PDB with |
Yes |
|
mmCIF / PDBx |
Yes |
|
PSF (CHARMM) |
Yes |
|
PRMTOP (AMBER) |
Yes |
|
MOL2 |
Yes |
|
SDF / MOL |
Yes |
bond table |
Plain PDB (no |
No |
— |
For plain PDB files, call guessBonds()
to infer bonds from inter-atomic distances and radii. It updates mol.bonds and
mol.bondtype together so the two parallel arrays stay consistent — do not
assign mol.bonds = guess_bonds(mol) directly. See
The Molecule data model: Bonds
for the full recipe.
mol = Molecule("structure.pdb")
mol.guessBonds()
print(mol.bonds.shape[0]) # now populated
Metal coordination bonds#
The PDB and mmCIF readers populate metal-coordination bonds in addition
to covalent ones. These appear as regular rows in
mol.bonds with bondtype == "mc". They are preserved through
systemPrepare() when hold_nonpeptidic_bonds=True (the default), so metal
chelation geometries survive the preparation pipeline.
# Select all metal-coordination bonds
mc_mask = mol.bondtype == "mc"
mc_bonds = mol.bonds[mc_mask]
print(mc_bonds) # index pairs involving metal atoms
Topology versus connectivity#
It helps to think of a Molecule as two orthogonal layers:
Topology — atoms and their per-atom fields (
name,resname,resid,chain,segid, …). This layer answers: “what atoms exist and what are their identities?”Connectivity — bonds. This layer answers: “which atoms are covalently (or metal-coordination) bonded to which?”
PDB files deliver topology but typically not connectivity (unless CONECT
records are present). Connectivity-rich formats like PSF and PRMTOP deliver
both. This distinction matters:
Selection strings that rely on
same fragment asor bond-graph traversal require non-emptymol.bonds.systemPrepare()can work without explicit bonds but callsguess_bonds()internally if needed.Any visualization in VMD or the built-in Molstar viewer will draw bonds only if they are present.
Further reading#
How-to: Guess bonds
How-to: Assign segments and chains