Segments, chains, and bonds#

Moleculekit inherits two parallel notions of molecular grouping — chains from the PDB world and segments from the MD world — and stores covalent connectivity in a separate bond array. Understanding why both exist, what populates them, and when each matters helps you avoid subtle bugs in downstream preparation and parameterization workflows.

Chains: PDB heritage#

The chain field (dtype object, shape (numAtoms,)) is read directly from column 22 of a PDB file, or from the equivalent field in mmCIF/PDBx. It is typically a single uppercase letter: "A", "B", "C", etc.

PDB convention: each polymer chain (protein, DNA, RNA) gets its own chain identifier. Small molecules, ions, and water are often all lumped into a single chain (commonly "A" or left blank). Chain identifiers reset on every new deposited structure; they are not globally unique.

After loading a PDB:

from moleculekit.molecule import Molecule

mol = Molecule("3ptb")
import numpy as np
print(np.unique(mol.chain))   # e.g. ['A']

Chain identifiers matter for:

  • Identifying polymer chains visually and in selection strings (chain A).

  • PDB output: molecules with no chain set ("") still write a valid file, but tools that expect a populated chain column may fail silently.

Segments: MD heritage#

The segid field (dtype object, shape (numAtoms,)) carries up to four characters and originates from the CHARMM/NAMD world. AMBER’s tleap, CHARMM’s PSF builder, and related parameterization tools use the segment identifier to group atoms into logical units that can receive independent force-field parameters.

A freshly loaded PDB has empty segment identifiers (""). A PSF or PRMTOP file — which is written by tleap or similar — almost always has segment identifiers populated.

print(np.unique(mol.segid))   # [''] for a plain PDB

Why both fields exist#

PDB files were designed for structure deposition and exchange. Column 22 (one character, chain) is the only built-in grouping. MD topology formats needed more granularity — multiple segments per chain for long proteins split across multiple tleap units, separate segments for lipid tails vs. head-groups, etc. — so they introduced the segment concept independently.

In practice:

  • Use chain when reading PDB output, working with visualization tools, or writing chain-aware selection strings.

  • Use segid when preparing a system for MD parameterization. Tools like tleap read the segment column and need it non-empty and consistent.

If you load a PDB and then hand it to a parameterizer without populating segid, you will often get an error. Running autoSegment() (see below) before systemPrepare() is the standard fix.

autoSegment: populating segments automatically#

autoSegment() walks the residue-ID sequence within a selection and assigns a new segment ID whenever it detects a gap in residue numbering. Each contiguous run of residues becomes one segment:

from moleculekit.tools.autosegment import autoSegment

mol_seg = autoSegment(mol)   # returns a modified copy
import numpy as np
print(np.unique(mol_seg.segid))   # e.g. ['P0', 'P1', 'W0']

The function accepts a basename argument to control naming: basename="P" produces P0, P1, P2, etc. The fields argument controls which field(s) are written: ("segid",) (default), ("chain",), or ("segid", "chain").

autoSegment() detects gaps by checking for breaks in resid sequence and is the supported entry point.

Bonds: the connectivity layer#

Bonds live in two parallel arrays:

  • mol.bondsuint32, shape (numBonds, 2). Each row [i, j] is an atom-index pair.

  • mol.bondtypeobject, shape (numBonds,). Bond-order or type string, parallel to mol.bonds. Common values: "1" (single), "2" (double), "ar" (aromatic), "un" (unknown), "mc" (metal coordination).

The bond count is mol.bonds.shape[0].

When are bonds present?#

Bonds are populated whenever the source file contains explicit connectivity:

Format

Bonds present?

Source

PDB with CONECT records

Yes

CONECT + LINK blocks

mmCIF / PDBx

Yes

_struct_conn entries

PSF (CHARMM)

Yes

!NBOND section

PRMTOP (AMBER)

Yes

BONDS_* section

MOL2

Yes

@<TRIPOS>BOND section

SDF / MOL

Yes

bond table

Plain PDB (no CONECT)

No

For plain PDB files, call guessBonds() to infer bonds from inter-atomic distances and radii — it updates mol.bonds and mol.bondtype together:

mol = Molecule("structure.pdb")
mol.guessBonds()
print(mol.bonds.shape[0])      # now populated
print(mol.bondtype.shape[0])   # same length, kept in lockstep

Metal coordination bonds#

The PDB and mmCIF readers populate metal-coordination bonds in addition to covalent ones. These appear as regular rows in mol.bonds with bondtype == "mc". They are preserved through systemPrepare() when hold_nonpeptidic_bonds=True (the default), so metal chelation geometries survive the preparation pipeline.

# Select all metal-coordination bonds
mc_mask = mol.bondtype == "mc"
mc_bonds = mol.bonds[mc_mask]
print(mc_bonds)   # index pairs involving metal atoms

Topology versus connectivity#

It helps to think of a Molecule as two orthogonal layers:

  • Topology — atoms and their per-atom fields (name, resname, resid, chain, segid, …). This layer answers: “what atoms exist and what are their identities?”

  • Connectivity — bonds. This layer answers: “which atoms are covalently (or metal-coordination) bonded to which?”

PDB files deliver topology but typically not connectivity (unless CONECT records are present). Connectivity-rich formats like PSF and PRMTOP deliver both. This distinction matters:

  • Selection strings that rely on same fragment as or bond-graph traversal require non-empty mol.bonds.

  • systemPrepare() can work without explicit bonds but calls guess_bonds() internally if needed.

  • Any visualization in VMD or the built-in Molstar viewer will draw bonds only if they are present.

Further reading#