The Molecule data model#
A Molecule in moleculekit is more than a container for atom coordinates. It
holds the full topology of a molecular system — every per-atom field, the
bond graph, periodic box information, and provenance metadata — together with
any number of trajectory frames. Understanding the data layout lets you
write efficient code, avoid pitfalls with in-place mutation, and trace data
back to its source files.
Per-atom arrays#
Every atom in a Molecule is described by a set of parallel NumPy arrays, all
of length mol.numAtoms. The most commonly used fields are:
Field |
dtype |
Typical content |
|---|---|---|
|
|
Atom name, e.g. |
|
|
Residue name, e.g. |
|
|
Residue sequence number (PDB column 23–26) |
|
|
Insertion code (PDB column 27), often |
|
|
Chain identifier (PDB column 22), e.g. |
|
|
Segment identifier (MD convention), e.g. |
|
|
Element symbol, e.g. |
|
|
PDB occupancy, default 1.0 |
|
|
PDB B-factor / temperature factor |
|
|
Partial charge (populated by force-field tools) |
|
|
Atomic mass in Da |
|
|
PDB record type: |
|
|
Integer formal charge (populated by |
|
|
Force-field atom type (populated downstream) |
Because these arrays are NumPy arrays you can inspect, compare, and manipulate them directly:
from moleculekit.molecule import Molecule
mol = Molecule("3ptb")
print(mol.name.dtype) # object
print(mol.resid.dtype) # int64
print(mol.occupancy.dtype) # float32
# Vectorised field access — no loop needed
ca_mask = mol.name == "CA"
print(mol.resid[ca_mask]) # resids of all alpha-carbons
There is no hidden indirection: mol.name[i] is always the atom-name string
for atom i, and mol.resid[i] is always its residue number.
Coordinates: coords#
Atomic positions are stored in a single array of shape
(numAtoms, 3, numFrames), dtype float32, in units of Ångström:
mol.coords.shape # (numAtoms, 3, numFrames)
mol.numAtoms # first axis
mol.numFrames # third axis
Indexing patterns you will use often:
# First frame, all atoms — shape (numAtoms, 3)
frame0 = mol.coords[:, :, 0]
# All frames for atom 42 — shape (3, numFrames)
traj_atom42 = mol.coords[42, :, :]
# x-coordinates of all atoms in the last frame
x_last = mol.coords[:, 0, -1]
For a single-frame structure mol.numFrames == 1 and mol.coords[:, :, 0]
gives the familiar (numAtoms, 3) coordinate matrix.
Bonds: bonds and bondtype#
Bonding information is stored separately from atom fields:
mol.bonds—uint32array of shape(numBonds, 2), where each row is a pair of atom indices[i, j].mol.bondtype—objectarray of shape(numBonds,), holding a bond-order or bond-type string (e.g."1","2","ar","un","mc"for metal coordination) parallel tomol.bonds.
The number of bonds is mol.bonds.shape[0]. When a file is read without
explicit bond information (e.g. a plain PDB with no CONECT records) the
array is empty; call guessBonds()
to populate it. The method updates mol.bonds and mol.bondtype together
so the two parallel arrays stay consistent.
mol = Molecule("plain.pdb") # a PDB with no CONECT records
print(mol.bonds.shape[0]) # 0 — no connectivity loaded
mol.guessBonds()
print(mol.bonds.shape[0]) # now populated
print(mol.bondtype.shape[0]) # same length, kept in lockstep
print(mol.bonds[:5]) # first 5 bonds as index pairs
Avoid the form mol.bonds = guess_bonds(mol): it updates only bonds
and leaves bondtype at its old length, so the two arrays drift out of
sync.
Bonds are atom-index pairs, not residue identifiers. If atoms are reordered or filtered out, index pairs must be regenerated.
Periodic box: box and boxangles#
For systems with periodic boundary conditions:
mol.box—float32array of shape(3, numFrames), holding the box lengths[a, b, c]in Ångström for each frame.mol.boxangles—float32array of shape(3, numFrames), holding the box angles[α, β, γ]in degrees.
Orthorhombic (rectangular) boxes have all angles equal to 90°. A non-periodic structure has all-zero box arrays.
mol.box[:, 0] # box lengths of the first frame
mol.boxangles[:, 0] # box angles of the first frame
# Orthorhombic check for frame 0:
all_ortho = (mol.boxangles[:, 0] == 90).all()
Units#
Moleculekit uses a single distance unit throughout: Ångström (Å). This applies to:
mol.coords— atomic positions.mol.box— periodic-box lengths.All readers and writers (regardless of the source format’s native units — GROMACS’
.gro/.xtcuse nanometres on disk; moleculekit converts to Å on load and converts back on write).All distance parameters in the library (
coldist,spatialgap,find_clashesthresholds,within X ofselections, etc.).
Angles — mol.boxangles, dihedrals returned by getDihedral(), and rotation angles passed to setDihedral() — are in radians for the function APIs, except mol.boxangles which is in degrees (matching the PDB convention).
Provenance: fileloc#
mol.fileloc is a Python list of [filename, frame_index] pairs, one per
trajectory frame. It records which file each frame came from and its position
within that file:
# After loading a trajectory
mol.fileloc[0] # e.g. ['/data/run1/traj.xtc', 0]
mol.fileloc[-1] # last frame's provenance
This is especially useful when you concatenate multiple trajectory files: you can trace any frame back to its exact source.
Mutation semantics#
Many Molecule methods mutate the object in place:
mol.filter(sel)— keeps only selected atoms; modifiesmoldirectly.mol.remove(sel)— removes selected atoms; modifiesmoldirectly.mol.set(field, value, sel=...)— assigns values to a field; modifiesmoldirectly.mol.wrap(...)— re-images coordinates into the periodic box; modifiesmoldirectly.mol.align(sel, refmol)— superposes frames; modifiesmoldirectly.
If you need to preserve the original molecule, call mol.copy() first:
mol_orig = Molecule("3ptb")
# Wrong — mol_orig is now the filtered result
mol_orig.filter("protein")
# Correct — keep the original, work on a copy
mol_orig = Molecule("3ptb")
mol_protein = mol_orig.copy()
mol_protein.filter("protein")
copy() with a sel= argument is a convenient shorthand that creates a new Molecule
containing only the selected atoms:
mol_lig = mol.copy(sel="resname BEN")
Topology versus trajectory layering#
A Molecule is built in two conceptual layers:
Topology layer — the per-atom arrays (
name,resname,resid,chain,segid,bonds, …) that describe what atoms exist and how they are connected.numAtomsis determined here. Typically read from a PDB, PSF, PRMTOP, or similar topology file.Trajectory layer — the
coords,box,boxangles,fileloc,step, andtimearrays that carry per-frame data.numFramesis determined here. Loaded from XTC, DCD, TRR, NetCDF, or similar trajectory files.
The standard MD workflow reads topology first, then appends one or more trajectory files:
mol = Molecule("system.prmtop") # topology only, numFrames == 0
mol.read("run1.xtc") # appends frames
mol.read("run2.xtc", append=True) # appends more frames
print(mol.numFrames) # total frames from both trajectories
Topology is stable across trajectory reads. Only frame-indexed arrays change when trajectories are appended. You cannot load a trajectory whose atom count differs from the topology — moleculekit will raise an error.
Further reading#
How-to: Read a structure
How-to: Set and get properties
How-to: Filter and remove atoms