Atom selection#

You will learn: how to select atoms using VMD-style strings, how to compose selections with boolean operators, and how to bypass the parser entirely with NumPy array masks for both clarity and speed.

Prerequisites:

The First molecule tutorial, or basic familiarity with Molecule.

Setup#

from moleculekit.molecule import Molecule

mol = Molecule("3PTB")

3PTB is bovine trypsin: 1701 atoms covering the protein, crystallographic waters, a calcium ion, and the benzamidine ligand (BEN) in the active site.

Step 1 — VMD-style selections#

atomselect() accepts a VMD-style selection string and returns a boolean NumPy array with one entry per atom.

Select all protein atoms:

mask = mol.atomselect("protein")
mask.sum()

np.int64(1629)

A boolean mask of length mol.numAtoms (1701); 1629 atoms belong to the protein chains in this structure.

Select only alpha carbons:

mol.atomselect("name CA").sum()

np.int64(224)

224 alpha carbons — one per residue in the protein.

Select a residue-id range:

mol.atomselect("resid 40 to 60").sum()

np.int64(161)

161 atoms spanning residues 40 through 60 (inclusive).

Select atoms within 5 Å of the benzamidine ligand:

mol.atomselect("within 5 of resname BEN").sum()

np.int64(65)

65 atoms are within 5 Å of any benzamidine atom.

Expand the proximity selection to whole residues:

mol.atomselect("same residue as within 5 of resname BEN").sum()

np.int64(131)

131 atoms — once partial residues touching the 5 Å shell are completed, more atoms are included.

Step 2 — Boolean composition#

Combine keywords with and, or, and not directly inside the selection string:

mol.atomselect("protein and name CA").sum()

np.int64(223)

223 alpha carbons — one less than the raw name CA count because a single CA atom in this structure belongs to the calcium ion record, not a protein residue.

mol.atomselect("(resid 40 to 60) and not water").sum()

np.int64(161)

161 atoms — in 3PTB none of the resid 40–60 atoms are water, so the count is unchanged here; parentheses are allowed for grouping.

Step 3 — Indices instead of masks#

Pass indexes=True to get an integer array of atom indices instead of a boolean mask:

mol.atomselect("resname BEN", indexes=True)

array([1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1638], dtype=uint32)

The 9 benzamidine atom indices. This is the right shape when you want to index directly into mol.coords or similar per-atom arrays without a boolean intermediate.

Recap#

VMD-style strings are convenient for ad-hoc selections: mol.atomselect("protein and name CA").
Boolean composition (and, or, not, parentheses) lets you build complex selections without leaving the string form.
Pass indexes=True to get an integer index array instead of a boolean mask.

For developers: bypass the parser with masks#

Everything below is an optimisation for tight loops and library code; casual scripting almost never needs it. If you are not sure whether you need this, stay on the string form — it is shorter, harder to break, and always re-parsed against whichever Molecule a call operates on.

Warning

Masks and index arrays are tied to a specific Molecule snapshot. They go stale the moment the underlying atom array changes (after filter, remove, append, insert, mutateResidue, …) and also cannot be reused across two different molecules (e.g. as refsel for align). There is no runtime check that flags either mistake. The string form is always safe — it is re-parsed against whichever Molecule the call operates on.

For the two failure modes in detail, see The atom-selection language: Mask and index substitution.

Build a mask without the parser#

Per-atom attributes on Molecule are plain NumPy arrays, so you can construct selection masks with standard array operations:

ben_mask = mol.resname == "BEN"
ben_mask.sum()

np.int64(9)

9 atoms — identical to mol.atomselect("resname BEN"), but the string parser is never called. Compose masks with NumPy bitwise operators:

composite = (mol.chain == "A") & (mol.resid > 100)
composite.sum()

np.int64(1076)

1076 atoms in chain A with resid above 100 — equivalent to mol.atomselect("chain A and resid > 100").

Why bother — the performance gap#

import timeit

t_str  = timeit.timeit(lambda: mol.atomselect("resname BEN"), number=1000)
t_mask = timeit.timeit(lambda: mol.resname == "BEN",          number=1000)

print(f"string path : {t_str:.3f} s for 1000 calls")
print(f"mask path   : {t_mask:.3f} s for 1000 calls")

string path : 10.337 s for 1000 calls
mask path   : 0.027 s for 1000 calls

The mask path is one vectorised NumPy comparison; the string path runs a full parser invocation per call. The gap matters in inner loops; it does not matter in interactive scripting.

Pass masks (and index arrays) anywhere a string is accepted#

Any selector argument that ultimately dispatches through atomselect() accepts a string, a boolean mask of length mol.numAtoms, or a numpy.ndarray of integer indices interchangeably. The non-string branch simply uses the array as the mask directly after normalisation.

ben_mol = mol.copy(sel=mol.resname == "BEN")
ben_mol.numAtoms

protein_mask = mol.atomselect("protein")
non_protein  = mol.copy(sel=~protein_mask)
non_protein.numAtoms

mol.name[mol.resname == "BEN"]

array(['C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C', 'N1', 'N2'], dtype=object)

The trio (string / bool mask / index array) is accepted by:

remove(), filter(), get(), set(), copy()
templateResidueFromSmiles() (documented as str or numpy.ndarray)
the sel parameter on Metric* projection classes
each per-residue entry in systemPrepare()’s no_opt, no_prot, no_titr, and force_protonation lists — each entry is forwarded to atomselect(), so a bool mask or index array works in place of a string (each entry must resolve to exactly one residue)