Atom selection#

You will learn: how to select atoms using VMD-style strings, how to compose selections with boolean operators, and how to bypass the parser entirely with NumPy array masks for both clarity and speed.

Prerequisites:

Setup#

from moleculekit.molecule import Molecule

mol = Molecule("3PTB")

3PTB is bovine trypsin: 1701 atoms covering the protein, crystallographic waters, a calcium ion, and the benzamidine ligand (BEN) in the active site.

Step 1 — VMD-style selections#

atomselect() accepts a VMD-style selection string and returns a boolean NumPy array with one entry per atom.

Select all protein atoms:

mask = mol.atomselect("protein")
mask.sum()
np.int64(1629)

A boolean mask of length mol.numAtoms (1701); 1629 atoms belong to the protein chains in this structure.

Select only alpha carbons:

mol.atomselect("name CA").sum()
np.int64(224)

224 alpha carbons — one per residue in the protein.

Select a residue-id range:

mol.atomselect("resid 40 to 60").sum()
np.int64(161)

161 atoms spanning residues 40 through 60 (inclusive).

Select atoms within 5 Å of the benzamidine ligand:

mol.atomselect("within 5 of resname BEN").sum()
np.int64(65)

65 atoms are within 5 Å of any benzamidine atom.

Expand the proximity selection to whole residues:

mol.atomselect("same residue as within 5 of resname BEN").sum()
np.int64(131)

131 atoms — once partial residues touching the 5 Å shell are completed, more atoms are included.

Step 2 — Boolean composition#

Combine keywords with and, or, and not directly inside the selection string:

mol.atomselect("protein and name CA").sum()
np.int64(223)

223 alpha carbons — one less than the raw name CA count because a single CA atom in this structure belongs to the calcium ion record, not a protein residue.

mol.atomselect("(resid 40 to 60) and not water").sum()
np.int64(161)

161 atoms — in 3PTB none of the resid 40–60 atoms are water, so the count is unchanged here; parentheses are allowed for grouping.

Step 3 — Indices instead of masks#

Pass indexes=True to get an integer array of atom indices instead of a boolean mask:

mol.atomselect("resname BEN", indexes=True)
array([1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1638], dtype=uint32)

The 9 benzamidine atom indices. This is the right shape when you want to index directly into mol.coords or similar per-atom arrays without a boolean intermediate.

Recap#

  • VMD-style strings are convenient for ad-hoc selections: mol.atomselect("protein and name CA").

  • Boolean composition (and, or, not, parentheses) lets you build complex selections without leaving the string form.

  • Pass indexes=True to get an integer index array instead of a boolean mask.


For developers: bypass the parser with masks#

Everything below is an optimisation for tight loops and library code; casual scripting almost never needs it. If you are not sure whether you need this, stay on the string form — it is shorter, harder to break, and always re-parsed against whichever Molecule a call operates on.

Warning

Masks and index arrays are tied to a specific Molecule snapshot.

Both boolean masks and integer index arrays refer to specific atom indices in the Molecule they were computed against. They go stale the moment the underlying atom array changes — and there is no runtime check that flags this.

Two concrete failure modes:

  1. The structure changes between computing the mask and using it. Operations like filter(), remove(), append(), insert(), and mutateResidue() reshape the atom array. Any mask or index array you computed beforehand silently refers to the wrong atoms (or runs off the end). Recompute the selection after any such call.

  2. The mask was computed on a different Molecule. Functions that take two molecules — most importantly align(), which accepts sel for mol and refsel for refmol — require each selection to come from its own Molecule. Passing mol.atomselect("name CA") as refsel is wrong; the mask is sized for mol, not refmol. Use a string ("name CA") for cross-Molecule calls, or compute each mask on the right Molecule (refmol.atomselect("name CA")).

The string form is always safe: it is re-parsed against whichever Molecule the call is operating on.

Build a mask without the parser#

Per-atom attributes on Molecule are plain NumPy arrays, so you can construct selection masks with standard array operations:

ben_mask = mol.resname == "BEN"
ben_mask.sum()
np.int64(9)

9 atoms — identical to mol.atomselect("resname BEN"), but the string parser is never called. Compose masks with NumPy bitwise operators:

composite = (mol.chain == "A") & (mol.resid > 100)
composite.sum()
np.int64(1076)

1076 atoms in chain A with resid above 100 — equivalent to mol.atomselect("chain A and resid > 100").

Why bother — the performance gap#

import timeit

t_str  = timeit.timeit(lambda: mol.atomselect("resname BEN"), number=1000)
t_mask = timeit.timeit(lambda: mol.resname == "BEN",          number=1000)

print(f"string path : {t_str:.3f} s for 1000 calls")
print(f"mask path   : {t_mask:.3f} s for 1000 calls")
string path : 9.901 s for 1000 calls
mask path   : 0.019 s for 1000 calls

The mask path is one vectorised NumPy comparison; the string path runs a full parser invocation per call. The gap matters in inner loops; it does not matter in interactive scripting.

Pass masks (and index arrays) anywhere a string is accepted#

Any selector argument that ultimately dispatches through atomselect() accepts a string, a boolean mask of length mol.numAtoms, or a numpy.ndarray of integer indices interchangeably. The non-string branch simply uses the array as the mask directly after normalisation.

ben_mol = mol.copy(sel=mol.resname == "BEN")
ben_mol.numAtoms
9
protein_mask = mol.atomselect("protein")
non_protein  = mol.copy(sel=~protein_mask)
non_protein.numAtoms
72
mol.name[mol.resname == "BEN"]
array(['C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C', 'N1', 'N2'], dtype=object)

The trio (string / bool mask / index array) is accepted by:

Next#