The atom-selection language#
Moleculekit ships a VMD-inspired atom-selection language that lets you describe
subsets of atoms in a Molecule using a concise, readable syntax. The same
selection string is accepted wherever an atom selection is expected — by
atomselect(), filter(), remove(), copy(), set(), wrap(), align(), and every
other method that takes a sel argument.
What a selection produces#
Every selection evaluates to a boolean mask — a NumPy array of bool with
length mol.numAtoms, where True marks selected atoms. You can also ask for
an array of integer indices instead:
from moleculekit.molecule import Molecule
mol = Molecule("3ptb")
# Boolean mask (default)
mask = mol.atomselect("protein and backbone")
print(mask.dtype, mask.shape) # bool (numAtoms,)
# Integer indices
idx = mol.atomselect("resname BEN", indexes=True)
print(idx) # array of uint32 atom indices
The mask can be used everywhere a string is accepted — pass it directly to
filter, copy, etc. to skip re-parsing (faster when reusing the same
selection many times):
prot_mask = mol.atomselect("protein")
mol_prot = mol.copy(sel=prot_mask) # reuses precomputed mask
Keyword selections#
The following keywords select entire chemical classes based on residue-name lookups and element checks:
Keyword |
What it selects |
|---|---|
|
All protein residues (canonical amino acids) |
|
All nucleic acid residues (DNA and RNA) |
|
Water molecules ( |
|
Common lipid residues |
|
Common monatomic ions |
|
Protein backbone atoms ( |
|
Protein sidechain atoms (non-backbone, non-hydrogen heavy atoms) |
|
All atoms with |
|
All non-hydrogen atoms |
|
Every atom in the molecule |
Per-atom field comparisons#
You can test any per-atom field against a value or list of values:
# Single value
mol.atomselect("resname ALA")
mol.atomselect("chain A")
mol.atomselect("element C")
# List of values (space-separated, no commas)
mol.atomselect("name CA N C O")
mol.atomselect("resname ALA GLY VAL")
mol.atomselect("chain A B")
Fields available for selection strings:
Field |
Description |
|---|---|
|
Atom name |
|
Residue name |
|
Residue sequence number |
|
Zero-based internal residue index (contiguous, ignores |
|
Zero-based atom index |
|
One-based atom serial number (as stored in the file) |
|
Chain identifier |
|
Segment identifier |
|
Element symbol |
|
Alternate location identifier |
|
Occupancy value |
|
B-factor |
|
Partial charge |
|
Atomic mass |
|
Insertion code |
|
Cartesian coordinates (Å) at the current frame |
Numeric fields can also be wrapped in the functions abs, sqr, and
sqrt (e.g. abs(charge) > 0.5, sqrt(sqr(x) + sqr(y)) < 10).
A dedicated backbonetype selector classifies atoms by backbone type:
backbonetype proteinback, backbonetype nucleicback, and
backbonetype normal (everything that is neither protein nor nucleic
backbone).
Note
Mass- and charge-based selections only work if those fields are
populated. A molecule freshly read from a PDB file has all masses (and
usually charges) set to zero, so mass > 0 would match nothing until
masses are assigned.
Comparison operators and ranges#
Numeric fields support comparison operators and range syntax:
# Comparisons
mol.atomselect("resid > 50")
mol.atomselect("occupancy >= 0.5")
mol.atomselect("beta < 20")
# Range (inclusive on both ends)
mol.atomselect("resid 40 to 60")
mol.atomselect("index 0 to 99")
# Negation: use `not`
mol.atomselect("not chain B")
The != operator only applies inside the modulo form (e.g.
resid % 2 != 0); for plain field comparisons use not.
Boolean composition#
Combine selections with and, or, not, and parentheses:
mol.atomselect("protein and chain A")
mol.atomselect("resname ALA or resname GLY")
mol.atomselect("not water")
mol.atomselect("(protein and backbone) or (resname BEN and not hydrogen)")
not binds tighter than and/or. Crucially, and and or have
equal precedence (they share one non-associative level), so a chain
of mixed and/or is grouped left-to-right rather than and binding
before or. For example:
# Parses as: protein and (name CA or name CB)
mol.atomselect("protein and name CA or name CB")
This is not the C-like behaviour where and binds before or.
Because the grouping of mixed and/or is easy to misread, always use
explicit parentheses when combining the two operators:
mol.atomselect("protein and (name CA or name CB)") # clear intent
mol.atomselect("(protein and name CA) or name CB") # the other grouping
Distance operators#
Distance-based selections are evaluated at the current frame (mol.frame):
# All atoms within 5 Å of the ligand (including the ligand itself)
mol.atomselect("within 5 of resname BEN")
# All atoms within 5 Å of the ligand, excluding the ligand
mol.atomselect("exwithin 5 of resname BEN")
same … as operators#
Expand a selection to cover complete residues, chains, or bond-graph fragments:
# All atoms in any residue that has at least one backbone atom within 5 Å
mol.atomselect("same residue as (backbone and within 5 of resname BEN)")
# All atoms in any chain that contains a titratable histidine
mol.atomselect("same chain as resname HID HIE HIP")
# All atoms in the same covalently bonded fragment as the ligand
mol.atomselect("same fragment as resname BEN")
fragment groups atoms by connected components of the bond graph. For this to
work correctly, mol.bonds must be populated (see
Guess bonds).
Cheat-sheet#
Expression |
Example |
Meaning |
|---|---|---|
keyword |
|
Predefined chemical class |
|
|
Field equals value |
|
|
Field equals any of the values |
|
|
Numeric range (inclusive) |
|
|
Numeric comparison |
|
|
Boolean logic |
|
|
Distance from selection |
|
|
Distance, excluding selection |
|
|
Whole-residue/chain/fragment expansion |
Mask and index substitution#
Any method that accepts a selection string also accepts:
A boolean NumPy array of length
mol.numAtoms— passed through without parsing, ideal for reusing expensive selections.An integer NumPy array of atom indices — converted automatically.
import numpy as np
# Precompute once, reuse many times
prot_mask = mol.atomselect("protein")
mol.copy(sel=prot_mask)
mol.filter(prot_mask)
mol.set("beta", 0, sel=prot_mask)
Precomputed masks and index arrays are tied to a specific Molecule snapshot, and there is no runtime check that flags a stale one. They go stale in two ways:
The structure changes between computing the mask and using it. Any operation that changes the number or order of atoms —
filter(),remove(),append(),insert(),mutateResidue(), or adding hydrogens — silently invalidates any mask computed beforehand (it refers to the wrong atoms or runs off the end). Recompute the selection after such operations.The mask was computed on a different Molecule. Functions that take two molecules — most importantly
align(), which acceptsselformolandrefselforrefmol— require each selection to come from its own Molecule. A mask sized formolis wrong as arefselforrefmol. Use a string for cross-Molecule calls, or compute each mask on the right Molecule.
The string form is always safe: it is re-parsed against whichever Molecule the call operates on.
What is not supported#
VMD’s
index from < Nrange variant for loading trajectory subsets is not exposed.Complex regex on atom names (VMD’s
=~regex operator) is not implemented.The
pbwithinperiodic-boundary-aware distance selection is not available; usewrapfirst if working with periodic systems.
Further reading#
Tutorial: Atom selection
How-to: Select atoms