First molecule#
You will learn: how to fetch a structure from the PDB, inspect its contents, filter atoms, and write it back to disk.
Prerequisites:
moleculekit installed.
Setup#
Import Molecule — the central class for all structure manipulation in moleculekit.
from moleculekit.molecule import Molecule
Step 1 — Load a structure#
mol = Molecule("3PTB")
The Molecule constructor accepts either a
local file path (PDB, mmCIF, MOL2, PRMTOP, PSF, … — see How to read a
structure for the full list of supported
formats) or a four-character RCSB PDB ID, which it downloads and parses
on the fly. Here we use the PDB ID 3PTB: bovine trypsin, 1701 atoms
covering one protein chain, a shell of crystallographic water molecules,
a calcium ion, and the benzamidine ligand in the active site.
Step 2 — Inspect basics#
mol.numAtoms
1701
numAtoms is a single integer — the total number of atoms in the loaded structure.
mol.numFrames
1
numFrames is 1 for a static PDB; it grows when you load a trajectory.
sorted(set(mol.resname))
['ALA',
'ARG',
'ASN',
'ASP',
'BEN',
'CA',
'CYS',
'GLN',
'GLU',
'GLY',
'HIS',
'HOH',
'ILE',
'LEU',
'LYS',
'MET',
'PHE',
'PRO',
'SER',
'THR',
'TRP',
'TYR',
'VAL']
mol.resname is an array with one entry per atom; wrapping it in set gives
the unique residue names present. You should see standard amino-acid codes
alongside BEN (benzamidine), CA (calcium), and HOH (water).
Step 3 — Inspect per-atom properties#
Every per-atom field on a Molecule is a NumPy array of length mol.numAtoms.
The arrays are indexed in parallel — mol.name[i], mol.resname[i],
mol.resid[i], and mol.chain[i] all describe the same atom.
mol.name[:8]
array(['N', 'CA', 'C', 'O', 'CB', 'CG1', 'CG2', 'CD1'], dtype=object)
Atom names as they appear in the source file (N, CA, C, O, … for
the protein backbone).
mol.element[:8]
array(['N', 'C', 'C', 'O', 'C', 'C', 'C', 'C'], dtype=object)
Element symbols per atom.
sorted(set(mol.chain)), sorted(set(mol.segid))
(['A'], ['1', '2', '3', '4'])
chain (one character, PDB convention) and segid (up to four characters,
MD topology convention) are both per-atom arrays. The BCIF fetch of 3PTB
populates segid with ['1', '2', '3', '4'] for the deposited entities;
a plain PDB load typically leaves it empty. See Assign segments and chains
if you need to populate segid for an MD parameterization tool.
mol.coords.shape
(1701, 3, 1)
Coordinates are stored as a single (numAtoms, 3, numFrames) array. For
this static PDB the third dimension is 1.
Because every field is a NumPy array, the usual NumPy operations work
directly — masking, slicing, np.unique, comparisons, and so on. See
The Molecule data model for the
full per-atom field list and their dtypes.
Step 4 — Filter waters#
mol.filter("not water")
mol.numAtoms
moleculekit.molecule - INFO - Removed 62 atoms. 1639 atoms remaining in the molecule.
1639
filter() mutates mol in place,
keeping only the atoms that match the selection string. This contrasts
with remove(), which takes atoms
out by matching them. After dropping the crystallographic waters you
have 1639 atoms remaining (1701 − 62 water oxygens).
Step 5 — Write the prepared structure#
mol.write("trypsin_dry.cif")
write() infers the format from the
file extension. The output is written to the current working directory
unless you pass a full path.
Recap#
Load a structure by PDB ID:
Molecule("3PTB")fetches it from RCSB automatically.Inspect counts and contents with
numAtoms,numFrames, and array attributes such asresname,chain, andsegid.filter()mutates the molecule in place to keep only the atoms you need; thenwrite()saves it in any supported format.