How to read a structure#

Goal#

Load a molecular structure into a Molecule object from a local file or directly from the RCSB PDB by accession ID.

Minimal example#

from moleculekit.molecule import Molecule

# Fetch directly from RCSB
mol = Molecule("3PTB")

# Load from a local file
mol = Molecule("structure.pdb")

Parameters that matter#

The Molecule constructor signature is Molecule(filename=None, name=None, **kwargs). When filename is given, the constructor forwards **kwargs to read(), so any read-time options can be passed alongside the path.

Parameter

Type

Default

What it does

filename

str

None

Path to the file (any supported format) or a 4-character RCSB PDB ID. If None, an empty Molecule is created and you can call mol.read(...) later.

name

str

None

Optional display label; does not affect atom data.

type

str

None

File-type override ("pdb", "mol2", "prmtop", …). Normally inferred from the extension.

validateElements

bool

True

Raises if element symbols are missing or unrecognised; set to False for custom atom types.

keepaltloc

str

"A"

Which alternate-location indicator to keep when reading PDB/mmCIF.

Supported formats#

The reader picks a backend from the file extension. Trajectories are covered in How to read a trajectory; the formats below carry topology (atoms, residues, optional bonds) and in some cases a single frame of coordinates.

Extension

Format

Notes

pdb, ent

Protein Data Bank

Most common; coords + optional CONECT bonds.

pdb.gz

gzipped PDB

Transparently decompressed.

cif

mmCIF / PDBx

Modern PDB replacement; full bond / LINK info.

bcif

Binary mmCIF

Compact binary mmCIF.

mmtf

Macromolecular Transmission Format

Compact binary; deprecated upstream but still readable.

mol2

Tripos MOL2

Bonds + bond orders + atom types.

sdf

Structure-Data File

Small-molecule format with bond orders.

mae

Schrödinger Maestro

Bond orders + force-field atom types.

prmtop, prm, parm7

AMBER topology

Topology only; pair with a coordinate file.

psf

CHARMM / NAMD topology

Topology only.

top

GROMACS topology

Falls back to PRMTOP reader if not GROMACS-style.

gro

GROMACS structure

Via mdtraj.

xyz

Plain XYZ

Coords + elements only; no bonds.

pdbqt

AutoDock PDB

PDB with partial charges.

gjf

Gaussian input

Coords + elements.

rtf

CHARMM residue topology

Residue templates.

prepi

AMBER prepi

Residue templates.

crd, coor, inpcrd

Coordinate files

Coordinates only — load a topology first.

xtc, dcd, trr, netcdf/nc/ncdf, binpos, xsc

Trajectory

See How to read a trajectory.

h5, lh5, hdf5, arc, hoomdxml

mdtraj-supported

Requires the mdtraj package.

json

moleculekit JSON

Lossless round-trip of a Molecule.

alphafold

AlphaFold output

AlphaFold-style topology.

Special case: a bare 4-character string with no extension (e.g. Molecule("3PTB")) is interpreted as an RCSB PDB ID and fetched over the network. Set the LOCAL_PDB_REPO environment variable to point at a local mirror to avoid repeated downloads.

Common variations#

# Load an AMBER PRMTOP topology
mol = Molecule("topology.prmtop")

# Load a PSF topology, then read trajectory frames separately
mol = Molecule("topology.psf")
mol.read("trajectory.xtc")
# Two-step read: construct empty, then load
mol = Molecule()
mol.read("structure.mol2")

Gotchas#

  • PDB format has fixed column widths: residue names longer than 4 characters and atom names longer than 4 characters are silently truncated.

  • For PDB IDs that are not in the RCSB, Molecule raises a download error — check the spelling and network connectivity.

  • When validateElements=True (default) the reader raises on unknown element symbols; set validateElements=False if working with custom atom types.

See also#