Getting Started with Projections#

Here, we are showing some basic functionalities for analysing MD trajectories using Projection classes. First, let’s import HTMD:

from htmd.ui import *
from moleculekit.config import config
from matplotlib import pylab as plt

config(viewer='webgl')
2024-06-11 18:08:42,913 - numexpr.utils - INFO - Note: NumExpr detected 20 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-06-11 18:08:42,914 - numexpr.utils - INFO - NumExpr defaulting to 8 threads.
2024-06-11 18:08:43,027 - rdkit - INFO - Enabling RDKit 2022.09.1 jupyter extensions
Please cite HTMD: Doerr et al.(2016)JCTC,12,1845. https://dx.doi.org/10.1021/acs.jctc.6b00049
HTMD Documentation at: https://software.acellera.com/htmd/

You are on the latest HTMD version (2.3.28+7.g46a4ca83e.dirty).

Projecting simulations#

Get the data for this tutorial here. Alternatively, you can download the data using wget:

import os
assert os.system('wget -rcN -np -nH -q --cut-dirs=2 -R index.html* http://pub.htmd.org/tutorials/projections/data/') == 0

We will be doing two types of data projections: * Projecting single Molecule objects * Projecting lists of simulations

Projecting single Molecule objects#

One can load a previously simulated single MD trajectory (in this case, NTL9 protein) into a Molecule object and visualize it using:

mol = Molecule('data/ntl9_structure.psf')
mol.read('data/ntl9_trajectory.xtc')
mol.filter('protein')
mol.view()
2024-06-11 18:09:06,033 - moleculekit.molecule - INFO - Removed 17 atoms. 631 atoms remaining in the molecule.
NGLWidget(max_frame=9)

Getting the RMSD along time#

HTMD has many projection classes. One of those classes is MetricRmsd.

Using the MetricRmsd class, one may calculate the RMSD of a Molecule object against a reference structure (refmol, e.g. the crystal structure) for a given atom selection (trajrmsdstr):

crystal = Molecule('data/ntl9_crystal.pdb')
metr = MetricRmsd(refmol=crystal, trajrmsdstr='protein and name CA')
proj = metr.project(mol)
print(proj)
[16.360521  15.793981  14.122061  14.7720175 13.908776  12.928344
 11.541591  10.691926  10.778459  12.866117 ]

The values are in Angstroms. They are quite high values, because the sample trajectory is of the unfolded protein (see the visualization above).

Plot the RMSD along time#

The previously obtained RMSD values can be easily plotted using matplotlib:

time_range = mol.fstep*np.arange(len(proj))
plt.plot(time_range, proj)
_ = plt.xlabel('Time (ns)')
_ = plt.ylabel('RMSD to crystal')
../../_images/projections_17_0.png

Projecting a list of MD simulations#

With the arrival of Markov state models, the paradigm of running one single long simulation has shifted to running hundreds or thousands of shorter simulations.

With the analysis of a large amount of MD trajectories in mind, in HTMD, the simlist function can be used to bundle all the trajectories together:

sims = simlist(glob('data/1/filtered/*/'), 'data/1/filtered/filtered.pdb')
Creating simlist: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 2150.56it/s]

Use Metric class to project a list of simulations#

Using the Metric class, one can calculate all the projections of an entire simlist at once. In this case, the MetricDistance projection class is used, which calculates the matrix of distances between two selections (sel1 and sel2):

metr = Metric(sims)
metr.set(MetricDistance(sel1='protein and name CA', sel2='resname MOL and noh', periodic='selections', metric='distances'))
data = metr.project()
data.fstep = 0.1
Projecting trajectories: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 12.45it/s]
2024-06-11 18:09:18,666 - htmd.projections.metric - INFO - Frame step 0.001ns was read from the trajectories. If it looks wrong, redefine it by manually setting the MetricData.fstep property.

Here, the projection is the matrix of distances between each carbon alpha of the protein and each heavy atom of the MOL molecule (which in the case of these trajectories, it’s a ligand binding to the protein)

The Metric.project method returns a MetricData object which contains the trajectory and projection information:

print(data)
MetricData object with 3 trajectories of 60.0ns aggregate simulation time
Centers: None
K: None
N: None
description: <class 'pandas.core.frame.DataFrame'> at 0x78decc2d44f0
fstep: 0.1
parent: <class 'NoneType'> at 0x55ab3b8027e0
trajectories shape: (3,)

Working with MetricData objects#

The MetricData object contains the information on each trajectory, including the corresponding projection. For example, for simulation 2:

print(data.trajectories[2])
sim: simid = 2
projection: np.array(shape=(200, 2493))
reference: np.array(shape=(200, 2))
cluster: None

The projection data can be easily accessed as a whole in the MetricData.dat property, which is an array of size equal to the number of trajectories. We can check the projection data of simulation 14 with:

data.dat[2]  # equivalent to data.trajectories[2].projection
array([[46.903584, 48.459248, 48.862976, ..., 52.245136, 51.477478,
        52.081757],
       [42.60737 , 42.6003  , 43.467403, ..., 46.8291  , 47.731323,
        45.771404],
       [45.674603, 44.244358, 44.22837 , ..., 53.92176 , 54.58503 ,
        53.017155],
       ...,
       [37.174904, 38.622143, 39.415646, ..., 37.552174, 37.359898,
        36.812588],
       [41.236233, 42.631603, 43.36949 , ..., 41.442356, 41.54479 ,
        41.618164],
       [45.716003, 45.90263 , 46.41945 , ..., 44.92147 , 44.872147,
        44.011333]], dtype=float32)
data.dat[2].shape
(200, 2493)
bundledprojs = np.vstack(data.dat) # np.concatenate can be used instead as well
print(bundledprojs.shape)
(600, 2493)

Operations and calculations#

One can use numpy functions like np.min, np.max, np.average to do calculations over the projected data: * get the global minimum over all data

global_minimum = np.min(bundledprojs)
print(global_minimum)
3.2580273
  • get the maximum distance on each frame:

max_eachframe = np.max(bundledprojs,axis=1)
print(max_eachframe.shape)
max_eachframe
(600,)
array([54.39825 , 53.998096, 55.228672, 51.56171 , 47.534378, 48.122726,
       54.234592, 52.287495, 56.084633, 59.717773, 58.638866, 59.020008,
       58.540077, 61.697826, 62.762337, 64.88375 , 57.164936, 61.46919 ,
       61.785923, 51.648315, 50.93795 , 49.167656, 51.03158 , 49.82826 ,
       50.78649 , 53.217636, 53.40074 , 54.59295 , 54.21768 , 49.040817,
       49.194637, 51.379505, 53.340816, 52.982063, 53.229916, 52.800793,
       51.121067, 50.52078 , 50.51849 , 53.35154 , 54.840214, 50.428257,
       46.97517 , 48.814857, 46.214386, 47.91875 , 46.25085 , 46.454384,
       48.536522, 47.89751 , 48.45051 , 47.181534, 50.3011  , 47.37468 ,
       44.809082, 44.76846 , 44.396427, 49.58187 , 49.452393, 44.699467,
       49.425236, 49.30479 , 48.831734, 48.04636 , 46.891056, 46.507465,
       53.23867 , 52.209763, 51.905357, 52.760147, 53.15323 , 52.926044,
       54.556217, 55.611897, 52.046055, 51.92997 , 50.818188, 51.01256 ,
       49.769115, 49.608753, 50.520557, 52.815094, 51.304066, 49.99397 ,
       50.81137 , 51.18169 , 49.10415 , 48.43612 , 52.17957 , 52.73203 ,
       54.345753, 55.31595 , 55.39871 , 54.839466, 53.79522 , 53.284203,
       47.44944 , 47.25443 , 46.913074, 48.135517, 46.80235 , 44.899788,
       44.704002, 45.82361 , 46.498   , 46.07886 , 46.025085, 44.451492,
       43.400375, 44.925617, 42.78121 , 42.454422, 44.038147, 44.447018,
       43.457703, 44.942665, 48.482765, 50.187347, 51.073917, 49.428154,
       48.50524 , 53.109505, 53.980877, 56.261894, 58.69202 , 58.06872 ,
       58.765503, 60.141937, 56.34336 , 58.222004, 59.827217, 57.84348 ,
       57.62419 , 59.43873 , 63.62348 , 64.050095, 62.193985, 59.517525,
       58.832714, 61.67116 , 60.90755 , 58.634247, 60.395714, 60.144405,
       49.775978, 48.653896, 49.283756, 56.077488, 49.26251 , 49.198658,
       46.681004, 45.703247, 44.092865, 46.880257, 47.840847, 48.930866,
       49.26276 , 48.424484, 51.754177, 53.38187 , 50.517876, 50.88899 ,
       48.957874, 50.86398 , 49.52347 , 51.52822 , 48.15141 , 49.787586,
       51.295555, 51.341866, 51.10505 , 49.70342 , 49.799862, 51.583054,
       50.491463, 52.18646 , 51.79193 , 52.027943, 48.697643, 45.54545 ,
       47.359547, 49.45975 , 48.112698, 52.51411 , 55.656136, 56.812218,
       57.211945, 61.67162 , 60.73994 , 55.838486, 60.59322 , 52.707558,
       59.200516, 55.77005 , 49.959007, 52.27606 , 53.4685  , 50.195663,
       51.83982 , 46.972984, 58.02898 , 55.06521 , 56.92774 , 58.603672,
       63.162193, 65.49683 , 55.11986 , 53.408955, 56.06107 , 51.011993,
       54.8826  , 56.91488 , 58.846996, 59.789986, 60.955803, 62.344776,
       61.62976 , 64.16956 , 58.623585, 61.526814, 58.12016 , 59.657608,
       54.12708 , 51.00011 , 55.123035, 65.394104, 58.940674, 58.358627,
       55.95053 , 56.449535, 52.173878, 55.778507, 57.703293, 57.1664  ,
       51.501755, 48.076046, 50.35523 , 49.17984 , 47.47704 , 49.505035,
       51.633377, 49.627056, 46.418175, 47.777798, 47.576702, 46.133038,
       45.152557, 45.07051 , 47.970413, 48.24823 , 45.96343 , 45.115925,
       45.666515, 46.490425, 45.549976, 46.01706 , 45.453964, 45.515446,
       45.9844  , 47.13223 , 48.836105, 47.906868, 48.961525, 48.519596,
       48.76764 , 51.081234, 46.477844, 46.278347, 44.705566, 44.669548,
       44.54288 , 45.592293, 46.39919 , 47.019867, 45.584335, 45.207325,
       46.80933 , 47.135063, 45.75573 , 45.835392, 47.149216, 46.10952 ,
       47.89244 , 45.813976, 47.050255, 48.913574, 44.89156 , 47.54856 ,
       45.633278, 48.73144 , 46.68292 , 46.73932 , 46.027905, 45.327396,
       44.535587, 45.38553 , 47.58102 , 45.6692  , 45.515438, 43.811703,
       44.08468 , 43.837803, 43.91667 , 44.465282, 44.301987, 44.174892,
       46.657036, 48.582195, 47.261246, 50.408527, 49.713585, 47.64656 ,
       51.807453, 48.009163, 54.441814, 52.480442, 59.65425 , 58.270264,
       55.087585, 49.165928, 48.586834, 45.99378 , 48.621094, 49.504105,
       50.48702 , 51.915627, 50.50157 , 51.82167 , 53.708004, 56.92613 ,
       57.959114, 55.430756, 55.34785 , 49.523216, 48.99247 , 53.743862,
       58.072803, 54.772854, 55.80709 , 55.54166 , 58.58327 , 60.215855,
       57.643593, 58.334824, 58.690002, 58.64841 , 59.40299 , 58.352924,
       55.96094 , 56.374454, 56.63217 , 62.827335, 63.720257, 62.8327  ,
       65.87671 , 66.41088 , 64.80266 , 64.03413 , 67.14985 , 66.259865,
       66.994804, 66.42933 , 65.72711 , 67.06555 , 65.66709 , 67.08758 ,
       66.65906 , 65.61816 , 66.19741 , 67.06138 , 64.4134  , 57.828926,
       55.1021  , 56.112644, 53.194065, 53.170574, 51.061966, 49.55039 ,
       53.800117, 48.612778, 45.44221 , 47.68182 , 45.53881 , 44.353558,
       52.942116, 54.64368 , 53.352097, 57.189117, 50.115234, 54.437584,
       53.27835 , 51.984848, 45.55794 , 48.964207, 44.674267, 46.13538 ,
       48.80334 , 45.493282, 52.49222 , 49.369164, 58.97841 , 56.06166 ,
       59.335297, 54.887905, 54.104248, 51.059814, 48.11115 , 54.588306,
       53.761787, 53.98229 , 52.87779 , 54.162224, 59.98114 , 61.48215 ,
       60.23783 , 58.839363, 54.243195, 56.94387 , 55.312115, 59.235455,
       55.783768, 51.72166 , 48.023804, 49.09402 , 55.95665 , 52.98509 ,
       53.49044 , 55.68321 , 56.54712 , 54.37112 , 57.356087, 53.08747 ,
       56.73367 , 57.498043, 53.299854, 53.063465, 49.98735 , 47.216633,
       46.92734 , 57.599716, 57.037083, 55.949203, 57.110626, 65.150894,
       63.827003, 64.43011 , 63.66299 , 66.6226  , 65.895065, 63.64519 ,
       64.096016, 66.76584 , 64.84866 , 62.557846, 64.06469 , 62.43578 ,
       64.79868 , 61.621136, 67.271355, 66.31896 , 67.36595 , 63.282326,
       59.729267, 62.05453 , 58.38107 , 51.13966 , 53.981583, 48.66814 ,
       54.832336, 54.26662 , 52.33797 , 51.29319 , 48.323696, 49.479324,
       48.924763, 48.606667, 51.652077, 50.38462 , 47.688026, 48.65353 ,
       48.816105, 46.393612, 46.657574, 44.796963, 45.534355, 48.16918 ,
       48.30007 , 46.139935, 46.76868 , 45.927097, 45.686913, 45.812035,
       48.05082 , 45.105473, 44.635887, 45.34747 , 46.082073, 44.79861 ,
       45.04778 , 46.79814 , 46.277466, 46.9441  , 45.258972, 45.755707,
       45.79995 , 45.970932, 44.646935, 48.003036, 46.15737 , 46.15521 ,
       47.198174, 46.622158, 45.844547, 47.209805, 48.396137, 48.344986,
       44.753998, 48.834183, 49.35003 , 51.633755, 49.904186, 49.072594,
       49.796757, 50.38063 , 48.495564, 48.504204, 50.45795 , 49.709827,
       49.515602, 48.371593, 50.092323, 51.048428, 50.520065, 49.9667  ,
       52.122448, 49.67022 , 50.168163, 48.1803  , 49.34701 , 47.397907,
       54.972626, 55.23825 , 54.018005, 51.48479 , 48.516758, 47.862297,
       47.40332 , 58.441353, 51.811367, 52.852455, 59.88181 , 59.017944,
       54.427086, 56.746006, 56.548817, 59.612827, 59.378536, 58.37478 ,
       56.771206, 53.756557, 55.679623, 56.572567, 54.465122, 47.28183 ,
       49.553474, 49.845093, 50.886894, 50.45243 , 53.55668 , 51.049065,
       53.551304, 58.681187, 56.133167, 55.18882 , 56.56295 , 55.322258,
       54.262825, 56.165165, 57.029556, 66.84226 , 60.031006, 60.488953,
       61.970337, 55.85079 , 54.538193, 51.219833, 54.06704 , 53.40585 ,
       53.578785, 47.85516 , 53.82843 , 53.730743, 54.80984 , 55.104183,
       49.593353, 53.362053, 46.985603, 49.279034, 52.47066 , 48.883205],
      dtype=float32)
  • or get the average of each pair across all frames:

avg_acrossframes = np.average(bundledprojs,axis=0)
print(avg_acrossframes.shape)
avg_acrossframes
(2493,)
array([34.71057 , 34.667583, 34.678036, ..., 39.51156 , 39.480606,
       39.56731 ], dtype=float32)

The axis property is important to make the operation over the desired dimension:

  • axis=1, in this case, operates over all pairs, to give a result for each frame

  • axis=0, in this case, operates over all frames, to give a result for each pair

But how can we know to what particular atoms of the system corresponds a given indexed pair?

Playing with the projection mapping#

The mapping (meaning of each index) of the projection can be obtained from the MetricData.map property:

  • the mapping is a pandas DataFrame object

The head method of DataFrame can be used to print the top entries of the mapping:

data.map.head()
type atomIndexes description
0 distance [4, 4480] distance between GLU 1 CA and MOL 278 C3
1 distance [4, 4483] distance between GLU 1 CA and MOL 278 C2
2 distance [4, 4486] distance between GLU 1 CA and MOL 278 C1
3 distance [4, 4489] distance between GLU 1 CA and MOL 278 C6
4 distance [4, 4492] distance between GLU 1 CA and MOL 278 C5

One can assess particular elements of the mapping through:

  • iloc - i(nteger)loc(ation):

data.map.iloc[20:22]
type atomIndexes description
20 distance [29, 4486] distance between ASP 3 CA and MOL 278 C1
21 distance [29, 4489] distance between ASP 3 CA and MOL 278 C6
  • loc (using the DataFrame index):

data.map.loc[20]
type                                           distance
atomIndexes                                  [29, 4486]
description    distance between ASP 3 CA and MOL 278 C1
Name: 20, dtype: object
  • or direct assess to the array elements:

data.map['description'][20]
'distance between ASP 3 CA and MOL 278 C1'

One can use the mapping to find out to what particular pair of elements the minimum distance found in the trajectories corresponds to.

The np.where functionality gives us the array coordinates where a given condition is matched (in this case, when the distance is equal to the minimum):

print(global_minimum)
frame, index = np.where(bundledprojs == global_minimum)
print(index)
print(bundledprojs[frame, index]) # confirm that indeed these coordinates correspond the inquired value
3.2580273
[26]
[3.2580273]

Then, using the index in the mapping shown before:

data.map.iloc[index]
type atomIndexes description
26 distance [29, 4500] distance between ASP 3 CA and MOL 278 N2

gives us the atom indexes and description of the pair.