Getting Started with Projections#
Here, we are showing some basic functionalities for analysing MD
trajectories using Projection
classes. First, let’s import HTMD:
from htmd.ui import *
from moleculekit.config import config
from matplotlib import pylab as plt
config(viewer='webgl')
2024-06-11 18:08:42,913 - numexpr.utils - INFO - Note: NumExpr detected 20 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-06-11 18:08:42,914 - numexpr.utils - INFO - NumExpr defaulting to 8 threads.
2024-06-11 18:08:43,027 - rdkit - INFO - Enabling RDKit 2022.09.1 jupyter extensions
Please cite HTMD: Doerr et al.(2016)JCTC,12,1845. https://dx.doi.org/10.1021/acs.jctc.6b00049
HTMD Documentation at: https://software.acellera.com/htmd/
You are on the latest HTMD version (2.3.28+7.g46a4ca83e.dirty).
Projecting simulations#
Get the data for this tutorial
here.
Alternatively, you can download the data using wget
:
import os
assert os.system('wget -rcN -np -nH -q --cut-dirs=2 -R index.html* http://pub.htmd.org/tutorials/projections/data/') == 0
We will be doing two types of data projections: * Projecting single Molecule objects * Projecting lists of simulations
Projecting single Molecule objects#
One can load a previously simulated single MD trajectory (in this case,
NTL9 protein) into a Molecule
object and visualize it using:
mol = Molecule('data/ntl9_structure.psf')
mol.read('data/ntl9_trajectory.xtc')
mol.filter('protein')
mol.view()
2024-06-11 18:09:06,033 - moleculekit.molecule - INFO - Removed 17 atoms. 631 atoms remaining in the molecule.
NGLWidget(max_frame=9)
Getting the RMSD along time#
HTMD has many projection classes. One of those classes is
MetricRmsd
.
Using the MetricRmsd
class, one may calculate the RMSD of a
Molecule
object against a reference structure (refmol
, e.g. the
crystal structure) for a given atom selection (trajrmsdstr
):
crystal = Molecule('data/ntl9_crystal.pdb')
metr = MetricRmsd(refmol=crystal, trajrmsdstr='protein and name CA')
proj = metr.project(mol)
print(proj)
[16.360521 15.793981 14.122061 14.7720175 13.908776 12.928344
11.541591 10.691926 10.778459 12.866117 ]
The values are in Angstroms. They are quite high values, because the sample trajectory is of the unfolded protein (see the visualization above).
Plot the RMSD along time#
The previously obtained RMSD values can be easily plotted using matplotlib:
time_range = mol.fstep*np.arange(len(proj))
plt.plot(time_range, proj)
_ = plt.xlabel('Time (ns)')
_ = plt.ylabel('RMSD to crystal')
Projecting a list of MD simulations#
With the arrival of Markov state models, the paradigm of running one single long simulation has shifted to running hundreds or thousands of shorter simulations.
With the analysis of a large amount of MD trajectories in mind, in HTMD,
the simlist
function can be used to bundle all the trajectories
together:
sims = simlist(glob('data/1/filtered/*/'), 'data/1/filtered/filtered.pdb')
Creating simlist: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 2150.56it/s]
Use Metric class to project a list of simulations#
Using the Metric
class, one can calculate all the projections of an
entire simlist at once. In this case, the MetricDistance
projection
class is used, which calculates the matrix of distances between two
selections (sel1
and sel2
):
metr = Metric(sims)
metr.set(MetricDistance(sel1='protein and name CA', sel2='resname MOL and noh', periodic='selections', metric='distances'))
data = metr.project()
data.fstep = 0.1
Projecting trajectories: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 12.45it/s]
2024-06-11 18:09:18,666 - htmd.projections.metric - INFO - Frame step 0.001ns was read from the trajectories. If it looks wrong, redefine it by manually setting the MetricData.fstep property.
Here, the projection is the matrix of distances between each carbon alpha of the protein and each heavy atom of the MOL molecule (which in the case of these trajectories, it’s a ligand binding to the protein)
The Metric.project
method returns a MetricData
object which
contains the trajectory and projection information:
print(data)
MetricData object with 3 trajectories of 60.0ns aggregate simulation time
Centers: None
K: None
N: None
description: <class 'pandas.core.frame.DataFrame'> at 0x78decc2d44f0
fstep: 0.1
parent: <class 'NoneType'> at 0x55ab3b8027e0
trajectories shape: (3,)
Working with MetricData objects#
The MetricData
object contains the information on each trajectory,
including the corresponding projection. For example, for simulation 2:
print(data.trajectories[2])
sim: simid = 2
projection: np.array(shape=(200, 2493))
reference: np.array(shape=(200, 2))
cluster: None
The projection data can be easily accessed as a whole in the
MetricData.dat
property, which is an array of size equal to the
number of trajectories. We can check the projection data of simulation
14 with:
data.dat[2] # equivalent to data.trajectories[2].projection
array([[46.903584, 48.459248, 48.862976, ..., 52.245136, 51.477478,
52.081757],
[42.60737 , 42.6003 , 43.467403, ..., 46.8291 , 47.731323,
45.771404],
[45.674603, 44.244358, 44.22837 , ..., 53.92176 , 54.58503 ,
53.017155],
...,
[37.174904, 38.622143, 39.415646, ..., 37.552174, 37.359898,
36.812588],
[41.236233, 42.631603, 43.36949 , ..., 41.442356, 41.54479 ,
41.618164],
[45.716003, 45.90263 , 46.41945 , ..., 44.92147 , 44.872147,
44.011333]], dtype=float32)
data.dat[2].shape
(200, 2493)
bundledprojs = np.vstack(data.dat) # np.concatenate can be used instead as well
print(bundledprojs.shape)
(600, 2493)
Operations and calculations#
One can use numpy functions like np.min
, np.max
, np.average
to do calculations over the projected data: * get the global minimum
over all data
global_minimum = np.min(bundledprojs)
print(global_minimum)
3.2580273
get the maximum distance on each frame:
max_eachframe = np.max(bundledprojs,axis=1)
print(max_eachframe.shape)
max_eachframe
(600,)
array([54.39825 , 53.998096, 55.228672, 51.56171 , 47.534378, 48.122726,
54.234592, 52.287495, 56.084633, 59.717773, 58.638866, 59.020008,
58.540077, 61.697826, 62.762337, 64.88375 , 57.164936, 61.46919 ,
61.785923, 51.648315, 50.93795 , 49.167656, 51.03158 , 49.82826 ,
50.78649 , 53.217636, 53.40074 , 54.59295 , 54.21768 , 49.040817,
49.194637, 51.379505, 53.340816, 52.982063, 53.229916, 52.800793,
51.121067, 50.52078 , 50.51849 , 53.35154 , 54.840214, 50.428257,
46.97517 , 48.814857, 46.214386, 47.91875 , 46.25085 , 46.454384,
48.536522, 47.89751 , 48.45051 , 47.181534, 50.3011 , 47.37468 ,
44.809082, 44.76846 , 44.396427, 49.58187 , 49.452393, 44.699467,
49.425236, 49.30479 , 48.831734, 48.04636 , 46.891056, 46.507465,
53.23867 , 52.209763, 51.905357, 52.760147, 53.15323 , 52.926044,
54.556217, 55.611897, 52.046055, 51.92997 , 50.818188, 51.01256 ,
49.769115, 49.608753, 50.520557, 52.815094, 51.304066, 49.99397 ,
50.81137 , 51.18169 , 49.10415 , 48.43612 , 52.17957 , 52.73203 ,
54.345753, 55.31595 , 55.39871 , 54.839466, 53.79522 , 53.284203,
47.44944 , 47.25443 , 46.913074, 48.135517, 46.80235 , 44.899788,
44.704002, 45.82361 , 46.498 , 46.07886 , 46.025085, 44.451492,
43.400375, 44.925617, 42.78121 , 42.454422, 44.038147, 44.447018,
43.457703, 44.942665, 48.482765, 50.187347, 51.073917, 49.428154,
48.50524 , 53.109505, 53.980877, 56.261894, 58.69202 , 58.06872 ,
58.765503, 60.141937, 56.34336 , 58.222004, 59.827217, 57.84348 ,
57.62419 , 59.43873 , 63.62348 , 64.050095, 62.193985, 59.517525,
58.832714, 61.67116 , 60.90755 , 58.634247, 60.395714, 60.144405,
49.775978, 48.653896, 49.283756, 56.077488, 49.26251 , 49.198658,
46.681004, 45.703247, 44.092865, 46.880257, 47.840847, 48.930866,
49.26276 , 48.424484, 51.754177, 53.38187 , 50.517876, 50.88899 ,
48.957874, 50.86398 , 49.52347 , 51.52822 , 48.15141 , 49.787586,
51.295555, 51.341866, 51.10505 , 49.70342 , 49.799862, 51.583054,
50.491463, 52.18646 , 51.79193 , 52.027943, 48.697643, 45.54545 ,
47.359547, 49.45975 , 48.112698, 52.51411 , 55.656136, 56.812218,
57.211945, 61.67162 , 60.73994 , 55.838486, 60.59322 , 52.707558,
59.200516, 55.77005 , 49.959007, 52.27606 , 53.4685 , 50.195663,
51.83982 , 46.972984, 58.02898 , 55.06521 , 56.92774 , 58.603672,
63.162193, 65.49683 , 55.11986 , 53.408955, 56.06107 , 51.011993,
54.8826 , 56.91488 , 58.846996, 59.789986, 60.955803, 62.344776,
61.62976 , 64.16956 , 58.623585, 61.526814, 58.12016 , 59.657608,
54.12708 , 51.00011 , 55.123035, 65.394104, 58.940674, 58.358627,
55.95053 , 56.449535, 52.173878, 55.778507, 57.703293, 57.1664 ,
51.501755, 48.076046, 50.35523 , 49.17984 , 47.47704 , 49.505035,
51.633377, 49.627056, 46.418175, 47.777798, 47.576702, 46.133038,
45.152557, 45.07051 , 47.970413, 48.24823 , 45.96343 , 45.115925,
45.666515, 46.490425, 45.549976, 46.01706 , 45.453964, 45.515446,
45.9844 , 47.13223 , 48.836105, 47.906868, 48.961525, 48.519596,
48.76764 , 51.081234, 46.477844, 46.278347, 44.705566, 44.669548,
44.54288 , 45.592293, 46.39919 , 47.019867, 45.584335, 45.207325,
46.80933 , 47.135063, 45.75573 , 45.835392, 47.149216, 46.10952 ,
47.89244 , 45.813976, 47.050255, 48.913574, 44.89156 , 47.54856 ,
45.633278, 48.73144 , 46.68292 , 46.73932 , 46.027905, 45.327396,
44.535587, 45.38553 , 47.58102 , 45.6692 , 45.515438, 43.811703,
44.08468 , 43.837803, 43.91667 , 44.465282, 44.301987, 44.174892,
46.657036, 48.582195, 47.261246, 50.408527, 49.713585, 47.64656 ,
51.807453, 48.009163, 54.441814, 52.480442, 59.65425 , 58.270264,
55.087585, 49.165928, 48.586834, 45.99378 , 48.621094, 49.504105,
50.48702 , 51.915627, 50.50157 , 51.82167 , 53.708004, 56.92613 ,
57.959114, 55.430756, 55.34785 , 49.523216, 48.99247 , 53.743862,
58.072803, 54.772854, 55.80709 , 55.54166 , 58.58327 , 60.215855,
57.643593, 58.334824, 58.690002, 58.64841 , 59.40299 , 58.352924,
55.96094 , 56.374454, 56.63217 , 62.827335, 63.720257, 62.8327 ,
65.87671 , 66.41088 , 64.80266 , 64.03413 , 67.14985 , 66.259865,
66.994804, 66.42933 , 65.72711 , 67.06555 , 65.66709 , 67.08758 ,
66.65906 , 65.61816 , 66.19741 , 67.06138 , 64.4134 , 57.828926,
55.1021 , 56.112644, 53.194065, 53.170574, 51.061966, 49.55039 ,
53.800117, 48.612778, 45.44221 , 47.68182 , 45.53881 , 44.353558,
52.942116, 54.64368 , 53.352097, 57.189117, 50.115234, 54.437584,
53.27835 , 51.984848, 45.55794 , 48.964207, 44.674267, 46.13538 ,
48.80334 , 45.493282, 52.49222 , 49.369164, 58.97841 , 56.06166 ,
59.335297, 54.887905, 54.104248, 51.059814, 48.11115 , 54.588306,
53.761787, 53.98229 , 52.87779 , 54.162224, 59.98114 , 61.48215 ,
60.23783 , 58.839363, 54.243195, 56.94387 , 55.312115, 59.235455,
55.783768, 51.72166 , 48.023804, 49.09402 , 55.95665 , 52.98509 ,
53.49044 , 55.68321 , 56.54712 , 54.37112 , 57.356087, 53.08747 ,
56.73367 , 57.498043, 53.299854, 53.063465, 49.98735 , 47.216633,
46.92734 , 57.599716, 57.037083, 55.949203, 57.110626, 65.150894,
63.827003, 64.43011 , 63.66299 , 66.6226 , 65.895065, 63.64519 ,
64.096016, 66.76584 , 64.84866 , 62.557846, 64.06469 , 62.43578 ,
64.79868 , 61.621136, 67.271355, 66.31896 , 67.36595 , 63.282326,
59.729267, 62.05453 , 58.38107 , 51.13966 , 53.981583, 48.66814 ,
54.832336, 54.26662 , 52.33797 , 51.29319 , 48.323696, 49.479324,
48.924763, 48.606667, 51.652077, 50.38462 , 47.688026, 48.65353 ,
48.816105, 46.393612, 46.657574, 44.796963, 45.534355, 48.16918 ,
48.30007 , 46.139935, 46.76868 , 45.927097, 45.686913, 45.812035,
48.05082 , 45.105473, 44.635887, 45.34747 , 46.082073, 44.79861 ,
45.04778 , 46.79814 , 46.277466, 46.9441 , 45.258972, 45.755707,
45.79995 , 45.970932, 44.646935, 48.003036, 46.15737 , 46.15521 ,
47.198174, 46.622158, 45.844547, 47.209805, 48.396137, 48.344986,
44.753998, 48.834183, 49.35003 , 51.633755, 49.904186, 49.072594,
49.796757, 50.38063 , 48.495564, 48.504204, 50.45795 , 49.709827,
49.515602, 48.371593, 50.092323, 51.048428, 50.520065, 49.9667 ,
52.122448, 49.67022 , 50.168163, 48.1803 , 49.34701 , 47.397907,
54.972626, 55.23825 , 54.018005, 51.48479 , 48.516758, 47.862297,
47.40332 , 58.441353, 51.811367, 52.852455, 59.88181 , 59.017944,
54.427086, 56.746006, 56.548817, 59.612827, 59.378536, 58.37478 ,
56.771206, 53.756557, 55.679623, 56.572567, 54.465122, 47.28183 ,
49.553474, 49.845093, 50.886894, 50.45243 , 53.55668 , 51.049065,
53.551304, 58.681187, 56.133167, 55.18882 , 56.56295 , 55.322258,
54.262825, 56.165165, 57.029556, 66.84226 , 60.031006, 60.488953,
61.970337, 55.85079 , 54.538193, 51.219833, 54.06704 , 53.40585 ,
53.578785, 47.85516 , 53.82843 , 53.730743, 54.80984 , 55.104183,
49.593353, 53.362053, 46.985603, 49.279034, 52.47066 , 48.883205],
dtype=float32)
or get the average of each pair across all frames:
avg_acrossframes = np.average(bundledprojs,axis=0)
print(avg_acrossframes.shape)
avg_acrossframes
(2493,)
array([34.71057 , 34.667583, 34.678036, ..., 39.51156 , 39.480606,
39.56731 ], dtype=float32)
The axis
property is important to make the operation over the
desired dimension:
axis=1
, in this case, operates over all pairs, to give a result for each frameaxis=0
, in this case, operates over all frames, to give a result for each pair
But how can we know to what particular atoms of the system corresponds a given indexed pair?
Playing with the projection mapping#
The mapping (meaning of each index) of the projection can be obtained
from the MetricData.map
property:
the mapping is a pandas
DataFrame
object
The head
method of DataFrame
can be used to print the top
entries of the mapping:
data.map.head()
type | atomIndexes | description | |
---|---|---|---|
0 | distance | [4, 4480] | distance between GLU 1 CA and MOL 278 C3 |
1 | distance | [4, 4483] | distance between GLU 1 CA and MOL 278 C2 |
2 | distance | [4, 4486] | distance between GLU 1 CA and MOL 278 C1 |
3 | distance | [4, 4489] | distance between GLU 1 CA and MOL 278 C6 |
4 | distance | [4, 4492] | distance between GLU 1 CA and MOL 278 C5 |
One can assess particular elements of the mapping through:
iloc
- i(nteger)loc(ation):
data.map.iloc[20:22]
type | atomIndexes | description | |
---|---|---|---|
20 | distance | [29, 4486] | distance between ASP 3 CA and MOL 278 C1 |
21 | distance | [29, 4489] | distance between ASP 3 CA and MOL 278 C6 |
loc
(using the DataFrame index):
data.map.loc[20]
type distance
atomIndexes [29, 4486]
description distance between ASP 3 CA and MOL 278 C1
Name: 20, dtype: object
or direct assess to the array elements:
data.map['description'][20]
'distance between ASP 3 CA and MOL 278 C1'
One can use the mapping to find out to what particular pair of elements the minimum distance found in the trajectories corresponds to.
The np.where
functionality gives us the array coordinates where a
given condition is matched (in this case, when the distance is equal to
the minimum):
print(global_minimum)
frame, index = np.where(bundledprojs == global_minimum)
print(index)
print(bundledprojs[frame, index]) # confirm that indeed these coordinates correspond the inquired value
3.2580273
[26]
[3.2580273]
Then, using the index in the mapping shown before:
data.map.iloc[index]
type | atomIndexes | description | |
---|---|---|---|
26 | distance | [29, 4500] | distance between ASP 3 CA and MOL 278 N2 |
gives us the atom indexes and description of the pair.