DataCenter tutorial#
by Stefan Doerr
Purpose#
The DataCenter is a central concept of PlayMolecule. It’s where all data of PlayMolecule is stored. It’s implemented as a MINIO database on the PlayMolecule server. It stores:
Data uploaded by the user manually
Data related to jobs (input, output, intermediate results)
Data related to apps (machine learning models, default training datasets, etc.)
The DataCenter also allows to pass data between applications without transferring them to the users computer, reducing network traffic, increasing speed, reducing data duplication and reducing storage requirements for the client machines.
The DataCenter also implements the Dataset class which is a python object allowing to retrieve information about a given dataset stored in the DataCenter.
Using the DataCenter#
To use the DataCenter we need to connect it to a PlayMolecule Session object
from playmolecule import *
sess = Session("liaihs9a8dfyasodbna0as8d")
dc = DataCenter(sess)
2022-09-29 10:14:32,910 - playmolecule.config - INFO - Reading PM API configuration from /home/sdoerr/Work/playmolecule/playmolecule-python-api/playmolecule/config.ini
The main function for searching for datasets in the DataCenter is
get_datasets
. Remembed to add filters via the different arguments of
the method, otherwise listing the whole DataCenter can take very long
due to the number of files stored in it.
dc.get_datasets(startswith="DeepSite/models/")
ID RemotePath Comments Public FileSize ExecID DateCreated DateUpdated
46 DeepSite/models/default Deepsite final datas True 3.5MiB 2022-05-03 13:30:23 2022-05-03 13:30:23
47 DeepSite/models/deepsite Deepsite final datas True 3.5MiB 2022-05-03 13:30:23 2022-05-03 13:30:23
Here we have listed the pre-trained models available for the DeepSite application. To get the results of a given execution we can supply the execution ID.
execid = "bb4a899b-6942-4da0-b952-7bdc75957d5d"
dc.get_datasets(execid=execid)
ID RemotePath Comments Public FileSize ExecID DateCreated DateUpdated
1255 bb4a899b-6942-4da0-b952-7bdc75957d5d/config PDB: 3ptb Chain: A p False 528.0B bb4a899b-6942-4da0-b952-7bdc75957d5d 2022-09-12 16:45:42 2022-09-29 10:20:32
1256 bb4a899b-6942-4da0-b952-7bdc75957d5d/input PDB: 3ptb Chain: A p False 40.1KiB bb4a899b-6942-4da0-b952-7bdc75957d5d 2022-09-12 16:45:42 2022-09-12 16:46:03
1257 ProteinPrepare/output/BB4A899B PDB: 3ptb Chain: A p False 58.8KiB bb4a899b-6942-4da0-b952-7bdc75957d5d 2022-09-12 16:46:02 2022-09-12 16:46:02
1258 bb4a899b-6942-4da0-b952-7bdc75957d5d/output PDB: 3ptb Chain: A p False 503.2KiB bb4a899b-6942-4da0-b952-7bdc75957d5d 2022-09-12 16:46:03 2022-09-29 10:31:26
Using Datasets#
To use the Datasets we can ask get_datasets
to return them as
objects
datasets = dc.get_datasets(startswith="DeepSite/models/default", returnobj=True)
default_ds = datasets[0]
default_ds
<playmolecule.datacenter.Dataset object at 0x7fc386d923e0>
Dataset: dc://46
bucket: admin-uploads
comments: Deepsite final dataset
created_at: 2022-05-03 10:30:23+00:00
execid:
filepath: DeepSite/models/default
files: ['model_98acc.ckpt']
filesize: 3712571
filetype: application/x-gzip
id: 46
num_downloads: 0
object: 55423c39-05fd-44d8-97d5-a074b39d5008
public: True
status: 0
updated_at: 2022-05-03 10:30:23+00:00
userid: admin
validated: False
Dataset object have their own set of methods which you can look up in the documentation. One of their most useful ones being the file listing to see what files exist inside the dataset.
default_ds.list_files() # This dataset only contains one file
['model_98acc.ckpt']
output_ds = dc.get_datasets(execid=execid, tags="type:output", returnobj=True)[0] # Take the first result
output_ds.list_files() # Here is an example with multiple files
['details.csv',
'inopt.yml',
'log.txt',
'output.pdb',
'pka_plot.png',
'pmwsjob.done',
'web_content.pickle']
If we want to download a dataset locally we can use the download
method
output_ds.download("/tmp/")
2022-09-29 10:31:28,153 - playmolecule.datacenter - INFO - Dataset 1258 was downloaded successfully to /tmp.
'/tmp'
But the magic of the DataCenter is that we don’t need to download
Datasets to use them in apps. We can directly tell the PlayMolecule
server to send the data from the DataCenter to another application. Here
as an example we take the output.pdb
file of a previous job and pass
it to the ProteinPrepare
application. We can either pass whole
datasets to apps which support that, or specific files of a dataset. To
specify one or more files of the dataset we index the object with
brackets []
and the name of the file we want to pass. The bracket
format also accepts regular expressions for selecting files. For
example:
import re
print(output_ds["output.pdb"].files)
print(output_ds[re.compile("outp*")].files)
print(output_ds[re.compile(".*.png")].files)
print(output_ds[re.compile("p.*")].files)
2022-09-29 10:36:39,484 - playmolecule.datacenter - INFO - Regular expression matched files: ['output.pdb']
2022-09-29 10:36:39,486 - playmolecule.datacenter - INFO - Regular expression matched files: ['pka_plot.png']
2022-09-29 10:36:39,487 - playmolecule.datacenter - INFO - Regular expression matched files: ['pka_plot.png', 'pmwsjob.done']
['output.pdb']
['output.pdb']
['pka_plot.png']
['pka_plot.png', 'pmwsjob.done']
job = sess.start_app("ProteinPrepare")
job.pdbfile = output_ds["output.pdb"] # Pass just the PDB file as input to ProteinPrepare
job.submit()
This allows us to efficiently communicate data from one application to another in PlayMolecule
DeepSite example#
The DeepSite application requires a model for running. It’s not possible
to use it without a model. The default DeepSite models are stored inside
PlayMolecule. describe
shows us the first argument being the model.
job = sess.start_app("DeepSite")
job.describe()
Name Type Mandatory Value Description
model file True The path to a model checkpoint.
appname string False [DEV]: Allows overriding the application name. Use with caution.
no_parallelism bool False False Use it to disable parallelism inside the app (default: False)
outdir string False . The output folder where to write the results
pdbfile file False .pdb file
pdbid string False PDB id.
scratchdir string False scratch The scratch folder where to write the temporary data
Thus to use DeepSite we need to first get the Dataset object of a model and then pass it as an argument to the app
model_ds = dc.get_datasets(startswith="DeepSite/models/default", returnobj=True)[0]
print(model_ds)
dc://46
job.model = model_ds # Pass the model to the app
job.pdbid = "3ptb"
job.submit()
Kdeep example#
Kdeep similarly can accept either pre-trained models or models produced by KdeepTrainer. Also the KdeepTrainer application can accept existing training sets located in the DataCenter. Let’s find a KdeepTrainer model to use in Kdeep.
dc.get_datasets(tags="app:kdeeptrainer:output")
ID RemotePath Comments Public FileSize ExecID DateCreated DateUpdated
394 19aeb528-a682-48c3-97ee-495fb5feeadc/output test1 False 2.6MiB 19aeb528-a682-48c3-97ee-495fb5feeadc 2022-05-23 11:22:12 2022-05-23 11:25:59
401 334f50f8-e9d9-4192-979e-066262fc179d/output test2 False 2.6MiB 334f50f8-e9d9-4192-979e-066262fc179d 2022-05-23 11:40:40 2022-05-23 11:51:36
408 289e041c-4e86-402f-a597-58c2e6f4f9ad/output test3 False 2.5MiB 289e041c-4e86-402f-a597-58c2e6f4f9ad 2022-05-23 11:52:01 2022-05-23 12:20:50
424 bddb0a98-d7cd-4f12-b702-00ec3253039f/output teststef1 False 2.5MiB bddb0a98-d7cd-4f12-b702-00ec3253039f 2022-05-23 15:00:11 2022-05-23 15:00:13
428 18a5c95a-77d6-4b15-a682-c9b77f0771d2/output teststef1 False 2.5MiB 18a5c95a-77d6-4b15-a682-c9b77f0771d2 2022-05-23 15:10:43 2022-05-23 15:10:46
922 0645e98f-48af-432c-8387-5329b80a0c1a/output False 2.5MiB 0645e98f-48af-432c-8387-5329b80a0c1a 2022-08-02 14:07:03 2022-08-02 14:15:05
931 3afdb586-96d8-4268-8e4a-896dd3e56140/output False 2.5MiB 3afdb586-96d8-4268-8e4a-896dd3e56140 2022-08-02 14:15:15 2022-08-02 14:15:15
936 28b70a5c-afa3-441d-a47a-d6c01c4a5656/output False 2.5MiB 28b70a5c-afa3-441d-a47a-d6c01c4a5656 2022-08-02 14:19:33 2022-08-02 14:20:10
943 3ade5f42-13b2-4492-bad5-5f061d1aff83/output False 2.5MiB 3ade5f42-13b2-4492-bad5-5f061d1aff83 2022-08-02 14:30:27 2022-08-02 14:30:27
947 20b6eeee-bc5a-499d-ad1a-794b8aebb474/output False 2.5MiB 20b6eeee-bc5a-499d-ad1a-794b8aebb474 2022-08-02 14:44:36 2022-08-02 14:44:54
1263 73be14f5-2077-4c42-b9a8-62a00ba89389/output False 1.2KiB 73be14f5-2077-4c42-b9a8-62a00ba89389 2022-09-23 13:40:40 2022-09-23 13:40:40
1264 b873d56f-e125-4d06-9b7b-26c8b9c17d41/output False 1.2KiB b873d56f-e125-4d06-9b7b-26c8b9c17d41 2022-09-23 13:40:42 2022-09-23 13:40:42
1268 285aacd5-d646-4763-9fb0-436e7b863d52/output False 2.5MiB 285aacd5-d646-4763-9fb0-436e7b863d52 2022-09-23 13:49:02 2022-09-23 14:08:56
1281 4df43a8f-48af-4537-8f75-6747f4d5df25/output False 2.5MiB 4df43a8f-48af-4537-8f75-6747f4d5df25 2022-09-23 14:25:33 2022-09-23 15:16:43
Let’s use the last model
kdtrainer_model_ds = dc.get_datasets(tags="app:kdeeptrainer:output", returnobj=True)[-1]
kdtrainer_model_ds.list_files()
['best_model.ckpt',
'inopt.yml',
'log.txt',
'metrics/',
'metrics/metrics_by_epoch.npy',
'metrics/training_labels.pickle',
'pmwsjob.done',
'training_labels.pickle',
'web_content.pickle']
job = sess.start_app("Kdeep")
job.modelfile = kdtrainer_model_ds["best_model.ckpt"]
job.pdb = "./protein.pdb"
job.sdf = "./ligands.sdf"
job.submit()
Or we could of course use a model provided by Acellera
dc.get_datasets(startswith="Kdeep/models") # List the default existing models (provided by Acellera)
ID RemotePath Comments Public FileSize ExecID DateCreated DateUpdated
513 Kdeep/models/default ACEBind 2020 model e True 20.4MiB 2022-06-02 16:28:07 2022-06-03 10:37:28
514 Kdeep/models/PDBBind2016 PDBBind 2016 model e True 20.3MiB 2022-06-02 16:28:08 2022-06-02 16:28:08
515 Kdeep/models/ACEBind2016_dl ACEBind 2016 model e True 20.3MiB 2022-06-02 16:28:10 2022-06-02 16:28:10
516 Kdeep/models/ACEBind2020_dl ACEBind 2020 model e True 20.4MiB 2022-06-02 16:28:11 2022-06-02 16:28:11
517 Kdeep/models/ACEBind2020_dl_reduced ACEBind 2020 model e True 14.5MiB 2022-06-02 16:28:13 2022-06-02 16:28:13
518 Kdeep/models/BindScope BindScope model ense True 12.2MiB 2022-06-02 16:28:14 2022-06-03 10:37:29
Respectively for KdeepTrainer we can see all training sets provided by Acellera
dc.get_datasets(startswith="KdeepTrainer/datasets")
ID RemotePath Comments Public FileSize ExecID DateCreated DateUpdated
1238 KdeepTrainer/datasets/ACEBind2022 ACEBind2022 training True 754.5MiB 2022-08-03 16:43:46 2022-08-03 16:43:46
1239 KdeepTrainer/datasets/ACEBind2020_dl ACEBind2020 training True 57.6MiB 2022-08-03 16:43:53 2022-08-03 16:43:53
1240 KdeepTrainer/datasets/ACEBind2016_dl ACEBind2016 training True 40.7MiB 2022-08-03 16:43:57 2022-08-03 16:43:57
1241 KdeepTrainer/datasets/PDBBind2016 PDBBind2016 training True 258.0B 2022-08-03 16:43:57 2022-08-03 16:43:57
Uploading your own files into the DataCenter#
The DataCenter doesn’t exist to only store application-generated files. The user can upload their own data, which allows them to re-use them as often as they want in apps without re-uploading them every time to the server.
ds_id = dc.upload_dataset(localpath="./3ptb.pdb", remotepath="/my-own-data/3ptb.pdb")
2022-09-29 10:56:31,899 - playmolecule.datacenter - INFO - File ./3ptb.pdb was uploaded successfully.
dataset = dc.get_datasets(datasetid=ds_id, returnobj=True)[0]
dataset # Now we can see that the file exists on the DataCenter
<playmolecule.datacenter.Dataset object at 0x7fc386d18130>
Dataset: dc://1303
bucket: admin-uploads
comments:
created_at: 2022-09-29 07:56:32+00:00
execid:
filepath: /my-own-data/3ptb.pdb
files: ['3ptb.pdb']
filesize: 31247
filetype: application/x-gzip
id: 1303
num_downloads: 0
object: 45e112a4-e06e-4722-b34c-4f4ce90e4f6b
public: False
status: 0
updated_at: 2022-09-29 07:56:32+00:00
userid: admin
validated: False
We can now pass this dataset as input to any application which accepts PDB files
job = sess.start_app("ProteinPrepare")
job.pdbfile = dataset
job.submit()
With this we conclude the DataCenter tutorial. For any further questions feel free to contact info@acellera.com