DataCenter tutorial#

by Stefan Doerr

Purpose#

The DataCenter is a central concept of PlayMolecule. It’s where all data of PlayMolecule is stored. It’s implemented as a MINIO database on the PlayMolecule server. It stores:

Data uploaded by the user manually
Data related to jobs (input, output, intermediate results)
Data related to apps (machine learning models, default training datasets, etc.)

The DataCenter also allows to pass data between applications without transferring them to the users computer, reducing network traffic, increasing speed, reducing data duplication and reducing storage requirements for the client machines.

The DataCenter also implements the Dataset class which is a python object allowing to retrieve information about a given dataset stored in the DataCenter.

Using the DataCenter#

To use the DataCenter we need to connect it to a PlayMolecule Session object

from playmolecule import *

sess = Session("liaihs9a8dfyasodbna0as8d")
dc = DataCenter(sess)

2022-09-29 10:14:32,910 - playmolecule.config - INFO - Reading PM API configuration from /home/sdoerr/Work/playmolecule/playmolecule-python-api/playmolecule/config.ini

The main function for searching for datasets in the DataCenter is get_datasets. Remembed to add filters via the different arguments of the method, otherwise listing the whole DataCenter can take very long due to the number of files stored in it.

dc.get_datasets(startswith="DeepSite/models/")

ID         RemotePath                                        Comments             Public FileSize ExecID                               DateCreated          DateUpdated
46         DeepSite/models/default                           Deepsite final datas True   3.5MiB                                        2022-05-03 13:30:23  2022-05-03 13:30:23
47         DeepSite/models/deepsite                          Deepsite final datas True   3.5MiB                                        2022-05-03 13:30:23  2022-05-03 13:30:23

Here we have listed the pre-trained models available for the DeepSite application. To get the results of a given execution we can supply the execution ID.

execid = "bb4a899b-6942-4da0-b952-7bdc75957d5d"
dc.get_datasets(execid=execid)

ID         RemotePath                                        Comments             Public FileSize ExecID                               DateCreated          DateUpdated
     bb4a899b-6942-4da0-b952-7bdc75957d5d/config       PDB: 3ptb Chain: A p False  528.0B   bb4a899b-6942-4da0-b952-7bdc75957d5d 2022-09-12 16:45:42  2022-09-29 10:20:32
     bb4a899b-6942-4da0-b952-7bdc75957d5d/input        PDB: 3ptb Chain: A p False  40.1KiB  bb4a899b-6942-4da0-b952-7bdc75957d5d 2022-09-12 16:45:42  2022-09-12 16:46:03
     ProteinPrepare/output/BB4A899B                    PDB: 3ptb Chain: A p False  58.8KiB  bb4a899b-6942-4da0-b952-7bdc75957d5d 2022-09-12 16:46:02  2022-09-12 16:46:02
     bb4a899b-6942-4da0-b952-7bdc75957d5d/output       PDB: 3ptb Chain: A p False  503.2KiB bb4a899b-6942-4da0-b952-7bdc75957d5d 2022-09-12 16:46:03  2022-09-29 10:31:26

Using Datasets#

To use the Datasets we can ask get_datasets to return them as objects

datasets = dc.get_datasets(startswith="DeepSite/models/default", returnobj=True)

default_ds = datasets[0]
default_ds

<playmolecule.datacenter.Dataset object at 0x7fc386d923e0>
Dataset: dc://46
  bucket: admin-uploads
  comments: Deepsite final dataset
  created_at: 2022-05-03 10:30:23+00:00
  execid:
  filepath: DeepSite/models/default
  files: ['model_98acc.ckpt']
  filesize: 3712571
  filetype: application/x-gzip
  id: 46
  num_downloads: 0
  object: 55423c39-05fd-44d8-97d5-a074b39d5008
  public: True
  status: 0
  updated_at: 2022-05-03 10:30:23+00:00
  userid: admin
  validated: False

Dataset object have their own set of methods which you can look up in the documentation. One of their most useful ones being the file listing to see what files exist inside the dataset.

default_ds.list_files()  # This dataset only contains one file

['model_98acc.ckpt']

output_ds = dc.get_datasets(execid=execid, tags="type:output", returnobj=True)[0] # Take the first result

output_ds.list_files() # Here is an example with multiple files

['details.csv',
 'inopt.yml',
 'log.txt',
 'output.pdb',
 'pka_plot.png',
 'pmwsjob.done',
 'web_content.pickle']

If we want to download a dataset locally we can use the download method

output_ds.download("/tmp/")

2022-09-29 10:31:28,153 - playmolecule.datacenter - INFO - Dataset 1258 was downloaded successfully to /tmp.

'/tmp'

But the magic of the DataCenter is that we don’t need to download Datasets to use them in apps. We can directly tell the PlayMolecule server to send the data from the DataCenter to another application. Here as an example we take the output.pdb file of a previous job and pass it to the ProteinPrepare application. We can either pass whole datasets to apps which support that, or specific files of a dataset. To specify one or more files of the dataset we index the object with brackets [] and the name of the file we want to pass. The bracket format also accepts regular expressions for selecting files. For example:

import re

print(output_ds["output.pdb"].files)
print(output_ds[re.compile("outp*")].files)
print(output_ds[re.compile(".*.png")].files)
print(output_ds[re.compile("p.*")].files)

2022-09-29 10:36:39,484 - playmolecule.datacenter - INFO - Regular expression matched files: ['output.pdb']
2022-09-29 10:36:39,486 - playmolecule.datacenter - INFO - Regular expression matched files: ['pka_plot.png']
2022-09-29 10:36:39,487 - playmolecule.datacenter - INFO - Regular expression matched files: ['pka_plot.png', 'pmwsjob.done']

['output.pdb']
['output.pdb']
['pka_plot.png']
['pka_plot.png', 'pmwsjob.done']

job = sess.start_app("ProteinPrepare")
job.pdbfile = output_ds["output.pdb"]  # Pass just the PDB file as input to ProteinPrepare
job.submit()

This allows us to efficiently communicate data from one application to another in PlayMolecule

DeepSite example#

The DeepSite application requires a model for running. It’s not possible to use it without a model. The default DeepSite models are stored inside PlayMolecule. describe shows us the first argument being the model.

job = sess.start_app("DeepSite")
job.describe()

Name                           Type                 Mandatory            Value                          Description
model                          file                 True                                                The path to a model checkpoint.
appname                        string               False                                               [DEV]: Allows overriding the application name. Use with caution.
no_parallelism                 bool                 False                False                          Use it to disable parallelism inside the app (default: False)
outdir                         string               False                .                              The output folder where to write the results
pdbfile                        file                 False                                               .pdb file
pdbid                          string               False                                               PDB id.
scratchdir                     string               False                scratch                        The scratch folder where to write the temporary data

Thus to use DeepSite we need to first get the Dataset object of a model and then pass it as an argument to the app

model_ds = dc.get_datasets(startswith="DeepSite/models/default", returnobj=True)[0]
print(model_ds)

dc://46

job.model = model_ds  # Pass the model to the app
job.pdbid = "3ptb"
job.submit()

Kdeep example#

Kdeep similarly can accept either pre-trained models or models produced by KdeepTrainer. Also the KdeepTrainer application can accept existing training sets located in the DataCenter. Let’s find a KdeepTrainer model to use in Kdeep.

dc.get_datasets(tags="app:kdeeptrainer:output")

ID         RemotePath                                        Comments             Public FileSize ExecID                               DateCreated          DateUpdated
      19aeb528-a682-48c3-97ee-495fb5feeadc/output       test1                False  2.6MiB   19aeb528-a682-48c3-97ee-495fb5feeadc 2022-05-23 11:22:12  2022-05-23 11:25:59
      334f50f8-e9d9-4192-979e-066262fc179d/output       test2                False  2.6MiB   334f50f8-e9d9-4192-979e-066262fc179d 2022-05-23 11:40:40  2022-05-23 11:51:36
      289e041c-4e86-402f-a597-58c2e6f4f9ad/output       test3                False  2.5MiB   289e041c-4e86-402f-a597-58c2e6f4f9ad 2022-05-23 11:52:01  2022-05-23 12:20:50
      bddb0a98-d7cd-4f12-b702-00ec3253039f/output       teststef1            False  2.5MiB   bddb0a98-d7cd-4f12-b702-00ec3253039f 2022-05-23 15:00:11  2022-05-23 15:00:13
      18a5c95a-77d6-4b15-a682-c9b77f0771d2/output       teststef1            False  2.5MiB   18a5c95a-77d6-4b15-a682-c9b77f0771d2 2022-05-23 15:10:43  2022-05-23 15:10:46
      0645e98f-48af-432c-8387-5329b80a0c1a/output                            False  2.5MiB   0645e98f-48af-432c-8387-5329b80a0c1a 2022-08-02 14:07:03  2022-08-02 14:15:05
      3afdb586-96d8-4268-8e4a-896dd3e56140/output                            False  2.5MiB   3afdb586-96d8-4268-8e4a-896dd3e56140 2022-08-02 14:15:15  2022-08-02 14:15:15
      28b70a5c-afa3-441d-a47a-d6c01c4a5656/output                            False  2.5MiB   28b70a5c-afa3-441d-a47a-d6c01c4a5656 2022-08-02 14:19:33  2022-08-02 14:20:10
      3ade5f42-13b2-4492-bad5-5f061d1aff83/output                            False  2.5MiB   3ade5f42-13b2-4492-bad5-5f061d1aff83 2022-08-02 14:30:27  2022-08-02 14:30:27
      20b6eeee-bc5a-499d-ad1a-794b8aebb474/output                            False  2.5MiB   20b6eeee-bc5a-499d-ad1a-794b8aebb474 2022-08-02 14:44:36  2022-08-02 14:44:54
     73be14f5-2077-4c42-b9a8-62a00ba89389/output                            False  1.2KiB   73be14f5-2077-4c42-b9a8-62a00ba89389 2022-09-23 13:40:40  2022-09-23 13:40:40
     b873d56f-e125-4d06-9b7b-26c8b9c17d41/output                            False  1.2KiB   b873d56f-e125-4d06-9b7b-26c8b9c17d41 2022-09-23 13:40:42  2022-09-23 13:40:42
     285aacd5-d646-4763-9fb0-436e7b863d52/output                            False  2.5MiB   285aacd5-d646-4763-9fb0-436e7b863d52 2022-09-23 13:49:02  2022-09-23 14:08:56
     4df43a8f-48af-4537-8f75-6747f4d5df25/output                            False  2.5MiB   4df43a8f-48af-4537-8f75-6747f4d5df25 2022-09-23 14:25:33  2022-09-23 15:16:43

Let’s use the last model

kdtrainer_model_ds = dc.get_datasets(tags="app:kdeeptrainer:output", returnobj=True)[-1]

kdtrainer_model_ds.list_files()

['best_model.ckpt',
 'inopt.yml',
 'log.txt',
 'metrics/',
 'metrics/metrics_by_epoch.npy',
 'metrics/training_labels.pickle',
 'pmwsjob.done',
 'training_labels.pickle',
 'web_content.pickle']

job = sess.start_app("Kdeep")
job.modelfile = kdtrainer_model_ds["best_model.ckpt"]
job.pdb = "./protein.pdb"
job.sdf = "./ligands.sdf"
job.submit()

Or we could of course use a model provided by Acellera

dc.get_datasets(startswith="Kdeep/models") # List the default existing models (provided by Acellera)

ID         RemotePath                                        Comments             Public FileSize ExecID                               DateCreated          DateUpdated
      Kdeep/models/default                              ACEBind 2020 model e True   20.4MiB                                       2022-06-02 16:28:07  2022-06-03 10:37:28
      Kdeep/models/PDBBind2016                          PDBBind 2016 model e True   20.3MiB                                       2022-06-02 16:28:08  2022-06-02 16:28:08
      Kdeep/models/ACEBind2016_dl                       ACEBind 2016 model e True   20.3MiB                                       2022-06-02 16:28:10  2022-06-02 16:28:10
      Kdeep/models/ACEBind2020_dl                       ACEBind 2020 model e True   20.4MiB                                       2022-06-02 16:28:11  2022-06-02 16:28:11
      Kdeep/models/ACEBind2020_dl_reduced               ACEBind 2020 model e True   14.5MiB                                       2022-06-02 16:28:13  2022-06-02 16:28:13
      Kdeep/models/BindScope                            BindScope model ense True   12.2MiB                                       2022-06-02 16:28:14  2022-06-03 10:37:29

Respectively for KdeepTrainer we can see all training sets provided by Acellera

dc.get_datasets(startswith="KdeepTrainer/datasets")

ID         RemotePath                                        Comments             Public FileSize ExecID                               DateCreated          DateUpdated
     KdeepTrainer/datasets/ACEBind2022                 ACEBind2022 training True   754.5MiB                                      2022-08-03 16:43:46  2022-08-03 16:43:46
     KdeepTrainer/datasets/ACEBind2020_dl              ACEBind2020 training True   57.6MiB                                       2022-08-03 16:43:53  2022-08-03 16:43:53
     KdeepTrainer/datasets/ACEBind2016_dl              ACEBind2016 training True   40.7MiB                                       2022-08-03 16:43:57  2022-08-03 16:43:57
     KdeepTrainer/datasets/PDBBind2016                 PDBBind2016 training True   258.0B                                        2022-08-03 16:43:57  2022-08-03 16:43:57

Uploading your own files into the DataCenter#

The DataCenter doesn’t exist to only store application-generated files. The user can upload their own data, which allows them to re-use them as often as they want in apps without re-uploading them every time to the server.

ds_id = dc.upload_dataset(localpath="./3ptb.pdb", remotepath="/my-own-data/3ptb.pdb")

2022-09-29 10:56:31,899 - playmolecule.datacenter - INFO - File ./3ptb.pdb was uploaded successfully.

dataset = dc.get_datasets(datasetid=ds_id, returnobj=True)[0]

dataset  # Now we can see that the file exists on the DataCenter

<playmolecule.datacenter.Dataset object at 0x7fc386d18130>
Dataset: dc://1303
  bucket: admin-uploads
  comments:
  created_at: 2022-09-29 07:56:32+00:00
  execid:
  filepath: /my-own-data/3ptb.pdb
  files: ['3ptb.pdb']
  filesize: 31247
  filetype: application/x-gzip
  id: 1303
  num_downloads: 0
  object: 45e112a4-e06e-4722-b34c-4f4ce90e4f6b
  public: False
  status: 0
  updated_at: 2022-09-29 07:56:32+00:00
  userid: admin
  validated: False

We can now pass this dataset as input to any application which accepts PDB files

job = sess.start_app("ProteinPrepare")
job.pdbfile = dataset
job.submit()

With this we conclude the DataCenter tutorial. For any further questions feel free to contact info@acellera.com