moleculekit.tools.hhblitsprofile module#

moleculekit.tools.hhblitsprofile.getSequenceProfile(sequence, hhblits, hhblitsdb, ncpu=6, niter=4)#

Calculates the sequence profile of a protein sequence using HHBlits

File description and unit conversions taken from section 6 https://hpc.nih.gov/apps/hhsuite-userguide.pdf

Parameters:
  • sequence (str) – A string encoding the one letter sequence of the protein

  • hhblits (str) – The path to the hhblits executable

  • hhblitsdb (str) – The path to the hhblits database that we want to search against. Should include the database name prefix, not just the folder.

  • ncpu (int) – Number of CPUs to use for the search

  • niter (int) – The number of hhblits iterations. The higher the value the more remote homologues it will find

Returns:

  • df (pandas.DataFrame) – A pandas dataframe containing all the information read from the file

  • pssm (np.ndarray) – A Nx20 numpy array where N the number of residues of the protein. Contains the transition probabilities to all 20 residues.

Examples

>>> hhb = '~/hhsuite-2.0.16-linux-x86_64/bin/hhblits'
>>> hhbdb = '~/hhsuite-2.0.16-linux-x86_64/databases/uniprot20_2016_02/uniprot20_2016_02'
>>> seq = 'MKVIFLKDVKGMGKKGEIKNVADGYANNFLFKQGLAIEA'
>>> df, prof = getSequenceProfile(seq, hhb, hhbdb)