moleculekit.tools.hhblitsprofile module#

moleculekit.tools.hhblitsprofile.getSequenceProfile(sequence, hhblits, hhblitsdb, ncpu=6, niter=4)#

Calculates the sequence profile of a protein sequence using HHBlits

File description and unit conversions taken from section 6 https://hpc.nih.gov/apps/hhsuite-userguide.pdf

Parameters:

sequence (str) – A string encoding the one letter sequence of the protein
hhblits (str) – The path to the hhblits executable
hhblitsdb (str) – The path to the hhblits database that we want to search against. Should include the database name prefix, not just the folder.
ncpu (int) – Number of CPUs to use for the search
niter (int) – The number of hhblits iterations. The higher the value the more remote homologues it will find

Returns:

df (pandas.DataFrame) – A pandas dataframe containing all the information read from the file
pssm (np.ndarray) – A Nx20 numpy array where N the number of residues of the protein. Contains the transition probabilities to all 20 residues.

Examples

>>> hhb = '~/hhsuite-2.0.16-linux-x86_64/bin/hhblits'
>>> hhbdb = '~/hhsuite-2.0.16-linux-x86_64/databases/uniprot20_2016_02/uniprot20_2016_02'
>>> seq = 'MKVIFLKDVKGMGKKGEIKNVADGYANNFLFKQGLAIEA'
>>> df, prof = getSequenceProfile(seq, hhb, hhbdb)