Calculates the sequence profile of a protein sequence using HHBlits
File description and unit conversions taken from section 6 https://hpc.nih.gov/apps/hhsuite-userguide.pdf
sequence (str) – A string encoding the one letter sequence of the protein
hhblits (str) – The path to the hhblits executable
hhblitsdb (str) – The path to the hhblits database that we want to search against. Should include the database name prefix, not just the folder.
ncpu (int) – Number of CPUs to use for the search
niter (int) – The number of hhblits iterations. The higher the value the more remote homologues it will find
df (pandas.DataFrame) – A pandas dataframe containing all the information read from the file
pssm (np.ndarray) – A Nx20 numpy array where N the number of residues of the protein. Contains the transition probabilities to all 20 residues.
>>> hhb = '~/hhsuite-2.0.16-linux-x86_64/bin/hhblits' >>> hhbdb = '~/hhsuite-2.0.16-linux-x86_64/databases/uniprot20_2016_02/uniprot20_2016_02' >>> seq = 'MKVIFLKDVKGMGKKGEIKNVADGYANNFLFKQGLAIEA' >>> df, prof = getSequenceProfile(seq, hhb, hhbdb)