Python Bioinformatics Tools

I've decided to collect together various Python programs I've written to solve relatively simple bioinformatics tasks. I'm aware that Biopython exists and have used it occasionally, but it's quite large and I've often found it takes more time to learn how to use it than it does to write my own version. I've collected some of these tools here in case they are of use to other people looking for a quick solution.

Codon table

Here's a quick code snippet to generate a codon table in Python. The 'table' is actually a dictionary that takes a three-letter, lowercase codon as a key, and returns a single uppercase letter corresponding to the encoded amino acid (or '*' if it's a stop codon).

bases = ['t', 'c', 'a', 'g']
codons = [a+b+c for a in bases for b in bases for c in bases]
amino_acids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
codon_table = dict(zip(codons, amino_acids))

So if you type codon_table['atg'], you'll get M for methionine. If you prefer to use 'u' rather than 't', simply change the base in the first line.

It's now quite easy to make a function to translate a gene into an amino acid sequence.

def translate(seq):
    seq = seq.lower().replace('\n', '').replace(' ', '')
    peptide = ''
    
    for i in xrange(0, len(seq), 3):
        codon = seq[i: i+3]
        amino_acid = codon_table.get(codon, '*')
        if amino_acid != '*':
            peptide += amino_acid
        else:
            break
                
    return peptide

This function takes a DNA sequence, converts it to lowercase and removes any line breaks or spaces. Then it loops through it in chunks of 3, i.e. codons, translating them until it hits a stop codon or a codon not in the dictionary. It returns the amino acid sequence of the resulting peptide.

Convert distance matrix to phylip format

I wrote a function to create a Eucliean distance matrix of some amino acid substitution matrices and I wanted to find a built-in method find the Spearman's rank of two lists to create a distance matrix that way. I found that BioPython actually has a method that builds distance matrices using various different distance metric, including Euclidean and Spearman's rank:

import Bio.Cluster
dm = Bio.Cluster.distancematrix(data, dist="s")

If you change the dist to "e", then it will calculate the Euclidean distance.

I thought there might be a way to output this in phylip format so I could use quicktree, but if there is, I wasn't able to find it. So here's mine:

fout = open(filename, 'w')
fout.write('%d\n' % len(names))
    
for name, row in zip(names, dm):
    fout.write(name)
    for value in row:
       fout.write('\t%s' % value)
    fout.write('\n')

It assumes you have the distance matrix in the format created by the Bio.Cluster distancematrix function, and have a list of names for the sequences or matrices.

An example output would be:

3
A    
B    1.2    0.8
C    3.2    1.6    2.0

The first value is the number of sequences in the distance matrix and the following lines are the lower triangle of a distance matrix, not including the diagonal (for which all the values would be 0).

FASTA parser

I find that one of the common tasks in bioinformatics is reading a file of sequence data, often from a FASTA file, and getting it into a usable format.

Below is a function that tries to open a given filename and read it like as FASTA file. I assumes that the names of sequences are indicated by a '>' and that the sequence starts on the following line. In my function any underscores ('_') are replaced by spaces, which isn't necessary, but useful for data I tend to use.

The function returns dictionary of the sequences with the names as a key and a list of the names in a list so you can read them in the original order if you so wish.

def FASTA(filename):
  try:
    f = file(filename)
  except IOError:                     
    print "The file, %s, does not exist" % filename
    return

  order = []
  sequences = {}
    
  for line in f:
    if line.startswith('>'):
      name = line[1:].rstrip('\n')
      name = name.replace('_', ' ')
      order.append(name)
      sequences[name] = ''
    else:
      sequences[name] += line.rstrip('\n').rstrip('*')
            
  print "%d sequences found" % len(order)
  return order, sequences

Finding the reverse complement

I wrote this small function to get the reverse complement of a DNA sequence. It's probably not the most efficient way to do it, but I like it because it's compact. It will work for upper- or lower-case sequences, but it will always return a lowercase sequence. It removes any characters it doesn't recognise, which is useful if you have line numbers in the sequence.

def reverseComplement(sequence):
  complement = {'a':'t','c':'g','g':'c','t':'a','n':'n'}
  return "".join([complement.get(nt.lower(), '') for nt in sequence[::-1]])

This function works by going through a string (or list) backwards and replacing each letter using a dictionary. Any character not in the dictionary, such as spaces, line breaks or numbers, are ignored so a single continuous string is returned.

Another option is to use the string.translate() function:

import string
complement = string.maketrans('atcgn', 'tagcn')

def reverseComplement(sequence):
    return sequence.lower().translate(complement)[::-1]

This function works by creating a translation table with string.maketrans() and using it to translate the sequence.

Predicting pI

The pI (or isoelectric point) of a molecule is the pH at which it has neutral charge. The pI of a protein is determined by its amino acid composition so can be used to identify it (or at least narrow down its identity). Proteins can be separated on the basis of their pI on a polyacylamide gel by a technique called isoelectric focusing. I've written an introduction to pH, pI and pKa here.

The full code for my pI predecting function can be downloaded as a txt file from the bottom of this post.

pKa Values

First we need to know the pKa values of the individual amino acids and the N- and C-termini. I got the pKa values from Wikipedia, though there are various sources and they don't all agree with one another. It's not too important, since pI calculations are never that accurate. The pKa values only represent the association constants for the ions in an ideal solution and are highly dependent on the environment of the ion. In a protein, most amino acids will be surrounded by other amino acids, which will change the pKa values, though on average, the values should be roughly correct.

pKa = {'D': 3.9,
       'E': 4.3,
       'H': 6.1,
       'C': 8.3,
       'Y': 10.1,
       'K': 10.5,
       'R': 12,
       'N-terminus': 8,
       'C-terminus': 3.1}

We also need to know whether the ions are positively charged when protonated or negatively charged when deprotonated:

charges = {'D':-1, 'E':-1, 'H':1, 'C':-1, 'Y':-1, 'K':1, 'R':1, 'N-terminus':1, 'C-terminus':-1}

Calculating a protein's charge

Using the pKa and charge dictionaries we can create a function to find the charge of an amino acid at a given pH (derivation of the equation here).

def aminoAcidCharge(amino_acid, pH):
    ratio = 1 / (1 + 10**(pH - pKa[amino_acid]))
    
    if charges[amino_acid] == 1:
        return ratio
    else:
        return ratio - 1

We can then use this function to calculate the charge of an entire protein at a specific pH given its sequence (as a string):

def proteinCharge(sequence, pH):
    protein_charge = aminoAcidCharge('N-terminus', pH)
    protein_charge += aminoAcidCharge('C-terminus', pH)
    
    for aa in pKa.keys():
        protein_charge += sequence.count(aa) * aminoAcidCharge(aa, pH)

    return protein_charge

Calculating a protein's pI

We have a function that calculates the charge of a protein at a given pH, but there is no inverse function to calculate the pH at which a protein will have a given charge (such as 0). One way to calculate pI is to plot the charge of a protein over a range of pHs, connect the points, and read off the graph.

For example, below is a graph of the charge of one of the human keratin proteins over a range of pHs. You can see that its pI is around 5. The graph also illustrates how the charge of a protein is buffered over the range of pH5 - 9.

Plot of keratin's charge vs pH

To get a more exact value for pI we can use a binary search algorithm. This is an interative method that narrows in on the answer until you are as close to a pI of 0 as you wish.

We start by finding the charge of the protein at the midpoint of a range of pH values.  I use pH 3 - pH 13 since all amino acids will be >50% protonated below pH 3 and all will be >50% deprotonated above pH 13. If the charge at the midpoint (pH 8) is negative, then we know that the pI must be between pH 3 - pH 8, so we test the midpoint of that range. If the charge is positive, then we test the range pH 8 - pH 13. We keep testing the midpoint of the ranges in which the pI must lie until we are as close to a charge of 0 as we want.

def isoelectricPoint(sequence, threshold):   
    min_pH, max_pH = 3, 13 

    while True:
        mid_pH = 0.5 * (max_pH + min_pH)
        protein_charge = proteinCharge(sequence, mid_pH)
        
        if protein_charge > threshold:
            min_pH = mid_pH
        elif protein_charge < -threshold:
            max_pH = mid_pH
        else:
            return mid_pH

The function above the implements the binary search for a given protein sequence and threshold. It predicts that the pI of human keratin is 5.3 (with a threshold of 0.5). There is no point attempting to get a more precise pI than this as the pKa values are only given to one decimal place, and will not be so accurate for amino acids within a protein anyway.

AttachmentSize
predictPI.txt1.04 KB