FASTA parser

I find that one of the common tasks in bioinformatics is reading a file of sequence data, often from a FASTA file, and getting it into a usable format.

Below is a function that tries to open a given filename and read it like as FASTA file. I assumes that the names of sequences are indicated by a '>' and that the sequence starts on the following line. In my function any underscores ('_') are replaced by spaces, which isn't necessary, but useful for data I tend to use.

The function returns dictionary of the sequences with the names as a key and a list of the names in a list so you can read them in the original order if you so wish (EDIT: Using an OrderedDict would be better).

def FASTA(filename):
  try:
    f = file(filename)
  except IOError:                     
    print "The file, %s, does not exist" % filename
    return

  order = []
  sequences = {}
    
  for line in f:
    if line.startswith('>'):
      name = line[1:].rstrip('\n')
      name = name.replace('_', ' ')
      order.append(name)
      sequences[name] = ''
    else:
      sequences[name] += line.rstrip('\n').rstrip('*')
            
  print "%d sequences found" % len(order)
  return order, sequences

Comments 3

HNS 23 October 2013

Hi Sir,

Can you explain this code in details?
Anonymous 13 August 2015

Wow- thank you! I was most of the way through building something similar, but stuck. Your solution is clear and useful!
Sewunet Abera 21 February 2017

Could you pls attach comments for the command lines (alternative commands, why those commands are needed and examples you think works best)

Comments are moderated and will appear after approval.

Comments 3

Leave a comment