Science · Code · Curiosity
FASTA parser
I find that one of the common tasks in bioinformatics is reading a file of sequence data, often from a FASTA file, and getting it into a usable format.
Below is a function that tries to open a given filename and read it like as FASTA file. I assumes that the names of sequences are indicated by a '>' and that the sequence starts on the following line. In my function any underscores ('_') are replaced by spaces, which isn't necessary, but useful for data I tend to use.
The function returns dictionary of the sequences with the names as a key and a list of the names in a list so you can read them in the original order if you so wish (EDIT: Using an OrderedDict would be better).
def FASTA(filename):
try:
f = file(filename)
except IOError:
print "The file, %s, does not exist" % filename
return
order = []
sequences = {}
for line in f:
if line.startswith('>'):
name = line[1:].rstrip('\n')
name = name.replace('_', ' ')
order.append(name)
sequences[name] = ''
else:
sequences[name] += line.rstrip('\n').rstrip('*')
print "%d sequences found" % len(order)
return order, sequences
Comments 3
Leave a comment
Comments are moderated and will appear after approval.
Hi Sir,
Can you explain this code in details?
Wow- thank you! I was most of the way through building something similar, but stuck. Your solution is clear and useful!
Could you pls attach comments for the command lines (alternative commands, why those commands are needed and examples you think works best)