hey all,
i've got a fasta file with about 1100 unique names, (i.g. >B10988), and I need to generate a tab delimited file with just those names in a continuous column (without sequences). this is so i can then build a "group" file for mothur and run multi sample analysis (the different names refer to microbial population samples from six different sites). i can't seem to think of a way to do this in text wrangler, word, excel, ect... so any advice is much appreciated.
in general if you have recommendations for a program that is good at formatting and concatenating different sequence files (fasta, nexus, phylip, ect.) that would be super-awesome too.
Ah, biopython seems to be the tool kit for these type of issues. Tutorial time.
Forget Biopython get a copy of Bioedit. This allows you to edit/cut/paste the sequence names independant of the sequence.
Perl can do this, but I'm not sure what you mean by a "tab-delimited column". If you want a column of FASTA names like:
#!/usr/bin/perl -w
use strict;
open (IN, "filename.ext") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">fasta_names.txt") or die "Couldn't open fasta_names.txt: $!\n";
while (<IN>) {
if (/^>(.+)/) {
print OUT "$1\n";
If you want a tab-delimited *row*, like:
#!/usr/bin/perl -w
use strict;
open (IN, "filename.ext") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">fasta_names.txt") or die "Couldn't open fasta_names.txt: $!\n";
while (<IN>) {
if (/^>(.+)/) {
print OUT "$1\t";
print OUT "\n";
This assumes that there's nothing on the FASTA description line other than the gene name. If there's something else, like a description of the gene following its name, change the line:
if (/^>(.+)/) {
if (/^>(.+?)\s/) {
so it will capture all text following the > up until the first space it encounters.
on a *nix machine
grep '>' input_filename.txt > output_filename.txt
to make sure you get only those starting with > you can replace the above:
grep '^>' input_filename.txt > output_filename.txt
you can then use awk to get at specific columns
Ahh - the difficult to read but incredibly useful world of Perl one-liners (these are written for Windows):
perl -nle "print for /^>(.+?)\s/g" your_fasta_file.ext > gene_names.txt
for a column of gene names, or:
perl -ne "while(/^>(.+?)\s/g){print \"$1\t\"}" your_fasta_file.ext > gene_names.txt
for a row of tab-delimited gene names.
ok, now I have to look at how to do this with assembly.
Thanks everyone!
I did figure out how to do it easily in BioPython too if anyone's interested:
>>> seq_rec_list =
>>> seq_rec_list
>>> seq_rec_string = '\n'.join(seq_rec_list)
>>> output_handle.write(seq_rec_string)
input_handle is of course the path to your fasta file.
I'll have to check BioEdit out though.
List compressions eh?
You can make that code more compact or less if you desire (I prefer not to use compact code as it can be nasty when I come back to it - or someone else has to use it)
This takes your list compression and does it in one sitting so that you don't have to perform the "\n".join(list)
from Bio import SeqIO
output = open("my_output.txt", "w")
[output.write( + "\n") for rec in SeqIO.parse(open("my_input.txt", 'r'), "fasta")]
Typically I wouldn't do this either. I would open the file handle with various checks (size, does it exist etc), then loop over the file.
from Bio import SeqIO
output = open("my_output.txt", "w")
for rec in SeqIO.parse( open("my_input.txt", 'r'), "fasta" ):
output.write( + "\n")
<I am trying *hard* to avoid my python code at the moment >