mothur - i'm stupid (Jul/23/2009 )
hey all,
i've got a fasta file with about 1100 unique names, (i.g. >B10988), and I need to generate a tab delimited file with just those names in a continuous column (without sequences). this is so i can then build a "group" file for mothur and run multi sample analysis (the different names refer to microbial population samples from six different sites). i can't seem to think of a way to do this in text wrangler, word, excel, ect... so any advice is much appreciated.
in general if you have recommendations for a program that is good at formatting and concatenating different sequence files (fasta, nexus, phylip, ect.) that would be super-awesome too.
303microbialist on Jul 23 2009, 12:42 PM said:
i've got a fasta file with about 1100 unique names, (i.g. >B10988), and I need to generate a tab delimited file with just those names in a continuous column (without sequences). this is so i can then build a "group" file for mothur and run multi sample analysis (the different names refer to microbial population samples from six different sites). i can't seem to think of a way to do this in text wrangler, word, excel, ect... so any advice is much appreciated.
in general if you have recommendations for a program that is good at formatting and concatenating different sequence files (fasta, nexus, phylip, ect.) that would be super-awesome too.
Ah, biopython seems to be the tool kit for these type of issues. Tutorial time.
303microbialist on Jul 23 2009, 11:42 AM said:
i've got a fasta file with about 1100 unique names, (i.g. >B10988), and I need to generate a tab delimited file with just those names in a continuous column (without sequences). this is so i can then build a "group" file for mothur and run multi sample analysis (the different names refer to microbial population samples from six different sites). i can't seem to think of a way to do this in text wrangler, word, excel, ect... so any advice is much appreciated.
in general if you have recommendations for a program that is good at formatting and concatenating different sequence files (fasta, nexus, phylip, ect.) that would be super-awesome too.
Forget Biopython get a copy of Bioedit. This allows you to edit/cut/paste the sequence names independant of the sequence.
J
Perl can do this, but I'm not sure what you mean by a "tab-delimited column". If you want a column of FASTA names like:
B10988
B10989
B10990
B10991
use:
#!/usr/bin/perl -w
use strict;
open (IN, "filename.ext") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">fasta_names.txt") or die "Couldn't open fasta_names.txt: $!\n";
while (<IN>) {
if (/^>(.+)/) {
print OUT "$1\n";
}
}
If you want a tab-delimited *row*, like:
B10988<tab>B10989<tab>B10990<tab>B10991
use:
#!/usr/bin/perl -w
use strict;
open (IN, "filename.ext") or die "Couldn't find filename.ext: $!\n";
open (OUT, ">fasta_names.txt") or die "Couldn't open fasta_names.txt: $!\n";
while (<IN>) {
if (/^>(.+)/) {
print OUT "$1\t";
}
}
print OUT "\n";
This assumes that there's nothing on the FASTA description line other than the gene name. If there's something else, like a description of the gene following its name, change the line:
if (/^>(.+)/) {
to:
if (/^>(.+?)\s/) {
so it will capture all text following the > up until the first space it encounters.
on a *nix machine
grep '>' input_filename.txt > output_filename.txt
to make sure you get only those starting with > you can replace the above:
grep '^>' input_filename.txt > output_filename.txt
you can then use awk to get at specific columns
Ahh - the difficult to read but incredibly useful world of Perl one-liners (these are written for Windows):
perl -nle "print for /^>(.+?)\s/g" your_fasta_file.ext > gene_names.txt
for a column of gene names, or:
perl -ne "while(/^>(.+?)\s/g){print \"$1\t\"}" your_fasta_file.ext > gene_names.txt
for a row of tab-delimited gene names.
HomeBrew on Aug 18 2009, 02:14 PM said:
perl -nle "print for /^>(.+?)\s/g" your_fasta_file.ext > gene_names.txt
for a column of gene names, or:
perl -ne "while(/^>(.+?)\s/g){print \"$1\t\"}" your_fasta_file.ext > gene_names.txt
for a row of tab-delimited gene names.
ok, now I have to look at how to do this with assembly.
perlmunky on Aug 20 2009, 04:32 AM said:
dosseg
.model small
.stack 100h
.data
hello_message db 'Hello, World!',0dh,0ah,'$'
.code
main proc
mov ax,@data
mov ds,ax
mov ah,9
mov dx,offset hello_message
int 21h
mov ax,4C00h
int 21h
main endp
end main
No, wait -- that's not right...
Thanks everyone!
I did figure out how to do it easily in BioPython too if anyone's interested:
>>> seq_rec_list =
>>> seq_rec_list
>>> seq_rec_string = '\n'.join(seq_rec_list)
>>> output_handle.write(seq_rec_string)
input_handle is of course the path to your fasta file.
I'll have to check BioEdit out though.
List compressions eh?
You can make that code more compact or less if you desire (I prefer not to use compact code as it can be nasty when I come back to it - or someone else has to use it)
Short:
This takes your list compression and does it in one sitting so that you don't have to perform the "\n".join(list)
#!/usr/bin/python
from Bio import SeqIO
output = open("my_output.txt", "w")
[output.write(rec.id + "\n") for rec in SeqIO.parse(open("my_input.txt", 'r'), "fasta")]
Longer:
Typically I wouldn't do this either. I would open the file handle with various checks (size, does it exist etc), then loop over the file.
#!/usr/bin/python
from Bio import SeqIO
output = open("my_output.txt", "w")
for rec in SeqIO.parse( open("my_input.txt", 'r'), "fasta" ):
output.write( rec.id + "\n")
<I am trying *hard* to avoid my python code at the moment >