mothur - i'm stupid (Jul/23/2009 )

hey all,
i've got a fasta file with about 1100 unique names, (i.g. >B10988), and I need to generate a tab delimited file with just those names in a continuous column (without sequences). this is so i can then build a "group" file for mothur and run multi sample analysis (the different names refer to microbial population samples from six different sites). i can't seem to think of a way to do this in text wrangler, word, excel, ect... so any advice is much appreciated.

in general if you have recommendations for a program that is good at formatting and concatenating different sequence files (fasta, nexus, phylip, ect.) that would be super-awesome too.

-303microbialist-

303microbialist on Jul 23 2009, 12:42 PM said:

Ah, biopython seems to be the tool kit for these type of issues. Tutorial time.

-303microbialist-

303microbialist on Jul 23 2009, 11:42 AM said:

Forget Biopython get a copy of Bioedit. This allows you to edit/cut/paste the sequence names independant of the sequence.

J

-Jugsy-

Perl can do this, but I'm not sure what you mean by a "tab-delimited column". If you want a column of FASTA names like:

B10988
B10989
B10990
B10991

use:

#!/usr/bin/perl -w use strict; open (IN, "filename.ext") or die "Couldn't find filename.ext: $!\n"; open (OUT, ">fasta_names.txt") or die "Couldn't open fasta_names.txt: $!\n"; while (<IN>) { if (/^>(.+)/) { print OUT "$1\n"; } }

If you want a tab-delimited *row*, like:

B10988<tab>B10989<tab>B10990<tab>B10991

use:

#!/usr/bin/perl -w use strict; open (IN, "filename.ext") or die "Couldn't find filename.ext: $!\n"; open (OUT, ">fasta_names.txt") or die "Couldn't open fasta_names.txt: $!\n"; while (<IN>) { if (/^>(.+)/) { print OUT "$1\t"; } } print OUT "\n";

This assumes that there's nothing on the FASTA description line other than the gene name. If there's something else, like a description of the gene following its name, change the line:

if (/^>(.+)/) {

to:

if (/^>(.+?)\s/) {

so it will capture all text following the > up until the first space it encounters.

-HomeBrew-

on a *nix machine

grep '>' input_filename.txt > output_filename.txt

to make sure you get only those starting with > you can replace the above:
grep '^>' input_filename.txt > output_filename.txt

you can then use awk to get at specific columns

-perlmunky-

Ahh - the difficult to read but incredibly useful world of Perl one-liners (these are written for Windows):

perl -nle "print for /^>(.+?)\s/g" your_fasta_file.ext > gene_names.txt

for a column of gene names, or:

perl -ne "while(/^>(.+?)\s/g){print \"$1\t\"}" your_fasta_file.ext > gene_names.txt

for a row of tab-delimited gene names.

:lol:

-HomeBrew-

HomeBrew on Aug 18 2009, 02:14 PM said:

ok, now I have to look at how to do this with assembly.

-perlmunky-

perlmunky on Aug 20 2009, 04:32 AM said:

ok, now I have to look at how to do this with assembly.

dosseg

.model small

.stack 100h



.data

hello_message db 'Hello, World!',0dh,0ah,'$'



.code

main  proc

	  mov	ax,@data

	  mov	ds,ax



	  mov	ah,9

	  mov	dx,offset hello_message

	  int	21h



	  mov	ax,4C00h

	  int	21h

main  endp

end   main

No, wait -- that's not right... :lol:

-HomeBrew-

Thanks everyone!

I did figure out how to do it easily in BioPython too if anyone's interested:

>>> seq_rec_list = asta")>

>>> seq_rec_list

>>> seq_rec_string = '\n'.join(seq_rec_list)

>>> output_handle.write(seq_rec_string)

input_handle is of course the path to your fasta file.

I'll have to check BioEdit out though.

-303microbialist-

List compressions eh?

You can make that code more compact or less if you desire (I prefer not to use compact code as it can be nasty when I come back to it - or someone else has to use it)

Short:
This takes your list compression and does it in one sitting so that you don't have to perform the "\n".join(list)
#!/usr/bin/python from Bio import SeqIO output = open("my_output.txt", "w") [output.write(rec.id + "\n") for rec in SeqIO.parse(open("my_input.txt", 'r'), "fasta")]

Longer:
Typically I wouldn't do this either. I would open the file handle with various checks (size, does it exist etc), then loop over the file.
#!/usr/bin/python from Bio import SeqIO output = open("my_output.txt", "w") for rec in SeqIO.parse( open("my_input.txt", 'r'), "fasta" ): output.write( rec.id + "\n")

<I am trying *hard* to avoid my python code at the moment :lol: >

-perlmunky-