quicker, better, faster - perl string to sequence format (Oct/22/2008 )
To perl people,
Is there a simpler, faster way of converting a string to a fasta block ( 60 chars newline 60 chars ... ) than this:
use warnings;
use strict;
my $string = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";
my $fstring = formatString( $string );
print $fstring;
sub formatString {
my ( $seq ) = @_;
my $i = 0;
my $tmp = "";
while ( $i < length $seq ) {
$tmp .= ( substr $seq, $i, 60 );
$tmp .= "\n";
$i += 60;
}
return($tmp);
}
oh, and you can't use bioperl - I think there is a way to do it using sprintf but I am not sure. This was my "I need it done 10 minutes ago" solution.
my output:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAA
real 0m0.257s
user 0m0.011s
sys 0m0.014s
I don't know if it's faster, but I always use the following two sub functions to get strings of nucleotides or amino acids into the correct format. They are part of a biofunctions module I wrote containing routine sequence manipulation and calculation functions (MW, %GC, reverse complement, translate, etc.). The GenBank function was written first, then I stripped it to output in Fasta format, so they are related, but I never bothered to time the Fasta one -- your way may indeed be faster.
I don't know if you need the one that outputs sequences in GenBank format, but I included it just in case:
use strict;
my $seq = 'agagatcaacgcaaaaataggagatatcgtactggtggaaggaaccggactgaacgaggagatgatgcaaaccttg
atggctctcgacgaattcagagggcgggactttaccgggaagttacgcatgaactag';
my $fasta_seq = fasta($seq);
my $gb_seq = gb($seq);
print $fasta_seq;
print "\n";
print $gb_seq;
sub fasta {
my $seq = shift;
my $len = shift || 60;
my $length = length ($seq);
my ($i, $fasta);
my $out_pat = "A$len";
my $whole = int($length/$len) * $len;
for ($i = 0; $i < $whole; $i += $len) {
my $line = pack ($out_pat, substr($seq, $i, $len)) . "\n";
$fasta .= $line;
}
if (my $last = substr($seq, $i)) {
$fasta .= $last . "\n";
}
return $fasta;
}
sub gb {
my $seq = shift;
my $len = 60;
my $whole_pat = 'a10' x 6;
my $out_pat = 'A11' x 6;
my $length = length($seq);
my ($i, $gb);
my $whole = int($length / $len) * $len;
for ($i = 0; $i < $whole; $i += $len) {
my $blocks = pack $out_pat,
unpack $whole_pat,
substr($seq, $i, $len);
$gb .= sprintf ("%9d", $i + 1) . " $blocks\n";
}
if (my $last = substr($seq, $i)) {
my $last_len = length($last);
my $last_pat = 'a10' x int($last_len / 10) .'a'. $last_len % 10;
my $blocks = pack $out_pat,
unpack($last_pat, $last);
$gb .= sprintf ("%9d", $i + 1) . " $blocks\n";
}
return $gb;
}
This script ouputs:
gatgatgcaaaccttgatggctctcgacgaattcagagggcgggactttaccgggaagtt
acgcatgaactag
1 agagatcaac gcaaaaatag gagatatcgt actggtggaa ggaaccggac tgaacgagga
61 gatgatgcaa accttgatgg ctctcgacga attcagaggg cgggacttta ccgggaagtt
121 acgcatgaac tag
Thanks for that.
The fasta function works at the same speed - which I supposed makes sense given that you are doing essentially what I am doing - with a for loop. I imagine this would change if we had a huge sequence - in which case, yours would be faster (while loops have extra overhead ).
Thanks for the genebank format, I don't need it now but it may well be handy in the future - it has been assimilated
I would bet that yours would be faster -- as you point out, I'm doing essentially the same thing as your original code save for an "if" versus "while" loop, but I have the added overhead of pack...
I might replace mine with yours -- assimilate it, so to speak...
I do a lot of stuff on bacterial genomes, like find all the orfs in this 6 million basepair genome and return them as DNA and amino acid sequences in Fasta files, along with info about the orf (start & end position, orientation, MW, IEP, %GC, etc.), and for such a task these functions work reasonable quickly.
This is what I tried:
The pack (and possibly the other things you do ) add seconds to the operation time on my MacBook ( 1.83 GHz core 2 Duo 2 GB 667 MHz DDR2 SDRAM ). I combined my version with a for loop to see if there was a change - there isn't.
use warnings;
use strict;
my $string = "A" x 10000000;
my $fstring;
#$fstring = formatString( $string );
#real 0m0.272s
#user 0m0.185s
#sys 0m0.083s
#$fstring = fasta($string);
#real 0m0.339s
#user 0m0.251s
#sys 0m0.087s
$fstring = fasta2($string);
#real 0m0.275s
#user 0m0.187s
#sys 0m0.086s
print $fstring;
sub formatString {
my ( $seq ) = @_;
my $i = 0;
my $tmp = "";
while ( $i < length $seq ) {
$tmp .= ( substr $seq, $i, 60 );
$tmp .= "\n";
$i += 60;
}
return($tmp);
}
sub fasta {
my $seq = shift;
my $len = shift || 60;
my $length = length ($seq);
my ($i, $fasta);
my $out_pat = "A$len";
my $whole = int($length/$len) * $len;
for ($i = 0; $i < $whole; $i += $len) {
my $line = pack ($out_pat, substr($seq, $i, $len)) . "\n";
$fasta .= $line;
}
if (my $last = substr($seq, $i)) {
$fasta .= $last . "\n";
}
return $fasta;
}
sub fasta2 {
my $seq = shift;
my ($i, $fasta);
for ( $i = 0; $i < length $seq; $i+=60 ){
$fasta .= ( substr $seq, $i, 60 );
$fasta .= "\n";
}
return $fasta;
}
#$fstring = formatString( $string );
#real 0m0.272s
#user 0m0.185s
#sys 0m0.083s
#$fstring = fasta($string);
#real 0m0.339s
#user 0m0.251s
#sys 0m0.087s
$fstring = fasta2($string);
#real 0m0.275s
#user 0m0.187s
#sys 0m0.086s
Well, not *seconds*, but it does add an additional 0.067th of a second -- and we'd need to compare them over many runs to see if there's a real difference. But, in any event, I thought yours would be faster...
It was late - that's my excuse for the 'seconds' mistake.