quicker, better, faster - perl string to sequence format (Oct/22/2008 )

To perl people,

Is there a simpler, faster way of converting a string to a fasta block ( 60 chars newline 60 chars ... ) than this:

CODE

#!/usr/bin/perl
use warnings;
use strict;

my $string = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";

my $fstring = formatString( $string );

print $fstring;

sub formatString {
    my ( $seq ) = @_;
    my $i = 0;
    my $tmp = "";
    while ( $i < length $seq ) {
        $tmp .=  ( substr $seq, $i, 60 );
        $tmp .= "\n";
        $i += 60;
    }
    return($tmp);
}

oh, and you can't use bioperl - I think there is a way to do it using sprintf but I am not sure. This was my "I need it done 10 minutes ago" solution.

my output:

CODE

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAA

real    0m0.257s
user    0m0.011s
sys     0m0.014s

-perlmunky-

I don't know if it's faster, but I always use the following two sub functions to get strings of nucleotides or amino acids into the correct format. They are part of a biofunctions module I wrote containing routine sequence manipulation and calculation functions (MW, %GC, reverse complement, translate, etc.). The GenBank function was written first, then I stripped it to output in Fasta format, so they are related, but I never bothered to time the Fasta one -- your way may indeed be faster.

I don't know if you need the one that outputs sequences in GenBank format, but I included it just in case:

CODE

#!/usr/bin/perl -w
use strict;

my $seq = 'agagatcaacgcaaaaataggagatatcgtactggtggaaggaaccggactgaacgaggagatgatgcaaaccttg
atggctctcgacgaattcagagggcgggactttaccgggaagttacgcatgaactag';

my $fasta_seq = fasta($seq);

my $gb_seq = gb($seq);

print $fasta_seq;
print "\n";
print $gb_seq;

sub fasta {
    my $seq = shift;
    my $len = shift || 60;
    my $length = length ($seq);
    my ($i, $fasta);
    my $out_pat = "A$len";
    my $whole = int($length/$len) * $len;

    for  ($i = 0; $i < $whole; $i += $len) {
        my $line = pack ($out_pat, substr($seq, $i, $len)) . "\n";
        $fasta .= $line;
    }

    if (my $last = substr($seq, $i)) {
        $fasta .= $last . "\n";
    }
    return $fasta;
}

sub gb {
    my $seq = shift;
    my $len = 60;
    my $whole_pat = 'a10' x 6;
    my $out_pat = 'A11' x 6;
    my $length = length($seq);
    my ($i, $gb);
    my $whole = int($length / $len) * $len;
    for ($i = 0; $i < $whole; $i += $len) {
        my $blocks = pack $out_pat,
        unpack $whole_pat,
        substr($seq, $i, $len);
        $gb .= sprintf ("%9d", $i + 1) . " $blocks\n";
    }
    if (my $last = substr($seq, $i)) {
        my $last_len = length($last);
        my $last_pat = 'a10' x int($last_len / 10) .'a'. $last_len % 10;
        my $blocks = pack $out_pat,
        unpack($last_pat, $last);
        $gb .= sprintf ("%9d", $i + 1) . " $blocks\n";
    }
    return $gb;
}

This script ouputs:

CODE

agagatcaacgcaaaaataggagatatcgtactggtggaaggaaccggactgaacgagga
gatgatgcaaaccttgatggctctcgacgaattcagagggcgggactttaccgggaagtt
acgcatgaactag

        1 agagatcaac gcaaaaatag gagatatcgt actggtggaa ggaaccggac tgaacgagga
       61 gatgatgcaa accttgatgg ctctcgacga attcagaggg cgggacttta ccgggaagtt
      121 acgcatgaac tag

-HomeBrew-

Thanks for that.

The fasta function works at the same speed - which I supposed makes sense given that you are doing essentially what I am doing - with a for loop. I imagine this would change if we had a huge sequence - in which case, yours would be faster (while loops have extra overhead ).

Thanks for the genebank format, I don't need it now but it may well be handy in the future - it has been assimilated

-perlmunky-

I would bet that yours would be faster -- as you point out, I'm doing essentially the same thing as your original code save for an "if" versus "while" loop, but I have the added overhead of pack...

I might replace mine with yours -- assimilate it, so to speak...

I do a lot of stuff on bacterial genomes, like find all the orfs in this 6 million basepair genome and return them as DNA and amino acid sequences in Fasta files, along with info about the orf (start & end position, orientation, MW, IEP, %GC, etc.), and for such a task these functions work reasonable quickly.

-HomeBrew-

This is what I tried:

The pack (and possibly the other things you do ) add seconds to the operation time on my MacBook ( 1.83 GHz core 2 Duo 2 GB 667 MHz DDR2 SDRAM ). I combined my version with a for loop to see if there was a change - there isn't.

CODE

#!/usr/bin/perl
use warnings;
use strict;
my $string = "A" x 10000000;

my $fstring;

#$fstring = formatString( $string );
#real    0m0.272s
#user    0m0.185s
#sys     0m0.083s

#$fstring = fasta($string);
#real    0m0.339s
#user    0m0.251s
#sys     0m0.087s

$fstring = fasta2($string);
#real    0m0.275s
#user    0m0.187s
#sys     0m0.086s

print $fstring;

sub formatString {
    my ( $seq ) = @_;
    my $i = 0;
    my $tmp = "";
    while ( $i < length $seq ) {
        $tmp .=  ( substr $seq, $i, 60 );
        $tmp .= "\n";
        $i += 60;
    }
    return($tmp);
}

sub fasta {
    my $seq = shift;
    my $len = shift || 60;
    my $length = length ($seq);
    my ($i, $fasta);
    my $out_pat = "A$len";
    my $whole = int($length/$len) * $len;

    for  ($i = 0; $i < $whole; $i += $len) {
        my $line = pack ($out_pat, substr($seq, $i, $len)) . "\n";
        $fasta .= $line;
    }

    if (my $last = substr($seq, $i)) {
        $fasta .= $last . "\n";
    }
    return $fasta;
}

sub fasta2 {
    my $seq = shift;
    my ($i, $fasta);

    for ( $i = 0; $i < length $seq; $i+=60 ){
        $fasta .=  ( substr $seq, $i, 60 );
        $fasta .= "\n";
    }
    return $fasta;
}

#$fstring = formatString( $string );
#real 0m0.272s
#user 0m0.185s
#sys 0m0.083s

#$fstring = fasta($string);
#real 0m0.339s
#user 0m0.251s
#sys 0m0.087s

$fstring = fasta2($string);
#real 0m0.275s
#user 0m0.187s
#sys 0m0.086s

-perlmunky-

Well, not *seconds*, but it does add an additional 0.067th of a second -- and we'd need to compare them over many runs to see if there's a real difference. But, in any event, I thought yours would be faster...

-HomeBrew-

It was late - that's my excuse for the 'seconds' mistake.

-perlmunky-