Protocol Online logo
Top : Forum Archives: : Bioinformatics and Biostatistics

Filtering & removing non-sequence data from a FASTA FILE using Perl scripts. - (Jun/13/2006 )

Hi there,

I have been trying to remove non sequence data from a fasta file (ie parsing protein sequences) and have created the following script:

CODE
#!usr/bin/perl
$infile = 'input.txt';
unless(open (INFILE, $infile)){
print STDERR "Cannot open file \"$infile\n\n";
exit;
}
@infile = <INFILE>;
close INFILE;
open (OUTPUTFILE, ">C:\skara.txt");
my $line = "";
while ($line = <INFILE>){
if ($line !~ /^>/) {
chomp ($line);
$line .= $line;
}
}
close (INFILE);

foreach $line (@infile)  
{

if(($line =~ /\d/) && (not $line =~ /^>/))
{
$line =~ s/\d/ /g;
}
elsif ($line =~ /\s./)
{
$line =~ s/\s./ /g;
}
print OUTPUTFILE "$line"; }      
close OUTPUTFILE;
exit;


BUT THERE SEEMS TO BE A PROBLEM. The script doesn't filter out all numeric data.



My INPUT file is this where I am trying to remove digits and whitespace as I only need the accession number and sequence:

>P67870
MSSSEEVSWISWFCGLRGNEFFCEVDEDYIQDKFNLTGLNEQVPHYRQALDMILDLEPDEELEDNPNQSDLIEQAAEMLY
GLIHARYILTNRGIAQMLEKYQQGDFGYCPRVYCENQPMLPIGLSDIPGEAMVKLYCPKCMDVYTPKSSRHHHTDGAYFGT
GFPHMLFMVHPEYRPKRPANQFVPRLYGFKIHPMAYQLQLQAASNFKSPVKTIR 209 S 7578274 CDK LTP 2004-12-31 00:00:00+01

>Q99640
MLERPPALAMPMPTEGTPPPLSGTPIPVPAYFRHAEPGFSLKRPRGLSRSLPPPPPAKGSIPISRLFPPRTPGWHQLQPR
RVSFRGEASETLQSPGYDPSRPESFFQQSFQRLSRLGHGSYGEVFKVRSKEDGRLYAVKRSMSPFRGPKDRARKLAEVGSH
EKVGQHPCCVRLEQAWEEGGILYLQTELCGPSLQQHCEAWGASLPEAQVWGYLRDTLLALAHLHSQGLVHLDVKPANIFLG
PRGRCKLGDFGLLVELGTAGAGEVQEGDPRYMAPELLQGSYGTAADVFSLGLTILEVACNMELPHGGEGWQQLRQGYLPPE
FTAGLSSELRSVLVMMLEPDPKLRATAEALLALPVLRQPRAWGVLWCMAAEALSRGWALWQALLALLCWLWHGLAHPASWL
QPLGPPATPPGSPPCSLLLDSSLSSNWDDDSLGPSLSPEAVLARTVGSTSTPRSRCTPRDALDLSDINSEPPRGSFPSFEP
RNLLSLFEDTLDPT 426 S 12738781 PLK1 LTP 2004-12-31 00:00:00+01

>P10747
MLRLLLALNLFPSIQVTGNKILVKQSPMLVAYDNAVNLSCKYSYNLFSREFRASLHKGLDSAVEVCVVYGNYSQQLQVYS
KTGFNCDGKLGNESVTFYLQNLYVNQTDIYFCKIEVMYPPPYLDNEKSNGTIIHVKGKHLCPSPLFPGPSKPFWVLVVVGG
VLACYSLLVTVAFIIFWVRSKRSRLLHSDYMNMTPRRPGPTRKHYQPYAPPRDFAAYRS 191 Y 8992971 Lck;ITK LTP 2004-12-31 00:00:00+01

>P04083
AMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIK
AAYLQETGKPLDETLKKALTGHLEEVVLALLKTPAQFDADELRAAMKGLGTDEDTLIEILASRTNKEIRDINRVYREELKR
DLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKY
SKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLC
QAILDETKGDYEKILVALCGGN 20 Y 2457390 Abl;Src;EGFR LTP 2004-12-31 00:00:00+01

>P19838
MAEDDPYLGRPEQMFHLDPSLTHTIFNPEVFQPQMALPTDGPYLQILEQPKQRGFRFRYVCEGPSHGGLPGASSEKNKKS
YPQVKICNYVGPAKVIVQLVTNGKNIHLHAHSLVGKHCEDGICTVTAGPKDMVVGFANLGILHVTKKKVFETLEARMTEAC
IRGYNPGLLVHPDLAYLQAEGGGDRQLGDREKELIRQAALQQTKEMDLSVVRLMFTAFLPDSTGSFTRRLEPVVSDAIYDS
KAPNASNLKIVRMDRTAGCVTGGEEIYLLCDKVQKDDIQIRFYEEEENGGVWEGFGDFSPTDVHRQFAIVFKTPKYKDINI
TKPASVFVQLRRKSDLETSEPKPFLYYPEIKDKEEVQRKRQKLMPNFSDSFGGGSGAGAGGGGMFGSGGGGGGTGSTGPGY
SFPHYGFPTYGGITFHPGTTKSNAGMKHGTMDTESKKDPEGCDKSDDKNTVNLFGKVIETTEQDQEPSEATVGNGEVTLTY
ATGTKEESAGVQDNLFLEKAMQLAKRHANALFDYAVTGDVKMLLAVQRHLTAVQDENGDSVLHLAIIHLHSQLVRDLLEVT
SGLISDDIINMRNDLYQTPLHLAVITKQEDVVEDLLRAGADLSLLDRLGNSVLHLAAKEGHDKVLSILLKHKKAALLLDHP
NGDGLNAIHLAMMSNSLPCLLLLVAAGADVNAQEQKSGRTALHLAVEHDNISLAGCLLLEGDAHVDSTTYDGTTPLHIAAG
RGSTRLAALLKAAGADPLVENFEPLYDLDDSWENAGEDEGVVPGTTPLDMATSWQVFDILNGKPYEPEFTSDDLLAQGDMK
QLAEDVKLQLYKLLEIPDPDKNWATLAQKLGLGILNNAFRLSPAPSKTLMDNYEVSGGTVRELVEALRQMGYTEAIEVIQA
ASSPVKTTSQAHSLPLSPASTRQQIDELRDSDSVCDSGVETSFRKLSFTESLTSGASLLTLNKMPHDYGQEGPLEGKI 923 S 11158290

IKK LTP 2004-12-31 00:00:00+01

BUT when I run the perl script, not all the digits are removed as I think it skips lines.

This is my OUTPUT after running the program:
>P67870 SSSEEVSWISWFCGLRGNEFFCEVDEDYIQDKFNLTGLNEQVPHYRQALDMILDLEPDEELEDNPNQSDLIEQAAEMLYG
LIHARYILTNRGIAQMLEKYQQGDFGYCPRVYCENQPMLPIGLSDIPGEAMVKLYCPKCMDVYTPKSSRHHHTDGAYFGTG
FPHMLFMVHPEYRPKRPANQFVPRLYGFKIHPMAYQLQLQAASNFKSPVKTIR 09 578274 DK TP 004-12-31 0:00:00+01

>Q99640 LERPPALAMPMPTEGTPPPLSGTPIPVPAYFRHAEPGFSLKRPRGLSRSLPPPPPAKGSIPISRLFPPRTPGWHQLQPRR
VSFRGEASETLQSPGYDPSRPESFFQQSFQRLSRLGHGSYGEVFKVRSKEDGRLYAVKRSMSPFRGPKDRARKLAEVGSHE
KVGQHPCCVRLEQAWEEGGILYLQTELCGPSLQQHCEAWGASLPEAQVWGYLRDTLLALAHLHSQGLVHLDVKPANIFLGP
RGRCKLGDFGLLVELGTAGAGEVQEGDPRYMAPELLQGSYGTAADVFSLGLTILEVACNMELPHGGEGWQQLRQGYLPPEF
TAGLSSELRSVLVMMLEPDPKLRATAEALLALPVLRQPRAWGVLWCMAAEALSRGWALWQALLALLCWLWHGLAHPASWLQ
PLGPPATPPGSPPCSLLLDSSLSSNWDDDSLGPSLSPEAVLARTVGSTSTPRSRCTPRDALDLSDINSEPPRGSFPSFEPR
NLLSLFEDTLDPT 26 2738781 LK1 TP 004-12-31 0:00:00+01

>P10747 LRLLLALNLFPSIQVTGNKILVKQSPMLVAYDNAVNLSCKYSYNLFSREFRASLHKGLDSAVEVCVVYGNYSQQLQVYSK
TGFNCDGKLGNESVTFYLQNLYVNQTDIYFCKIEVMYPPPYLDNEKSNGTIIHVKGKHLCPSPLFPGPSKPFWVLVVVGGV
LACYSLLVTVAFIIFWVRSKRSRLLHSDYMNMTPRRPGPTRKHYQPYAPPRDFAAYRS 91 992971 ck;ITK TP 004-12-31 0:00:00+01

>P04083 MVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKA
AYLQETGKPLDETLKKALTGHLEEVVLALLKTPAQFDADELRAAMKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRD
LAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYS
KHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQ
AILDETKGDYEKILVALCGGN 0 457390 bl;Src;EGFR TP 004-12-31 0:00:00+01

>P19838 AEDDPYLGRPEQMFHLDPSLTHTIFNPEVFQPQMALPTDGPYLQILEQPKQRGFRFRYVCEGPSHGGLPGASSEKNKKSY
PQVKICNYVGPAKVIVQLVTNGKNIHLHAHSLVGKHCEDGICTVTAGPKDMVVGFANLGILHVTKKKVFETLEARMTEACI
RGYNPGLLVHPDLAYLQAEGGGDRQLGDREKELIRQAALQQTKEMDLSVVRLMFTAFLPDSTGSFTRRLEPVVSDAIYDSK
APNASNLKIVRMDRTAGCVTGGEEIYLLCDKVQKDDIQIRFYEEEENGGVWEGFGDFSPTDVHRQFAIVFKTPKYKDINIT
KPASVFVQLRRKSDLETSEPKPFLYYPEIKDKEEVQRKRQKLMPNFSDSFGGGSGAGAGGGGMFGSGGGGGGTGSTGPGYS
FPHYGFPTYGGITFHPGTTKSNAGMKHGTMDTESKKDPEGCDKSDDKNTVNLFGKVIETTEQDQEPSEATVGNGEVTLTYA
TGTKEESAGVQDNLFLEKAMQLAKRHANALFDYAVTGDVKMLLAVQRHLTAVQDENGDSVLHLAIIHLHSQLVRDLLEVTS
GLISDDIINMRNDLYQTPLHLAVITKQEDVVEDLLRAGADLSLLDRLGNSVLHLAAKEGHDKVLSILLKHKKAALLLDHPN
GDGLNAIHLAMMSNSLPCLLLLVAAGADVNAQEQKSGRTALHLAVEHDNISLAGCLLLEGDAHVDSTTYDGTTPLHIAAGR
GSTRLAALLKAAGADPLVENFEPLYDLDDSWENAGEDEGVVPGTTPLDMATSWQVFDILNGKPYEPEFTSDDLLAQGDMKQ
LAEDVKLQLYKLLEIPDPDKNWATLAQKLGLGILNNAFRLSPAPSKTLMDNYEVSGGTVRELVEALRQMGYTEAIEVIQAA
SSPVKTTSQAHSLPLSPASTRQQIDELRDSDSVCDSGVETSFRKLSFTESLTSGASLLTLNKMPHDYGQEGPLEGKI 23 1158290

IKK LTP - - : : +

NOTHING SEEMS TO HAVE HAPPENED :-(

Does anyone know what the problem could be as I think there may be something wrong with my script? I have to filter 2500 sequences but this doesn't seem to be working.

Are there any ready made programs which will filter my FASTA file and remove numbers, whitespaces, lowercase characters, +, -, etc???
Any suggestions would be very much appreciated.
Thank you in advance.

Sara

-sara.pl-

Try it this way:

CODE
#!/usr/bin/perl -w
use strict;

open (IN, "input.txt") or die "Can't open input.txt: $!\n";
open (OUT, ">skara.txt") or die "Can't open skara.txt: $!\n";

while (<IN>) {
   s/\s+\d+.+\n/\n/;
   print OUT;
}


What's up with the final "IKK LTP 2004-12-31 00:00:00+01" line?

-HomeBrew-

Oops -- I didn't notice you wanted to remove the "whitespace". Does that mean blank lines between the sequences, too? I can easily fix it if the answer's yes...

-HomeBrew-

Homebrew's code is elegant indeed. Only five lines.

Perl is powerful!

-pcrman-

QUOTE (HomeBrew @ Jun 14 2006, 04:47 AM)
Oops -- I didn't notice you wanted to remove the "whitespace". Does that mean blank lines between the sequences, too? I can easily fix it if the answer's yes...



Hi HomeBrew,

Yes, I am trying to remove whitespaces and all "junk" (eg. numbers, upper case characters, *, + - etc) data. I only need the sequences and lines beginning with >

I shall try out your perl script tonight. If you have any other scripts or ways of filtering then please let me know. I am in desparate need of some professional help. Thank you very much.

Sara

P.S. I don't know why the LKP.....02 etc line has printed to the output file....something wrong in that it doesn't remove numbers???? Thanks again.

-sara.pl-

Hi sara.pl (nice nick, BTW biggrin.gif ) --

I see two possibilities here -- if you want to remove the blank lines, but retain the end-of-line characters, try this:

CODE
#!/usr/bin/perl -w
use strict;

open (IN, "input.txt") or die "Can't open input.txt: $!\n";
open (OUT, ">skara.txt") or die "Can't open input.txt: $!\n";

while (<IN>) {
   next if /^\s*$/;
   s/\s+\d+.+\n/\n/;
   print OUT unless /^(\w{3}\s){2}([\d:+-])+/;
}


Notice I added a new regular expression to filter out "IKK LTP 2004-12-31 00:00:00+01" and such when they appear on lines by themselves. I only have one example of this type of line to work from, so if these are spread throughout your file and the regexp is either too strict or too loose to catch them all, let me know and we'll tune it up.

The second possibility is if you want to remove the blank lines *and* the end-of-line characters. If that's the case, try this:

CODE
#!/usr/bin/perl -w
use strict;

open (IN, "input.txt") or die "Can't open input.txt: $!\n";
open (OUT, ">skara.txt") or die "Can't open input.txt: $!\n";

while (<IN>) {
   s/^>(.*)$/\n>$1\n/;
   s/\s+\d+.+\n/\n/;
   chomp;
   print OUT unless /^(\w{3}\s){2}([\d:+-])+/;
}


Let me know if the scripts need more work...

-HomeBrew-

QUOTE (HomeBrew @ Jun 14 2006, 11:33 PM)
Hi sara.pl (nice nick, BTW biggrin.gif ) --

I see two possibilities here -- if you want to remove the blank lines, but retain the end-of-line characters, try this:

CODE
#!/usr/bin/perl -w
use strict;

open (IN, "input.txt") or die "Can't open input.txt: $!\n";
open (OUT, ">skara.txt") or die "Can't open input.txt: $!\n";

while (<IN>) {
   next if /^\s*$/;
   s/\s+\d+.+\n/\n/;
   print OUT unless /^(\w{3}\s){2}([\d:+-])+/;
}


Notice I added a new regular expression to filter out "IKK LTP 2004-12-31 00:00:00+01" and such when they appear on lines by themselves. I only have one example of this type of line to work from, so if these are spread throughout your file and the regexp is either too strict or too loose to catch them all, let me know and we'll tune it up.

The second possibility is if you want to remove the blank lines *and* the end-of-line characters. If that's the case, try this:

CODE
#!/usr/bin/perl -w
use strict;

open (IN, "input.txt") or die "Can't open input.txt: $!\n";
open (OUT, ">skara.txt") or die "Can't open input.txt: $!\n";

while (<IN>) {
   s/^>(.*)$/\n>$1\n/;
   s/\s+\d+.+\n/\n/;
   chomp;
   print OUT unless /^(\w{3}\s){2}([\d:+-])+/;
}


Let me know if the scripts need more work...




Hi HomeBrew,

Yes, things like "IKK LTP 2004-12-31 00:00:00+01" are spread out throughout the file. I shall try out the perl scripts tonight and let you know what happens. Thank you very much for your help. It is very much appreciated.

Sara :-)

-sara.pl-

Were the scripts what you needed, sara.pl?

-HomeBrew-

QUOTE (HomeBrew @ Jun 20 2006, 04:21 PM)
Were the scripts what you needed, sara.pl?



Hi HomeBrew,

Sorry for the delay in replying. Been busy watching the world cup! smile.gif

Yes thank you very much....the scripts worked very well. The scripts removed all the rubbish from my input file and I got a perfect FASTA file output. I preffered script 2 as it removed end of line characters too.

I'd like to thank you very much as you've been a great help to my project. Will have to put you down in my dissertation reference section LOL biggrin.gif

Thank you once again.

Sara

P.S. I hope it's ok with you, if I can ask for any further help with this project. Cheers!

-sara.pl-

QUOTE (sara.pl @ Jun 20 2006, 03:26 PM)
P.S. I hope it's ok with you, if I can ask for any further help with this project. Cheers!


Of course -- we're all glad to help!

-HomeBrew-