Perl script - parsing non-phosphorylated protein sequences from a Uniprot flat f - (Jul/21/2006 )
Hi there
I have a set of data that contains human sequences like these (obtained from Uniprot):
ID 1433B_HUMAN STANDARD; PRT; 245 AA.
AC P31946;
OS Homo sapiens (Human).
KW 3D-structure; Acetylation; Alternative initiation;
KW Direct protein sequencing; Phosphorylation.
>1433B_HUMAN
TMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSW
RVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYL
KMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYY
EILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDA
GEGEN
ID 1433E_HUMAN STANDARD; PRT; 255 AA.
AC P62258; P29360; P42655; Q63631;
OS Homo sapiens (Human).
KW 3D-structure; Acetylation; Direct protein sequencing;
KW Host-virus interaction.
>1433E_HUMAN
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW
RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF
YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF
YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE
EQNKEALQDVEDENQ
ID 1433F_HUMAN STANDARD; PRT; 245 AA.
AC Q04917;
OS Homo sapiens (Human).
KW 3D-structure; Acetylation; Direct protein sequencing; Phosphorylation.
>1433F_HUMAN
GDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWR
VISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKV
FYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFSV
FYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEE
AGEGN
I only want to retrieve those sequences and accession numbers that are non-phosphorylated (i.e. those that don't contain the keywords "Phosphorylation".
I have created a perl script like this:
!usr/bin/perl
$infile = 'human.txt';
unless(open (INFILE, $infile))
{
print STDERR "Cannot open file \"$infile\"\n\n";
exit;
}
@infile = <INFILE>;
close INFILE;
open (OUTPUTFILE, ">C:\temp.txt");
my $line="";
while (<INFILE>){
if($line=~ /^ID/){
}elsif($line=~ /^AC/){
}elsif($line=~/^OS/){
}elsif(($line =~ /\w/) && (not $line =~ /^Phosphorylation/)){
print OUTPUTFILE "$line";}}
close OUTPUTFILE;
exit;
BUT nothing seems to happen
Is there a problem in my script? I think there is but can't seem to figure it out. I would really appreciate any help or suggestions.
Thank you in advance
Sara
Hi Sara --
Try it this way:
use strict;
$/ = "\nID ";
open (IN, "human.txt") or die "Cannot open human.txt for input: $!\n";
open (OUT, ">parsed.txt") or die "Cannot open parsed.txt for output: $!\n";
while (<IN>){
next if (/Phosphorylation/);
chomp;
print OUT "ID $_\n";
}
print "Done.\n";
This does nothing to alter the format of the records contained in the output file (parsed.txt), except insure they don't have the word "Phosphorylation" in them.
Do you need to reformat the output (perhaps strip out uneeded headrs or something)?
Try it this way:
#!/usr/bin/perl -w
use strict;
$/ = "\nID ";
open (IN, "human.txt") or die "Cannot open human.txt for input: $!\n";
open (OUT, ">parsed.txt") or die "Cannot open parsed.txt for output: $!\n";
while (<IN>){
next if (/Phosphorylation/);
chomp;
print OUT "ID $_\n";
}
print "Done.\n";
This does nothing to alter the format of the records contained in the output file (parsed.txt), except insure they don't have the word "Phosphorylation" in them.
Do you need to reformat the output (perhaps strip out uneeded headrs or something)?
Hi Homebrew,
The script worked very well. Thank you!

This is an output example where non-phosphorylated sequences have been found (i.e all those containing the Keyword "phosphorylation" have been removed leaving me with non-phosphorylated sets:
ID 1433E_HUMAN STANDARD; PRT; 255 AA.
AC P62258; P29360; P42655; Q63631;
OS Homo sapiens (Human).
KW 3D-structure; Acetylation; Direct protein sequencing;
KW Host-virus interaction.
>1433E_HUMAN
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW
RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF
YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF
YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE
EQNKEALQDVEDENQ
ID 1433T_HUMAN STANDARD; PRT; 245 AA.
AC P27348; Q567U5; Q5TZU8; Q9UP48;
OS Homo sapiens (Human).
KW 3D-structure; Direct protein sequencing.
>1433T_HUMAN
MEKTELIQKAKLAEQAERYDDMATCMKAVTEQGAELSNEERNLLSVAYKNVVGGRRSAWR
VISSIEQKTDTSDKKLQLIKDYREKVESELRSICTTVLELLDKYLIANATNPESKVFYLK
MKGDYFRYLAEVACGDDRKQTIDNSQGAYQEAFDISKKEMQPTHPIRLGLALNFSVFYYE
ILNNPELACTLAKTAFDEAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDSAGEECDAA
EGAEN
ID 15E2_HUMAN STANDARD; PRT; 136 AA.
AC O43716;
OS Homo sapiens (Human).
KW Hypothetical protein.
>15E2_HUMAN
MWSRLVWLGLRAPLGGRQGFTSKADPQGSGRITAAVIEHLERLALVDFGSREAVARLEKA
IAFADRLRAVDTDGVEPMESVLEDRCLYLRSDNVVEGNCADELLQNSHRVVEEYFVAPPG
NISLPKLDEQEPFPHS
Would you know how I can put the accession numbers in place of ID's in the line containing ">"?
Take for example the last entry 15E2_HUMAN - could this be formatted like this since I only want sequence and accession number in my output file:-
>O43716
MWSRLVWLGLRAPLGGRQGFTSKADPQGSGRITAAVIEHLERLALVDFGSREAVARLEKA
IAFADRLRAVDTDGVEPMESVLEDRCLYLRSDNVVEGNCADELLQNSHRVVEEYFVAPPG
NISLPKLDEQEPFPHS
where O43716 is the accession number.
Thank you very much once again.....You've been a big help to my project!

Sara
Sure, that should be an easy tweak. How do you want to handle records with more than one accession number (see ID 1433T_HUMAN STANDARD, for example)? We could retain just the first one, thus:
>P27348
MEKTELIQKAKLAEQAERYDDMATCMKAVTEQGAELSNEERNLLSVAYKNVVGGRRSAWR
VISSIEQKTDTSDKKLQLIKDYREKVESELRSICTTVLELLDKYLIANATNPESKVFYLK
MKGDYFRYLAEVACGDDRKQTIDNSQGAYQEAFDISKKEMQPTHPIRLGLALNFSVFYYE
ILNNPELACTLAKTAFDEAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDSAGEECDAA
or retain them all, thus:
>P27348 Q567U5 Q5TZU8 Q9UP48
MEKTELIQKAKLAEQAERYDDMATCMKAVTEQGAELSNEERNLLSVAYKNVVGGRRSAWR
VISSIEQKTDTSDKKLQLIKDYREKVESELRSICTTVLELLDKYLIANATNPESKVFYLK
MKGDYFRYLAEVACGDDRKQTIDNSQGAYQEAFDISKKEMQPTHPIRLGLALNFSVFYYE
ILNNPELACTLAKTAFDEAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDSAGEECDAA
Also, is there any other info you'd like to include on the description line (AA count, perhaps)?
Here's an example:
use strict;
$/ = "\nID ";
open (IN, "human.txt") or die "Cannot open human.txt for input: $!\n";
open (OUT, ">parsed.txt") or die "Cannot open parsed.txt for output: $!\n";
while (<IN>) {
next if (/Phosphorylation/);
chomp;
my ($aa) = (/PRT;\s+(\d+)\s+AA\./);
my ($seq) = (/>.*?\n(.*)/ms);
my ($accline) = (/AC(?:\s+(.*);)+/);
my @acc = split /;/, $accline;
print OUT ">" . join (",", @acc) . " ($aa AA)\n$seq\n"
}
print "Done.\n";
An example of the output is:
>P62258, P29360, P42655, Q63631 (255 AA)
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW
RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF
YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF
YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE
EQNKEALQDVEDENQ
An example of the output is:
>P62258, P29360, P42655, Q63631 (255 AA)
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW
RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF
YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF
YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE
EQNKEALQDVEDENQ
[/quote]
Hi Homebrew,
Thanks for getting back to me. Is it possible to just retain first the accession number as I'm going to be using a clustering algorithm (blastclust) to cluster the sequences at 25% identity, hence so will just need one accession number.
But, I'd also like to add a random decimal number after the accessions as blastclust doesn't recognise accessions as unique identifiers so it requires a number after the accessions.
For example:
>P62258-::-0.001
sequence
>O43716-::-0.002
etc
The -::- separates the accession from the random number for blastclust to work properly. I will then write a perl script to only access the accession and remove the random number.
Thank you very much.
Sara
Hey, I recognise that separator!!!!
I can send you the script I used to make it in the first place... all you need to do is implement the perl rand function and add that to the id line.
When you said "random number", did you really mean "unique number"? The former requires generating a random number between a particular range (not hard, but different), the latter is more straightforward. I suspect you mean "unique number". If that's so, here's one way of doing it:
use strict;
$/ = "\nID ";
my $count;
open (IN, "human.txt") or die "Cannot open human.txt for input: $!\n";
open (OUT, ">parsed.txt") or die "Cannot open parsed.txt for output: $!\n";
while (<IN>) {
next if (/Phosphorylation/);
chomp;
$count++;
my ($seq) = (/>.*?\n(.*)/ms);
my ($acc) = (/AC\s+(.*?);/);
print OUT ">$acc-::-" . $count/1000 . "\n$seq\n"
}
print "Done.\n";
Example output:
>P62258-::-0.001
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW
RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF
YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF
YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE
EQNKEALQDVEDENQ
So, then you'll have your blastclust-compatible fasta formatted file of non-phosphorylated proteins. Next, you want to pass this to blastclust? So we're looking at something like:
Hi Homebrew,
Yes, what I meant was unique identifier. I will try out your perl script over the weekend and then blastclust the sequences on a Unix platform on Monday. (Windows is a bit problematic when it comes to blastclust )
The command used to execute the algorithm is:
blastclust -i parsed.txt -o clusters.txt -p T -L .9 -b T -S 25
where:
-i = the input fasta file called "pasred.txt"
-o = the output file called "clusters.txt"
-p T = indicates input file contains protein sequences and not nucleotide sequnces
-L = area covering 90% of the length of sequence
-S = I want sequences which have 25% similarity
This will cluster the sequences accordingly and print the clusters with accession numbers and the unique number to the output file.
To obtain the accession no's only (leaving behind the unique number), I created the following script:
[code]
#!/usr/bin/perl
$clustfile = 'clusters.txt';
unless(open (CLUSTFILE, $clustfile))
{
print STDERR "Cannot open file \"$clustfile\"\n\n";
exit;
}
@clustfile = <CLUSTFILE>;
close CLUSTFILE;
open (OUTPUTFILE, ">C:\accession.txt");
my $line ="";
foreach $line (@clustfile)
{
if ($line =~ /(>[OPQ].{5})/ig)
{
print OUTPUTFILE "$1\n";}}
close OUTPUTFILE;
exit;
It prints all the appropriate accession numbers at 25% identity on a new line.
Thanks once again for your help. I shall let you know how it goes.
Have a nice weekend
Sara
My understanding is that -p, -L, and -b default to T, 0.9, and T, respectively, therefore there's no reason to explicitly set them, since you're using the default values, and thus all you need set is -S 25.
Your script will fail as written, generating a bunch of "print() on closed filehandle OUTPUTFILE at script_name.pl line 28." errors.
This is because you have the pathname wrong in the line:
and you've not checked for the error (contained in $!). This file is (silently) never opened, thus each time you call:
Perl will complain about trying to print to a closed (i.e. never opened) filehandle.
Here's the deal on Windows pathnames in Perl:
">C:\accession.txt" # fail
'>C:\accession.txt' # okay
">C:/accession.txt" # okay
'>C:/accession.txt' # okay
">C:\\accession.txt" # okay
'>C:\\accession.txt' # okay
I often begin my windows-destined scripts with something like:
my $desktop = "$ENV{USERPROFILE}\\Desktop";
my $temp = "$ENV{TEMP}";
so my three most frequently written to directories are predefined, then I use them like:
Using the predefined environmental variables also makes the script machine-independent -- I haven't hard-coded the username into the script.
Your script looks okay otherwise, except that it will fail if the accession number begins with anything other than O, P, or Q, and you haven't taken advantage of the neat '-::-' we used. I'm also not a big fan of assigning constants to variables (as in $clustfile = 'clusters.txt';).
It's unlikely to be a problem, but you might want to do a line-by-line read rather than slurp the whole file into an array, which might be trouble if we're doing genome-sized stuff with huge files.
I notice you're not using strict or enabling warnings, both of which you should always, always, always do. Did I mention you should always enable strict and warnings?

But, save for the pathname problem, your script will function correctly, as far as I can tell.
I would do it this way:
use strict;
open (IN, "clusters.txt") or die "Cannot open file clusters.txt: $!\n";
open (OUT, ">C:/accession.txt") or die "Cannot open file accession.txt: $!\n";
while (<IN>) {
print OUT "$1\n" if (/^>(.*)-::-/);
}
Is the problem with blastclust one of blastclust on Windows, or Perl on Windows?