Protocol Online logo
Top : Forum Archives: : Bioinformatics and Biostatistics

Searching for gene ontology terms automatically and adding to a text file - (Aug/14/2006 )

Pages: 1 2 3 Next

Hi there,

I was wondering if anyone can help me out here.......

I have a text file of Uniprot accession numbers (~1000 or so) and would like to find their gene ontology (GO) terms at the top level and add these GO terms to the file.

I've tried doing this manually using swissprot and the gene ontology database but it's not practical (copy and paste takes forever!) and I'd like an automatic method of doing this.

Any ideas or suggestions would be greatly appreciated.

Thank you in advance!

Sara smile.gif

-sara.pl-

have a llok at the url you use to access the data you have copied and pasted ... if it looks like you can automagically build urls you could use perls LWP::simple to pull off particular pages and parse those for the data you want.

-perlmunky-

I agree with perlmunky -- Perl can do this for you. Give us an example of an accession number you have and the corresponding gene ontology database URL, and we'll whip something up...

-HomeBrew-

QUOTE (HomeBrew @ Aug 14 2006, 02:13 PM)
I agree with perlmunky -- Perl can do this for you. Give us an example of an accession number you have and the corresponding gene ontology database URL, and we'll whip something up...


Hi,

Thanks for getting back to me!

This is the URL I'm supposed to be using but since I only have accession numbers, the database doesn't recognise them and requires GO Terms.

http://www.godatabase.org/cgi-bin/amigo/go...on=replace_tree

I've been using swissprot to get the GO terms. (WAY TOO TIME CONSUMING mad.gif )

So for accession number Q9Y264: the corresponding GO term is as follows:

Molecular function: transmembrane receptor protein tyrosine kinase activator activity

Now, since this is the lowest level GO term, I paste this GO term in the above URL to get the top level ones.

The whole process takes days doing it manually and an automatic method would be far more convenient.

Thank you for your help guys!

Sara

-sara.pl-

So, if I use the URL http://ca.expasy.org/uniprot/Q9Y264 (which is just the http://ca.expasy.org/uniprot/ URL with the accession number Q9Y264 tacked on the end), it returns a page that has (under "Ontologies") a line like:

GO:0030297; Molecular function: transmembrane receptor protein tyrosine kinase activator activity (non-traceable author statement).

Is the number 0030297 the info you're looking for? Are all GO numbers 7 digits, with no letters?

-HomeBrew-

QUOTE (HomeBrew @ Aug 14 2006, 04:47 PM)
So, if I use the URL http://ca.expasy.org/uniprot/Q9Y264 (which is just the http://ca.expasy.org/uniprot/ URL with the accession number Q9Y264 tacked on the end), it returns a page that has (under "Ontologies") a line like:

GO:0030297; Molecular function: transmembrane receptor protein tyrosine kinase activator activity (non-traceable author statement).

Is the number 0030297 the info you're looking for? Are all GO numbers 7 digits, with no letters?


Yes that is the number i'm searching for....... all GO numbers start with the letters GO followed by : and 7 digits. There are no letters within or after the digits.

-sara.pl-

Two more quick questions:

  1. What format is the list of accession numbers in (the input file)?
  2. What do you want the output file to look like?

-HomeBrew-

QUOTE (HomeBrew @ Aug 14 2006, 06:25 PM)
Two more quick questions:
  1. What format is the list of accession numbers in (the input file)?
  2. What do you want the output file to look like?


The format of the input file is just a list of accession numbers like this:

Q95604
Q29960
P16189
P30460
P30484
P30498
P49703
P40616
Q9H013
Q99965
O96019
P63267
P07510
Q9GZZ6
P13765
P11826
P51857
O94788
P08319
Q9Y264



I'd like the output file to be like this if possible:


Q95604(TAB) GO NOT FOUND
(SPACE HERE PLEASE)
Q29960 (TAB) GO NOT FOUND
(SPACE HERE PLEASE)
P16189(TAB) GO NOT FOUND
(SPACE HERE PLEASE)
P30460 (TAB)Cellular component: integral to plasma membrane
(A TAB HERE) Molecular function: MHC class I receptor activity
(A TAB HERE) Biological process: immune response
(SPACE HERE PLEASE)
P30484 (TAB) GO NOT FOUND
(SPACE HERE PLEASE)
P30498 (TAB) GO NOT FOUND
(SPACE HERE PLEASE)
P49703(TAB) Molecular function: GTPase activity
(A TAB HERE) Biological process: protein secretion
(SPACE HERE PLEASE)
P40616 (TAB) Molecular function: enzyme activator activity
(A TAB HERE) Molecular function: GTPase activity
(SPACE HERE PLEASE)
Q9H013 (TAB) GO NOT FOUND
(SPACE HERE PLEASE)
Q99965 (TAB) Cellular component: integral to plasma membrane
(A TAB HERE) Molecular function: integrin binding
(A TAB HERE) Molecular function: metallopeptidase activity
(A TAB HERE) Biological process: fusion of sperm to egg plasma membrane


etc..................


I'd like a tab after the accessions and tabs before the Gene ontology. It doesn't seem to show on here so I have marked them as (TAB) and (A TAB HERE). Also, is it possible to put a space between accessions?

Some accession numbers don't have a gene ontology term so could you just put GO NOT FOUND after the accessions?

Or it would actually be better not to include the accessions that don't have GO terms.

Whichever one's ok with you. Is it ok with you if I email you my input file so you can have a look at it?

Thanks a lot!!

Sara

-sara.pl-

Try this:

CODE
#!/usr/bin/perl -w
use strict;
use LWP::Simple;

open (ACC, "acc_numbers.txt") or die "Can't open acc_numbers.txt: $!\n";
open (GO, ">go.txt") or die "Can't open go.txt: $!\n";
open (NOGO, ">no_go.txt") or die "Can't open no_go.txt: $!\n";

while (<ACC>) {
    chomp($_);
    $_ =~ s/\s+//g;
    my $link = 'http://ca.expasy.org/cgi-bin/niceprot.pl/printable?ac=' . $_;
    my $page = get $link;
    unless (defined $page) {
        warn "Having trouble contacting ExPASy server.  Sleeping once...\n";
        sleep 3;
        $page = get $link;
        unless (defined $page) {
            warn "Still having trouble contacting  ExPASy server.  Sleeping twice...\n";
            sleep 5;
            $page = get $link;
        }
    }
    if (!(defined $page)) {
        warn "Three attempts to retrieve data from  ExPASy server were unsuccessful...\n";
        sleep 3;
        die;
    } else {
        my ($seg) = ($page =~ /<td>GO<\/td>(.*)<\/i>/ms);
        if ($seg) {
            $seg =~ s/<.*?>//g;
            $seg =~ s/\s*GO:\d{7};\s+(.*)\s+\(.*?\)\./\t$1\n/g;
            print GO "$_", $seg, "\n";
        } else {
            print NOGO "$_\tGO NOT FOUND\n";
        }
    }
}
print "Done.\n";

The input file (acc_numbers.txt) is a text file of accession numbers, one per line. The output is split amongst two files -- those accession numbers that have GO numbers associated with them (go.txt) and those that don't (no_go.txt). The ones that do have GO numbers have the info you requested, formatted as requested.

The bulk of the code is concerned with making sure that we've received a response from the ExPASy server; much of it is likely unnecessary as the Internet and that server are pretty reliable. Still...

If you want to trim it down, we can leave out some of the response checks...

Hope this helps!

-HomeBrew-

QUOTE (HomeBrew @ Aug 15 2006, 12:51 AM)
If you want to trim it down, we can leave out some of the response checks...

Hope this helps!


Thank you very much HomeBrew!! smile.gif

I just tried it now and it works perfectly...... I didn't know that you could use a perl module to access servers.... It's a brilliant way of searching and retrieving information from ExPASY. Here's what the output looks like for accessions that do have GO terms:

P30460 Cellular component: integral to plasma membrane
Molecular function: MHC class I receptor activity
Biological process: immune response

P49703 Molecular function: GTPase activity
Biological process: protein secretion

P40616 Molecular function: enzyme activator activity
Molecular function: GTPase activity

Q99965 Cellular component: integral to plasma membrane
Molecular function: integrin binding
Molecular function: metallopeptidase activity
Biological process: fusion of sperm to egg plasma membrane

And this is the output for accessions without GO terms:

Q95604 GO NOT FOUND
Q29960 GO NOT FOUND
P16189 GO NOT FOUND
P30459 GO NOT FOUND


Now I can use DAG-Edit to get the top level terms by just loading these terms (through a .obo file) into the interface and looking at a hierarchy of parent terms.

Thank you once again....... You've been a big help (as always!) biggrin.gif

Sara

-sara.pl-

Pages: 1 2 3 Next