Interacting with NCBI - which tool could I use to interact with NCBI resources without browser (Mar/27/2008 )

Hi all!

I have many accession number of EST sequences (ex. W46985) and for each one I have to:

- Find nucleotidic sequence
- BLAST it against human genome to find the corresponding gene
- Find the accession number of the found gene/genes

I can do it by hand, but since I have 50 or so ESTs I'd like the computer to do it for me
The problem is, I don't know how to interact with the NCBI resources without using the website.
I'm actually under Ubuntu Linux and quite fond in bash scripting, and I can program in perl.
Can someone help me? thank you!

-Uruclef-

Hi Uruclef,

If you are familiar with Perl you can use the Perl Module 'WWW::Mechanize' and 'LWP' to connect to the NCBI pages.

It should look like:

CODE

#!/usr/bin/perl -w
use strict;
use warnings;
use WWW::Mechanize;
use LMP;
...
my $EST_ID = "W46985";
my $mech = WWW::Mechanize->new(timeout => 300);
$mech->proxy('http','http://PROXY:PROXY'); #proxy if available
$mech->get('http://www.ncbi.nlm.nih.gov/');
mech->submit_form(
     form_number => 1,                       # identify the form of interest
     fields => {                                    # identify the fields of interest
                   "db" => EST,
                   "text"  => $EST_ID
                   }
);
my $content = $mech->response()->content;

Have a look at the source code of http://www.ncbi.nlm.nih.gov/ webpage, focus on section <form> for a better understanding.

At this point $content contains the information on the direct link to the sequence database, use regular expressions to filter for it ...should be something like this:
<a href="http://foobar">W46985</a>
Use LWP to open the link:

CODE

...
my $content =~ /<a href="(http://foobar)">$EST_ID</a>/;
my $url = $1;
...
my $request=LWP::UserAgent->new();
$request->proxy('http', 'PROXY:PROXY_PORT'); #proxy if available
$response=$request->get($url);
$results= $response->content;
die unless $response->is_success;
$results =~ /SEQUENCE(.*?)COMMENTS/gm;
$EST_SEQ = $1;
...

Now $EST_SEQ should contain the source code for the submitted EST sequence and additional data. Adjust the last regular expression for better results!

Ciao,
Markus

-xeroxed_yeti-