Interacting with NCBI - which tool could I use to interact with NCBI resources without browser (Mar/27/2008 )
Hi all!
I have many accession number of EST sequences (ex. W46985) and for each one I have to:
- Find nucleotidic sequence
- BLAST it against human genome to find the corresponding gene
- Find the accession number of the found gene/genes
I can do it by hand, but since I have 50 or so ESTs I'd like the computer to do it for me
The problem is, I don't know how to interact with the NCBI resources without using the website.
I'm actually under Ubuntu Linux and quite fond in bash scripting, and I can program in perl.
Can someone help me? thank you!
-Uruclef-
Hi Uruclef,
If you are familiar with Perl you can use the Perl Module 'WWW::Mechanize' and 'LWP' to connect to the NCBI pages.
It should look like:
CODE
#!/usr/bin/perl -w
use strict;
use warnings;
use WWW::Mechanize;
use LMP;
...
my $EST_ID = "W46985";
my $mech = WWW::Mechanize->new(timeout => 300);
$mech->proxy('http','http://PROXY:PROXY'); #proxy if available
$mech->get('http://www.ncbi.nlm.nih.gov/');
mech->submit_form(
form_number => 1, # identify the form of interest
fields => { # identify the fields of interest
"db" => EST,
"text" => $EST_ID
}
);
my $content = $mech->response()->content;
use strict;
use warnings;
use WWW::Mechanize;
use LMP;
...
my $EST_ID = "W46985";
my $mech = WWW::Mechanize->new(timeout => 300);
$mech->proxy('http','http://PROXY:PROXY'); #proxy if available
$mech->get('http://www.ncbi.nlm.nih.gov/');
mech->submit_form(
form_number => 1, # identify the form of interest
fields => { # identify the fields of interest
"db" => EST,
"text" => $EST_ID
}
);
my $content = $mech->response()->content;
Have a look at the source code of http://www.ncbi.nlm.nih.gov/ webpage, focus on section <form> for a better understanding.
At this point $content contains the information on the direct link to the sequence database, use regular expressions to filter for it ...should be something like this:
<a href="http://foobar">W46985</a>
Use LWP to open the link:
CODE
...
my $content =~ /<a href="(http://foobar)">$EST_ID</a>/;
my $url = $1;
...
my $request=LWP::UserAgent->new();
$request->proxy('http', 'PROXY:PROXY_PORT'); #proxy if available
$response=$request->get($url);
$results= $response->content;
die unless $response->is_success;
$results =~ /SEQUENCE(.*?)COMMENTS/gm;
$EST_SEQ = $1;
...
my $content =~ /<a href="(http://foobar)">$EST_ID</a>/;
my $url = $1;
...
my $request=LWP::UserAgent->new();
$request->proxy('http', 'PROXY:PROXY_PORT'); #proxy if available
$response=$request->get($url);
$results= $response->content;
die unless $response->is_success;
$results =~ /SEQUENCE(.*?)COMMENTS/gm;
$EST_SEQ = $1;
...
Now $EST_SEQ should contain the source code for the submitted EST sequence and additional data. Adjust the last regular expression for better results!
Ciao,
Markus
-xeroxed_yeti-