Search all of GenBank for a given sequence - Google for raw subsequences? (May/29/2009 )
OK, this is a naive question, but is it possible to search Genbank (ALL of genbank) for a given nucleotide (or protein..) sequence?
So I have some sequence "ATTCGTAGCTGATGACGATGACATGGGATTTTGAGGGAAC" and I am curious what known sequences happen to have that very same substring somewhere in them.
This is different than a BLAST-like search where I am aligning a sequence to a known genome I provide.. I mean a more general search where I have some sequence I am trying to classify or even determine if it's related to ANYTHING.
I do realize there are alignment tools like Maq and SOAP and MUMer for searching for an alignment with a given sequence. I'm more of asking about searching "all known Genbank DNA sequences" in a database.
Is there such a tool? It would be interesting if I've found some sequence and perhaps I'm looking for what it may be related to, and boom, Genbank can say "hey, that's found here in the human genome, in 44 places, and it's in the mouse genome here, and there's a weird cancer variant of dogs that have it here.." and so on.
Or is such a tool a bizarre fantasy? Genbank has 100M sequences (100T bases!) so maybe searching like that is impractical.
But if it wasn't, would it be useful?
Why is that different than just BLASTing your sequence against the non-redundant nucleotide database?
HomeBrew on May 29 2009, 08:48 PM said:
Don't you have to specify what you want to compare against?
Or does the online BLAST server really search ALL of Genbase (non-redundant parts at least I assume)?
You do have to select what you want to search against, but you don't have to select a particular organism. Go here and click on the "nucleotide blast" link in the "Basic BLAST" section. Paste your sequence into the box in the "Enter Query Sequence" section, and select "Nucleotide collection (nr/nt)" from the database drop-down list in the "Choose Search Set" section, and click the BLAST button.
HomeBrew on May 30 2009, 08:44 AM said:
Ha! That shows where I misunderstood. I really thought you had to specify the genomes!
So that's pretty impressive that it can search the whole database!
Thanks for fixing my broken understanding.