Local BLAST returns too many HSP's - (Dec/06/2010 )
Hello. I am an undergrad in a computational biology lab. This is my first attempt at running BLAST locally. I am using the older (LEGACY) blast executables because I was getting registry errors with the newer ones. I have been struggling with this issue for a while now. Here's the gist of what I've done:
1) Used "formatdb.exe" to convert 25 entire chromosome sequences (25 ".fa" files) into one large genomic database (~760 MB ".nsq" file, others)
The command I used:
formatdb -i "ref_chr1.fa ref_chr2A.fa ref_chr2B.fa (etc...)" -t legacypantrog -p F -n legacypantrog.db
2) Used "blastall.exe" to search for a ~1000 bp sequence in fasta format against the previously made database file
The command I used:
blastall -p blastn -d legacypantrog.db -i testfile.fasta -o testblast.xml -m 7 -v 1 -b 1 -K 1 -e 1e-30
An explanation of the blastall parameters is available here: http://www.plexdb.org/modules/documentation/NCBIblastall.htm
3) The problem:
The output file returns ONE hit, but with ~3000 HSPs! The first alignment in the output is exactly what I want - an alignment of the whole query sequence against a sequence in the database. The rest of the output looks like 3000 of these with variants in base pair length:
Score = 141 bits (71), Expect = 9e-031
Identities = 89/95 (93%)
Strand = Plus / Plus
Query: 273 tctactaaaactacaaaaattagctgggcacggtggcaggcgcctgtaatcccagctact 332
|||||||||| ||||||||| ||||||||| ||||||||||||||||| |||||||||||
Sbjct: 193250233 tctactaaaaatacaaaaatgagctgggcatggtggcaggcgcctgtagtcccagctact 193250292
Query: 333 caggaggctgaggcaggagaatcacttgaacctgg 367
| |||||| ||||||||||||||||||||||||||
Sbjct: 193250293 cgggaggcggaggcaggagaatcacttgaacctgg 193250327
Is there a way to get rid of all these extra HSPs in my output?
The following is a summary from an HTML format output, maybe it can help:
Matrix: blastn matrix:1 -3
Gap Penalties: Existence: 5, Extension: 2
Number of Sequences: 25
Number of Hits to DB: 43,805,126
Number of extensions: 1120422
Number of successful extensions: 1120422
Number of sequences better than 1.0e-030: 25
Number of HSP's gapped: 1091227
Number of HSP's successfully gapped: 75943
Length of query: 971
Length of database: 3,175,582,169
Length adjustment: 21
Effective length of query: 950
Effective length of database: 3,175,581,644
Effective search space: 3016802561800
Effective search space used: 3016802561800
Any help would be greatly appreciated.
So the BLAST FAQ says this happens when the query contains repeat elements and the database is large. One of their suggestions is to filter out species-specific repeats.
http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#HSPs
Unfortunately it is the repeats that I am interested in. All of my queries have long simple tandem repeats. I don't think there is a simple solution to this issue, but if anyone has any ideas for a workaround, I'd still be interested.