.nsq files - (Oct/23/2008 )
Hi everybody.
I'm trying to extract files from ftp://ftp.ncbi.nih.gov/blast/db/, est_others.06.tar.gz. I unzipped once and I got a .tar file. I unzipped twice and got files with extensions nni, nsq, nhr, nin, nnd, nsd and nsi.
I guess that sequences are in the nsq file.
Now, I need to use those files for Mascot, so I need sequences in fasta format. How could I get that format from the former file extensions?
I also tried to get the right format in ftp://ftp.ncbi.nih.gov/blast/db/FASTA, where they have the whole est_others in one file, est_others.gz. This is a huge file, it takes hours to download. When I try to extract this .gz file, it ends up crashing down.
How could I get the fasta files??
Thanks a lot
.nsq, .nhr, and .nin files are blastn database files (which is why they're in the /blast/db/ directory).
Your second attempt was the correct one, the fasta files are in the FASTA directory. Unfortunately, the est_others.gz file is 4 GB in size, and that's compressed...
Now it starts to get strange. I downloaded the file (~ 15 minutes), but when I opened it (with WinRAR), it said "est_others.gz - GZIP archive, unpacked size 184,215,389 bytes".
Why would a compressed 176 MB file create a file that's 3.71 GB (3,993,206,700 bytes) -- a 2,068% increase?
I checked the file, again with WinRAR, and it came back with "Unexpected end of archive. CRC failed in est_others. The file is corrupt."
My advice is to contact the NCBI staff about it -- and while you've got them, you can ask them to send you the file on CD.
Thanks a lot!! I'll ask NCBI.
This is very easy with a the tool called fastacmd which comes with the NCBI blast software binaries. You can dump from the BLAST formatted databases in FASTA format and even use an input file of accessions or GIs to filter your query.
fastacmd 2.2.18 arguments:
-d Database [String] Optional
default = nr
-p Type of file
G - guess mode (look for protein, then nucleotide)
T - protein
F - nucleotide [String] Optional
default = G
-s Comma-delimited search string(s).
GIs, accessions, loci, or fullSeq-id strings may be used,
e.g. 555, AC147927, 'gnl|dbname|tag' [String] Optional
-i Input file with GIs/accessions/loci for batch
retrieval [String] Optional
-a Retrieve duplicate accessions [T/F] Optional
default = F
-l Line length for sequence [Integer] Optional
default = 80
-t Definition line should contain target gi only [T/F] Optional
default = F
-o Output file [File Out] Optional
default = stdout
-c Use Ctrl-A's as non-redundant defline separator [T/F] Optional
default = F
-D Dump the entire database as (default is not to dump anything):
1 FASTA
2 Gi list
3 Accession.version list
[Integer] Optional
default = 0
range from 0 to 3
-L Range of sequence to extract (Format: start,stop)
0 in 'start' refers to the beginning of the sequence
0 in 'stop' refers to the end of the sequence [String] Optional
default = 0,0
-S Strand on subsequence (nucleotide only): 1 is top, 2 is bottom [Integer]
default = 1
-T Print taxonomic information for requested sequence(s) [T/F]
default = F
-I Print database information only (overrides all other options) [T/F]
default = F
-P Retrieve sequences with this PIG [Integer] Optional