What are today's bioinformatic bottlenecks? - (May/02/2009 )

I have some interesting questions about practical bioinformatics bottlenecks.. what common or important bioinformatic tasks are the slowest or most limiting?
My background is in text compression, databases, and parallel processing, but I know basic bionformatic algorithms as well, since they're so related.

But what I don't know is what, in practice, is the bottleneck for most bioinformatic users is. I'm hoping to try to focus some new research (using cheap PC graphics cards) to bioinformatic algorithms and I want to work on speeding up the tasks that are truly a problem. Most kinds of applications can be sped up 10-100 times, so it's worthwhile to try to get some of these apps working in practice. But which ones should be sped up first? Probably the tasks that are both common AND slow.

I'd love if you'd share your opinions, experiences, even HOPES for what kind of tools or speedups or new abilities you'd like. Again I've got a good technical background, but not an idea of what tasks are still making actual biologists grind their teeth in frustration.

Some examples:

<*>Is BLAST alignment speed an issue? Would you yell in joy if there was a new tool that gave identical results and was twice as fast?
<*>Are you happy with BLAST but just want to do much bigger alignments? Stuff like "Here's my 1M nucleotide sequence, give me the top 1000 local alignments", and get an answer in 10 seconds?
<*>Or maybe database searching? You want to say "here's 10000 nucleotides, please search every genome in GenBank and give me the best hits from everything! In 2 seconds like a Google search does!!"
<*>Or de novo assembly? Is is a huge problem to get shotgun fragments and have to use a zillion CPU hours to assemble a genome?
<*>Or maybe you always run a Smith-Waterman high quality alignment as a double check, and that take a week to cook and it really becomes a big issue?

Those are just examples right from the top of my head, and I don't know if those are actually abilities that are despirately desired. Or of course there's likely tasks I haven't even heard of that are a limitation.. please teach me.

Again what I'm really trying to understand is what computational tasks are common, but TOO SLOW. Or too size limited (maybe BLAST is fast for you, but only because you use short sequences because long ones you'd rather use are too slow.)

I appreciate any suggestions or stories or pleas... links to other forums that might help me learn as well, any feedback.
I'll be happy to discuss what algorithms modern hardware can help with as well. You may be surprised.

Thanks!

-GerryB-

Simple task:

Here are my 300 millions of short sequences. Map them to mammalian genome. (if you can do it on single core in few days.....)

If you want some training data let me know.

-Asashoryu-

As I was reading your post, I was thinking as Asashoryu was -- next-gen sequencing is the hottest thing going in genomics right now, and you get back hundreds of millions of short sequence reads that need to be mapped to a backbone. Quite CPU intensive.

I like your other ideas, too -- I just did ~18,000 blasts locally keeping only the best hit, and it took quite awhile (a couple of hours). I also did about 5,000 RPS-Blasts, keeping all hits < 1x10^-3, and it took even longer. Some of that overhead was from Perl writing the protein file and reading and parsing the blast reports, but still -- I had to put a 0.25 second wait state in after each system call to BLAST, or the script would get ahead of itself...

I know much of this stuff could be handled much faster by the GPU (look at folding@home with ATI cards), but I don't think there's anything out there to offload the crunching to the GPU. A very good idea, GerryB!

-HomeBrew-

There have been already attempts to use GPUs instead of CPUs for mapping of next gen sequencing tags. But I doubt if somebody is using this approach routinely.

Experience of my colleagues is that CPU based programs that are doing mapping are or extremely slow or extremely inaccurate. There is also some kind of the middle.. but performance with respect to speed and accuracy could still be better.

For us best compromise for now is Maq.

-Asashoryu-

Asashoryu on May 20 2009, 02:31 PM said:

Simple task:

Here are my 300 millions of short sequences. Map them to mammalian genome. (if you can do it on single core in few days.....)

If you want some training data let me know.

How about using Bowtie? It can match 25M/hours. That should be able to do it in about 8 hours with one core.

Is such mapping a common bottleneck? What would you be able to do if you could do it as fast as your machine could read the data from disk? That's a very possible goal... a disk is say 150MB/sec, use some rough guess of 100 bytes per short read including overhead, meaning you could generate about 1.5 million matches per second. Include writing the answer back to disk, call it 600 seconds, meaning 10 minutes.

How would you use such a 1M/sec align-to-known-genome tool?

How about if it were one step further and it were even a denovo assembler and were of similar speed?

Yes, both of these suggestions are just huge speculation, but what I'm getting at is how would it change your work? Is
such speed just "nice" or would it change your science methodology? If so, how?

-GerryB-

How about using Bowtie? It can match 25M/hours. That should be able to do it in about 8 hours with one core.

for us this is not option yet, since it does not support SOLiD colorspace. We have been testing several tools up now but those that were extremely fast were inacurate and were able to map only ~1/2 of reads that slower softwares did and frequently they placed tag to different location.

Is such mapping a common bottleneck? What would you be able to do if you could do it as fast as your machine could read the data from disk? That\'s a very possible goal... a disk is say 150MB/sec, use some rough guess of 100 bytes per short read including overhead, meaning you could generate about 1.5 million matches per second. Include writing the answer back to disk, call it 600 seconds, meaning 10 minutes.

We have dedicated cluster for NGS (roughly 100 cores, 4GB RAM per core) for this and our experience is that mapping takes most of its time. In addition it will be convenient to do mapping simultaneously with different parameters (more mismatches, allowing gaps...) but by allowing this it slows down the whole process dramatically and even with relatively good hardware we are approaching to the situation that mapping would take longer time than generating data.

are you sure that this numbers are possible?? because I know that there is relatively big effort in developing such a tool and the results are still not there..

Problem is that the data produced from sequencer are noisy with lot of errors, mapping program have to take this on account and allow for mismatches in alignment and should report the best possible hit. it should also take on account quality scores assigned to each sequenced base (mismatch in high quality base should have bigger penalty than mismatch in low quality base while making decision which imperfect hit is better).

It should be also versatile and should be able to deal with bases (Solexa) and colorspace di-base coding (Solid).

How about if it were one step further and it were even a denovo assembler and were of similar speed?

complete de-novo assembly is probably not possible from short reads

Yes, both of these suggestions are just huge speculation, but what I\'m getting at is how would it change your work? Is
such speed just \"nice\" or would it change your science methodology? If so, how?

it will make possible things that are not possible because of time, and It can save a lot of money invested in large clusters dedicated purely for mapping.

-Asashoryu-

most tasks should be simply parallelized.
That would be cool for BLAST, Since some searches last several days.

CU

-dersven-