Login at Biotechnology Forums

coatespt · 04-18-2012, 08:29 AM

I would like to get some algorithmic wisdom on alignment/matching techniques (my field is computer science, not biology, so please allow for a degree of ignorance!) I'm trying to assess whether an algorithm developed for another purpose might be useful in biotech. Here is my complicated multi-part question, and thanks in advance for your patience!

Background: The basic task is finding out where, in a very large search string, reasonably good alignments for a target string occur. A bio-example might be, you're looking for 2kbp target in in a multi-gig genome, and you want to find any places where that target string occurs, even if it's pretty badly corrupted. (I'm assuming that actually aligning the target and the match will be a separate task executed by one of the standard heavyweight algorithms.) The result is the location of substrings that are less than some given edit-distance from the target string.

My first question is, how fast does FINDING the location of such matches (as opposed to actually aligning them) have to be, in order to be useful to biology problems? Answers either in big-Oh notation or clock time would be helpful. For example, if you had a 5Kbp target, and you wanted to find out everywhere something similar appears in a set of genomes, how fast would an algorithm need to be to be useful?

My second question is, in practical biological applications, over what kind of ranges of goodness/badness are matches interesting, and over what size ranges? E.g., what might be the biggest edit differences that are still of interest in real-world biology problems? Do biologists want only very close matches, or are highly corrupted versions also of interest?

The reason for asking is that I'm curious whether an algorithm originally developed for a different purpose might have biology applications. The algorithm has a one-time linear time pre-processing step, but thereafter is super-linear for actually finding matches. It's only applicable to larger targets, say, minimum of half-kilobyte up to megabytes. Settings for sensitivity, target-size, and the maximum degree of corruption matter somewhat, but as a vanilla example, on a single processing thread it takes about 3 seconds to find matches for a 5kbp target in the c. elegans genome. The same search against the human genome is about 10 minutes. It's parallelizes well though, so within reason, you can just divide by the number of cores you apply.

If you got this far, thanks for reading, and any answers or comments would be greatly appreciated!

karan_psy · 09-02-2015, 05:44 AM

a cell algorithm for bioinformatics to predict not only the gene expression , but also the biochemical changes or everything that can be logically connected to define a functionality or list of changes that would occur for any response in change of cellular system , like protein structure prediction and receptor interactions. like how would a gene transformation will behave .and to visualize the changes in 3d gene modelling software.how genetic changes would change the phenotype or expression. including punnet square of crossing over to expression of dominant traits if put simply. w.r.t biochemistry and biophysics of a cell. "the butterfly effect" within the cell. defining how everything connects with everythin else within the cell.

hope this brain process and language put up helps ur insight . i wish that could be done. time to start logically defining every cell process wrt to its chemistry may be.