Background: The basic task is finding out where, in a very large search string, reasonably good alignments for a target string occur. A bio-example might be, you're looking for 2kbp target in in a multi-gig genome, and you want to find any places where that target string occurs, even if it's pretty badly corrupted. (I'm assuming that actually aligning the target and the match will be a separate task executed by one of the standard heavyweight algorithms.) The result is the location of substrings that are less than some given edit-distance from the target string.
My first question is, how fast does FINDING the location of such matches (as opposed to actually aligning them) have to be, in order to be useful to biology problems? Answers either in big-Oh notation or clock time would be helpful. For example, if you had a 5Kbp target, and you wanted to find out everywhere something similar appears in a set of genomes, how fast would an algorithm need to be to be useful?
My second question is, in practical biological applications, over what kind of ranges of goodness/badness are matches interesting, and over what size ranges? E.g., what might be the biggest edit differences that are still of interest in real-world biology problems? Do biologists want only very close matches, or are highly corrupted versions also of interest?
The reason for asking is that I'm curious whether an algorithm originally developed for a different purpose might have biology applications. The algorithm has a one-time linear time pre-processing step, but thereafter is super-linear for actually finding matches. It's only applicable to larger targets, say, minimum of half-kilobyte up to megabytes. Settings for sensitivity, target-size, and the maximum degree of corruption matter somewhat, but as a vanilla example, on a single processing thread it takes about 3 seconds to find matches for a 5kbp target in the c. elegans genome. The same search against the human genome is about 10 minutes. It's parallelizes well though, so within reason, you can just divide by the number of cores you apply.
If you got this far, thanks for reading, and any answers or comments would be greatly appreciated!