Distortion-Resistant Hashing for rapid search of similar DNA subsequence

Abstract

One of the basic tasks in bioinformatics is localizing a short subsequence S, read while sequencing, in a long reference sequence R, like the human geneome. A natural rapid approach would be finding a hash value for S and compare it with a prepared database of hash values for each of length |S| subsequences of R. The problem with such approach is that it would only spot a perfect match, while in reality there are lots of small changes: substitutions, deletions and insertions. This issue could be repaired if having a hash function designed to tolerate some small distortion accordingly to an alignment metric (like Needleman-Wunch): designed to make that two similar sequences should most likely give the same hash value. This paper discusses construction of Distortion-Resistant Hashing (DRH) to generate such fingerprints for rapid search of similar subsequences. The proposed approach is based on the rate distortion theory: in a nearly uniform subset of length |S| sequences, the hash value represents the closest sequence to S. This gives some control of the distance of collisions: sequences having the same hash value.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…