|
Designing Hashtable for Easy Comparison of Genomic DNA Sequences
|
|
This is a hashtable designed to help easily search corpus and pattern sequence strings for common substrings of length k. Constructing a table of every k-mer in the pattern string, with its noted location permits an easy cross check with each k-mer from the corpus string. Additionally any k-mers appearing in a mask string are removed from the table before searching for matches in the corpus. The table resolves collisions using double hashing, computes slot sequences in a way that avoids multiplication and supports a doubling procedure so when the load factor gets too large the table will rehash into a new table twice the size.