Entropy Measurements and Ball Cover Construction for Biological Sequences
Robertson, Jeffrey Alan
MetadataShow full item record
As improving technology is making it easier to select or engineer DNA sequences that produce dangerous proteins, it is important to be able to predict whether a novel DNA sequence is potentially dangerous by determining its taxonomic identity and functional characteristics. These tasks can be facilitated by the ever increasing amounts of available biological data. Unfortunately, though, these growing databases can be difficult to take full advantage of due to the corresponding increase in computational and storage costs. Entropy scaling algorithms and data structures present an approach that can expedite this type of analysis by scaling with the amount of entropy contained in the database instead of scaling with the size of the database. Because sets of DNA and protein sequences are biologically meaningful instead of being random, they demonstrate some amount of structure instead of being purely random. As biological databases grow, taking advantage of this structure can be extremely beneficial. The entropy scaling sequence similarity search algorithm introduced here demonstrates this by accelerating the biological sequence search tools BLAST and DIAMOND. Tests of the implementation of this algorithm shows that while this approach can lead to improved query times, constructing the required entropy scaling indices is difficult and expensive. To improve performance and remove this bottleneck, I investigate several ideas for accelerating building indices that support entropy scaling searches. The results of these tests identify key tradeoffs and demonstrate that there is potential in using these techniques for sequence similarity searches.
General Audience Abstract
As biological organisms are created and discovered, it is important to compare their genetic information to known organisms in order to detect possible harmful or dangerous properties. However, the collection of published genetic information from known organisms is huge and growing rapidly, making it difficult to search. This thesis shows that it might be possible to use the non-random properties of biological information to increase the speed and efficiency of searches; that is, because genetic sequences are not random but have common structures, the increase of known data does not mean a proportional increase in complexity, known as entropy. Specifically, when comparing a new sequence to a set of previously known sequences, it is important to choose the correct algorithms for comparing the similarity of two sequences, also known as the distance between them. This thesis explores the performance of entropy scaling algorithm compared to several conventional tools.
- Masters Theses