Entropy Measurements and Ball Cover Construction for Biological Sequences

Robertson, Jeffrey Alan

Entropy Measurements and Ball Cover Construction for Biological Sequences

dc.contributor.author	Robertson, Jeffrey Alan	en
dc.contributor.committeechair	Heath, Lenwood S.	en
dc.contributor.committeemember	Marathe, Madhav Vishnu	en
dc.contributor.committeemember	Eubank, Stephen G.	en
dc.contributor.department	Computer Science	en
dc.date.accessioned	2018-08-02T08:00:32Z	en
dc.date.available	2018-08-02T08:00:32Z	en
dc.date.issued	2018-08-01	en
dc.description.abstract	As improving technology is making it easier to select or engineer DNA sequences that produce dangerous proteins, it is important to be able to predict whether a novel DNA sequence is potentially dangerous by determining its taxonomic identity and functional characteristics. These tasks can be facilitated by the ever increasing amounts of available biological data. Unfortunately, though, these growing databases can be difficult to take full advantage of due to the corresponding increase in computational and storage costs. Entropy scaling algorithms and data structures present an approach that can expedite this type of analysis by scaling with the amount of entropy contained in the database instead of scaling with the size of the database. Because sets of DNA and protein sequences are biologically meaningful instead of being random, they demonstrate some amount of structure instead of being purely random. As biological databases grow, taking advantage of this structure can be extremely beneficial. The entropy scaling sequence similarity search algorithm introduced here demonstrates this by accelerating the biological sequence search tools BLAST and DIAMOND. Tests of the implementation of this algorithm shows that while this approach can lead to improved query times, constructing the required entropy scaling indices is difficult and expensive. To improve performance and remove this bottleneck, I investigate several ideas for accelerating building indices that support entropy scaling searches. The results of these tests identify key tradeoffs and demonstrate that there is potential in using these techniques for sequence similarity searches.	en
dc.description.abstractgeneral	As biological organisms are created and discovered, it is important to compare their genetic information to known organisms in order to detect possible harmful or dangerous properties. However, the collection of published genetic information from known organisms is huge and growing rapidly, making it difficult to search. This thesis shows that it might be possible to use the non-random properties of biological information to increase the speed and efficiency of searches; that is, because genetic sequences are not random but have common structures, the increase of known data does not mean a proportional increase in complexity, known as entropy. Specifically, when comparing a new sequence to a set of previously known sequences, it is important to choose the correct algorithms for comparing the similarity of two sequences, also known as the distance between them. This thesis explores the performance of entropy scaling algorithm compared to several conventional tools.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:16720	en
dc.identifier.uri	http://hdl.handle.net/10919/84470	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Bioinformatics	en
dc.subject	Entropy Scaling	en
dc.subject	Sequence Search	en
dc.subject	BLAST	en
dc.title	Entropy Measurements and Ball Cover Construction for Biological Sequences	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Science and Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Robertson_JA_T_2018.pdf
Size:: 1.21 MB
Format:: Adobe Portable Document Format

Download

Collections

Masters Theses