Optimizing analysis pipelines for improved variant discovery

TR Number

Date

2014-04-17

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

In modern genomics, all experiments begin data collection with sequencing and downstream alignment or assembly processing. As such, the development of reliable sequencing pipelines is hugely important as a foundation for any future analysis on that data. While much existing work has been done on enhancing the throughput and computational performance of such pipelines, there is still the question of accuracy. The rift in knowledge between speed and accuracy can be attributed to the more conceptually complex nature of what constitutes the measurement of accuracy. Unlike simply parsing logs of memory usage and CPU hours, accuracy requires experimental validation. Subsets of accuracy are also created when assessing alignment or variations around particular genomic features such as indels, Copy Number Variants (CNVs), or microsatellite repeats. Here is the development of accuracy measurements in read alignment and variation calls, allowing the optimization of sequencing pipelines at all stages. The underlying hypothesis, then, is that different sequencing platforms and analysis software can be distinguished from each other in accuracy by both sample and genomic variation of interest. As the term accuracy suggests, the measurements of alignment and variation recall require comparison against a truth set, for which read library simulations and high quality data from the Genome in a Bottle Consortium or Illumina Omni array have served us. In exploring the hypothesis, the measurements are built into a community resource to crowdsource the creation of a benchmarking repository for pipeline comparison. Results from pipelines promoted by this computational model are then wet lab validated with support for a hierarchy of pipeline performance. Particularly, the construction of an accurate pipeline for genotyping microsatellite repeats will be investigated, which is then used to create a database of human microsatellites.

Progress in this area is vital for the growth of sequencing in both clinical and research settings. For genomics research to fully translate to the bedside, the boom of new technology must be controlled by rational metrics and industry standardization. This project will address both of these issues, as well as contribute to the understanding of human microsatellite variation.

Description

Keywords

Genomics, sequencing, genetics, optimization, bioinformatics

Citation