Browsing by Author "Hasan, Mohammad Shabbir"
Now showing 1 - 7 of 7
Results Per Page
Sort Options
- Identifying and Analyzing Indel Variants in the Human Genome Using Computational ApproachesHasan, Mohammad Shabbir (Virginia Tech, 2019-07-01)Insertion and deletion (indel), a common form of genetic variation, has been shown to cause or contribute to human genetic diseases and cancer. Despite this importance and being the second most abundant variant type in the human genome, indels have not been studied as much as the single nucleotide polymorphism (SNP). With the advance of next-generation sequencing technology, many indel calling tools have been developed. However, performance comparison of commonly used tools has shown that (1) the tools have limited power in identifying indels and there are significant number of indels undetected, and (2) there is significant disagreement among the indel sets produced by the tools. These findings indicate the necessity of improving the existing tools or developing new algorithms to achieve reliable and consistent indel calling results. Two indels are biologically equivalent if the resulting sequences are the same. Storing biologically equivalent indels as distinct entries in databases causes data redundancy and misleads downstream analysis. It is thus desirable to have a unified system for identifying and representing equivalent indels. This dissertation describes UPS-indel, a utility tool that creates a universal positioning system for indels so that equivalent indels can be uniquely determined by their coordinates in the new system. Results show that UPS-indel identifies more redundant indels than existing algorithms. While mapping short reads to the reference genome, a significant number of short reads are unmapped and excluded from downstream analyses, thereby causing information loss in the subsequent variant calling. This dissertation describes Genesis-indel, a computational pipeline that explores the unmapped reads to identify missing novel indels. Results analyzing sequence alignment of 30 breast cancer patients show that Genesis-indel identifies many novel indels that also show significant enrichment in oncogenes and tumor suppressor genes, demonstrating the importance of rescuing indels hidden in the unmapped reads in cancer and disease studies. Somatic mutations play a vital role in transforming healthy cells into cancer cells. Therefore, accurate identification of somatic mutations is essential. Many somatic mutations callers are available with different strengths and weaknesses. An ensemble approach integrating the power of the callers is warranted. This dissertation describes SomaticHunter, an ensemble of two callers, namely Platypus and VarDict. Results on synthetic tumor data show that for both SNPs and indels, SomaticHunter achieves recall comparable to the state-of-the-art somatic mutation callers and the highest precision, resulting in the highest F1 score.
- Identifying Pathogenicity Islands in Bacterial Pathogenomics Using Computational ApproachesChe, Dongsheng; Hasan, Mohammad Shabbir; Chen, Bernard (MDPI, 2014-01-13)High-throughput sequencing technologies have made it possible to study bacteria through analyzing their genome sequences. For instance, comparative genome sequence analyses can reveal the phenomenon such as gene loss, gene gain, or gene exchange in a genome. By analyzing pathogenic bacterial genomes, we can discover that pathogenic genomic regions in many pathogenic bacteria are horizontally transferred from other bacteria, and these regions are also known as pathogenicity islands (PAIs). PAIs have some detectable properties, such as having different genomic signatures than the rest of the host genomes, and containing mobility genes so that they can be integrated into the host genome. In this review, we will discuss various pathogenicity island-associated features and current computational approaches for the identification of PAIs. Existing pathogenicity island databases and related computational resources will also be discussed, so that researchers may find it to be useful for the studies of bacterial evolution and pathogenicity mechanisms.
- Performance evaluation of indel calling tools using real short-read dataHasan, Mohammad Shabbir; Wu, Xiaowei; Zhang, Liqing (Biomed Central, 2015-08-19)Background Insertion and deletion (indel), a common form of genetic variation, has been shown to cause or contribute to human genetic diseases and cancer. With the advance of next-generation sequencing technology, many indel calling tools have been developed; however, evaluation and comparison of these tools using large-scale real data are still scant. Here we evaluated seven popular and publicly available indel calling tools, GATK Unified Genotyper, VarScan, Pindel, SAMtools, Dindel, GTAK HaplotypeCaller, and Platypus, using 78 human genome low-coverage data from the 1000 Genomes project. Results Comparing indels called by these tools with a known set of indels, we found that Platypus outperforms other tools. In addition, a high percentage of known indels still remain undetected and the number of common indels called by all seven tools is very low. Conclusion All these findings indicate the necessity of improving the existing tools or developing new algorithms to achieve reliable and consistent indel calling results.
- SPAI: an interactive platform for indel analysisHasan, Mohammad Shabbir; Zhang, Liqing (BMC, 2016-08-31)Background: Insertions and Deletions (Indels) are the most common form of structural variation in human genome. Indels not only contribute to genetic diversity but also cause diseases. Therefore assessing indels in human genome has become an interesting topic to the research community. This increasing interest on indel calling research has resulted into the development of a good number of indel calling tools. However, all of these tools are command line based and require expertise from Computer Science (CS) to execute them which makes it challenging for researchers from non-CS background. Methods: In this paper, we describe an interactive platform named SPAI which stands for Single Platform for Analyzing Indels. Results: Being a Graphical User Interface (GUI) tool, SPAI facilitates users to run several popular indel calling tools and perform several analyses on the indel calling results without knowing any command line programming. Conclusions: SPAI is written in Java and tested in Linux operating system.
- Uncovering missed indels by leveraging unmapped readsHasan, Mohammad Shabbir; Wu, Xiaowei; Zhang, Liqing (Springer Nature, 2019-07-31)In current practice, Next Generation Sequencing (NGS) applications start with mapping/aligning short reads to the reference genome, with the aim of identifying genetic variants. Although existing alignment tools have shown great accuracy in mapping short reads to the reference genome, a significant number of short reads still remain unmapped and are often excluded from downstream analyses thereby causing nonnegligible information loss in the subsequent variant calling procedure. This paper describes Genesis-indel, a computational pipeline that explores the unmapped reads to identify novel indels that are initially missed in the original procedure. Genesis-indel is applied to the unmapped reads of 30 breast cancer patients from TCGA. Results show that the unmapped reads are conserved between the two subtypes of breast cancer investigated in this study and might contribute to the divergence between the subtypes. Genesis-indel identifies 72,997 novel high-quality indels previously not found, among which 16,141 have not been annotated in the widely used mutation database. Statistical analysis of these indels shows significant enrichment of indels residing in oncogenes and tumour suppressor genes. Functional annotation further reveals that these indels are strongly correlated with pathways of cancer and can have high to moderate impact on protein functions. Additionally, some of the indels overlap with the genes that do not have any indel mutations called from the originally mapped reads but have been shown to contribute to the tumorigenesis in multiple carcinomas, further emphasizing the importance of rescuing indels hidden in the unmapped reads in cancer and disease studies.
- UPS-indel: a Universal Positioning System for IndelsHasan, Mohammad Shabbir; Wu, Xiaowei; Watson, Layne T.; Li, Zhiyi; Zhang, Liqing (Nature, 2017-10-26)Storing biologically equivalent indels as distinct entries in databases causes data redundancy, and misleads downstream analysis. It is thus desirable to have a unified system for identifying and representing equivalent indels. Moreover, a unified system is also desirable to compare the indel calling results produced by different tools. This paper describes UPS-indel, a utility tool that creates a universal positioning system for indels so that equivalent indels can be uniquely determined by their coordinates in the new system, which also can be used to compare different indel calling results. UPS-indel identifies 15% redundant indels in dbSNP, 29% in COSMIC coding, and 13% in COSMIC noncoding datasets across all human chromosomes, higher than previously reported. Comparing the performance of UPS-indel with existing variant normalization tools vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants shows that UPS-indel is able to identify 456,352 more redundant indels in dbSNP; 2,118 more in COSMIC coding, and 553 more in COSMIC noncoding indel dataset in addition to the ones reported jointly by these tools. Moreover, comparing UPS-indel to state-of-the-art approaches for indel call set comparison demonstrates its clear superiority in finding common indels among call sets. UPS-indel is theoretically proven to find all equivalent indels, and thus exhaustive.
- vi-HMM: a novel HMM-based method for sequence variant identification in short-read dataTang, Man; Hasan, Mohammad Shabbir; Zhu, Hongxiao; Zhang, Liqing; Wu, Xiaowei (2019-02-13)Background Accurate and reliable identification of sequence variants, including single nucleotide polymorphisms (SNPs) and insertion-deletion polymorphisms (INDELs), plays a fundamental role in next-generation sequencing (NGS) applications. Existing methods for calling these variants often make simplified assumptions of positional independence and fail to leverage the dependence between genotypes at nearby loci that is caused by linkage disequilibrium (LD). Results and conclusion We propose vi-HMM, a hidden Markov model (HMM)-based method for calling SNPs and INDELs in mapped short-read data. This method allows transitions between hidden states (defined as “SNP,” “Ins,” “Del,” and “Match”) of adjacent genomic bases and determines an optimal hidden state path by using the Viterbi algorithm. The inferred hidden state path provides a direct solution to the identification of SNPs and INDELs. Simulation studies show that, under various sequencing depths, vi-HMM outperforms commonly used variant calling methods in terms of sensitivity and F1 score. When applied to the real data, vi-HMM demonstrates higher accuracy in calling SNPs and INDELs.