Browsing by Author "Ressom, Habtom W."
Now showing 1 - 14 of 14
Results Per Page
Sort Options
- Automated Analysis of Astrocyte Activities from Large-scale Time-lapse Microscopic Imaging DataWang, Yizhi (Virginia Tech, 2019-12-13)The advent of multi-photon microscopes and highly sensitive protein sensors enables the recording of astrocyte activities on a large population of cells over a long-time period in vivo. Existing tools cannot fully characterize these activities, both within single cells and at the population-level, because of the insufficiency of current region-of-interest-based approaches to describe the activity that is often spatially unfixed, size-varying, and propagative. Here, we present Astrocyte Quantitative Analysis (AQuA), an analytical framework that releases astrocyte biologists from the ROI-based paradigm. The framework takes an event-based perspective to model and accurately quantify the complex activity in astrocyte imaging datasets, with an event defined jointly by its spatial occupancy and temporal dynamics. To model the signal propagation in astrocyte, we developed graphical time warping (GTW) to align curves with graph-structured constraints and integrated it into AQuA. To make AQuA easy to use, we designed a comprehensive software package. The software implements the detection pipeline in an intuitive step by step GUI with visual feedback. The software also supports proof-reading and the incorporation of morphology information. With synthetic data, we showed AQuA performed much better in accuracy compared with existing methods developed for astrocytic data and neuronal data. We applied AQuA to a range of ex vivo and in vivo imaging datasets. Since AQuA is data-driven and based on machine learning principles, it can be applied across model organisms, fluorescent indicators, experimental modes, and imaging resolutions and speeds, enabling researchers to elucidate fundamental astrocyte physiology.
- Automated Identification and Tracking of Motile Oligodendrocyte Precursor Cells (OPCs) from Time-lapse 3D Microscopic Imaging Data of Cell Clusters in vivoWang, Yinxue (Virginia Tech, 2021-06-02)Advances in time-lapse 3D in vivo fluorescence microscopic imaging techniques enables the observation and investigation into the migration of Oligodendrocyte precursor cells (OPCs) and its role in the central nervous system. However, current practice of image-based OPC motility analysis heavily relies on manual labeling and tracking on 2D max projection of the 3D data, which suffers from massive human labor, subjective biases, weak reproducibility and especially information loss and distortion. Besides, due to the lack of OPC specific genetically encoded indicator, OPCs can only be identified from other oligodendrocyte lineage cells by their observed motion patterns. Automated analytical tools are needed for the identification and tracking of OPCs. In this dissertation work, we proposed an analytical framework, MicTracker (Migrating Cell Tracker), for the integrated task of identifying, segmenting and tracking migrating cells (OPCs) from in vivo time-lapse fluorescence imaging data of high-density cell clusters composed of cells with different modes of motions. As a component of the framework, we presented a novel strategy for cell segmentation with global temporal consistency enforced, tackling the challenges caused by highly clustered cell population and temporally inconsistently blurred boundaries between touching cells. We also designed a data association algorithm to address the violation of usual assumption of small displacements. Recognizing that the violation was in the mixed cell population composed of two cell groups while the assumption held within each group, we proposed to solve the seemingly impossible mission by de-mixing the two groups of cell motion modes without known labels. We demonstrated the effectiveness of MicTracker in solving our problem on in vivo real data.
- Bayesian Alignment Model for Analysis of LC-MS-based Omic DataTsai, Tsung-Heng (Virginia Tech, 2014-05-22)Liquid chromatography coupled with mass spectrometry (LC-MS) has been widely used in various omic studies for biomarker discovery. Appropriate LC-MS data preprocessing steps are needed to detect true differences between biological groups. Retention time alignment is one of the most important yet challenging preprocessing steps, in order to ensure that ion intensity measurements among multiple LC-MS runs are comparable. In this dissertation, we propose a Bayesian alignment model (BAM) for analysis of LC-MS data. BAM uses Markov chain Monte Carlo (MCMC) methods to draw inference on the model parameters and provides estimates of the retention time variability along with uncertainty measures, enabling a natural framework to integrate information of various sources. From methodology development to practical application, we investigate the alignment problem through three research topics: 1) development of single-profile Bayesian alignment model, 2) development of multi-profile Bayesian alignment model, and 3) application to biomarker discovery research. Chapter 2 introduces the profile-based Bayesian alignment using a single chromatogram, e.g., base peak chromatogram from each LC-MS run. The single-profile alignment model improves on existing MCMC-based alignment methods through 1) the implementation of an efficient MCMC sampler using a block Metropolis-Hastings algorithm, and 2) an adaptive mechanism for knot specification using stochastic search variable selection (SSVS). Chapter 3 extends the model to integrate complementary information that better captures the variability in chromatographic separation. We use Gaussian process regression on the internal standards to derive a prior distribution for the mapping functions. In addition, a clustering approach is proposed to identify multiple representative chromatograms for each LC-MS run. With the Gaussian process prior, these chromatograms are simultaneously considered in the profile-based alignment, which greatly improves the model estimation and facilitates the subsequent peak matching process. Chapter 4 demonstrates the applicability of the proposed Bayesian alignment model to biomarker discovery research. We integrate the proposed Bayesian alignment model into a rigorous preprocessing pipeline for LC-MS data analysis. Through the developed analysis pipeline, candidate biomarkers for hepatocellular carcinoma (HCC) are identified and confirmed on a complementary platform.
- Differential Network Analysis based on Omic Data for Cancer Biomarker DiscoveryZuo, Yiming (Virginia Tech, 2017-06-16)Recent advances in high-throughput technique enables the generation of a large amount of omic data such as genomics, transcriptomics, proteomics, metabolomics, glycomics etc. Typically, differential expression analysis (e.g., student's t-test, ANOVA) is performed to identify biomolecules (e.g., genes, proteins, metabolites, glycans) with significant changes on individual level between biologically disparate groups (disease cases vs. healthy controls) for cancer biomarker discovery. However, differential expression analysis on independent studies for the same clinical types of patients often led to different sets of significant biomolecules and had only few in common. This may be attributed to the fact that biomolecules are members of strongly intertwined biological pathways and highly interactive with each other. Without considering these interactions, differential expression analysis could lead to biased results. Network-based methods provide a natural framework to study the interactions between biomolecules. Commonly used data-driven network models include relevance network, Bayesian network and Gaussian graphical models. In addition to data-driven network models, there are many publicly available databases such as STRING, KEGG, Reactome, and ConsensusPathDB, where one can extract various types of interactions to build knowledge-driven networks. While both data- and knowledge-driven networks have their pros and cons, an appropriate approach to incorporate the prior biological knowledge from publicly available databases into data-driven network model is desirable for more robust and biologically relevant network reconstruction. Recently, there has been a growing interest in differential network analysis, where the connection in the network represents a statistically significant change in the pairwise interaction between two biomolecules in different groups. From the rewiring interactions shown in differential networks, biomolecules that have strongly altered connectivity between distinct biological groups can be identified. These biomolecules might play an important role in the disease under study. In fact, differential expression and differential network analyses investigate omic data from two complementary perspectives: the former focuses on the change in individual biomolecule level between different groups while the latter concentrates on the change in pairwise biomolecules level. Therefore, an approach that can integrate differential expression and differential network analyses is likely to discover more reliable and powerful biomarkers. To achieve these goals, we start by proposing a novel data-driven network model (i.e., LOPC) to reconstruct sparse biological networks. The sparse networks only contains direct interactions between biomolecules which can help researchers to focus on the more informative connections. Then we propose a novel method (i.e., dwgLASSO) to incorporate prior biological knowledge into data-driven network model to build biologically relevant networks. Differential network analysis is applied based on the networks constructed for biologically disparate groups to identify cancer biomarker candidates. Finally, we propose a novel network-based approach (i.e., INDEED) to integrate differential expression and differential network analyses to identify more reliable and powerful cancer biomarker candidates. INDEED is further expanded as INDEED-M to utilize omic data at different levels of human biological system (e.g., transcriptomics, proteomics, metabolomics), which we believe is promising to increase our understanding of cancer. Matlab and R packages for the proposed methods are developed and available at Github (https://github.com/Hurricaner1989) to share with the research community.
- Incorporating prior biological knowledge for network-based differential gene expression analysis using differentially weighted graphical LASSOZuo, Yiming; Cui, Yi; Yu, Guoqiang; Li, Ruijiang; Ressom, Habtom W. (2017-02-10)Background Conventional differential gene expression analysis by methods such as student’s t-test, SAM, and Empirical Bayes often searches for statistically significant genes without considering the interactions among them. Network-based approaches provide a natural way to study these interactions and to investigate the rewiring interactions in disease versus control groups. In this paper, we apply weighted graphical LASSO (wgLASSO) algorithm to integrate a data-driven network model with prior biological knowledge (i.e., protein-protein interactions) for biological network inference. We propose a novel differentially weighted graphical LASSO (dwgLASSO) algorithm that builds group-specific networks and perform network-based differential gene expression analysis to select biomarker candidates by considering their topological differences between the groups. Results Through simulation, we showed that wgLASSO can achieve better performance in building biologically relevant networks than purely data-driven models (e.g., neighbor selection, graphical LASSO), even when only a moderate level of information is available as prior biological knowledge. We evaluated the performance of dwgLASSO for survival time prediction using two microarray breast cancer datasets previously reported by Bild et al. and van de Vijver et al. Compared with the top 10 significant genes selected by conventional differential gene expression analysis method, the top 10 significant genes selected by dwgLASSO in the dataset from Bild et al. led to a significantly improved survival time prediction in the independent dataset from van de Vijver et al. Among the 10 genes selected by dwgLASSO, UBE2S, SALL2, XBP1 and KIAA0922 have been confirmed by literature survey to be highly relevant in breast cancer biomarker discovery study. Additionally, we tested dwgLASSO on TCGA RNA-seq data acquired from patients with hepatocellular carcinoma (HCC) on tumors samples and their corresponding non-tumorous liver tissues. Improved sensitivity, specificity and area under curve (AUC) were observed when comparing dwgLASSO with conventional differential gene expression analysis method. Conclusions The proposed network-based differential gene expression analysis algorithm dwgLASSO can achieve better performance than conventional differential gene expression analysis methods by integrating information at both gene expression and network topology levels. The incorporation of prior biological knowledge can lead to the identification of biologically meaningful genes in cancer biomarker studies.
- Module-based Analysis of Biological Data for Network Inference and Biomarker DiscoveryZhang, Yuji (Virginia Tech, 2010-07-20)Systems biology comprises the global, integrated analysis of large-scale data encoding different levels of biological information with the aim to obtain global insight into the cellular networks. Several studies have unveiled the modular and hierarchical organization inherent in these networks. In this dissertation, we propose and develop innovative systems approaches to integrate multi-source biological data in a modular manner for network inference and biomarker discovery in complex diseases such as breast cancer. The first part of the dissertation is focused on gene module identification in gene expression data. As the most popular way to identify gene modules, many cluster algorithms have been applied to the gene expression data analysis. For the purpose of evaluating clustering algorithms from a biological point of view, we propose a figure of merit based on Kullback-Leibler divergence between cluster membership and known gene ontology attributes. Several benchmark expression-based gene clustering algorithms are compared using the proposed method with different parameter settings. Applications to diverse public time course gene expression data demonstrated that fuzzy c-means clustering is superior to other clustering methods with regard to the enrichment of clusters for biological functions. These results contribute to the evaluation of clustering outcomes and the estimations of optimal clustering partitions. The second part of the dissertation presents a hybrid computational intelligence method to infer gene regulatory modules. We explore the combined advantages of the nonlinear and dynamic properties of neural networks, and the global search capabilities of the hybrid genetic algorithm and particle swarm optimization method to infer network interactions at modular level. The proposed computational framework is tested in two biological processes: yeast cell cycle, and human Hela cancer cell cycle. The identified gene regulatory modules were evaluated using several validation strategies: 1) gene set enrichment analysis to evaluate the gene modules derived from clustering results; (2) binding site enrichment analysis to determine enrichment of the gene modules for the cognate binding sites of their predicted transcription factors; (3) comparison with previously reported results in the literatures to confirm the inferred regulations. The proposed framework could be beneficial to biologists for predicting the components of gene regulatory modules in which any candidate gene is involved. Such predictions can then be used to design a more streamlined experimental approach for biological validation. Understanding the dynamics of these gene regulatory modules will shed light on the related regulatory processes. Driven by the fact that complex diseases such as cancer are “diseases of pathways”, we extended the module concept to biomarker discovery in cancer research. In the third part of the dissertation, we explore the combined advantages of molecular interaction network and gene expression profiles to identify biomarkers in cancer research. The reliability of conventional gene biomarkers has been challenged because of the biological heterogeneity and noise within and across patients. In this dissertation, we present a module-based biomarker discovery approach that integrates interaction network topology and high-throughput gene expression data to identify markers not as individual genes but as modules. To select reliable biomarker sets across different studies, a hybrid method combining group feature selection with ensemble feature selection is proposed. First, a group feature selection method is used to extract the modules (subnetworks) with discriminative power between disease groups. Then, an ensemble feature selection method is used to select the optimal biomarker sets, in which a double-validation strategy is applied. The ensemble method allows combining features selected from multiple classifications with various data subsampling to increase the reliability and classification accuracy of the final selected biomarker set. The results from four breast cancer studies demonstrated the superiority of the module biomarkers identified by the proposed approach: they can achieve higher accuracies, and are more reliable in datasets with same clinical design. Based on the experimental results above, we believe that the proposed systems approaches provide meaningful solutions to discover the cellular regulatory processes and improve the understanding about disease mechanisms. These computational approaches are primarily developed for analysis of high-throughput genomic data. Nevertheless, the proposed methods can also be extended to analyze high-throughput data in proteomics and metablomics areas.
- Multi-Platform Molecular Data Integration and Disease Outcome AnalysisYoussef, Ibrahim Mohamed (Virginia Tech, 2016-12-06)One of the most common measures of clinical outcomes is the survival time. Accurately linking cancer molecular profiling with survival outcome advances clinical management of cancer. However, existing survival analysis relies intensively on statistical evidence from a single level of data, without paying much attention to the integration of interacting multi-level data and the underlying biology. Advances in genomic techniques provide unprecedented power of characterizing the cancer tissue in a more complete manner than before, opening the opportunity of designing biologically informed and integrative approaches for survival analysis. Many cancer tissues have been profiled for gene expression levels and genomic variants (such as copy number alterations, sequence mutations, DNA methylation, and histone modification). However, it is not clear how to integrate the gene expression and genetic variants to achieve a better prediction and understanding of the cancer survival. To address this challenge, we propose two approaches for data integration in order to both biologically and statistically boost the features selection process for proper detection of the true predictive players of survival. The first approach is data-driven yet biologically informed. Consistent with the biological hierarchy from DNA to RNA, we prioritize each survival-relevant feature with two separate scores, predictive and mechanistic. With mRNA expression levels in concern, predictive features are those mRNAs whose variation in expression levels are associated with the survival outcome, and mechanistic features are those mRNAs whose variation in expression levels are associated with genomic variants (copy number alterations (CNAs) in this study). Further, we propose simultaneously integrating information from both the predictive model and the mechanistic model through our new approach GEMPS (Gene Expression as a Mediator for Predicting Survival). Applied on two cancer types (ovarian and glioblastoma multiforme), our method achieved better prediction power than peer methods. Gene set enrichment analysis confirms that the genes utilized for the final survival analysis are biologically important and relevant. The second approach is a generic mathematical framework to biologically regularize the Cox's proportional hazards model that is widely used in survival analysis. We propose a penalty function that both links the mechanistic model to the clinical model and reflects the biological downstream regulatory effect of the genomic variants on the mRNA expression levels of the target genes. Fast and efficient optimization principles like the coordinate descent and majorization-minimization are adopted in the inference process of the coefficients of the Cox model predictors. Through this model, we develop the regulator-target gene relationship to a new one: regulator-target-outcome relationship of a disease. Assessed via a simulation study and analysis of two real cancer data sets, the proposed method showed better performance in terms of selecting the true predictors and achieving better survival prediction. The proposed method gives insightful and meaningful interpretability to the selected model due to the biological linking of the mechanistic model and the clinical model. Other important forms of clinical outcomes are monitoring angiogenesis (formation of new blood vessels necessary for tumor to nourish itself and sustain its existence) and assessing therapeutic response. This can be done through dynamic imaging, in which a series of images at different time instances are acquired for a specific tumor site after injection of a contrast agent. Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is a noninvasive tool to examine tumor vasculature patterns based on accumulation and washout of the contrast agent. DCE-MRI gives indication about tumor vasculature permeability, which in turn indicates the tumor angiogenic activity. Observing this activity over time can reflect the tumor drug responsiveness and efficacy of the treatment plan. However, due to the limited resolution of the imaging scanners, a partial-volume effect (PVE) problem occurs, which is the result of signals from two or more tissues combining together to produce a single image concentration value within a pixel, with the effect of inaccurate estimation to the values of the pharmacokinetic parameters. A multi-tissue compartmental modeling (CM) technique supported by convex analysis of mixtures is used to mitigate the PVE by clustering pixels and constructing a simplex whose vertices are of a single compartment type. CAM uses the identified pure-volume pixels to estimate the kinetics of the tissues under investigation. We propose an enhanced version of CAM-CM to identify pure-volume pixels more accurately. This includes the consideration of the neighborhood effect on each pixel and the use of a barycentric coordinate system to identify more pure-volume pixels and to test those identified by CAM-CM. Tested on simulated DCE-MRI data, the enhanced CAM-CM achieved better performance in terms of accuracy and reproducibility.
- Network motif-based identification of transcription factor-target gene relationships by integrating multi-source biological dataZhang, Yuji; Xuan, Jianhua; de los Reyes, Benildo G.; Clarke, Robert; Ressom, Habtom W. (2008-04-21)Background Integrating data from multiple global assays and curated databases is essential to understand the spatio-temporal interactions within cells. Different experiments measure cellular processes at various widths and depths, while databases contain biological information based on established facts or published data. Integrating these complementary datasets helps infer a mutually consistent transcriptional regulatory network (TRN) with strong similarity to the structure of the underlying genetic regulatory modules. Decomposing the TRN into a small set of recurring regulatory patterns, called network motifs (NM), facilitates the inference. Identifying NMs defined by specific transcription factors (TF) establishes the framework structure of a TRN and allows the inference of TF-target gene relationship. This paper introduces a computational framework for utilizing data from multiple sources to infer TF-target gene relationships on the basis of NMs. The data include time course gene expression profiles, genome-wide location analysis data, binding sequence data, and gene ontology (GO) information. Results The proposed computational framework was tested using gene expression data associated with cell cycle progression in yeast. Among 800 cell cycle related genes, 85 were identified as candidate TFs and classified into four previously defined NMs. The NMs for a subset of TFs are obtained from literature. Support vector machine (SVM) classifiers were used to estimate NMs for the remaining TFs. The potential downstream target genes for the TFs were clustered into 34 biologically significant groups. The relationships between TFs and potential target gene clusters were examined by training recurrent neural networks whose topologies mimic the NMs to which the TFs are classified. The identified relationships between TFs and gene clusters were evaluated using the following biological validation and statistical analyses: (1) Gene set enrichment analysis (GSEA) to evaluate the clustering results; (2) Leave-one-out cross-validation (LOOCV) to ensure that the SVM classifiers assign TFs to NM categories with high confidence; (3) Binding site enrichment analysis (BSEA) to determine enrichment of the gene clusters for the cognate binding sites of their predicted TFs; (4) Comparison with previously reported results in the literatures to confirm the inferred regulations. Conclusion The major contribution of this study is the development of a computational framework to assist the inference of TRN by integrating heterogeneous data from multiple sources and by decomposing a TRN into NM-based modules. The inference capability of the proposed framework is verified statistically (e.g., LOOCV) and biologically (e.g., GSEA, BSEA, and literature validation). The proposed framework is useful for inferring small NM-based modules of TF-target gene relationships that can serve as a basis for generating new testable hypotheses.
- Novel Preprocessing and Normalization Methods for Analysis of GC/LC-MS DataNezami Ranjbar, Mohammad Rasoul (Virginia Tech, 2015-06-02)We introduce new methods for preprocessing and normalization of data acquired by gas/liquid chromatography coupled with mass spectrometry (GC/LC-MS). Normalization is desired prior to subsequent statistical analysis to adjust variabilities in ion intensities that are not caused by biological differences. There are different sources of experimental bias including variabilities in sample collection, sample storage, poor experimental design, noise, etc. Also, instrument variability in experiments involving a large number of runs leads to a significant drift in intensity measurements. We propose new normalization methods based on bootstrapping, Gaussian process regression, non-negative matrix factorization (NMF), and Bayesian hierarchical models. These methods model the bias by borrowing information across runs and features. Another novel aspect is utilizing scan-level data to improve the accuracy of quantification. We evaluated the performance of our method using simulated and experimental data. In comparison with several existing methods, the proposed methods yielded significant improvement. Gas chromatography coupled with mass spectrometry (GC-MS) is one of the technologies widely used for qualitative and quantitative analysis of small molecules. In particular, GC coupled to single quadrupole MS can be utilized for targeted analysis by selected ion monitoring (SIM). However, to our knowledge, there are no software tools specifically designed for analysis of GS-SIM-MS data. We introduce SIMAT, a new R package for quantitative analysis of the levels of targeted analytes. SIMAT provides guidance in choosing fragments for a list of targets. This is accomplished through an optimization algorithm that has the capability to select the most appropriate fragments from overlapping peaks based on a pre-specified library of background analytes. The tool also allows visualization of the total ion chromatogram (TIC) of runs and extracted ion chromatogram (EIC) of analytes of interest. Moreover, retention index (RI) calibration can be performed and raw GC-SIM-MS data can be imported in netCDF or NIST mass spectral library (MSL) formats. We evaluated the performance of SIMAT using several experimental data sets. Our results demonstrate that SIMAT performs better than AMDIS and MetaboliteDetector in terms of finding the correct targets in the acquired GC-SIM-MS data and estimating their relative levels.
- Reconstruction of Gene Regulatory Modules in Cancer Cell Cycle by Multi-Source Data IntegrationZhang, Yuji; Xuan, Jianhua; de los Reyes, Benildo G.; Clarke, Robert; Ressom, Habtom W. (PLOS, 2010-04-21)Background Precise regulation of the cell cycle is crucial to the growth and development of all organisms. Understanding the regulatory mechanism of the cell cycle is crucial to unraveling many complicated diseases, most notably cancer. Multiple sources of biological data are available to study the dynamic interactions among many genes that are related to the cancer cell cycle. Integrating these informative and complementary data sources can help to infer a mutually consistent gene transcriptional regulatory network with strong similarity to the underlying gene regulatory relationships in cancer cells. Results and Principal Findings We propose an integrative framework that infers gene regulatory modules from the cell cycle of cancer cells by incorporating multiple sources of biological data, including gene expression profiles, gene ontology, and molecular interaction. Among 846 human genes with putative roles in cell cycle regulation, we identified 46 transcription factors and 39 gene ontology groups. We reconstructed regulatory modules to infer the underlying regulatory relationships. Four regulatory network motifs were identified from the interaction network. The relationship between each transcription factor and predicted target gene groups was examined by training a recurrent neural network whose topology mimics the network motif(s) to which the transcription factor was assigned. Inferred network motifs related to eight well-known cell cycle genes were confirmed by gene set enrichment analysis, binding site enrichment analysis, and comparison with previously published experimental results. Conclusions We established a robust method that can accurately infer underlying relationships between a given transcription factor and its downstream target genes by integrating different layers of biological data. Our method could also be beneficial to biologists for predicting the components of regulatory modules in which any candidate gene is involved. Such predictions can then be used to design a more streamlined experimental approach for biological validation. Understanding the dynamics of these modules will shed light on the processes that occur in cancer cells resulting from errors in cell cycle regulation.
- Reverse engineering module networks by PSO-RNN hybrid modelingZhang, Yuji; Xuan, Jianhua; de los Reyes, Benildo G.; Clarke, Robert; Ressom, Habtom W. (2009-07-07)Background Inferring a gene regulatory network (GRN) from high throughput biological data is often an under-determined problem and is a challenging task due to the following reasons: (1) thousands of genes are involved in one living cell; (2) complex dynamic and nonlinear relationships exist among genes; (3) a substantial amount of noise is involved in the data, and (4) the typical small sample size is very small compared to the number of genes. We hypothesize we can enhance our understanding of gene interactions in important biological processes (differentiation, cell cycle, and development, etc) and improve the inference accuracy of a GRN by (1) incorporating prior biological knowledge into the inference scheme, (2) integrating multiple biological data sources, and (3) decomposing the inference problem into smaller network modules. Results This study presents a novel GRN inference method by integrating gene expression data and gene functional category information. The inference is based on module network model that consists of two parts: the module selection part and the network inference part. The former determines the optimal modules through fuzzy c-mean (FCM) clustering and by incorporating gene functional category information, while the latter uses a hybrid of particle swarm optimization and recurrent neural network (PSO-RNN) methods to infer the underlying network between modules. Our method is tested on real data from two studies: the development of rat central nervous system (CNS) and the yeast cell cycle process. The results are evaluated by comparing them to previously published results and gene ontology annotation information. Conclusion The reverse engineering of GRNs in time course gene expression data is a major obstacle in system biology due to the limited number of time points. Our experiments demonstrate that the proposed method can address this challenge by: (1) preprocessing gene expression data (e.g. normalization and missing value imputation) to reduce the data noise; (2) clustering genes based on gene expression data and gene functional category information to identify biologically meaningful modules, thereby reducing the dimensionality of the data; (3) modeling GRNs with the PSO-RNN method between the modules to capture their nonlinear and dynamic relationships. The method is shown to lead to biologically meaningful modules and networks among the modules.
- SIMAT: GC-SIM-MS data analysis toolNezami Ranjbar, Mohammad R.; Poto, Cristina D.; Wang, Yue; Ressom, Habtom W. (2015-08-19)Background Gas chromatography coupled with mass spectrometry (GC-MS) is one of the technologies widely used for qualitative and quantitative analysis of small molecules. In particular, GC coupled to single quadrupole MS can be utilized for targeted analysis by selected ion monitoring (SIM). However, to our knowledge, there are no software tools specifically designed for analysis of GC-SIM-MS data. In this paper, we introduce a new R/Bioconductor package called SIMAT for quantitative analysis of the levels of targeted analytes. SIMAT provides guidance in choosing fragments for a list of targets. This is accomplished through an optimization algorithm that has the capability to select the most appropriate fragments from overlapping chromatographic peaks based on a pre-specified library of background analytes. The tool also allows visualization of the total ion chromatograms (TIC) of runs and extracted ion chromatograms (EIC) of analytes of interest. Moreover, retention index (RI) calibration can be performed and raw GC-SIM-MS data can be imported in netCDF or NIST mass spectral library (MSL) formats. Results We evaluated the performance of SIMAT using two GC-SIM-MS datasets obtained by targeted analysis of: (1) plasma samples from 86 patients in a targeted metabolomic experiment; and (2) mixtures of internal standards spiked in plasma samples at varying concentrations in a method development study. Our results demonstrate that SIMAT offers alternative solutions to AMDIS and MetaboliteDetector to achieve accurate detection of targets and estimation of their relative intensities by analysis of GC-SIM-MS data. Conclusions We introduce a new R package called SIMAT that allows the selection of the optimal set of fragments and retention time windows for target analytes in GC-SIM-MS based analysis. Also, various functions and algorithms are implemented in the tool to: (1) read and import raw data and spectral libraries; (2) perform GC-SIM-MS data preprocessing; and (3) plot and visualize EICs and TICs.
- Topic model-based mass spectrometric data analysis in cancer biomarker discovery studiesWang, Minkun; Tsai, Tsung-Heng; Di Poto, Cristina; Ferrarini, Alessia; Yu, Guoqiang; Ressom, Habtom W. (BMC, 2016)Background: A fundamental challenge in quantitation of biomolecules for cancer biomarker discovery is owing to the heterogeneous nature of human biospecimens. Although this issue has been a subject of discussion in cancer genomic studies, it has not yet been rigorously investigated in mass spectrometry based proteomic and metabolomic studies. Purification of mass spectometric data is highly desired prior to subsequent analysis, e.g., quantitative comparison of the abundance of biomolecules in biological samples. Methods: We investigated topic models to computationally analyze mass spectrometric data considering both integrated peak intensities and scan-level features, i.e., extracted ion chromatograms (EICs). Probabilistic generative models enable flexible representation in data structure and infer sample-specific pure resources. Scan-level modeling helps alleviate information loss during data preprocessing. We evaluated the capability of the proposed models in capturing mixture proportions of contaminants and cancer profiles on LC-MS based serum proteomic and GC-MS based tissue metabolomic datasets acquired from patients with hepatocellular carcinoma (HCC) and liver cirrhosis as well as synthetic data we generated based on the serum proteomic data. Results: The results we obtained by analysis of the synthetic data demonstrated that both intensity-level and scan-level purification models can accurately infer the mixture proportions and the underlying true cancerous sources with small average error ratios (< 7 %) between estimation and ground truth. By applying the topic model-based purification to mass spectrometric data, we found more proteins and metabolites with significant changes between HCC cases and cirrhotic controls. Candidate biomarkers selected after purification yielded biologically meaningful pathway analysis results and improved disease discrimination power in terms of the area under ROC curve compared to the results found prior to purification. Conclusions: We investigated topic model-based inference methods to computationally address the heterogeneity issue in samples analyzed by LC/GC-MS. We observed that incorporation of scan-level features have the potential to lead to more accurate purification results by alleviating the loss in information as a result of integrating peaks. We believe cancer biomarker discovery studies that use mass spectrometric analysis of human biospecimens can greatly benefit from topic model-based purification of the data prior to statistical and pathway analyses.
- Topic Model-based Mass Spectrometric Data Analysis in Cancer Biomarker Discovery StudiesWang, Minkun (Virginia Tech, 2017-06-14)Identification of disease-related alterations in molecular and cellular mechanisms may reveal useful biomarkers for human diseases including cancers. High-throughput omic technologies for identifying and quantifying multi-level biological molecules (e.g., proteins, glycans, and metabolites) have facilitated the advances in biological research in recent years. Liquid (or gas) chromatography coupled with mass spectrometry (LC/GC-MS) has become an essential tool in such large-scale omic studies. Appropriate LC/GC-MS data preprocessing pipelines are needed to detect true differences between biological groups. Challenges exist in several aspects of MS data analysis. Specifically for biomarker discovery, one fundamental challenge in quantitation of biomolecules is owing to the heterogeneous nature of human biospecimens. Although this issue has been a subject of discussion in cancer genomic studies, it has not yet been rigorously investigated in mass spectrometry based omic studies. Purification of mass spectometric data is highly desired prior to subsequent differential analysis. In this research dissertation, we majorly target at addressing the purification problem through probabilistic modeling. We propose an intensity-level purification model (IPM) to computationally purify LC/GC-MS based cancerous data in biomarker discovery studies. We further extend IPM to scan-level purification model (SPM) by considering information from extracted ion chromatogram (EIC, scan-level feature). Both IPM and SPM belong to the category of topic modeling approach, which aims to identify the underlying "topics" (sources) and their mixture proportions in composing the heterogeneous data. Additionally, denoise deconvolution model (DMM) is proposed to capture the noise signals in samples based on purified profiles. Variational expectation-maximization (VEM) and Markov chain Monte Carlo (MCMC) methods are used to draw inference on the latent variables and estimate the model parameters. Before we come to purification, other research topics in related to mass spectrometric data analysis for cancer biomarker discovery are also investigated in this dissertation. Chapter 3 discusses the developed methods in the differential analysis of LC/GC-MS based omic data, specifically for the preprocessing in data of LC-MS profiled glycans. Chapter 4 presents the assumptions and inference details of IPM, SPM, and DDM. A latent Dirichlet allocation (LDA) core is used to model the heterogeneous cancerous data as mixtures of topics consisting of sample-specific pure cancerous source and non-cancerous contaminants. We evaluated the capability of the proposed models in capturing mixture proportions of contaminants and cancer profiles on LC-MS based serum and tissue proteomic and GC-MS based tissue metabolomic datasets acquired from patients with hepatocellular carcinoma (HCC) and liver cirrhosis. Chapter 5 elaborates these applications in cancer biomarker discovery, where typical single omic and integrative analysis of multi-omic studies are included.