Browsing by Author "Xuan, Jianhua Jason"
Now showing 1 - 20 of 20
Results Per Page
Sort Options
- Advanced Projection Ultrasound Imaging with CMOS-based Sensor Array: Development, Characterization, and Potential Medical ApplicationsLiu, Chu Chuan (Virginia Tech, 2009-12-17)Since early 1960s, ultrasound has become one of the most widely used medical imaging device as a diagnostic tool or an image guider for surgical intervention because of its high portability, non-ionization, non-invasiveness and low cost. Although continuous improvements in commercial equipments have been underway for many years, almost all systems are developed with pulse-echo geometry. In this research, a newly invented ultrasound sensor array was incorporated into the developments of a projection imaging system. Three C-scan prototypes, which included prototypes #1, #2 and an ultrasound mammography system, were constructed. Systematic and Evaluative studies included ultrasound CT, 3-D ultrasound, and multi-modality investigations were also performed. Furthermore, a new analytical method to model ultrasound forward scattering distribution (FSD) was developed by employing a specific annular apparatus. After applying this method, the scattering-corrected C-scan images revealed more detail structures as compared to unprocessed images. This new analytical modelling approach is believed to be effective for most imaging systems operating in projection geometry. In summary, while awaiting additional clinical validation, the C-scan ultrasound prototypes with the state-of-the-art PE-CMOS sensor arrays can provide veritable value and holds real and imminent promise in medical diagnostic imaging. Potential future uses of C-scan ultrasound include but not limit to computerized tomography, biopsy guidance, therapeutic device placing, foreign object detection, pediatric imaging, breast imaging, prostate imaging, human extremities imaging and live animal imaging. With continuous research and development, we believe that C-scan ultrasound has the potential to make a significant impact in the field of medical ultrasound imaging.
- The Application of the Expectation-Maximization Algorithm to the Identification of Biological ModelsChen, Shuo (Virginia Tech, 2006-12-11)With the onset of large-scale gene expression profiling, many researchers have turned their attention toward biological process modeling and system identification. The abundance of data available, while inspiring, is also daunting to interpret. Following the initial work of Rangel et al., we propose a linear model for identifying the biological model behind the data and utilize a modification of the Expectation-Maximization algorithm for training it. With our model, we explore some commonly accepted assumptions concerning sampling, discretization, and state transformations. Also, we illuminate the model complexities and interpretation difficulties caused by unknown state transformations and propose some solutions for resolving these problems. Finally, we elucidate the advantages and limitations of our linear state-space model with simulated data from several nonlinear networks.
- An Approach to Demand Response for Alleviating Power System Stress Conditions due to Electric Vehicle PenetrationShao, Shengnan (Virginia Tech, 2011-10-17)Along with the growth of electricity demand and the penetration of intermittent renewable energy sources, electric power distribution networks will face more and more stress conditions, especially as electric vehicles (EVs) take a greater share in the personal automobile market. This may cause potential transformer overloads, feeder congestions, and undue circuit failures. Demand response (DR) is gaining attention as it can potentially relieve system stress conditions through load management. DR can possibly defer or avoid construction of large-scale power generation and transmission infrastructures by improving the electric utility load factor. This dissertation proposes to develop a planning tool for electric utilities that can provide an insight into the implementation of demand response at the end-user level. The proposed planning tool comprises control algorithms and a simulation platform that are designed to intelligently manage end-use loads to make the EV penetration transparent to an electric power distribution network. The proposed planning tool computes the demand response amount necessary at the circuit/substation level to alleviate the stress condition due to the penetration of EVs. Then, the demand response amount is allocated to the end-user as a basis for appliance scheduling and control. To accomplish the dissertation objective, electrical loads of both residential and commercial customers, as well as EV fleets, are modeled, validated, and aggregated with their control algorithms proposed at the appliance level. A multi-layer demand response model is developed that takes into account both concerns from utilities for load reduction and concerns from consumers for convenience and privacy. An analytic hierarchy process (AHP)-based approach is put forward taking into consideration opinions from all stakeholders in order to determine the priority and importance of various consumer groups. The proposed demand response strategy takes into consideration dynamic priorities of the load based on the consumers' real-time needs. Consumer comfort indices are introduced to measure the impact of demand response on consumers' life style. The proposed indices can provide electric utilities a better estimation of the customer acceptance of a DR program, and the capability of a distribution circuit to accommodate EV penetration. Research findings from this work indicate that the proposed demand response strategy can fulfill the task of peak demand reduction with different EV penetration levels while maintaining consumer comfort levels. The study shows that the higher number of EVs in the distribution circuit will result in the higher DR impacts on consumers' comfort. This indicates that when EV numbers exceed a certain threshold in an area, other measures besides demand response will have to be taken into account to tackle the peak demand growth. The proposed planning tool is expected to provide an insight into the implementation of demand response at the end-user level. It can be used to estimate demand response potentials and the benefit of implementing demand response at different DR penetration levels within a distribution circuit. The planning tool can be used by a utility to design proper incentives and encourage consumers to participate in DR programs. At the same time, the simulation results will give a better understanding of the DR impact on scheduling of electric appliances.
- Bayesian Integration and Modeling for Next-generation Sequencing Data AnalysisChen, Xi (Virginia Tech, 2016-07-01)Computational biology currently faces challenges in a big data world with thousands of data samples across multiple disease types including cancer. The challenging problem is how to extract biologically meaningful information from large-scale genomic data. Next-generation Sequencing (NGS) can now produce high quality data at DNA and RNA levels. However, in cells there exist a lot of non-specific (background) signals that affect the detection accuracy of true (foreground) signals. In this dissertation work, under Bayesian framework, we aim to develop and apply approaches to learn the distribution of genomic signals in each type of NGS data for reliable identification of specific foreground signals. We propose a novel Bayesian approach (ChIP-BIT) to reliably detect transcription factor (TF) binding sites (TFBSs) within promoter or enhancer regions by jointly analyzing the sample and input ChIP-seq data for one specific TF. Specifically, a Gaussian mixture model is used to capture both binding and background signals in the sample data; and background signals are modeled by a local Gaussian distribution that is accurately estimated from the input data. An Expectation-Maximization algorithm is used to learn the model parameters according to the distributions on binding signal intensity and binding locations. Extensive simulation studies and experimental validation both demonstrate that ChIP-BIT has a significantly improved performance on TFBS detection over conventional methods, particularly on weak binding signal detection. To infer cis-regulatory modules (CRMs) of multiple TFs, we propose to develop a Bayesian integration approach, namely BICORN, to integrate ChIP-seq and RNA-seq data of the same tissue. Each TFBS identified from ChIP-seq data can be either a functional binding event mediating target gene transcription or a non-functional binding. The functional bindings of a set of TFs usually work together as a CRM to regulate the transcription processes of a group of genes. We develop a Gibbs sampling approach to learn the distribution of CRMs (a joint distribution of multiple TFs) based on their functional bindings and target gene expression. The robustness of BICORN has been validated on simulated regulatory network and gene expression data with respect to different noise settings. BICORN is further applied to breast cancer MCF-7 ChIP-seq and RNA-seq data to identify CRMs functional in promoter or enhancer regions. In tumor cells, the normal regulatory mechanism may be interrupted by genome mutations, especially those somatic mutations that uniquely occur in tumor cells. Focused on a specific type of genome mutation, structural variation (SV), we develop a novel pattern-based probabilistic approach, namely PSSV, to identify somatic SVs from whole genome sequencing (WGS) data. PSSV features a mixture model with hidden states representing different mutation patterns; PSSV can thus differentiate heterozygous and homozygous SVs in each sample, enabling the identification of those somatic SVs with a heterozygous status in the normal sample and a homozygous status in the tumor sample. Simulation studies demonstrate that PSSV outperforms existing tools. PSSV has been successfully applied to breast cancer patient WGS data for identifying somatic SVs of key factors associated with breast cancer development. In this dissertation research, we demonstrate the advantage of the proposed distributional learning-based approaches over conventional methods for NGS data analysis. Distributional learning is a very powerful approach to gain biological insights from high quality NGS data. Successful applications of the proposed Bayesian methods to breast cancer NGS data shed light on underlying molecular mechanisms of breast cancer, enabling biologists or clinicians to identify major cancer drivers and develop new therapeutics for cancer treatment.
- Bayesian Modeling for Isoform Identification and Phenotype-specific Transcript AssemblyShi, Xu (Virginia Tech, 2017-10-24)The rapid development of biotechnology has enabled researchers to collect high-throughput data for studying various biological processes at the genomic level, transcriptomic level, and proteomic level. Due to the large noise in the data and the high complexity of diseases (such as cancer), it is a challenging task for researchers to extract biologically meaningful information that can help reveal the underlying molecular mechanisms. The challenges call for more efforts in developing efficient and effective computational methods to analyze the data at different levels so as to understand the biological systems in different aspects. In this dissertation research, we have developed novel Bayesian approaches to infer alternative splicing mechanisms in biological systems using RNA sequencing data. Specifically, we focus on two research topics in this dissertation: isoform identification and phenotype-specific transcript assembly. For isoform identification, we develop a computational approach, SparseIso, to jointly model the existence and abundance of isoforms in a Bayesian framework. A spike-and-slab prior is incorporated into the model to enforce the sparsity of expressed isoforms. A Gibbs sampler is developed to sample the existence and abundance of isoforms iteratively. For transcript assembly, we develop a Bayesian approach, IntAPT, to assemble phenotype-specific transcripts from multiple RNA sequencing profiles. A two-layer Bayesian framework is used to model the existence of phenotype-specific transcripts and the transcript abundance in individual samples. Based on the hierarchical Bayesian model, a Gibbs sampling algorithm is developed to estimate the joint posterior distribution for phenotype-specific transcript assembly. The performances of our proposed methods are evaluated with simulation data, compared with existing methods and benchmarked with real cell line data. We then apply our methods on breast cancer data to identify biologically meaningful splicing mechanisms associated with breast cancer. For the further work, we will extend our methods for de novo transcript assembly to identify novel isoforms in biological systems; we will incorporate isoform-specific networks into our methods to better understand splicing mechanisms in biological systems.
- Building Matlab Standalone Package from Java for Differential Dependence Network Analysis Bioinformatics ToolkitJin, Lu (Virginia Tech, 2010-05-26)This thesis reports a software development effort to transplant Matlab algorithm into a Matlab license-free, platform dependent Java based software. The result is almost equivalent to a direct translation of Matlab source codes into Java or any other programming languages. Since compiled library is platform dependent, an MCR (Matlab Compiler Runtime environment) is required and has been developed to deploy the transplanted algorithm to end user. As the result, the implemented MCR is free to distribution and the streamline transplantation process is much simpler and more reliable than manually translation work. In addition, the implementation methodology reported here can be reused for other similar software engineering tasks. There are mainly 4 construction steps in our software package development. First, all Matlab *.m files or *.mex files associated with the algorithms of interest (to be transplanted) are gathered, and the corresponding shared library is created by the Matlab Compiler. Second, a Java driver is created that will serve as the final user interface. This Java based user interface will take care of all the input and output of the original Matlab algorithm, and prepare all native methods. Third, assisted by JNI, a C driver is implemented to manage the variable transfer between Matlab and Java. Lastly, Matlab mbuild function is used to compile the C driver and aforementioned shared library into a dependent library, ready to be called from the standalone Java interface. We use a caBIG™ (Cancer Biomedical Informatics Grid) data analytic toolkit, namely, the DDN (differential dependence network) algorithm as the testbed in the software development. The developed DDN standalone package can be used on any Matlab-supported platform with Java GUI (Graphic User Interface) or command line parameter. As a caBIG™ toolkit, the DDN package can be integrated into other information systems such as Taverna or G-DOC. The major benefits provided by the proposed methodology can be summarized as follows. First, the proposed software development framework offers a simple and effective way for algorithm developer to provide novel bioinformatics tools to the biomedical end-users, where the frequent obstacle is the lack of language-specific software runtime environment and incompatibility between the compiled software and available computer platforms at user's sites. Second, the proposed software development framework offers software developer a significant time/effort-saving method for translating code between different programming languages, where the majority of software developer's time/effort is spent on understanding the specific analytic algorithm and its language-specific codes rather than developing efficient and platform/user-friendly software. Third, the proposed methodology allows software engineers to focus their effort on the quality of software rather than the details of original source codes, where the only required information is the inputs and outputs of the algorithm. Specifically, all used variables and functions are mapped between Matlab, C and Java, handled solely by our designated C driver.
- Computational Dissection of Composite Molecular Signatures and Transcriptional ModulesGong, Ting (Virginia Tech, 2009-12-14)This dissertation aims to develop a latent variable modeling framework with which to analyze gene expression profiling data for computational dissection of molecular signatures and transcriptional modules. The first part of the dissertation is focused on extracting pure gene expression signals from tissue or cell mixtures. The main goal of gene expression profiling is to identify the pure signatures of different cell types (such as cancer cells, stromal cells and inflammatory cells) and estimate the concentration of each cell type. In order to accomplish this, a new blind source separation method is developed, namely, nonnegative partially independent component analysis (nPICA), for tissue heterogeneity correction (THC). The THC problem is formulated as a constrained optimization problem and solved with a learning algorithm based on geometrical and statistical principles. The second part of the dissertation sought to identify gene modules from gene expression data to uncover important biological processes in different types of cells. A new gene clustering approach, nonnegative independent component analysis (nICA), is developed for gene module identification. The nICA approach is completed with an information-theoretic procedure for input sample selection and a novel stability analysis approach for proper dimension estimation. Experimental results showed that the gene modules identified by the nICA approach appear to be significantly enriched in functional annotations in terms of gene ontology (GO) categories. The third part of the dissertation moves from gene module level down to DNA sequence level to identify gene regulatory programs by integrating gene expression data and protein-DNA binding data. A sparse hidden component model is first developed for this problem, taking into account a well-known biological principle, i.e., a gene is most likely regulated by a few regulators. This is followed by the development of a novel computational approach, motif-guided sparse decomposition (mSD), in order to integrate the binding information and gene expression data. These computational approaches are primarily developed for analyzing high-throughput gene expression profiling data. Nevertheless, the proposed methods should be able to be extended to analyze other types of high-throughput data for biomedical research.
- Computational Modeling for Differential Analysis of RNA-seq and Methylation dataWang, Xiao (Virginia Tech, 2016-08-16)Computational systems biology is an inter-disciplinary field that aims to develop computational approaches for a system-level understanding of biological systems. Advances in high-throughput biotechnology offer broad scope and high resolution in multiple disciplines. However, it is still a major challenge to extract biologically meaningful information from the overwhelming amount of data generated from biological systems. Effective computational approaches are of pressing need to reveal the functional components. Thus, in this dissertation work, we aim to develop computational approaches for differential analysis of RNA-seq and methylation data to detect aberrant events associated with cancers. We develop a novel Bayesian approach, BayesIso, to identify differentially expressed isoforms from RNA-seq data. BayesIso features a joint model of the variability of RNA-seq data and the differential state of isoforms. BayesIso can not only account for the variability of RNA-seq data but also combines the differential states of isoforms as hidden variables for differential analysis. The differential states of isoforms are estimated jointly with other model parameters through a sampling process, providing an improved performance in detecting isoforms of less differentially expressed. We propose to develop a novel probabilistic approach, DM-BLD, in a Bayesian framework to identify differentially methylated genes. The DM-BLD approach features a hierarchical model, built upon Markov random field models, to capture both the local dependency of measured loci and the dependency of methylation change. A Gibbs sampling procedure is designed to estimate the posterior distribution of the methylation change of CpG sites. Then, the differential methylation score of a gene is calculated from the estimated methylation changes of the involved CpG sites and the significance of genes is assessed by permutation-based statistical tests. We have demonstrated the advantage of the proposed Bayesian approaches over conventional methods for differential analysis of RNA-seq data and methylation data. The joint estimation of the posterior distributions of the variables and model parameters using sampling procedure has demonstrated the advantage in detecting isoforms or methylated genes of less differential. The applications to breast cancer data shed light on understanding the molecular mechanisms underlying breast cancer recurrence, aiming to identify new molecular targets for breast cancer treatment.
- Dimensionality Reduction, Feature Selection and Visualization of Biological DataHa, Sook Shin (Virginia Tech, 2012-08-08)Due to the high dimensionality of most biological data, it is a difficult task to directly analyze, model and visualize the data to gain biological insight. Thus, dimensionality reduction becomes an imperative pre-processing step in analyzing and visualizing high-dimensional biological data. Two major approaches to dimensionality reduction in genomic analysis and biomarker identification studies are: Feature extraction, creating new features by combining existing ones based on a mapping technique; and feature selection, choosing an optimal subset of all features based on an objective function. In this dissertation, we show how our innovative reduction schemes effectively reduce the dimensionality of DNA gene expression data to extract biologically interpretable and relevant features which result in enhancing the biomarker identification process. To construct biologically interpretable features and facilitate Muscular Dystrophy (MD) subtypes classification, we extract molecular features from MD microarray data by constructing sub-networks using a novel integrative scheme which utilizes protein-protein interaction (PPI) network, functional gene sets information and mRNA profiling data. The workflow includes three major steps: First, by combining PPI network structure and gene-gene co-expression relationship into a new distance metric, we apply affinity propagation clustering (APC) to build gene sub-networks; secondly, we further incorporate functional gene sets knowledge to complement the physical interaction information; finally, based on the constructed sub-network and gene set features, we apply multi-class support vector machine (MSVM) for MD sub-type classification and highlight the biomarkers contributing to the sub-type prediction. The experimental results show that our scheme could construct sub-networks that are more relevant to MD than those constructed by the conventional approach. Furthermore, our integrative strategy substantially improved the prediction accuracy, especially for those ‘hard-to-classify' sub-types. Conventionally, pathway-based analysis assumes that genes in a pathway equally contribute to a biological function, thus assigning uniform weight to genes. However, this assumption has been proven incorrect and applying uniform weight in the pathway analysis may not be an adequate approach for tasks like molecular classification of diseases, as genes in a functional group may have different differential power. Hence, we propose to use different weights for the pathway analysis which resulted in the development of four weighting schemes. We applied them in two existing pathway analysis methods using both real and simulated gene expression data for pathways. Weighting changes pathway scoring and brings up some new significant pathways, leading to the detection of disease-related genes that are missed under uniform weight. To help us understand our MD expression data better and derive scientific insight from it, we have explored a suite of visualization tools. Particularly, for selected top performing MD sub-networks, we displayed the network view using Cytoscape; functional annotations using IPA and DAVID functional analysis tools; expression pattern using heat-map and parallel coordinates plot; and MD associated pathways using KEGG pathway diagrams. We also performed weighted MD pathway analysis, and identified overlapping sub-networks across different weight schemes and different MD subtypes using Venn Diagrams, which resulted in the identification of a new sub-network significantly associated with MD. All those graphically displayed data and information helped us understand our MD data and the MD subtypes better, resulting in the identification of several potentially MD associated biomarker pathways and genes.
- From network to pathway: integrative network analysis of genomic dataWang, Chen (Virginia Tech, 2011-06-14)The advent of various types of high-throughput genomic data has enabled researchers to investigate complex biological systems in a systemic way and started to shed light on the underlying molecular mechanisms in cancers. To analyze huge amounts of genomic data, effective statistical and machine learning tools are clearly needed; more importantly, integrative approaches are especially needed to combine different types of genomic data for a network or pathway view of biological systems. Motivated by such needs, we make efforts in this dissertation to develop integrative framework for pathway analysis. Specifically, we dissect the molecular pathway into two parts: protein-DNA interaction network and protein-protein interaction network. Several novel approaches are proposed to integrate gene expression data with various forms of biological knowledge, such as protein-DNA interaction and protein-protein interaction for reliable molecular network identification. The first part of this dissertation seeks to infer condition-specific transcriptional regulatory network by integrating gene expression data and protein-DNA binding information. Protein-DNA binding information provides initial relationships between transcription factors (TFs) and their target genes, and this information is essential to derive biologically meaningful integrative algorithms. Based on the availability of this information, we discuss the inference task based on two different situations: (a) if protein-DNA binding information of multiple TFs is available: based on the protein-DNA data of multiple TFs, which are derived from sequence analysis between DNA motifs and gene promoter regions, we can construct initial connection matrix and solve the network inference using a constraint least-squares approach named motif-guided network component analysis (mNCA). However, connection matrix usually contains a considerable amount of false positives and false negatives that make inference results questionable. To circumvent this problem, we propose a knowledge based stability analysis (kSA) approach to test the conditional relevance of individual TFs, by checking the discrepancy of multiple estimations of transcription factor activity with respect to different perturbations on the connections. The rationale behind stability analysis is that the consistency of observed gene expression and true network connection shall remain stable after small perturbations are applied to initial connection matrix. With condition-specific TFs prioritized by kSA, we further propose to use multivariate regression to highlight condition-specific target genes. Through simulation studies comparing with several competing methods, we show that the proposed schemes are more sensitive to detect relevant TFs and target genes for network inference purpose. Experimentally, we have applied stability analysis to yeast cell cycle experiment and further to a series of anti-estrogen breast cancer studies. In both experiments not only biologically relevant regulators are highlighted, the condition-specific transcriptional regulatory networks are also constructed, which could provide further insights into the corresponding cellular mechanisms. (b) if only single TF's protein-DNA information is available: this happens when protein-DNA binding relationship of individual TF is measured through experiments. Since original mNCA requires a complete connection matrix to perform estimation, an incomplete knowledge of single TF is not applicable for such approach. Moreover, binding information derived from experiments could still be inconsistent with gene expression levels. To overcome these limitations, we propose a linear extraction scheme called regulatory component analysis (RCA), which can infer underlying regulation relationships, even with partial biological knowledge. Numerical simulations show significant improvement of RCA over other traditional methods to identify target genes, not only in low signal-to-noise-ratio situations and but also when the given biological knowledge is incomplete and inconsistent to data. Furthermore, biological experiments on Escherichia coli regulatory network inferences are performed to fairly compare traditional methods, where the effectiveness and superior performance of RCA are confirmed. The second part of the dissertation moves from protein-DNA interaction network up to protein-protein interaction network, to identify dys-regulated protein sub-networks by integrating gene expression data and protein-protein interaction information. Specifically, we propose a statistically principled method, namely Metropolis random walk on graph (MRWOG), to highlight condition-specific PPI sub-networks in a probabilistic way. The method is based on the Markov chain Monte Carlo (MCMC) theory to generate a series of samples that will eventually converge to some desired equilibrium distribution, and each sample indicates the selection of one particular sub-network during the process of Metropolis random walk. The central idea of MRWOG is built upon that the essentiality of one gene to be included in a sub-network depends on not only its expression but also its topological importance. Contrasted to most existing methods constructing sub-networks in a deterministic way and therefore lacking relevance score for each protein, MRWOG is capable of assessing the importance of each individual protein node in a global way, not only reflecting its individual association with clinical outcome but also indicating its topological role (hub, bridge) to connect other important proteins. Moreover, each protein node is associated with a sampling frequency score, which enables the statistical justification of each individual node and flexible scaling of sub-network results. Based on MRWOG approach, we further propose two strategies: one is bootstrapping used for assessing statistical confidence of detected sub-networks; the other is graphic division to separate a large sub-network to several smaller sub-networks for facilitating interpretations. MRWOG is easy to use with only two parameters need to be adjusted, one is beta value for performing random walk and another is Quantile level for calculating truncated posteriori mean. Through extensive simulations, we show that the proposed scheme is not sensitive to these two parameters in a relatively wide range. We also compare MRWOG with deterministic approaches for identifying sub-network and prioritizing topologically important proteins, in both cases MRWG outperforms existing methods in terms of both precision and recall. By utilizing MRWOG generated node/edge sampling frequency, which is actually posteriori mean of corresponding protein node/interaction edge, we illustrate that condition-specific nodes/interactions can be better prioritized than the schemes based on scores of individual node/interaction. Experimentally, we have applied MRWOG to study yeast knockout experiment for galactose utilization pathways to reveal important components of corresponding biological functions; we also applied MRWSOG to study breast cancer patient prognostics problems, where the sub-network analysis could lead to an understanding of the molecular mechanisms of antiestrogen resistance in breast cancer. Finally, we conclude this dissertation with a summary of the original contributions, and the future work for deepening the theoretical justification of the proposed methods and broadening their potential biological applications such as cancer studies.
- Integrative Modeling and Analysis of High-throughput Biological DataChen, Li (Virginia Tech, 2010-12-15)Computational biology is an interdisciplinary field that focuses on developing mathematical models and algorithms to interpret biological data so as to understand biological problems. With current high-throughput technology development, different types of biological data can be measured in a large scale, which calls for more sophisticated computational methods to analyze and interpret the data. In this dissertation research work, we propose novel methods to integrate, model and analyze multiple biological data, including microarray gene expression data, protein-DNA interaction data and protein-protein interaction data. These methods will help improve our understanding of biological systems. First, we propose a knowledge-guided multi-scale independent component analysis (ICA) method for biomarker identification on time course microarray data. Guided by a knowledge gene pool related to a specific disease under study, the method can determine disease relevant biological components from ICA modes and then identify biologically meaningful markers related to the specific disease. We have applied the proposed method to yeast cell cycle microarray data and Rsf-1-induced ovarian cancer microarray data. The results show that our knowledge-guided ICA approach can extract biologically meaningful regulatory modes and outperform several baseline methods for biomarker identification. Second, we propose a novel method for transcriptional regulatory network identification by integrating gene expression data and protein-DNA binding data. The approach is built upon a multi-level analysis strategy designed for suppressing false positive predictions. With this strategy, a regulatory module becomes increasingly significant as more relevant gene sets are formed at finer levels. At each level, a two-stage support vector regression (SVR) method is utilized to reduce false positive predictions by integrating binding motif information and gene expression data; a significance analysis procedure is followed to assess the significance of each regulatory module. The resulting performance on simulation data and yeast cell cycle data shows that the multi-level SVR approach outperforms other existing methods in the identification of both regulators and their target genes. We have further applied the proposed method to breast cancer cell line data to identify condition-specific regulatory modules associated with estrogen treatment. Experimental results show that our method can identify biologically meaningful regulatory modules related to estrogen signaling and action in breast cancer. Third, we propose a bootstrapping Markov Random Filed (MRF)-based method for subnetwork identification on microarray data by incorporating protein-protein interaction data. Methodologically, an MRF-based network score is first derived by considering the dependency among genes to increase the chance of selecting hub genes. A modified simulated annealing search algorithm is then utilized to find the optimal/suboptimal subnetworks with maximal network score. A bootstrapping scheme is finally implemented to generate confident subnetworks. Experimentally, we have compared the proposed method with other existing methods, and the resulting performance on simulation data shows that the bootstrapping MRF-based method outperforms other methods in identifying ground truth subnetwork and hub genes. We have then applied our method to breast cancer data to identify significant subnetworks associated with drug resistance. The identified subnetworks not only show good reproducibility across different data sets, but indicate several pathways and biological functions potentially associated with the development of breast cancer and drug resistance. In addition, we propose to develop network-constrained support vector machines (SVM) for cancer classification and prediction, by taking into account the network structure to construct classification hyperplanes. The simulation study demonstrates the effectiveness of our proposed method. The study on the real microarray data sets shows that our network-constrained SVM, together with the bootstrapping MRF-based subnetwork identification approach, can achieve better classification performance compared with conventional biomarker selection approaches and SVMs. We believe that the research presented in this dissertation not only provides novel and effective methods to model and analyze different types of biological data, the extensive experiments on several real microarray data sets and results also show the potential to improve the understanding of biological mechanisms related to cancers by generating novel hypotheses for further study.
- Learning Statistical and Geometric Models from Microarray Gene Expression DataZhu, Yitan (Virginia Tech, 2009-09-02)In this dissertation, we propose and develop innovative data modeling and analysis methods for extracting meaningful and specific information about disease mechanisms from microarray gene expression data. To provide a high-level overview of gene expression data for easy and insightful understanding of data structure, we propose a novel statistical data clustering and visualization algorithm that is comprehensively effective for multiple clustering tasks and that overcomes some major limitations of existing clustering methods. The proposed clustering and visualization algorithm performs progressive, divisive hierarchical clustering and visualization, supported by hierarchical statistical modeling, supervised/unsupervised informative gene/feature selection, supervised/unsupervised data visualization, and user/prior knowledge guidance through human-data interactions, to discover cluster structure within complex, high-dimensional gene expression data. For the purpose of selecting suitable clustering algorithm(s) for gene expression data analysis, we design an objective and reliable clustering evaluation scheme to assess the performance of clustering algorithms by comparing their sample clustering outcome to phenotype categories. Using the proposed evaluation scheme, we compared the performance of our newly developed clustering algorithm with those of several benchmark clustering methods, and demonstrated the superior and stable performance of the proposed clustering algorithm. To identify the underlying active biological processes that jointly form the observed biological event, we propose a latent linear mixture model that quantitatively describes how the observed gene expressions are generated by a process of mixing the latent active biological processes. We prove a series of theorems to show the identifiability of the noise-free model. Based on relevant geometric concepts, convex analysis and optimization, gene clustering, and model stability analysis, we develop a robust blind source separation method that fits the model to the gene expression data and subsequently identify the underlying biological processes and their activity levels under different biological conditions. Based on the experimental results obtained on cancer, muscle regeneration, and muscular dystrophy gene expression data, we believe that the research work presented in this dissertation not only contributes to the engineering research areas of machine learning and pattern recognition, but also provides novel and effective solutions to potentially solve many biomedical research problems, for improving the understanding about disease mechanisms.
- Modeling and Characterization of Dynamic Changes in Biological Systems from Multi-platform Genomic DataZhang, Bai (Virginia Tech, 2011-09-13)Biological systems constantly evolve and adapt in response to changed environment and external stimuli at the molecular and genomic levels. Building statistical models that characterize such dynamic changes in biological systems is one of the key objectives in bioinformatics and computational biology. Recent advances in high-throughput genomic and molecular profiling technologies such as gene expression and and copy number microarrays provide ample opportunities to study cellular activities at the individual gene and network levels. The aim of this dissertation is to formulate mathematically dynamic changes in biological networks and DNA copy numbers, to develop machine learning algorithms to learn these statistical models from high-throughput biological data, and to demonstrate their applications in systems biological studies. The first part (Chapters 2-4) of the dissertation focuses on the dynamic changes taking placing at the biological network level. Biological networks are context-specific and dynamic in nature. Under different conditions, different regulatory components and mechanisms are activated and the topology of the underlying gene regulatory network changes. We report a differential dependency network (DDN) analysis to detect statistically significant topological changes in the transcriptional networks between two biological conditions. Further, we formalize and extend the DDN approach to an effective learning strategy to extract structural changes in graphical models using l1-regularization based convex optimization. We discuss the key properties of this formulation and introduce an efficient implementation by the block coordinate descent algorithm. Another type of dynamic changes in biological networks is the observation that a group of genes involved in certain biological functions or processes coordinate to response to outside stimuli, producing distinct time course patterns. We apply the echo stat network, a new architecture of recurrent neural networks, to model temporal gene expression patterns and analyze the theoretical properties of echo state networks with random matrix theory. The second part (Chapter 5) of the dissertation focuses on the changes at the DNA copy number level, especially in cancer cells. Somatic DNA copy number alterations (CNAs) are key genetic events in the development and progression of human cancers, and frequently contribute to tumorigenesis. We propose a statistically-principled in silico approach, Bayesian Analysis of COpy number Mixtures (BACOM), to accurately detect genomic deletion type, estimate normal tissue contamination, and accordingly recover the true copy number profile in cancer cells.
- Novel Monte Carlo Approaches to Identify Aberrant Pathways in CancerGu, Jinghua (Virginia Tech, 2013-08-27)Recent breakthroughs in high-throughput biotechnology have promoted the integration of multi-platform data to investigate signal transduction pathways within a cell. In order to model complicated dynamics and heterogeneity of biological pathways, sophisticated computational models are needed to address unique properties of both the biological hypothesis and the data. In this dissertation work, we have proposed and developed methods using Markov Chain Monte Carlo (MCMC) techniques to solve complex modeling problems in human cancer research by integrating multi-platform data. We focus on two research topics: 1) identification of transcriptional regulatory networks and 2) uncovering of aberrant intracellular signal transduction pathways. We propose a robust method, called GibbsOS, to identify condition specific gene regulatory patterns between transcription factors and their target genes. A Gibbs sampler is employed to sample target genes from the marginal function of outlier sum of regression t statistic. Numerical simulation has demonstrated significant performance improvement of GibbsOS over existing methods against noise and false positive connections in binding data. We have applied GibbsOS to breast cancer cell line datasets and identified condition specific regulatory rewiring in human breast cancer. We also propose a novel method, namely Gibbs sampler to Infer Signal Transduction (GIST), to detect aberrant pathways that are highly associated with biological phenotypes or clinical information. By converting predefined potential functions into a Gibbs distribution, GIST estimates edge directions by learning the distribution of linear signaling pathway structures. Through the sampling process, the algorithm is able to infer signal transduction directions which are jointly determined by both gene expression and network topology. We demonstrate the advantage of the proposed algorithms on simulation data with respect to different settings of noise level in gene expression and false-positive connections in protein-protein interaction (PPI) network. Another major contribution of the dissertation work is that we have improved traditional perspective towards understanding aberrant signal transductions by further investigating structural linkage of signaling pathways. We develop a method called Structural Organization to Uncover pathway Landscape (SOUL), which emphasizes on modularized pathways structures from reconstructed pathway landscape. GIST and SOUL provide a very unique angle to computationally model alternative pathways and pathway crosstalk. The proposed new methods can bring insight to drug discovery research by targeting nodal proteins that oversee multiple signaling pathways, rather than treating individual pathways separately. A complete pathway identification protocol, namely Infer Modularization of PAthway CrossTalk (IMPACT), is developed to bridge downstream regulatory networks with upstream signaling cascades. We have applied IMPACT to breast cancer treated patient datasets to investigate how estrogen receptor (ER) signaling pathways are related to drug resistance. The identified pathway proteins from patient datasets are well supported by breast cancer cell line models. We hypothesize from computational results that HSP90AA1 protein is an important nodal protein that oversees multiple signaling pathways to drive drug resistance. Cell viability analysis has supported our hypothesis by showing a significant decrease in viability of endocrine resistant cells compared with non-resistant cells when 17-AAG (a drug that inhibits HSP90AA1) is applied. We believe that this dissertation work not only offers novel computational tools towards understanding complicated biological problems, but more importantly, it provides a valuable paradigm where systems biology connects data with hypotheses using computational modeling. Initial success of using microarray datasets to study endocrine resistance in breast cancer has shed light on translating results from high throughput datasets to biological discoveries in complicated human disease studies. As the next generation biotechnology becomes more cost-effective, the power of the proposed methods to untangle complicated aberrant signaling rewiring and pathway crosstalk will be finally unleashed.
- Registration of Images with Varying Topology using Embedded MapsLi, Xiaoxing (Virginia Tech, 2010-11-16)In medical images, intensity changes caused by certain pathology can change the topology of image level-sets and are thus commonly referred to as topological changes. Topological changes cause false deformation in existing deformable registration algorithms, which in turn leads to unreliable observations in the clinical study that relies on the deformation fields, such as deformation based morphometry (DBM). In this work, we develop a new deformable registration algorithm for images with topological changes. In our proposed algorithm, 3D images are embedded as 4D surfaces in a Riemannian space. The registration is therefore conducted as a surface evolution, which is modeled by a diffusion process. Our algorithm differs from existing methods in the sense that it takes an a-priori estimation of areas with topological change as an additional input and generates dense deformation vector fields which are free of false deformation. In particular, the output of our algorithm is composed of a diffeomorphic deformation field and an intensity displacement which corrects intensity difference caused by topological changes. By conducting multiple sets of experiments, we demonstrate that our proposed algorithm is capable of accurately registering images with considerable topological changes. More importantly, the resulting deformation field is not impacted by topological changes, i.e., there is no false deformation.
- Remote Operator Blended Intelligence System for Environmental Navigation and Discernment (RobiSEND)Gaines, Jonathan Elliot (Virginia Tech, 2011-09-06)Mini Rotorcraft Unmanned Aerial Vehicles (MRUAVs) flown at low altitude as a part of a human-robot team are potential sources of tactical information for local search missions. Traditionally, their effectiveness in this role has been limited by an inability to intelligently perceive unknown environments or integrate with human team members. Human-robot collaboration provides the theory for building cooperative relationships in this context. This theory, however, only addresses those human-robot teams that are either robot-centered or human-centered in their decision making processes or relationships. This work establishes a new branch of human-robot collaborative theory, Operator Blending, which creates codependent and cooperative relationships between a single robot and human team member for tactical missions. Joint Intension Theory is the basis of this approach, which allows both the human and robot to contribute what each does well in accomplishing the mission objectives. Information processing methods for shared visual information and object tracking take advantage of the human role in the perception process. In addition, coupling of translational commands and the search process establish navigation as the shared basis of communication between the MRUAV and human, for system integration purposes. Observation models relevant to both human and robotic collaborators are tracked through a boundary based approach deemed AIM-SHIFT. A system is developed to classify the semantic and functional relevance of an observation model to local search called the Code of Observational Genetics (COG). These COGs are used to qualitatively map the environment through Qualitative Unsupervised Intelligent Collaborative Keypoint (QUICK) mapping, created to support these methods.
- Robust Feature Extraction and Temporal Analysis for Partial Fingerprint IdentificationShort, Nathaniel Jackson (Virginia Tech, 2012-09-05)Identification of an individual from discriminating features of the friction ridge surface is one of the oldest and most commonly used biometric techniques. Methods for identification span from tedious, although highly accurate, manual examination to much faster Automated Fingerprint Identification Systems (AFIS). While automatic fingerprint recognition has grown in popularity due to the speed and accuracy of matching minutia features of good quality plain-to-rolled prints, the performance is less than impressive when matching partial fingerprints. For some applications, including forensic analysis where partial prints come in the form of latent prints, it is not always possible to obtain high-quality image samples. Latent prints, which are lifted from a surface, are typically of low quality and low fingerprint surface area. As a result, the overlapping region in which to find corresponding features in the genuine matching ten-print is reduced; this in turn reduces the identification performance. Image quality also can vary substantially during image capture in applications with a high throughput of subjects having limited training, such as in border control. The rushed image capture leads to an overall acceptable sample being obtained where local image region quality may be low. We propose an improvement to the reliability of features detected in exemplar prints in order to reduce the likelihood of an unreliable overlapping region corresponding with a genuine partial print. A novel approach is proposed for detecting minutiae in low quality image regions. The approach has demonstrated an increase in match performance for a set of fingerprints from a well-known database. While the method is effective at improving match performance for all of the fingerprint images in the database, a more significant improvement is observed for a subset of low quality images. In addition, a novel method for fingerprint analysis using a sequence of fingerprint images is proposed. The approach uses the sequence of images to extract and track minutiae for temporal analysis during a single impression, reducing the variation in image quality during image capture. Instead of choosing a single acceptable image from the sequence based on a global measure, we examine the change in quality on a local level and stitch blocks from multiple images based on the optimal local quality measures.
- Separating a Gas Mixture Into Its Constituent Analytes Using FicaMahadevan, Aparna (Virginia Tech, 2009-05-05)Unlike the conventional "lock-and-key" sensor design in which one sensor is finely tuned to respond to one analyte, the sensor array approach employs multiple sensors in which one sensor responds to many analytes. Consequently, signal processing algorithms must be used to identify the analyte present from the array's response. The analyte identification process becomes significantly more complicated when a mixture of analytes is presented to the sensor array. Conventional methods that are employed in gas mixture identification are plagued by several design issues like: complexity, scalability, and flexibility. This thesis derives and develops a novel method, fingerprint-based ICA (FICA), to extract and identify individual analytes from a sensor array's response to a gas mixture of the analytes. FICA is a simple, flexible, and scalable signal processing system that employs independent components analysis (ICA) to extract and identify individual analytes present in a gas mixture; separation and identification of gas mixtures using ICA has not been investigated previously. FICA takes a fundamentally different approach that reflects the underlying property of gas mixtures: gas mixtures are composed of individual analyte responses. Conventional signal processing methods that identify gas mixtures have been developed and implemented in this work; this helps us understand the drawbacks in the conventional approach. FICA's performance is compared to the performance of conventional methods using metric like error rate and false positives rate. Properties like flexibility, scalability, and the data requirements for both conventional methods and FICA are examined. Results obtained in this work indicates that FICA results in lower error rates, and it's performance is better than conventional methods like multi-stage multi-stage support vector machines, and PCR. Furthermore, FICA provides the most simple, scalable, and flexible signal processing system.
- Statistical Machine Learning for Multi-platform Biomedical Data AnalysisChen, Li (Virginia Tech, 2011-08-24)Recent advances in biotechnologies have enabled multiplatform and large-scale quantitative measurements of biomedical events. The need to analyze the produced vast amount of imaging and genomic data stimulates various novel applications of statistical machine learning methods in many areas of biomedical research. The main objective is to assist biomedical investigators to better interpret, analyze, and understand the biomedical questions based on the acquired data. Given the computational challenges imposed by these high-dimensional and complex data, machine learning research finds its new opportunities and roles. In this dissertation thesis, we propose to develop, test and apply novel statistical machine learning methods to analyze the data mainly acquired by dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) and single nucleotide polymorphism (SNP) microarrays. The research work focuses on: (1) tissue-specific compartmental analysis for dynamic contrast-enhanced MR imaging of complex tumors; (2) computational Analysis for detecting DNA SNP interactions in genome-wide association studies. DCE-MRI provides a noninvasive method for evaluating tumor vasculature patterns based on contrast accumulation and washout. Compartmental analysis is a widely used mathematical tool to model dynamic imaging data and can provide accurate pharmacokinetics parameter estimates. However partial volume effect (PVE) existing in imaging data would have profound effect on the accuracy of pharmacokinetics studies. We therefore propose a convex analysis of mixtures (CAM) algorithm to explicitly eliminate PVE by expressing the kinetics in each pixel as a nonnegative combination of underlying compartments and subsequently identifying pure volume pixels at the corners of the clustered pixel time series scatter plot. The algorithm is supported by a series of newly proved theorems and additional noise filtering and normalization preprocessing. We demonstrate the principle and feasibility of the CAM approach together with compartmental modeling on realistic synthetic data, and compare the accuracy of parameter estimates obtained using CAM or other relevant techniques. Experimental results show a significant improvement in the accuracy of kinetic parameter estimation. We then apply the algorithm to real DCE-MRI data of breast cancer and observe improved pharmacokinetics parameter estimation that separates tumor tissue into sub-regions with differential tracer kinetics on a pixel-by-pixel basis and reveals biologically plausible tumor tissue heterogeneity patterns. This method has combined the advantages of multivariate clustering, convex optimization and compartmental modeling approaches. Interactions among genetic loci are believed to play an important role in disease risk. Due to the huge dimension of SNP data (normally several millions in genome-wide association studies), the combinatorial search and statistical evaluation required to detect multi-locus interactions constitute a significantly challenging computational task. While many approaches have been proposed for detecting such interactions, their relative performance remains largely unclear, due to the fact that performance was evaluated on different data sources, using different performance measures, and under different experimental protocols. Given the importance of detecting gene-gene interactions, a thorough evaluation of the performance and limitations of available methods, a theoretical analysis of the interaction effect and the genetic factors it depends on, and the development of more efficient methods are warranted. Therefore, we perform a computational analysis for detect interactions among SNPs. The contributions are four-fold: (1) developed simulation tools for evaluating performance of any technique designed to detect interactions among genetic variants in case-control studies; (2) used these tools to compare performance of five popular SNP detection methods; and (3) derived analytic relationships between power and the genetic factors, which not only support the experimental results but also gives a quantitative linkage between interaction effect and these factors; (4) based on the novel insights gained by comparative and theoretical analysis, developed an efficient statistically-principled method, namely the hybrid correlation-based association (HCA) to detect interacting SNPs. The HCA algorithm is based on three correlation-based statistics, which are designed to measure the strength of multi-locus interaction with three different interaction types, covering a large portion of possible interactions. Moreover, to maximize the detection power (sensitivity) while suppressing false positive rate (or retaining moderate specificity), we also devised a strategy to hybridize these three statistics in a case-by-case way. A heuristic search strategy is also proposed to largely decrease the computational complexity, especially for high-order interaction detection. We have tested HCA in both simulation study and real disease study. HCA and the selected peer methods were compared on a large number of simulated datasets, each including multiple sets of interaction models. The assessment criteria included several power measures, family-wise type I error rate, and computational complexity. The experimental results of HCA on the simulation data indicate its promising performance in terms of a good balance between detection accuracy and computational complexity. By running on multiple real datasets, HCA also replicates plausible biomarkers reported in previous literatures.
- Using Artificial Life to Design Machine Learning Algorithms for Decoding Gene Expression Patterns from ImagesZaghlool, Shaza Basyouni (Virginia Tech, 2008-04-30)Understanding the relationship between gene expression and phenotype is important in many areas of biology and medicine. Current methods for measuring gene expression such as microarrays however are invasive, require biopsy, and expensive. These factors limit experiments to low rate temporal sampling of gene expression and prevent longitudinal studies within a single subject, reducing their statistical power. Thus methods for non-invasive measurements of gene expression are an important and current topic of research. An interesting approach (Segal et al, Nature Biotechnology 25 (6) 2007) to indirect measurements of gene expression has recently been reported that uses existing imaging techniques and machine learning to estimate a function mapping image features to gene expression patterns, providing an image-derived surrogate for gene expression. However, the design of machine learning methods for this purpose is hampered by the cost of training and validation. My thesis shows that populations of artificial organisms simulating genetic variation can be used for designing machine learning approaches to decoding gene expression patterns from images. If analysis of these images proves successful, then this can be applied to real biomedical images reducing the limitations of invasive imaging. The results showed that the box counting dimension was a suitable feature extraction method yielding a classification rate of at least 90% for mutation rates up to 40%. Also, the box-counting dimension was robust in dealing with distorted images. The performance of the classifiers using the fractal dimension as features, actually, seemed more vulnerable to the mutation rate as opposed to the applied distortion level.