Browsing by Author "Wang, Yue J."
Now showing 1 - 20 of 65
Results Per Page
Sort Options
- Accurate Identification of Significant Aberrations in Cancer Genome: Implementation and ApplicationsHou, Xuchu (Virginia Tech, 2013-01-07)Somatic Copy Number Alterations (CNAs) are common events in human cancers. Identifying CNAs and Significant Copy number Aberrations (SCAs) in cancer genomes is a critical task in searching for cancer-associated genes. Advanced genome profiling technologies, such as SNP array technology, facilitate copy number study at a genome-wide scale with high resolution. However, due to normal tissue contamination, the observed intensity signals are actually the mixture of copy number signals contributed from both tumor and normal cells. This genetic confounding factor would significantly affect the subsequent copy number analyses. In order to accurately identify significant aberrations in contaminated cancer genome, we develop a Java AISAIC package (Accurate Identification of Significant Aberrations in Cancer) that incorporates recent novel algorithms in the literature, BACOM (Bayesian Analysis of Copy number Mixtures) and SAIC (Significant Aberrations in Cancer). Specifically, BACOM is used to estimate the normal tissue contamination fraction and recover the "true" copy number profiles. And SAIC is used to detect SCAs using large recovered tumor samples. Considering the popularity of modern multi-core computers and clusters, we adopt concurrent computing using Java Fork/Join API to speed up the analysis. We evaluate the performance of the AISAIC package in both empirical family-wise type I error rate and detection power on a large number of simulation data, and get promising results. Finally, we use AISAIC to analyze real cancer data from TCGA portal and detect many SCAs that not only cover majority of reported cancer-associated genes, but also some novel genome regions that may worth further study.
- Advanced Projection Ultrasound Imaging with CMOS-based Sensor Array: Development, Characterization, and Potential Medical ApplicationsLiu, Chu Chuan (Virginia Tech, 2009-12-17)Since early 1960s, ultrasound has become one of the most widely used medical imaging device as a diagnostic tool or an image guider for surgical intervention because of its high portability, non-ionization, non-invasiveness and low cost. Although continuous improvements in commercial equipments have been underway for many years, almost all systems are developed with pulse-echo geometry. In this research, a newly invented ultrasound sensor array was incorporated into the developments of a projection imaging system. Three C-scan prototypes, which included prototypes #1, #2 and an ultrasound mammography system, were constructed. Systematic and Evaluative studies included ultrasound CT, 3-D ultrasound, and multi-modality investigations were also performed. Furthermore, a new analytical method to model ultrasound forward scattering distribution (FSD) was developed by employing a specific annular apparatus. After applying this method, the scattering-corrected C-scan images revealed more detail structures as compared to unprocessed images. This new analytical modelling approach is believed to be effective for most imaging systems operating in projection geometry. In summary, while awaiting additional clinical validation, the C-scan ultrasound prototypes with the state-of-the-art PE-CMOS sensor arrays can provide veritable value and holds real and imminent promise in medical diagnostic imaging. Potential future uses of C-scan ultrasound include but not limit to computerized tomography, biopsy guidance, therapeutic device placing, foreign object detection, pediatric imaging, breast imaging, prostate imaging, human extremities imaging and live animal imaging. With continuous research and development, we believe that C-scan ultrasound has the potential to make a significant impact in the field of medical ultrasound imaging.
- Analysis of the Impact of Solar Thermal Water Heaters on the Electrical Distribution LoadJesudhason Maria Therasammal, Terry Bruno (Virginia Tech, 2011-09-23)In this research, the impact of solar thermal water heaters on the electric water heating load curve in a residential distribution circuit is analyzed with realistic hot water draw profiles. For this purpose, the electric and solar thermal water heater models are developed in MATLAB and validated with results from GridLAB-D and TRNSYS respectively. The solar thermal water heater model is developed for two types of collectors namely the flat plate and evacuated glass tube collector. Simulations are performed with the climate data from two cities - Madison, WI and Tampa, FL - which belong to two very different climate zones in the United States. Minute-by-minute electric energy consumptions in all three configurations of water heaters are modeled for a single water heater as well as a residential distribution circuit with 100 water heaters for daily as well as monthly time frames. The research findings include: The electric energy saving potential of a solar thermal water heater powered by auxiliary electric element is in the range of 40-80% as compared to an all-electric water heater depending on the site conditions such as ambient temperature, sunshine and wind speed. The simulation results indicate that the energy saving potential of a solar thermal water heater is in the range of 40-70% during winter and 60-80% during summer. Solar thermal water heaters aid in reducing the peak demand for electric water heating in a distribution feeder during sunshine hours when ambient temperatures are higher. The simulation results indicate that the peak reduction potential of solar thermal water heaters in a residential distribution feeder is in the range of 25-40% during winter and 40-60% during summer. The evacuated glass tube collectors save an additional 7-10% electric energy compared to the flat plate collectors with one glass pane during winter and around 10-15% during summer. The additional savings result from the capability of glass tube collectors to absorb ground reflected radiation and diffuse as well as direct beam radiation for a wider range of incidence angles. Also, the evacuated glass tube structure helps in reducing wind convective losses. From the simulations performed for Madison, WI and Tampa, FL, it is observed that Tampa, FL experiences more energy savings in winter than Madison, WI, while the energy savings are almost the same in summer. This is due to the fact that Tampa, FL has warmer winters with higher ambient temperatures and longer sunshine hours during the day compared to Madison, WI while the summer temperatures and sunshine hours are almost the same for the two cities. As expected, the simulation results prove the fact that lowering the hot water temperature set point will result in the reduction of electricity consumption. For a temperature reduction from 120 deg. F to 110 deg. F, electric water heaters save about 25-35% electric energy whereas solar thermal water heaters save about 30-40% auxiliary electric energy for the same temperature reduction. For the flat plate collectors, glass panes play an important role in auxiliary electric energy consumption. Flat plate collectors with two glass panes save about 10-15% auxiliary electric energy compared to those with no glass panes and about 3-5% energy saving compared to collectors with one glass pane. This is because there are reduced wind convective losses with glass panes. However, there are also transmittance losses from glass panes and there are upper limits on how many glass panes can be used. Results and findings from this research provide valuable insight into the benefits of solar thermal water heaters in a residential distribution feeder, which include the energy savings and peak demand reduction.
- Automated Analysis of Astrocyte Activities from Large-scale Time-lapse Microscopic Imaging DataWang, Yizhi (Virginia Tech, 2019-12-13)The advent of multi-photon microscopes and highly sensitive protein sensors enables the recording of astrocyte activities on a large population of cells over a long-time period in vivo. Existing tools cannot fully characterize these activities, both within single cells and at the population-level, because of the insufficiency of current region-of-interest-based approaches to describe the activity that is often spatially unfixed, size-varying, and propagative. Here, we present Astrocyte Quantitative Analysis (AQuA), an analytical framework that releases astrocyte biologists from the ROI-based paradigm. The framework takes an event-based perspective to model and accurately quantify the complex activity in astrocyte imaging datasets, with an event defined jointly by its spatial occupancy and temporal dynamics. To model the signal propagation in astrocyte, we developed graphical time warping (GTW) to align curves with graph-structured constraints and integrated it into AQuA. To make AQuA easy to use, we designed a comprehensive software package. The software implements the detection pipeline in an intuitive step by step GUI with visual feedback. The software also supports proof-reading and the incorporation of morphology information. With synthetic data, we showed AQuA performed much better in accuracy compared with existing methods developed for astrocytic data and neuronal data. We applied AQuA to a range of ex vivo and in vivo imaging datasets. Since AQuA is data-driven and based on machine learning principles, it can be applied across model organisms, fluorescent indicators, experimental modes, and imaging resolutions and speeds, enabling researchers to elucidate fundamental astrocyte physiology.
- Automated Identification and Tracking of Motile Oligodendrocyte Precursor Cells (OPCs) from Time-lapse 3D Microscopic Imaging Data of Cell Clusters in vivoWang, Yinxue (Virginia Tech, 2021-06-02)Advances in time-lapse 3D in vivo fluorescence microscopic imaging techniques enables the observation and investigation into the migration of Oligodendrocyte precursor cells (OPCs) and its role in the central nervous system. However, current practice of image-based OPC motility analysis heavily relies on manual labeling and tracking on 2D max projection of the 3D data, which suffers from massive human labor, subjective biases, weak reproducibility and especially information loss and distortion. Besides, due to the lack of OPC specific genetically encoded indicator, OPCs can only be identified from other oligodendrocyte lineage cells by their observed motion patterns. Automated analytical tools are needed for the identification and tracking of OPCs. In this dissertation work, we proposed an analytical framework, MicTracker (Migrating Cell Tracker), for the integrated task of identifying, segmenting and tracking migrating cells (OPCs) from in vivo time-lapse fluorescence imaging data of high-density cell clusters composed of cells with different modes of motions. As a component of the framework, we presented a novel strategy for cell segmentation with global temporal consistency enforced, tackling the challenges caused by highly clustered cell population and temporally inconsistently blurred boundaries between touching cells. We also designed a data association algorithm to address the violation of usual assumption of small displacements. Recognizing that the violation was in the mixed cell population composed of two cell groups while the assumption held within each group, we proposed to solve the seemingly impossible mission by de-mixing the two groups of cell motion modes without known labels. We demonstrated the effectiveness of MicTracker in solving our problem on in vivo real data.
- Automated Tracking of Mouse Embryogenesis from Large-scale Fluorescence Microscopy DataWang, Congchao (Virginia Tech, 2021-06-03)Recent breakthroughs in microscopy techniques and fluorescence probes enable the recording of mouse embryogenesis at the cellular level for days, easily generating terabyte-level 3D time-lapse data. Since millions of cells are involved, this information-rich data brings a natural demand for an automated tool for its comprehensive analysis. This tool should automatically (1) detect and segment cells at each time point and (2) track cell migration across time. Most existing cell tracking methods cannot scale to the data with such large size and high complexity. For those purposely designed for embryo data analysis, the accuracy is heavily sacrificed. Here, we present a new computational framework for the mouse embryo data analysis with high accuracy and efficiency. Our framework detects and segments cells with a fully probability-principled method, which not only has high statistical power but also helps determine the desired cell territories and increase the segmentation accuracy. With the cells detected at each time point, our framework reconstructs cell traces with a new minimum-cost circulation-based paradigm, CINDA (CIrculation Network-based DataAssociation). Compared with the widely used minimum-cost flow-based methods, CINDA guarantees the global optimal solution with the best-of-known theoretical worst-case complexity and hundreds to thousands of times practical efficiency improvement. Since the information extracted from a single time point is limited, our framework iteratively refines cell detection and segmentation results based on the cell traces which contain more information from other time points. Results show that this dramatically improves the accuracy of cell detection, segmentation, and tracking. To make our work easy to use, we designed a standalone software, MIVAQ (Microscopic Image Visualization, Annotation, and Quantification), with our framework as the backbone and a user-friendly interface. With MIVAQ, users can easily analyze their data and visually check the results.
- Bayesian Alignment Model for Analysis of LC-MS-based Omic DataTsai, Tsung-Heng (Virginia Tech, 2014-05-22)Liquid chromatography coupled with mass spectrometry (LC-MS) has been widely used in various omic studies for biomarker discovery. Appropriate LC-MS data preprocessing steps are needed to detect true differences between biological groups. Retention time alignment is one of the most important yet challenging preprocessing steps, in order to ensure that ion intensity measurements among multiple LC-MS runs are comparable. In this dissertation, we propose a Bayesian alignment model (BAM) for analysis of LC-MS data. BAM uses Markov chain Monte Carlo (MCMC) methods to draw inference on the model parameters and provides estimates of the retention time variability along with uncertainty measures, enabling a natural framework to integrate information of various sources. From methodology development to practical application, we investigate the alignment problem through three research topics: 1) development of single-profile Bayesian alignment model, 2) development of multi-profile Bayesian alignment model, and 3) application to biomarker discovery research. Chapter 2 introduces the profile-based Bayesian alignment using a single chromatogram, e.g., base peak chromatogram from each LC-MS run. The single-profile alignment model improves on existing MCMC-based alignment methods through 1) the implementation of an efficient MCMC sampler using a block Metropolis-Hastings algorithm, and 2) an adaptive mechanism for knot specification using stochastic search variable selection (SSVS). Chapter 3 extends the model to integrate complementary information that better captures the variability in chromatographic separation. We use Gaussian process regression on the internal standards to derive a prior distribution for the mapping functions. In addition, a clustering approach is proposed to identify multiple representative chromatograms for each LC-MS run. With the Gaussian process prior, these chromatograms are simultaneously considered in the profile-based alignment, which greatly improves the model estimation and facilitates the subsequent peak matching process. Chapter 4 demonstrates the applicability of the proposed Bayesian alignment model to biomarker discovery research. We integrate the proposed Bayesian alignment model into a rigorous preprocessing pipeline for LC-MS data analysis. Through the developed analysis pipeline, candidate biomarkers for hepatocellular carcinoma (HCC) are identified and confirmed on a complementary platform.
- Bayesian Integration and Modeling for Next-generation Sequencing Data AnalysisChen, Xi (Virginia Tech, 2016-07-01)Computational biology currently faces challenges in a big data world with thousands of data samples across multiple disease types including cancer. The challenging problem is how to extract biologically meaningful information from large-scale genomic data. Next-generation Sequencing (NGS) can now produce high quality data at DNA and RNA levels. However, in cells there exist a lot of non-specific (background) signals that affect the detection accuracy of true (foreground) signals. In this dissertation work, under Bayesian framework, we aim to develop and apply approaches to learn the distribution of genomic signals in each type of NGS data for reliable identification of specific foreground signals. We propose a novel Bayesian approach (ChIP-BIT) to reliably detect transcription factor (TF) binding sites (TFBSs) within promoter or enhancer regions by jointly analyzing the sample and input ChIP-seq data for one specific TF. Specifically, a Gaussian mixture model is used to capture both binding and background signals in the sample data; and background signals are modeled by a local Gaussian distribution that is accurately estimated from the input data. An Expectation-Maximization algorithm is used to learn the model parameters according to the distributions on binding signal intensity and binding locations. Extensive simulation studies and experimental validation both demonstrate that ChIP-BIT has a significantly improved performance on TFBS detection over conventional methods, particularly on weak binding signal detection. To infer cis-regulatory modules (CRMs) of multiple TFs, we propose to develop a Bayesian integration approach, namely BICORN, to integrate ChIP-seq and RNA-seq data of the same tissue. Each TFBS identified from ChIP-seq data can be either a functional binding event mediating target gene transcription or a non-functional binding. The functional bindings of a set of TFs usually work together as a CRM to regulate the transcription processes of a group of genes. We develop a Gibbs sampling approach to learn the distribution of CRMs (a joint distribution of multiple TFs) based on their functional bindings and target gene expression. The robustness of BICORN has been validated on simulated regulatory network and gene expression data with respect to different noise settings. BICORN is further applied to breast cancer MCF-7 ChIP-seq and RNA-seq data to identify CRMs functional in promoter or enhancer regions. In tumor cells, the normal regulatory mechanism may be interrupted by genome mutations, especially those somatic mutations that uniquely occur in tumor cells. Focused on a specific type of genome mutation, structural variation (SV), we develop a novel pattern-based probabilistic approach, namely PSSV, to identify somatic SVs from whole genome sequencing (WGS) data. PSSV features a mixture model with hidden states representing different mutation patterns; PSSV can thus differentiate heterozygous and homozygous SVs in each sample, enabling the identification of those somatic SVs with a heterozygous status in the normal sample and a homozygous status in the tumor sample. Simulation studies demonstrate that PSSV outperforms existing tools. PSSV has been successfully applied to breast cancer patient WGS data for identifying somatic SVs of key factors associated with breast cancer development. In this dissertation research, we demonstrate the advantage of the proposed distributional learning-based approaches over conventional methods for NGS data analysis. Distributional learning is a very powerful approach to gain biological insights from high quality NGS data. Successful applications of the proposed Bayesian methods to breast cancer NGS data shed light on underlying molecular mechanisms of breast cancer, enabling biologists or clinicians to identify major cancer drivers and develop new therapeutics for cancer treatment.
- Bayesian Modeling for Isoform Identification and Phenotype-specific Transcript AssemblyShi, Xu (Virginia Tech, 2017-10-24)The rapid development of biotechnology has enabled researchers to collect high-throughput data for studying various biological processes at the genomic level, transcriptomic level, and proteomic level. Due to the large noise in the data and the high complexity of diseases (such as cancer), it is a challenging task for researchers to extract biologically meaningful information that can help reveal the underlying molecular mechanisms. The challenges call for more efforts in developing efficient and effective computational methods to analyze the data at different levels so as to understand the biological systems in different aspects. In this dissertation research, we have developed novel Bayesian approaches to infer alternative splicing mechanisms in biological systems using RNA sequencing data. Specifically, we focus on two research topics in this dissertation: isoform identification and phenotype-specific transcript assembly. For isoform identification, we develop a computational approach, SparseIso, to jointly model the existence and abundance of isoforms in a Bayesian framework. A spike-and-slab prior is incorporated into the model to enforce the sparsity of expressed isoforms. A Gibbs sampler is developed to sample the existence and abundance of isoforms iteratively. For transcript assembly, we develop a Bayesian approach, IntAPT, to assemble phenotype-specific transcripts from multiple RNA sequencing profiles. A two-layer Bayesian framework is used to model the existence of phenotype-specific transcripts and the transcript abundance in individual samples. Based on the hierarchical Bayesian model, a Gibbs sampling algorithm is developed to estimate the joint posterior distribution for phenotype-specific transcript assembly. The performances of our proposed methods are evaluated with simulation data, compared with existing methods and benchmarked with real cell line data. We then apply our methods on breast cancer data to identify biologically meaningful splicing mechanisms associated with breast cancer. For the further work, we will extend our methods for de novo transcript assembly to identify novel isoforms in biological systems; we will incorporate isoform-specific networks into our methods to better understand splicing mechanisms in biological systems.
- Biclustering and Visualization of High Dimensional Data using VIsual Statistical Data AnalyzerBlake, Patrick Michael (Virginia Tech, 2019-01-31)Many data sets have too many features for conventional pattern recognition techniques to work properly. This thesis investigates techniques that alleviate these difficulties. One such technique, biclustering, clusters data in both dimensions and is inherently resistant to the challenges posed by having too many features. However, the algorithms that implement biclustering have limitations in that the user must know at least the structure of the data and how many biclusters to expect. This is where the VIsual Statistical Data Analyzer, or VISDA, can help. It is a visualization tool that successively and progressively explores the structure of the data, identifying clusters along the way. This thesis proposes coupling VISDA with biclustering to overcome some of the challenges of data sets with too many features. Further, to increase the performance, usability, and maintainability as well as reduce costs, VISDA was translated from Matlab to a Python version called VISDApy. Both VISDApy and the overall process were demonstrated with real and synthetic data sets. The results of this work have the potential to improve analysts' understanding of the relationships within complex data sets and their ability to make informed decisions from such data.
- Blockchain-enabled Secure and Trusted Personalized Health RecordDong, Yibin (Virginia Tech, 2022-12-20)Longitudinal personalized electronic health record (LPHR) provides a holistic view of health records for individuals and offers a consistent patient-controlled information system for managing the health care of patients. Except for the patients in Veterans Affairs health care service, however, no LPHR is available for the general population in the U.S. that can integrate the existing patients' electronic health records throughout life of care. Such a gap may be contributed mainly by the fact that existing patients' electronic health records are scattered across multiple health care facilities and often not shared due to privacy and security concerns from both patients and health care organizations. The main objective of this dissertation is to address these roadblocks by designing a scalable and interoperable LPHR with patient-controlled and mutually-trusted security and privacy. Privacy and security are complex problems. Specifically, without a set of access control policies, encryption alone cannot secure patient data due to insider threat. Moreover, in a distributed system like LPHR, so-called race condition occurs when access control policies are centralized while decisions making processes are localized. We propose a formal definition of secure LPHR and develop a blockchain-enabled next generation access control (BeNGAC) model. The BeNGAC solution focuses on patient-managed secure authorization for access, and NGAC operates in open access surroundings where users can be centrally known or unknown. We also propose permissioned blockchain technology - Hyperledger Fabric (HF) - to ease the shortcoming of race condition in NGAC that in return enhances the weak confidentiality protection in HF. Built upon BeNGAC, we further design a blockchain-enabled secure and trusted (BEST) LPHR prototype in which data are stored in a distributed yet decentralized database. The unique feature of the proposed BEST-LPHR is the use of blockchain smart contracts allowing BeNGAC policies to govern the security, privacy, confidentiality, data integrity, scalability, sharing, and auditability. The interoperability is achieved by using a health care data exchange standard called Fast Health Care Interoperability Resources. We demonstrated the feasibility of the BEST-LPHR design by the use case studies. Specifically, a small-scale BEST-LPHR is built for sharing platform among a patient and health care organizations. In the study setting, patients have been raising additional ethical concerns related to consent and granular control of LPHR. We engineered a Web-delivered BEST-LPHR sharing platform with patient-controlled consent granularity, security, and privacy realized by BeNGAC. Health organizations that holding the patient's electronic health record (EHR) can join the platform with trust based on the validation from the patient. The mutual trust is established through a rigorous validation process by both the patient and built-in HF consensus mechanism. We measured system scalability and showed millisecond-range performance of LPHR permission changes. In this dissertation, we report the BEST-LPHR solution to electronically sharing and managing patients' electronic health records from multiple organizations, focusing on privacy and security concerns. While the proposed BEST-LPHR solution cannot, expectedly, address all problems in LPHR, this prototype aims to increase EHR adoption rate and reduce LPHR implementation roadblocks. In a long run, the BEST-LPHR will contribute to improving health care efficiency and the quality of life for many patients.
- Brain Signal Quantification and Functional Unit Analysis in Fluorescent Imaging Data by Unsupervised LearningMi, Xuelong (Virginia Tech, 2024-06-04)Optical recording of various brain signals is becoming an indispensable technique for biological studies, accelerated by the development of new or improved biosensors and microscopy technology. A major challenge in leveraging the technique is to identify and quantify the rich patterns embedded in the data. However, existing methods often struggle, either due to their limited signal analysis capabilities or poor performance. Here we present Activity Quantification and Analysis (AQuA2), an innovative analysis platform built upon machine learning theory. AQuA2 features a novel event detection pipeline for precise quantification of intricate brain signals and incorporates a Consensus Functional Unit (CFU) module to explore interactions among potential functional units driving repetitive signals. To enhance efficiency, we developed BIdirectional pushing with Linear Component Operations (BILCO) algorithm to handle propagation analysis, a time-consuming step using traditional algorithms. Furthermore, considering user-friendliness, AQuA2 is implemented as both a MATLAB package and a Fiji plugin, complete with a graphical interface for enhanced usability. AQuA2's validation through both simulation and real-world applications demonstrates its superior performance compared to its peers. Applied across various sensors (Calcium, NE, and ATP), cell types (astrocytes, oligodendrocytes, and neurons), animal models (zebrafish and mouse), and imaging modalities (two-photon, light sheet, and confocal), AQuA2 consistently delivers promising results and novel insights, showcasing its versatility in fluorescent imaging data analysis.
- Building Matlab Standalone Package from Java for Differential Dependence Network Analysis Bioinformatics ToolkitJin, Lu (Virginia Tech, 2010-05-26)This thesis reports a software development effort to transplant Matlab algorithm into a Matlab license-free, platform dependent Java based software. The result is almost equivalent to a direct translation of Matlab source codes into Java or any other programming languages. Since compiled library is platform dependent, an MCR (Matlab Compiler Runtime environment) is required and has been developed to deploy the transplanted algorithm to end user. As the result, the implemented MCR is free to distribution and the streamline transplantation process is much simpler and more reliable than manually translation work. In addition, the implementation methodology reported here can be reused for other similar software engineering tasks. There are mainly 4 construction steps in our software package development. First, all Matlab *.m files or *.mex files associated with the algorithms of interest (to be transplanted) are gathered, and the corresponding shared library is created by the Matlab Compiler. Second, a Java driver is created that will serve as the final user interface. This Java based user interface will take care of all the input and output of the original Matlab algorithm, and prepare all native methods. Third, assisted by JNI, a C driver is implemented to manage the variable transfer between Matlab and Java. Lastly, Matlab mbuild function is used to compile the C driver and aforementioned shared library into a dependent library, ready to be called from the standalone Java interface. We use a caBIG™ (Cancer Biomedical Informatics Grid) data analytic toolkit, namely, the DDN (differential dependence network) algorithm as the testbed in the software development. The developed DDN standalone package can be used on any Matlab-supported platform with Java GUI (Graphic User Interface) or command line parameter. As a caBIG™ toolkit, the DDN package can be integrated into other information systems such as Taverna or G-DOC. The major benefits provided by the proposed methodology can be summarized as follows. First, the proposed software development framework offers a simple and effective way for algorithm developer to provide novel bioinformatics tools to the biomedical end-users, where the frequent obstacle is the lack of language-specific software runtime environment and incompatibility between the compiled software and available computer platforms at user's sites. Second, the proposed software development framework offers software developer a significant time/effort-saving method for translating code between different programming languages, where the majority of software developer's time/effort is spent on understanding the specific analytic algorithm and its language-specific codes rather than developing efficient and platform/user-friendly software. Third, the proposed methodology allows software engineers to focus their effort on the quality of software rather than the details of original source codes, where the only required information is the inputs and outputs of the algorithm. Specifically, all used variables and functions are mapped between Matlab, C and Java, handled solely by our designated C driver.
- Building trustworthy machine learning systems in adversarial environmentsWang, Ning (Virginia Tech, 2023-05-26)Modern AI systems, particularly with the rise of big data and deep learning in the last decade, have greatly improved our daily life and at the same time created a long list of controversies. AI systems are often subject to malicious and stealthy subversion that jeopardizes their efficacy. Many of these issues stem from the data-driven nature of machine learning. While big data and deep models significantly boost the accuracy of machine learning models, they also create opportunities for adversaries to tamper with models or extract sensitive data. Malicious data providers can compromise machine learning systems by supplying false data and intermediate computation results. Even a well-trained model can be deceived to misbehave by an adversary who provides carefully designed inputs. Furthermore, curious parties can derive sensitive information of the training data by interacting with a machine-learning model. These adversarial scenarios, known as poisoning attack, adversarial example attack, and inference attack, have demonstrated that security, privacy, and robustness have become more important than ever for AI to gain wider adoption and societal trust. To address these problems, we proposed the following solutions: (1) FLARE, which detects and mitigates stealthy poisoning attacks by leveraging latent space representations; (2) MANDA, which detects adversarial examples by utilizing evaluations from diverse sources, i.e, model-based prediction and data-based evaluation; (3) FeCo which enhances the robustness of machine learning-based network intrusion detection systems by introducing a novel representation learning method; and (4) DP-FedMeta, which preserves data privacy and improves the privacy-accuracy trade-off in machine learning systems through a novel adaptive clipping mechanism.
- Computational Analysis of Genome-Wide DNA Copy Number ChangesSong, Lei (Virginia Tech, 2011-05-03)DNA copy number change is an important form of structural variation in human genome. Somatic copy number alterations (CNAs) can cause over expression of oncogenes and loss of tumor suppressor genes in tumorigenesis. Recent development of SNP array technology has facilitated studies on copy number changes at a genome-wide scale, with high resolution. Quantitative analysis of somatic CNAs on genes has found broad applications in cancer research. Most tumors exhibit genomic instability at chromosome scale as a result of dynamically accumulated genomic mutations during the course of tumor progression. Such higher level cancer genomic characteristics cannot be effectively captured by the analysis of individual genes. We introduced two definitions of chromosome instability (CIN) index to mathematically and quantitatively characterize genome-wide genomic instability. The proposed CIN indices are derived from detected CNAs using circular binary segmentation and wavelet transform, which calculates a score based on both the amplitude and frequency of the copy number changes. We generated CIN indices on ovarian cancer subtypes' copy number data and used them as features to train a SVM classifier. The experimental results show promising and high classification accuracy estimated through cross-validations. Additional survival analysis is constructed on the extracted CIN scores from TCGA ovarian cancer dataset and showed considerable correlation between CIN scores and various events and severity in ovarian cancer development. Currently our methods have been integrated into G-DOC. We expect these newly defined CINs to be predictors in tumors subtype diagnosis and to be a useful tool in cancer research.
- Computational Analysis of LC-MS/MS Data for Metabolite IdentificationZhou, Bin (Virginia Tech, 2011-11-30)Metabolomics aims at the detection and quantitation of metabolites within a biological system. As the most direct representation of phenotypic changes, metabolomics is an important component in system biology research. Recent development on high-resolution, high-accuracy mass spectrometers enables the simultaneous study of hundreds or even thousands of metabolites in one experiment. Liquid chromatography-mass spectrometry (LC-MS) is a commonly used instrument for metabolomic studies due to its high sensitivity and broad coverage of metabolome. However, the identification of metabolites remains a bottle-neck for current metabolomic studies. This thesis focuses on utilizing computational approaches to improve the accuracy and efficiency for metabolite identification in LC-MS/MS-based metabolomic studies. First, an outlier screening approach is developed to identify those LC-MS runs with low analytical quality, so they will not adversely affect the identification of metabolites. The approach is computationally simple but effective, and does not depend on any preprocessing approach. Second, an integrated computational framework is proposed and implemented to improve the accuracy of metabolite identification and prioritize the multiple putative identifications of one peak in LC-MS data. Through the framework, peaks are likely to have the m/z values that can give appropriate putative identifications. And important guidance for the metabolite verification is provided by prioritizing the putative identifications. Third, an MS/MS spectral matching algorithm is proposed based on support vector machine classification. The approach provides an improved retrieval performance in spectral matching, especially in the presence of data heterogeneity due to different instruments or experimental settings used during the MS/MS spectra acquisition.
- Computational Dissection of Composite Molecular Signatures and Transcriptional ModulesGong, Ting (Virginia Tech, 2009-12-14)This dissertation aims to develop a latent variable modeling framework with which to analyze gene expression profiling data for computational dissection of molecular signatures and transcriptional modules. The first part of the dissertation is focused on extracting pure gene expression signals from tissue or cell mixtures. The main goal of gene expression profiling is to identify the pure signatures of different cell types (such as cancer cells, stromal cells and inflammatory cells) and estimate the concentration of each cell type. In order to accomplish this, a new blind source separation method is developed, namely, nonnegative partially independent component analysis (nPICA), for tissue heterogeneity correction (THC). The THC problem is formulated as a constrained optimization problem and solved with a learning algorithm based on geometrical and statistical principles. The second part of the dissertation sought to identify gene modules from gene expression data to uncover important biological processes in different types of cells. A new gene clustering approach, nonnegative independent component analysis (nICA), is developed for gene module identification. The nICA approach is completed with an information-theoretic procedure for input sample selection and a novel stability analysis approach for proper dimension estimation. Experimental results showed that the gene modules identified by the nICA approach appear to be significantly enriched in functional annotations in terms of gene ontology (GO) categories. The third part of the dissertation moves from gene module level down to DNA sequence level to identify gene regulatory programs by integrating gene expression data and protein-DNA binding data. A sparse hidden component model is first developed for this problem, taking into account a well-known biological principle, i.e., a gene is most likely regulated by a few regulators. This is followed by the development of a novel computational approach, motif-guided sparse decomposition (mSD), in order to integrate the binding information and gene expression data. These computational approaches are primarily developed for analyzing high-throughput gene expression profiling data. Nevertheless, the proposed methods should be able to be extended to analyze other types of high-throughput data for biomedical research.
- Computational Modeling for Differential Analysis of RNA-seq and Methylation dataWang, Xiao (Virginia Tech, 2016-08-16)Computational systems biology is an inter-disciplinary field that aims to develop computational approaches for a system-level understanding of biological systems. Advances in high-throughput biotechnology offer broad scope and high resolution in multiple disciplines. However, it is still a major challenge to extract biologically meaningful information from the overwhelming amount of data generated from biological systems. Effective computational approaches are of pressing need to reveal the functional components. Thus, in this dissertation work, we aim to develop computational approaches for differential analysis of RNA-seq and methylation data to detect aberrant events associated with cancers. We develop a novel Bayesian approach, BayesIso, to identify differentially expressed isoforms from RNA-seq data. BayesIso features a joint model of the variability of RNA-seq data and the differential state of isoforms. BayesIso can not only account for the variability of RNA-seq data but also combines the differential states of isoforms as hidden variables for differential analysis. The differential states of isoforms are estimated jointly with other model parameters through a sampling process, providing an improved performance in detecting isoforms of less differentially expressed. We propose to develop a novel probabilistic approach, DM-BLD, in a Bayesian framework to identify differentially methylated genes. The DM-BLD approach features a hierarchical model, built upon Markov random field models, to capture both the local dependency of measured loci and the dependency of methylation change. A Gibbs sampling procedure is designed to estimate the posterior distribution of the methylation change of CpG sites. Then, the differential methylation score of a gene is calculated from the estimated methylation changes of the involved CpG sites and the significance of genes is assessed by permutation-based statistical tests. We have demonstrated the advantage of the proposed Bayesian approaches over conventional methods for differential analysis of RNA-seq data and methylation data. The joint estimation of the posterior distributions of the variables and model parameters using sampling procedure has demonstrated the advantage in detecting isoforms or methylated genes of less differential. The applications to breast cancer data shed light on understanding the molecular mechanisms underlying breast cancer recurrence, aiming to identify new molecular targets for breast cancer treatment.
- Computer Modeling and Simulation of Morphotropic Phase Boundary FerroelectricsRao, Weifeng (Virginia Tech, 2009-07-31)Phase field modeling and simulation is employed to study the underlying mechanism of enhancing electromechanical properties in single crystals and polycrystals of perovskite-type ferroelectrics around the morphotropic phase boundary (MPB). The findings include: (I) Coherent phase decomposition near MPB in PZT is investigated. It reveals characteristic multidomain microstructures, where nanoscale lamellar domains of tetragonal and rhombohedral phases coexist with well-defined crystallographic orientation relationships and produce coherent diffraction effects. (II) A bridging domain mechanism for explaining the phase coexistence observed around MPBs is presented. It shows that minor domains of metastable phase spontaneously coexist with and bridge major domains of stable phase to reduce total system free energy, which explains the enhanced piezoelectric response around MPBs. (III) We demonstrate a grain size- and composition-dependent behavior of phase coexistence around the MPBs in polycrystals of ferroelectric solid solutions. It shows that grain boundaries impose internal mechanical and electric boundary conditions, which give rise to the grain size effect of phase coexistence, that is, the width of phase coexistence composition range increases with decreasing grain sizes. (IV) The domain size effect is explained by the domain wall broadening mechanism. It shows that, under electric field applied along the nonpolar axis, without domain wall motion, the domain wall broadens and serves as embryo of field-induced new phase, producing large reversible strain free from hysteresis. (V) The control mechanisms of domain configurations and sizes in crystallographically engineered ferroelectric single crystals are investigated. It reveals that highest domain wall densities are obtained with intermediate magnitude of electric field applied along non-polar axis of ferroelectric crystals. (VI) The domain-dependent internal electric field associated with the short-range ordering of charged point defects is demonstrated to stabilize engineered domain microstructure. The internal electric field strength is estimated, which is in agreement with the magnitude evaluated from available experimental data. (VII) The poling-induced piezoelectric anisotropy in untextured ferroelectric ceramics is investigated. It is found that the maximum piezoelectric response in the poled ceramics is obtained along a macroscopic nonpolar direction; and extrinsic contributions from preferred domain wall motions play a dominant role in piezoelectric anisotropy and enhancement in macroscopic nonpolar direction. (VIII) Stress effects on domain microstructure are investigated for the MPB-based ferroelectric polycrystals. It shows that stress alone cannot pole the sample, but can be utilized to reduce the strength of poling electric field. (IX) The effects of compressions on hysteresis loops and domain microstructures of MPB-based ferroelectric polycrystals are investigated. It shows that longitudinal piezoelectric coefficient can be enhanced by compressions, with the best value found when compression is about to initiate the depolarization process.
- Concurrency Optimization for Integrative Network AnalysisBarnes, Robert Otto II (Virginia Tech, 2013-06-12)Virginia Tech\'s Computational Bioinformatics and Bio-imaging Laboratory (CBIL) is exploring integrative network analysis techniques to identify subnetworks or genetic pathways that contribute to various cancers. Chen et. al. developed a bagging Markov random field (BMRF)-based approach which examines gene expression data with prior biological information to reliably identify significant genes and proteins. Using random resampling with replacement (bootstrapping or bagging) is essential to confident results but is computationally demanding as multiple iterations of the network identification (by simulated annealing) is required. The MATLAB implementation is computationally demanding, employs limited concurrency, and thus time prohibitive. Using strong software development discipline we optimize BMRF using algorithmic, compiler, and concurrency techniques (including Nvidia GPUs) to alleviate the wall clock time needed for analysis of large-scale genomic data. Particularly, we decompose the BMRF algorithm into functional blocks, implement the algorithm in C/C++ and further explore the C/C++ implementation with concurrency optimization. Experiments are conducted with simulation and real data to demonstrate that a significant speedup of BMRF can be achieved by exploiting concurrency opportunities. We believe that the experience gained by this research shall help pave the way for us to develop computationally efficient algorithms leveraging concurrency, enabling researchers to efficiently analyze larger-scale data sets essential for furthering cancer research.