Browsing by Author "Xuan, Jianhua"
Now showing 1 - 20 of 52
Results Per Page
Sort Options
- Algorithmic Distribution of Applied Learning on Big DataShukla, Manu (Virginia Tech, 2020-10-16)Machine Learning and Graph techniques are complex and challenging to distribute. Generally, they are distributed by modeling the problem in a similar way as single node sequential techniques except applied on smaller chunks of data and compute and the results combined. These techniques focus on stitching the results from smaller chunks as the best possible way to have the outcome as close to the sequential results on entire data as possible. This approach is not feasible in numerous kernel, matrix, optimization, graph, and other techniques where the algorithm needs access to all the data during execution. In this work, we propose key-value pair based distribution techniques that are widely applicable to statistical machine learning techniques along with matrix, graph, and time series based algorithms. The crucial difference with previously proposed techniques is that all operations are modeled on key-value pair based fine or coarse-grained steps. This allows flexibility in distribution with no compounding error in each step. The distribution is applicable not only in robust disk-based frameworks but also in in-memory based systems without significant changes. Key-value pair based techniques also provide the ability to generate the same result as sequential techniques with no edge or overlap effects in structures such as graphs or matrices to resolve. This thesis focuses on key-value pair based distribution of applied machine learning techniques on a variety of problems. For the first method key-value pair distribution is used for storytelling at scale. Storytelling connects entities (people, organizations) using their observed relationships to establish meaningful storylines. When performed sequentially these computations become a bottleneck because the massive number of entities make space and time complexity untenable. We present DISCRN, or DIstributed Spatio-temporal ConceptseaRch based StorytelliNg, a distributed framework for performing spatio-temporal storytelling. The framework extracts entities from microblogs and event data, and links these entities using a novel ConceptSearch to derive storylines in a distributed fashion utilizing key-value pair paradigm. Performing these operations at scale allows deeper and broader analysis of storylines. The novel parallelization techniques speed up the generation and filtering of storylines on massive datasets. Experiments with microblog posts such as Twitter data and GDELT(Global Database of Events, Language and Tone) events show the efficiency of the techniques in DISCRN. The second work determines brand perception directly from people's comments in social media. Current techniques for determining brand perception, such as surveys of handpicked users by mail, in person, phone or online, are time consuming and increasingly inadequate. The proposed DERIV system distills storylines from open data representing direct consumer voice into a brand perception. The framework summarizes the perception of a brand in comparison to peer brands with in-memory key-value pair based distributed algorithms utilizing supervised machine learning techniques. Experiments performed with open data and models built with storylines of known peer brands show the technique as highly scalable and accurate in capturing brand perception from vast amounts of social data compared to sentiment analysis. The third work performs event categorization and prospect identification in social media. The problem is challenging due to endless amount of information generated daily. In our work, we present DISTL, an event processing and prospect identifying platform. It accepts as input a set of storylines (a sequence of entities and their relationships) and processes them as follows: (1) uses different algorithms (LDA, SVM, information gain, rule sets) to identify themes from storylines; (2) identifies top locations and times in storylines and combines with themes to generate events that are meaningful in a specific scenario for categorizing storylines; and (3) extracts top prospects as people and organizations from data elements contained in storylines. The output comprises sets of events in different categories and storylines under them along with top prospects identified. DISTL utilizes in-memory key-value pair based distributed processing that scales to high data volumes and categorizes generated storylines in near real-time. The fourth work builds flight paths of drones in a distributed manner to survey a large area taking images to determine growth of vegetation over power lines allowing for adjustment to terrain and number of drones and their capabilities. Drones are increasingly being used to perform risky and labor intensive aerial tasks cheaply and safely. To ensure operating costs are low and flights autonomous, their flight plans must be pre-built. In existing techniques drone flight paths are not automatically pre-calculated based on drone capabilities and terrain information. We present details of an automated flight plan builder DIMPL that pre-builds flight plans for drones tasked with surveying a large area to take photographs of electric poles to identify ones with hazardous vegetation overgrowth. DIMPL employs a distributed in-memory key-value pair based paradigm to process subregions in parallel and build flight paths in a highly efficient manner. The fifth work highlights scaling graph operations, particularly pruning and joins. Linking topics to specific experts in technical documents and finding connections between experts are crucial for detecting the evolution of emerging topics and the relationships between their influencers in state-of-the-art research. Current techniques that make such connections are limited to similarity measures. Methods based on weights such as TF-IDF and frequency to identify important topics and self joins between topics and experts are generally utilized to identify connections between experts. However, such approaches are inadequate for identifying emerging keywords and experts since the most useful terms in technical documents tend to be infrequent and concentrated in just a few documents. This makes connecting experts through joins on large dense graphs challenging. We present DIGDUG, a framework that identifies emerging topics by applying graph operations to technical terms. The framework identifies connections between authors of patents and journal papers by performing joins on connected topics and topics associated with the authors at scale. The problem of scaling the graph operations for topics and experts is solved through dense graph pruning and graph joins categorized under their own scalable separable dense graph class based on key-value pair distribution. Comparing our graph join and pruning technique against multiple graph and join methods in MapReduce revealed a significant improvement in performance using our approach.
- Applications of Different Weighting Schemes to Improve Pathway-Based AnalysisHa, Sook S.; Kim, Inyoung; Wang, Yue; Xuan, Jianhua (Hindawi, 2011-05-22)Conventionally, pathway-based analysis assumes that genes in a pathway equally contribute to a biological function, thus assigning uniform weight to genes. However, this assumption has been proved incorrect, and applying uniform weight in the pathway analysis may not be an appropriate approach for the tasks like molecular classification of diseases, as genes in a functional group may have different predicting power. Hence, we propose to use different weights to genes in pathway-based analysis and devise four weighting schemes. We applied them in two existing pathway analysis methods using both real and simulated gene expression data for pathways. Among all schemes, random weighting scheme, which generates random weights and selects optimal weights minimizing an objective function, performs best in terms of 𝑷 value or error rate reduction. Weighting changes pathway scoring and brings up some new significant pathways, leading to the detection of disease-related genes that are missed under uniform weight.
- BADGE: A novel Bayesian model for accurate abundance quantification and differential analysis of RNA-Seq dataGu, Jinghua; Wang, Xiao; Hilakivi-Clarke, Leena; Clarke, Robert; Xuan, Jianhua (2014-09-10)Background Recent advances in RNA sequencing (RNA-Seq) technology have offered unprecedented scope and resolution for transcriptome analysis. However, precise quantification of mRNA abundance and identification of differentially expressed genes are complicated due to biological and technical variations in RNA-Seq data. Results We systematically study the variation in count data and dissect the sources of variation into between-sample variation and within-sample variation. A novel Bayesian framework is developed for joint estimate of gene level mRNA abundance and differential state, which models the intrinsic variability in RNA-Seq to improve the estimation. Specifically, a Poisson-Lognormal model is incorporated into the Bayesian framework to model within-sample variation; a Gamma-Gamma model is then used to model between-sample variation, which accounts for over-dispersion of read counts among multiple samples. Simulation studies, where sequencing counts are synthesized based on parameters learned from real datasets, have demonstrated the advantage of the proposed method in both quantification of mRNA abundance and identification of differentially expressed genes. Moreover, performance comparison on data from the Sequencing Quality Control (SEQC) Project with ERCC spike-in controls has shown that the proposed method outperforms existing RNA-Seq methods in differential analysis. Application on breast cancer dataset has further illustrated that the proposed Bayesian model can 'blindly' estimate sources of variation caused by sequencing biases. Conclusions We have developed a novel Bayesian hierarchical approach to investigate within-sample and between-sample variations in RNA-Seq data. Simulation and real data applications have validated desirable performance of the proposed method. The software package is available at http://www.cbil.ece.vt.edu/software.htm.
- Bayesian Alignment Model for Analysis of LC-MS-based Omic DataTsai, Tsung-Heng (Virginia Tech, 2014-05-22)Liquid chromatography coupled with mass spectrometry (LC-MS) has been widely used in various omic studies for biomarker discovery. Appropriate LC-MS data preprocessing steps are needed to detect true differences between biological groups. Retention time alignment is one of the most important yet challenging preprocessing steps, in order to ensure that ion intensity measurements among multiple LC-MS runs are comparable. In this dissertation, we propose a Bayesian alignment model (BAM) for analysis of LC-MS data. BAM uses Markov chain Monte Carlo (MCMC) methods to draw inference on the model parameters and provides estimates of the retention time variability along with uncertainty measures, enabling a natural framework to integrate information of various sources. From methodology development to practical application, we investigate the alignment problem through three research topics: 1) development of single-profile Bayesian alignment model, 2) development of multi-profile Bayesian alignment model, and 3) application to biomarker discovery research. Chapter 2 introduces the profile-based Bayesian alignment using a single chromatogram, e.g., base peak chromatogram from each LC-MS run. The single-profile alignment model improves on existing MCMC-based alignment methods through 1) the implementation of an efficient MCMC sampler using a block Metropolis-Hastings algorithm, and 2) an adaptive mechanism for knot specification using stochastic search variable selection (SSVS). Chapter 3 extends the model to integrate complementary information that better captures the variability in chromatographic separation. We use Gaussian process regression on the internal standards to derive a prior distribution for the mapping functions. In addition, a clustering approach is proposed to identify multiple representative chromatograms for each LC-MS run. With the Gaussian process prior, these chromatograms are simultaneously considered in the profile-based alignment, which greatly improves the model estimation and facilitates the subsequent peak matching process. Chapter 4 demonstrates the applicability of the proposed Bayesian alignment model to biomarker discovery research. We integrate the proposed Bayesian alignment model into a rigorous preprocessing pipeline for LC-MS data analysis. Through the developed analysis pipeline, candidate biomarkers for hepatocellular carcinoma (HCC) are identified and confirmed on a complementary platform.
- A Bayesian approach for accurate de novo transcriptome assemblyShi, Xu; Wang, Xiao; Neuwald, Andrew F.; Hilakivi-Clarke, Leena; Clarke, Robert; Xuan, Jianhua (2021-09-03)De novo transcriptome assembly from billions of RNA-seq reads is very challenging due to alternative splicing and various levels of expression, which often leads to incorrect, mis-assembled transcripts. BayesDenovo addresses this problem by using both a read-guided strategy to accurately reconstruct splicing graphs from the RNA-seq data and a Bayesian strategy to estimate, from these graphs, the probability of transcript expression without penalizing poorly expressed transcripts. Simulation and cell line benchmark studies demonstrate that BayesDenovo is very effective in reducing false positives and achieves much higher accuracy than other assemblers, especially for alternatively spliced genes and for highly or poorly expressed transcripts. Moreover, BayesDenovo is more robust on multiple replicates by assembling a larger portion of common transcripts. When applied to breast cancer data, BayesDenovo identifies phenotype-specific transcripts associated with breast cancer recurrence.
- Biclustering and Visualization of High Dimensional Data using VIsual Statistical Data AnalyzerBlake, Patrick Michael (Virginia Tech, 2019-01-31)Many data sets have too many features for conventional pattern recognition techniques to work properly. This thesis investigates techniques that alleviate these difficulties. One such technique, biclustering, clusters data in both dimensions and is inherently resistant to the challenges posed by having too many features. However, the algorithms that implement biclustering have limitations in that the user must know at least the structure of the data and how many biclusters to expect. This is where the VIsual Statistical Data Analyzer, or VISDA, can help. It is a visualization tool that successively and progressively explores the structure of the data, identifying clusters along the way. This thesis proposes coupling VISDA with biclustering to overcome some of the challenges of data sets with too many features. Further, to increase the performance, usability, and maintainability as well as reduce costs, VISDA was translated from Matlab to a Python version called VISDApy. Both VISDApy and the overall process were demonstrated with real and synthetic data sets. The results of this work have the potential to improve analysts' understanding of the relationships within complex data sets and their ability to make informed decisions from such data.
- BICORN: An R package for integrative inference of de novo cisregulatory modulesChen, Xi; Gu, Jinghua; Neuwald, Andrew F.; Hilakivi-Clarke, Leena; Clarke, Robert; Xuan, Jianhua (Springer Nature, 2020-05-14)Genome-wide transcription factor (TF) binding signal analyses reveal co-localization of TF binding sites based on inferred cis-regulatory modules (CRMs). CRMs play a key role in understanding the cooperation of multiple TFs under specific conditions. However, the functions of CRMs and their effects on nearby gene transcription are highly dynamic and context-specific and therefore are challenging to characterize. BICORN (Bayesian Inference of COoperative Regulatory Network) builds a hierarchical Bayesian model and infers context-specific CRMs based on TF-gene binding events and gene expression data for a particular cell type. BICORN automatically searches for a list of candidate CRMs based on the input TF bindings at regulatory regions associated with genes of interest. Applying Gibbs sampling, BICORN iteratively estimates model parameters of CRMs, TF activities, and corresponding regulation on gene transcription, which it models as a sparse network of functional CRMs regulating target genes. The BICORN package is implemented in R (version 3.4 or later) and is publicly available on the CRAN server at https://cran.r-project.org/web/packages/BICORN/index.html.
- BMRF-MI: integrative identification of protein interaction network by modeling the gene dependencyShi, Xu; Wang, Xiao; Shajahan, Ayesha; Hilakivi-Clarke, Leena; Clarke, Robert; Xuan, Jianhua (2015-06-11)Background Identification of protein interaction network is a very important step for understanding the molecular mechanisms in cancer. Several methods have been developed to integrate protein-protein interaction (PPI) data with gene expression data for network identification. However, they often fail to model the dependency between genes in the network, which makes many important genes, especially the upstream genes, unidentified. It is necessary to develop a method to improve the network identification performance by incorporating the dependency between genes. Results We proposed an approach for identifying protein interaction network by incorporating mutual information (MI) into a Markov random field (MRF) based framework to model the dependency between genes. MI is widely used in information theory to measure the uncertainty between random variables. Different from traditional Pearson correlation test, MI is capable of capturing both linear and non-linear relationship between random variables. Among all the existing MI estimators, we choose to use k-nearest neighbor MI (kNN-MI) estimator which is proved to have minimum bias. The estimated MI is integrated with an MRF framework to model the gene dependency in the context of network. The maximum a posterior (MAP) estimation is applied on the MRF-based model to estimate the network score. In order to reduce the computational complexity of finding the optimal network, a probabilistic searching algorithm is implemented. We further increase the robustness and reproducibility of the results by applying a non-parametric bootstrapping method to measure the confidence level of the identified genes. To evaluate the performance of the proposed method, we test the method on simulation data under different conditions. The experimental results show an improved accuracy in terms of subnetwork identification compared to existing methods. Furthermore, we applied our method onto real breast cancer patient data; the identified protein interaction network shows a close association with the recurrence of breast cancer, which is supported by functional annotation. We also show that the identified subnetworks can be used to predict the recurrence status of cancer patients by survival analysis. Conclusions We have developed an integrated approach for protein interaction network identification, which combines Markov random field framework and mutual information to model the gene dependency in PPI network. Improvements in subnetwork identification have been demonstrated with simulation datasets compared to existing methods. We then apply our method onto breast cancer patient data to identify recurrence related subnetworks. The experiment results show that the identified genes are enriched in the pathway and functional categories relevant to progression and recurrence of breast cancer. Finally, the survival analysis based on identified subnetworks achieves a good result of classifying the recurrence status of cancer patients.
- caBIG VISDA: modeling, visualization, and discovery for cluster analysis of genomic dataZhu, Yitan; Li, Huai; Miller, David J.; Wang, Zuyi; Xuan, Jianhua; Clarke, Robert; Hoffman, Eric P.; Wang, Yue (2008-09-18)Background The main limitations of most existing clustering methods used in genomic data analysis include heuristic or random algorithm initialization, the potential of finding poor local optima, the lack of cluster number detection, an inability to incorporate prior/expert knowledge, black-box and non-adaptive designs, in addition to the curse of dimensionality and the discernment of uninformative, uninteresting cluster structure associated with confounding variables. Results In an effort to partially address these limitations, we develop the VIsual Statistical Data Analyzer (VISDA) for cluster modeling, visualization, and discovery in genomic data. VISDA performs progressive, coarse-to-fine (divisive) hierarchical clustering and visualization, supported by hierarchical mixture modeling, supervised/unsupervised informative gene selection, supervised/unsupervised data visualization, and user/prior knowledge guidance, to discover hidden clusters within complex, high-dimensional genomic data. The hierarchical visualization and clustering scheme of VISDA uses multiple local visualization subspaces (one at each node of the hierarchy) and consequent subspace data modeling to reveal both global and local cluster structures in a "divide and conquer" scenario. Multiple projection methods, each sensitive to a distinct type of clustering tendency, are used for data visualization, which increases the likelihood that cluster structures of interest are revealed. Initialization of the full dimensional model is based on first learning models with user/prior knowledge guidance on data projected into the low-dimensional visualization spaces. Model order selection for the high dimensional data is accomplished by Bayesian theoretic criteria and user justification applied via the hierarchy of low-dimensional visualization subspaces. Based on its complementary building blocks and flexible functionality, VISDA is generally applicable for gene clustering, sample clustering, and phenotype clustering (wherein phenotype labels for samples are known), albeit with minor algorithm modifications customized to each of these tasks. Conclusion VISDA achieved robust and superior clustering accuracy, compared with several benchmark clustering schemes. The model order selection scheme in VISDA was shown to be effective for high dimensional genomic data clustering. On muscular dystrophy data and muscle regeneration data, VISDA identified biologically relevant co-expressed gene clusters. VISDA also captured the pathological relationships among different phenotypes revealed at the molecular level, through phenotype clustering on muscular dystrophy data and multi-category cancer data.
- Cardiac Signals: Remote Measurement and ApplicationsSarkar, Abhijit (Virginia Tech, 2017-08-25)The dissertation investigates the promises and challenges for application of cardiac signals in biometrics and affective computing, and noninvasive measurement of cardiac signals. We have mainly discussed two major cardiac signals: electrocardiogram (ECG), and photoplethysmogram (PPG). ECG and PPG signals hold strong potential for biometric authentications and identifications. We have shown that by mapping each cardiac beat from time domain to an angular domain using a limit cycle, intra-class variability can be significantly minimized. This is in contrary to conventional time domain analysis. Our experiments with both ECG and PPG signal shows that the proposed method eliminates the effect of instantaneous heart rate on the shape morphology and improves authentication accuracy. For noninvasive measurement of PPG beats, we have developed a systematic algorithm to extract pulse rate from face video in diverse situations using video magnification. We have extracted signals from skin patches and then used frequency domain correlation to filter out non-cardiac signals. We have developed a novel entropy based method to automatically select skin patches from face. We report beat-to-beat accuracy of remote PPG (rPPG) in comparison to conventional average heart rate. The beat-to-beat accuracy is required for applications related to heart rate variability (HRV) and affective computing. The algorithm has been tested on two datasets, one with static illumination condition and the other with unrestricted ambient illumination condition. Automatic skin detection is an intermediate step for rPPG. Existing methods always depend on color information to detect human skin. We have developed a novel standalone skin detection method to show that it is not necessary to have color cues for skin detection. We have used LBP lacunarity based micro-textures features and a region growing algorithm to find skin pixels in an image. Our experiment shows that the proposed method is applicable universally to any image including near infra-red images. This finding helps to extend the domain of many application including rPPG. To the best of our knowledge, this is first such method that is independent of color cues.
- ChIP-BIT2: a software tool to detect weak binding events using a Bayesian integration approachChen, Xi; Shi, Xu; Neuwald, Andrew F.; Hilakivi-Clarke, Leena; Clarke, Robert; Xuan, Jianhua (2021-04-15)Background ChIP-seq combines chromatin immunoprecipitation assays with sequencing and identifies genome-wide binding sites for DNA binding proteins. While many binding sites have strong ChIP-seq ‘peak’ observations and are well captured, there are still regions bound by proteins weakly, with a relatively low ChIP-seq signal enrichment. These weak binding sites, especially those at promoters and enhancers, are functionally important because they also regulate nearby gene expression. Yet, it remains a challenge to accurately identify weak binding sites in ChIP-seq data due to the ambiguity in differentiating these weak binding sites from the amplified background DNAs. Results ChIP-BIT2 ( http://sourceforge.net/projects/chipbitc/) is a software package for ChIP-seq peak detection. ChIP-BIT2 employs a mixture model integrating protein and control ChIP-seq data and predicts strong or weak protein binding sites at promoters, enhancers, or other genomic locations. For binding sites at gene promoters, ChIP-BIT2 simultaneously predicts their target genes. ChIP-BIT2 has been validated on benchmark regions and tested using large-scale ENCODE ChIP-seq data, demonstrating its high accuracy and wide applicability. Conclusion ChIP-BIT2 is an efficient ChIP-seq peak caller. It provides a better lens to examine weak binding sites and can refine or extend the existing binding site collection, providing additional regulatory regions for decoding the mechanism of gene expression regulation.
- ChIP-BIT: Bayesian inference of target genes using a novel joint probabilistic model of ChIP-seq profilesChen, Xi; Jung, Jin-Gyoung; Shajahan-Haq, Ayesha N.; Clarke, Robert; Shih, Ie-Ming; Wang, Yue; Magnani, Luca; Wang, Tian-Li; Xuan, Jianhua (Oxford, 2015-12-23)Chromatin immunoprecipitation with massively parallel DNA sequencing (ChIP-seq) has greatly improved the reliability with which transcription factor binding sites (TFBSs) can be identified from genome-wide profiling studies. Many computational tools are developed to detect binding events or peaks, however the robust detection of weak binding events remains a challenge for current peak calling tools. We have developed a novel Bayesian approach (ChIP-BIT) to reliably detect TFBSs and their target genes by jointly modeling binding signal intensities and binding locations of TFBSs. Specifically, a Gaussian mixture model is used to capture both binding and background signals in sample data. As a unique feature of ChIP-BIT, background signals are modeled by a local Gaussian distribution that is accurately estimated from the input data. Extensive simulation studies showed a significantly improved performance of ChIP-BIT in target gene prediction, particularly for detecting weak binding signals at gene promoter regions. We applied ChIP-BIT to find target genes from NOTCH3 and PBX1 ChIP-seq data acquired from MCF-7 breast cancer cells. TF knockdown experiments have initially validated about 30% of co-regulated target genes identified by ChIP-BIT as being differentially expressed in MCF-7 cells. Functional analysis on these genes further revealed the existence of crosstalk between Notch and Wnt signaling pathways.
- ChIP-GSM: Inferring active transcription factor modules to predict functional regulatory elementsChen, Xi; Neuwald, Andrew F.; Hilakivi-Clarke, Leena; Clarke, Robert; Xuan, Jianhua (PLoS, 2021-07-01)Transcription factors (TFs) often function as a module including both master factors and mediators binding at cis-regulatory regions to modulate nearby gene transcription. ChIPseq profiling of multiple TFs makes it feasible to infer functional TF modules. However, when inferring TF modules based on co-localization of ChIP-seq peaks, often many weak binding events are missed, especially for mediators, resulting in incomplete identification of modules. To address this problem, we develop a ChIP-seq data-driven Gibbs Sampler to infer Modules (ChIP-GSM) using a Bayesian framework that integrates ChIP-seq profiles of multiple TFs. ChIP-GSM samples read counts of module TFs iteratively to estimate the binding potential of a module to each region and, across all regions, estimates the module abundance. Using inferred module-region probabilistic bindings as feature units, ChIP-GSM then employs logistic regression to predict active regulatory elements. Validation of ChIPGSM predicted regulatory regions on multiple independent datasets sharing the same context confirms the advantage of using TF modules for predicting regulatory activity. In a case study of K562 cells, we demonstrate that the ChIP-GSM inferred modules form as groups, activate gene expression at different time points, and mediate diverse functional cellular processes. Hence, ChIP-GSM infers biologically meaningful TF modules and improves the prediction accuracy of regulatory region activities.
- Collimator width Optimization in X-ray Luminescent Computed TomographyMishra, Sourav (Virginia Tech, 2013-06-17)X-ray Luminescent Computed Tomography (XLCT) is a new imaging modality which is under extensive trials at present. The modality works by selective excitation of X-ray sensitive nanophosphors and detecting the optical signal thus generated. This system can be used towards recreating high quality tomographic slices even with low X-ray dose. There have been many studies which have reported successful validation of the underlying philosophy. However, there is still lack of information about optimal settings or combination of imaging parameters, which could yield best outputs. Research groups participating in this area have reported results on basis of dose, signal to noise ratio or resolution only. In this thesis, the candidate has evaluated XLCT taking into consideration noise and resolution in terms of composite indices. Simulations have been performed for various beam widths and noise & resolution metrics deduced. This information has been used in evaluating quality of images on basis of CT Figure of Merit & a modified Wang-Bovik Image Quality index. Simulations indicate the presence of an optimal setting which can be set prior to extensive scans. The conducted study, although focusing on a particular implementation, hopes to establish a paradigm in finding best settings for any XLCT system. Scanning with an optimal setting preconfigured can help in vastly reducing the cost and risks involved with this imaging modality.
- A Comparative Study of Machine Learning Models for Multivariate NextG Network Traffic Prediction with SLA-based Loss FunctionBaykal, Asude (Virginia Tech, 2023-10-20)As Next Generation (NextG) networks become more complex, the need to develop a robust, reliable network traffic prediction framework for intelligent network management increases. This study compares the performance of machine learning models in network traffic prediction using a custom Service-Level Agreement (SLA) - based loss function to ensure SLA violation constraints while minimizing overprovisioning. The proposed SLA-based parametric custom loss functions are used to maintain the SLA violation rate percentages the network operators require. Our approach is multivariate, spatiotemporal, and SLA-driven, incorporating 20 Radio Access Network (RAN) features, custom peak traffic time features, and custom mobility-based clustering to leverage spatiotemporal relationships. In this study, five machine learning models are considered: one recurrent neural network (LSTM) model, two encoder-decoder architectures (Transformer and Autoformer), and two gradient-boosted tree models (XGBoost and LightGBM). The prediction performance of the models is evaluated based on different metrics such as SLA violation rate constraints, overprovisioning, and the custom SLA-based loss function parameter. According to our evaluations, Transformer models with custom peak time features achieve the minimum overprovisioning volume at 3% SLA violation constraint. Gradient-boosted tree models have lower overprovisioning volumes at higher SLA violation rates.
- Computational Analysis of Genome-Wide DNA Copy Number ChangesSong, Lei (Virginia Tech, 2011-05-03)DNA copy number change is an important form of structural variation in human genome. Somatic copy number alterations (CNAs) can cause over expression of oncogenes and loss of tumor suppressor genes in tumorigenesis. Recent development of SNP array technology has facilitated studies on copy number changes at a genome-wide scale, with high resolution. Quantitative analysis of somatic CNAs on genes has found broad applications in cancer research. Most tumors exhibit genomic instability at chromosome scale as a result of dynamically accumulated genomic mutations during the course of tumor progression. Such higher level cancer genomic characteristics cannot be effectively captured by the analysis of individual genes. We introduced two definitions of chromosome instability (CIN) index to mathematically and quantitatively characterize genome-wide genomic instability. The proposed CIN indices are derived from detected CNAs using circular binary segmentation and wavelet transform, which calculates a score based on both the amplitude and frequency of the copy number changes. We generated CIN indices on ovarian cancer subtypes' copy number data and used them as features to train a SVM classifier. The experimental results show promising and high classification accuracy estimated through cross-validations. Additional survival analysis is constructed on the extracted CIN scores from TCGA ovarian cancer dataset and showed considerable correlation between CIN scores and various events and severity in ovarian cancer development. Currently our methods have been integrated into G-DOC. We expect these newly defined CINs to be predictors in tumors subtype diagnosis and to be a useful tool in cancer research.
- Concurrency Optimization for Integrative Network AnalysisBarnes, Robert Otto II (Virginia Tech, 2013-06-12)Virginia Tech\'s Computational Bioinformatics and Bio-imaging Laboratory (CBIL) is exploring integrative network analysis techniques to identify subnetworks or genetic pathways that contribute to various cancers. Chen et. al. developed a bagging Markov random field (BMRF)-based approach which examines gene expression data with prior biological information to reliably identify significant genes and proteins. Using random resampling with replacement (bootstrapping or bagging) is essential to confident results but is computationally demanding as multiple iterations of the network identification (by simulated annealing) is required. The MATLAB implementation is computationally demanding, employs limited concurrency, and thus time prohibitive. Using strong software development discipline we optimize BMRF using algorithmic, compiler, and concurrency techniques (including Nvidia GPUs) to alleviate the wall clock time needed for analysis of large-scale genomic data. Particularly, we decompose the BMRF algorithm into functional blocks, implement the algorithm in C/C++ and further explore the C/C++ implementation with concurrency optimization. Experiments are conducted with simulation and real data to demonstrate that a significant speedup of BMRF can be achieved by exploiting concurrency opportunities. We believe that the experience gained by this research shall help pave the way for us to develop computationally efficient algorithms leveraging concurrency, enabling researchers to efficiently analyze larger-scale data sets essential for furthering cancer research.
- CyNetSVM: A Cytoscape App for Cancer Biomarker Identification Using Network Constrained Support Vector MachinesShi, Xu; Banerjee, Sharmi; Chen, Li; Hilakivi-Clarke, Leena; Clarke, Robert; Xuan, Jianhua (PLOS, 2017-01-25)One of the important tasks in cancer research is to identify biomarkers and build classification models for clinical outcome prediction. In this paper, we develop a CyNetSVM software package, implemented in Java and integrated with Cytoscape as an app, to identify network biomarkers using network-constrained support vector machines (NetSVM). The Cytoscape app of NetSVM is specifically designed to improve the usability of NetSVM with the following enhancements: (1) user-friendly graphical user interface (GUI), (2) computationally efficient core program and (3) convenient network visualization capability. The CyNetSVM app has been used to analyze breast cancer data to identify network genes associated with breast cancer recurrence. The biological function of these network genes is enriched in signaling pathways associated with breast cancer progression, showing the effectiveness of CyNetSVM for cancer biomarker identification. The CyNetSVM package is available at Cytoscape App Store and http://sourceforge.net/projects/netsvmjava; a sample data set is also provided at sourceforge. Net.
- Deep Learning Based Proteomic Language Modelling for in-silico Protein GenerationKesavan Nair, Nitin (Virginia Tech, 2020-09-29)A protein is a biopolymer of amino acids that encodes a particular function. Given that there are 20 amino acids possible at each site, even a short protein of 100 amino acids has $20^{100}$ possible variants, making it unrealistic to evaluate all possible sequences in sequence level space. This search space could be reduced by considering the fact that billions of years of evolution exerting a constant pressure has left us with only a small subset of protein sequences that carry out particular cellular functions. The portion of amino acid space occupied by actual proteins found in nature is therefore much smaller than that which is possible cite{kauffman1993origins}. By examining related proteins that share a conserved function and common evolutionary history (heretofore referred to as protein families), it is possible to identify common motifs that are shared. Examination of these motifs allows us to characterize protein families in greater depth and even generate new ``in silico" proteins that are not found in nature, but exhibit properties of a particular protein family. Using novel deep learning approaches and leveraging the large volume of genomic data that is now available due to high-throughput DNA sequencing, it is now possible to examine protein families in a scale and resolution that has never before been possible. By using this abundance of data to learn high dimensional representations of amino acids sequences, in this work, we show that it is possible to generate novel sequences from a particular protein family. Such a deep sequential model-based approach has great value for bioinformatics and biotechnological applications due to its rapid sampling abilities.
- A Deep-learning based Approach for Foot Placement PredictionLee, Sung-Wook (Virginia Tech, 2023-05-24)Foot placement prediction can be important for exoskeleton and prosthesis controllers, human-robot interaction, or body-worn systems to prevent slips or trips. Previous studies investigating foot placement prediction have been limited to predicting foot placement during the swing phase, and do not fully consider contextual information such as the preceding step or the stance phase before push-off. In this study, a deep learning-based foot placement prediction approach was proposed, where the deep learning models were designed to sequentially process data from three IMU sensors mounted on pelvis and feet. The raw sensor data are pre-processed to generate multi-variable time-series data for training two deep learning models, where the first model estimates the gait progression and the second model subsequently predicts the next foot placement. The ground truth gait phase data and foot placement data are acquired from a motion capture system. Ten healthy subjects were invited to walk naturally at different speeds on a treadmill. In cross-subject learning, the trained models had a mean distance error of 5.93 cm for foot placement prediction. In single-subject learning, the prediction accuracy improved with additional training data, and a mean distance error of 2.60 cm was achieved by fine-tuning the cross-subject validated models with the target subject data. Even from 25-81% in the gait cycle, mean distance errors were only 6.99 cm and 3.22 cm for cross-subject learning and single-subject learning, respectively
- «
- 1 (current)
- 2
- 3
- »