Browsing by Author "Wang, Yue"
Now showing 1 - 20 of 47
Results Per Page
Sort Options
- Applications of Different Weighting Schemes to Improve Pathway-Based AnalysisHa, Sook S.; Kim, Inyoung; Wang, Yue; Xuan, Jianhua (Hindawi, 2011-05-22)Conventionally, pathway-based analysis assumes that genes in a pathway equally contribute to a biological function, thus assigning uniform weight to genes. However, this assumption has been proved incorrect, and applying uniform weight in the pathway analysis may not be an appropriate approach for the tasks like molecular classification of diseases, as genes in a functional group may have different predicting power. Hence, we propose to use different weights to genes in pathway-based analysis and devise four weighting schemes. We applied them in two existing pathway analysis methods using both real and simulated gene expression data for pathways. Among all schemes, random weighting scheme, which generates random weights and selects optimal weights minimizing an objective function, performs best in terms of 𝑷 value or error rate reduction. Weighting changes pathway scoring and brings up some new significant pathways, leading to the detection of disease-related genes that are missed under uniform weight.
- Asymmetric independence modeling identifies novel gene-environment interactionsYu, Guoqiang; Miller, David J.; Wu, Chiung-Ting; Hoffman, Eric P.; Liu, Chunyu; Herrington, David M.; Wang, Yue (Springer Nature, 2019-02-21)Most genetic or environmental factors work together in determining complex disease risk. Detecting gene-environment interactions may allow us to elucidate novel and targetable molecular mechanisms on how environmental exposures modify genetic effects. Unfortunately, standard logistic regression (LR) assumes a convenient mathematical structure for the null hypothesis that however results in both poor detection power and type 1 error, and is also susceptible to missing factor, imperfect surrogate, and disease heterogeneity confounding effects. Here we describe a new baseline framework, the asymmetric independence model (AIM) in case-control studies, and provide mathematical proofs and simulation studies verifying its validity across a wide range of conditions. We show that AIM mathematically preserves the asymmetric nature of maintaining health versus acquiring a disease, unlike LR, and thus is more powerful and robust to detect synergistic interactions. We present examples from four clinically discrete domains where AIM identified interactions that were previously either inconsistent or recognized with less statistical certainty.
- Automated Functional Analysis of Astrocytes from Chronic Time-Lapse Calcium Imaging DataWang, Yinxue; Shi, Guilai; Miller, David J.; Wang, Yizhi; Wang, Congchao; Broussard, Gerard J.; Wang, Yue; Tian, Lin; Yu, Goquiang (Frontiers, 2017-07-14)Recent discoveries that astrocytes exert proactive regulatory effects on neural information processing and that they are deeply involved in normal brain development and disease pathology have stimulated broad interest in understanding astrocyte functional roles in brain circuit. Measuring astrocyte functional status is now technically feasible, due to recent advances in modern microscopy and ultrasensitive cell-type specific genetically encoded Ca²⁺ indicators for chronic imaging. However, there is a big gap between the capability of generating large dataset via calcium imaging and the availability of sophisticated analytical tools for decoding the astrocyte function. Current practice is essentially manual, which not only limits analysis throughput but also risks introducing bias and missing important information latent in complex, dynamic big data. Here, we report a suite of computational tools, called Functional AStrocyte Phenotyping (FASP), for automatically quantifying the functional status of astrocytes. Considering the complex nature of Ca²⁺ signaling in astrocytes and low signal to noise ratio, FASP is designed with data-driven and probabilistic principles, to flexibly account for various patterns and to perform robustly with noisy data. In particular, FASP explicitly models signal propagation, which rules out the applicability of tools designed for other types of data. We demonstrate the effectiveness of FASP using extensive synthetic and real data sets. The findings by FASP were verified by manual inspection. FASP also detected signals that were missed by purely manual analysis but could be confirmed by more careful manual examination under the guidance of automatic analysis. All algorithms and the analysis pipeline are packaged into a plugin for Fiji (ImageJ), with the source code freely available online at https://github.com/VTcbil/FASP.
- BACOM2.0 facilitates absolute normalization and quantification of somatic copy number alterations in heterogeneous tumorFu, Yi; Yu, Guoqiang; Levine, Douglas A.; Wang, Niya; Shih, Ie-Ming; Zhang, Zhen; Clarke, Robert; Wang, Yue (Springer Nature, 2015-09-09)Most published copy number datasets on solid tumors were obtained from specimens comprised of mixed cell populations, for which the varying tumor-stroma proportions are unknown or unreported. The inability to correct for signal mixing represents a major limitation on the use of these datasets for subsequent analyses, such as discerning deletion types or detecting driver aberrations. We describe the BACOM2.0 method with enhanced accuracy and functionality to normalize copy number signals, detect deletion types, estimate tumor purity, quantify true copy numbers, and calculate average-ploidy value. While BACOM has been validated and used with promising results, subsequent BACOM analysis of the TCGA ovarian cancer dataset found that the estimated average tumor purity was lower than expected. In this report, we first show that this lowered estimate of tumor purity is the combined result of imprecise signal normalization and parameter estimation. Then, we describe effective allele-specific absolute normalization and quantification methods that can enhance BACOM applications in many biological contexts while in the presence of various confounders. Finally, we discuss the advantages of BACOM in relation to alternative approaches. Here we detail this revised computational approach, BACOM2.0, and validate its performance in real and simulated datasets.
- Bioinformatic Analysis of Coronary Disease Associated SNPs and Genes to Identify Proteins Potentially Involved in the Pathogenesis of AtherosclerosisMao, Chunhong; Howard, Timothy D.; Sullivan, Dan; Fu, Zongming; Yu, Guoqiang; Parker, Sarah J.; Will, Rebecca; Vander Heide, Richard S.; Wang, Yue; Hixson, James; Van Eyk, Jennifer; Herrington, David M. (Open Access Pub, 2017-03-04)Factors that contribute to the onset of atherosclerosis may be elucidated by bioinformatic techniques applied to multiple sources of genomic and proteomic data. The results of genome wide association studies, such as the CardioGramPlusC4D study, expression data, such as that available from expression quantitative trait loci (eQTL) databases, along with protein interaction and pathway data available in Ingenuity Pathway Analysis (IPA), constitute a substantial set of data amenable to bioinformatics analysis. This study used bioinformatic analyses of recent genome wide association data to identify a seed set of genes likely associated with atherosclerosis. The set was expanded to include protein interaction candidates to create a network of proteins possibly influencing the onset and progression of atherosclerosis. Local average connectivity (LAC), eigenvector centrality, and betweenness metrics were calculated for the interaction network to identify top gene and protein candidates for a better understanding of the atherosclerotic disease process. The top ranking genes included some known to be involved with cardiovascular disease (APOA1, APOA5, APOB, APOC1, APOC2, APOE, CDKN1A, CXCL12, SCARB1, SMARCA4 and TERT), and others that are less obvious and require further investigation (TP53, MYC, PPARG, YWHAQ, RB1, AR, ESR1, EGFR, UBC and YWHAZ). Collectively these data help define a more focused set of genes that likely play a pivotal role in the pathogenesis of atherosclerosis and are therefore natural targets for novel therapeutic interventions.
- Blockchain-Enabled Next Generation Access ControlDong, Yibin; Mun, Seong K.; Wang, Yue (Springer, 2022-01-01)In the past two decades, longitudinal personal health record (LPHR) adoption rate has been low in the United States. Patients’ privacy and security concerns was the primary negative factor impacting LPHR adoption. Patients desire to control the privacy of their own LPHR in multiple information systems at various facilities. However, little is known how to model and construct a scalable and interoperable LPHR with patient-controlled privacy and confidentiality that preserves patients’ health information integrity and availability. Understanding this problem and proposing a practical solution are considered important to increase LPHR adoption rate and improve the efficiency as well as the quality of care. Even though having the state-of-the-art encryption methodologies being applied to patients’ data, without a set of secure access control policies being implemented, LPHR patient data privacy is not guaranteed due to insider threats. We proposed a definition of “secure LPHR” and argued LPHR is secure when the security and privacy requirements are fulfilled through adopting an access control security model. In searching for an access control model, we enhanced the National Institute of Standards and Technology (NIST) next generation access control (NGAC) model by replacing the centralized access control policy database with a permissioned blockchain peer-to-peer database, which not only eases the race condition in NGAC, but also provides patient-managed access control policy update capability. We proposed a novel blockchain-enabled next generation access control (BeNGAC) model to protect security and privacy of LPHR. We sketched BeNGAC and LPHR architectures and identified limitations of the design.
- caBIG VISDA: modeling, visualization, and discovery for cluster analysis of genomic dataZhu, Yitan; Li, Huai; Miller, David J.; Wang, Zuyi; Xuan, Jianhua; Clarke, Robert; Hoffman, Eric P.; Wang, Yue (2008-09-18)Background The main limitations of most existing clustering methods used in genomic data analysis include heuristic or random algorithm initialization, the potential of finding poor local optima, the lack of cluster number detection, an inability to incorporate prior/expert knowledge, black-box and non-adaptive designs, in addition to the curse of dimensionality and the discernment of uninformative, uninteresting cluster structure associated with confounding variables. Results In an effort to partially address these limitations, we develop the VIsual Statistical Data Analyzer (VISDA) for cluster modeling, visualization, and discovery in genomic data. VISDA performs progressive, coarse-to-fine (divisive) hierarchical clustering and visualization, supported by hierarchical mixture modeling, supervised/unsupervised informative gene selection, supervised/unsupervised data visualization, and user/prior knowledge guidance, to discover hidden clusters within complex, high-dimensional genomic data. The hierarchical visualization and clustering scheme of VISDA uses multiple local visualization subspaces (one at each node of the hierarchy) and consequent subspace data modeling to reveal both global and local cluster structures in a "divide and conquer" scenario. Multiple projection methods, each sensitive to a distinct type of clustering tendency, are used for data visualization, which increases the likelihood that cluster structures of interest are revealed. Initialization of the full dimensional model is based on first learning models with user/prior knowledge guidance on data projected into the low-dimensional visualization spaces. Model order selection for the high dimensional data is accomplished by Bayesian theoretic criteria and user justification applied via the hierarchy of low-dimensional visualization subspaces. Based on its complementary building blocks and flexible functionality, VISDA is generally applicable for gene clustering, sample clustering, and phenotype clustering (wherein phenotype labels for samples are known), albeit with minor algorithm modifications customized to each of these tasks. Conclusion VISDA achieved robust and superior clustering accuracy, compared with several benchmark clustering schemes. The model order selection scheme in VISDA was shown to be effective for high dimensional genomic data clustering. On muscular dystrophy data and muscle regeneration data, VISDA identified biologically relevant co-expressed gene clusters. VISDA also captured the pathological relationships among different phenotypes revealed at the molecular level, through phenotype clustering on muscular dystrophy data and multi-category cancer data.
- ChIP-BIT: Bayesian inference of target genes using a novel joint probabilistic model of ChIP-seq profilesChen, Xi; Jung, Jin-Gyoung; Shajahan-Haq, Ayesha N.; Clarke, Robert; Shih, Ie-Ming; Wang, Yue; Magnani, Luca; Wang, Tian-Li; Xuan, Jianhua (Oxford, 2015-12-23)Chromatin immunoprecipitation with massively parallel DNA sequencing (ChIP-seq) has greatly improved the reliability with which transcription factor binding sites (TFBSs) can be identified from genome-wide profiling studies. Many computational tools are developed to detect binding events or peaks, however the robust detection of weak binding events remains a challenge for current peak calling tools. We have developed a novel Bayesian approach (ChIP-BIT) to reliably detect TFBSs and their target genes by jointly modeling binding signal intensities and binding locations of TFBSs. Specifically, a Gaussian mixture model is used to capture both binding and background signals in sample data. As a unique feature of ChIP-BIT, background signals are modeled by a local Gaussian distribution that is accurately estimated from the input data. Extensive simulation studies showed a significantly improved performance of ChIP-BIT in target gene prediction, particularly for detecting weak binding signals at gene promoter regions. We applied ChIP-BIT to find target genes from NOTCH3 and PBX1 ChIP-seq data acquired from MCF-7 breast cancer cells. TF knockdown experiments have initially validated about 30% of co-regulated target genes identified by ChIP-BIT as being differentially expressed in MCF-7 cells. Functional analysis on these genes further revealed the existence of crosstalk between Notch and Wnt signaling pathways.
- Comparative analysis of methods for detecting interacting lociChen, Li; Yu, Guoqiang; Langefeld, Carl D.; Miller, David J.; Guy, Richard T.; Raghuram, Jayaram; Yuan, Xiguo; Herrington, David M.; Wang, Yue (Biomed Central, 2011-07-05)Background: Interactions among genetic loci are believed to play an important role in disease risk. While many methods have been proposed for detecting such interactions, their relative performance remains largely unclear, mainly because different data sources, detection performance criteria, and experimental protocols were used in the papers introducing these methods and in subsequent studies. Moreover, there have been very few studies strictly focused on comparison of existing methods. Given the importance of detecting gene-gene and gene-environment interactions, a rigorous, comprehensive comparison of performance and limitations of available interaction detection methods is warranted. Results: We report a comparison of eight representative methods, of which seven were specifically designed to detect interactions among single nucleotide polymorphisms (SNPs), with the last a popular main-effect testing method used as a baseline for performance evaluation. The selected methods, multifactor dimensionality reduction (MDR), full interaction model (FIM), information gain (IG), Bayesian epistasis association mapping (BEAM), SNP harvester (SH), maximum entropy conditional probability modeling (MECPM), logistic regression with an interaction term (LRIT), and logistic regression (LR) were compared on a large number of simulated data sets, each, consistent with complex disease models, embedding multiple sets of interacting SNPs, under different interaction models. The assessment criteria included several relevant detection power measures, family-wise type I error rate, and computational complexity. There are several important results from this study. First, while some SNPs in interactions with strong effects are successfully detected, most of the methods miss many interacting SNPs at an acceptable rate of false positives. In this study, the best-performing method was MECPM. Second, the statistical significance assessment criteria, used by some of the methods to control the type I error rate, are quite conservative, thereby limiting their power and making it difficult to fairly compare them. Third, as expected, power varies for different models and as a function of penetrance, minor allele frequency, linkage disequilibrium and marginal effects. Fourth, the analytical relationships between power and these factors are derived, aiding in the interpretation of the study results. Fifth, for these methods the magnitude of the main effect influences the power of the tests. Sixth, most methods can detect some ground-truth SNPs but have modest power to detect the whole set of interacting SNPs. Conclusion: This comparison study provides new insights into the strengths and limitations of current methods for detecting interacting loci. This study, along with freely available simulation tools we provide, should help support development of improved methods. The simulation tools are available at: http://code.google.com/p/simulationtool-bmc-ms9169818735220977/downloads/list.
- Comparative Analysis of Methods for Identifying Recurrent Copy Number Alterations in CancerYuan, Xiguo; Zhang, Junying; Zhang, Shengli; Yu, Guoqiang; Wang, Yue (PLOS, 2012-12-20)Recurrent copy number alterations (CNAs) play an important role in cancer genesis. While a number of computational methods have been proposed for identifying such CNAs, their relative merits remain largely unknown in practice since very few efforts have been focused on comparative analysis of the methods. To facilitate studies of recurrent CNA identification in cancer genome, it is imperative to conduct a comprehensive comparison of performance and limitations among existing methods. In this paper, six representative methods proposed in the latest six years are compared. These include one-stage and two-stage approaches, working with raw intensity ratio data and discretized data respectively. They are based on various techniques such as kernel regression, correlation matrix diagonal segmentation, semi-parametric permutation and cyclic permutation schemes. We explore multiple criteria including type I error rate, detection power, Receiver Operating Characteristics (ROC) curve and the area under curve (AUC), and computational complexity, to evaluate performance of the methods under multiple simulation scenarios. We also characterize their abilities on applications to two real datasets obtained from cancers with lung adenocarcinoma and glioblastoma. This comparison study reveals general characteristics of the existing methods for identifying recurrent CNAs, and further provides new insights into their strengths and weaknesses. It is believed helpful to accelerate the development of novel and improved methods.
- Comparative assessment and novel strategy on methods for imputing proteomics dataShen, Minjie; Chang, Yi-Tan; Wu, Chiung-Ting; Parker, Sarah J.; Saylor, Georgia; Wang, Yizhi; Yu, Guoqiang; Van Eyk, Jennifer E.; Clarke, Robert; Herrington, David M.; Wang, Yue (2022-01-20)Missing values are a major issue in quantitative proteomics analysis. While many methods have been developed for imputing missing values in high-throughput proteomics data, a comparative assessment of imputation accuracy remains inconclusive, mainly because mechanisms contributing to true missing values are complex and existing evaluation methodologies are imperfect. Moreover, few studies have provided an outlook of future methodological development. We first re-evaluate the performance of eight representative methods targeting three typical missing mechanisms. These methods are compared on both simulated and masked missing values embedded within real proteomics datasets, and performance is evaluated using three quantitative measures. We then introduce fused regularization matrix factorization, a low-rank global matrix factorization framework, capable of integrating local similarity derived from additional data types. We also explore a biologically-inspired latent variable modeling strategy—convex analysis of mixtures—for missing value imputation and present preliminary experimental results. While some winners emerged from our comparative assessment, the evaluation is intrinsically imperfect because performance is evaluated indirectly on artificial missing or masked values not authentic missing values. Nevertheless, we show that our fused regularization matrix factorization provides a novel incorporation of external and local information, and the exploratory implementation of convex analysis of mixtures presents a biologically plausible new approach.
- Convex Analysis of Mixtures for Separating Non-negative Well-grounded SourcesZhu, Yitan; Wang, Niya; Miller, David J.; Wang, Yue (Springer Nature, 2016-12-06)Blind Source Separation (BSS) is a powerful tool for analyzing composite data patterns in many areas, such as computational biology. We introduce a novel BSS method, Convex Analysis of Mixtures (CAM), for separating non-negative well-grounded sources, which learns the mixing matrix by identifying the lateral edges of the convex data scatter plot. We propose and prove a sufficient and necessary condition for identifying the mixing matrix through edge detection in the noise-free case, which enables CAM to identify the mixing matrix not only in the exact-determined and over-determined scenarios, but also in the under-determined scenario. We show the optimality of the edge detection strategy, even for cases where source well-groundedness is not strictly satisfied. The CAM algorithm integrates plug-in noise filtering using sector-based clustering, an efficient geometric convex analysis scheme, and stability-based model order selection. The superior performance of CAM against a panel of benchmark BSS techniques is demonstrated on numerically mixed gene expression data of ovarian cancer subtypes. We apply CAM to dissect dynamic contrast-enhanced magnetic resonance imaging data taken from breast tumors and time-course microarray gene expression data derived from in-vivo muscle regeneration in mice, both producing biologically plausible decomposition results.
- Cosbin: cosine score-based iterative normalization of biologically diverse samplesWu, Chiung-Ting; Shen, Minjie; Du, Dongping; Cheng, Zuolin; Parker, Sarah J.; Lu, Yingzhou; Van Eyk, Jennifer E.; Yu, Guoqiang; Clarke, Robert; Herrington, David M.; Wang, Yue (Oxford University Press, 2022)Motivation: Data normalization is essential to ensure accurate inference and comparability of gene expression measures across samples or conditions. Ideally, gene expression data should be rescaled based on consistently expressed reference genes. However, to normalize biologically diverse samples, the most commonly used reference genes exhibit striking expression variability and size-factor or distribution-based normalization methods can be problematic when the amount of asymmetry in differential expression is significant. Results: We report an efficient and accurate data-driven method-Cosine score-based iterative normalization (Cosbin)-to normalize biologically diverse samples. Based on the Cosine scores of cross-condition expression patterns, the Cosbin pipeline iteratively eliminates asymmetric differentially expressed genes, identifies consistently expressed genes, and calculates sample-wise normalization factors. We demonstrate the superior performance and enhanced utility of Cosbin compared with six representative peer methods using both simulation and real multi-omics expression datasets. Implemented in open-source R scripts and specifically designed to address normalization bias due to significant asymmetry in differential expression across multiple conditions, the Cosbin tool complements rather than replaces the existing methods and will allow biologists to more accurately detect true molecular signals among diverse phenotypic groups. Availability and implementation: The R scripts of Cosbin pipeline are freely available at https://github.com/MinjieSh/Cosbin. Supplementary information: Supplementary data are available at Bioinformatics Advances online.
- COT: an efficient and accurate method for detecting marker genes among many subtypesLu, Yingzhou; Wu, Chiung-Ting; Parker, Sarah J.; Cheng, Zuolin; Saylor, Georgia; Van Eyk, Jennifer E.; Yu, Guoqiang; Clarke, Robert; Herrington, David M.; Wang, Yue (Oxford University Press, 2022)Motivation: Ideally, a molecularly distinct subtype would be composed of molecular features that are expressed uniquely in the subtype of interest but in no others-so-called marker genes (MGs). MG plays a critical role in the characterization, classification or deconvolution of tissue or cell subtypes. We and others have recognized that the test statistics used by most methods do not exactly satisfy the MG definition and often identify inaccurate MG. Results: We report an efficient and accurate data-driven method, formulated as a Cosine-based One-sample Test (COT) in scatter space, to detect MG among many subtypes using subtype expression profiles. Fundamentally different from existing approaches, the test statistic in COT precisely matches the mathematical definition of an ideal MG. We demonstrate the performance and utility of COT on both simulated and real gene expression and proteomics data. The open source Python/R tool will allow biologists to efficiently detect MG and perform a more comprehensive and unbiased molecular characterization of tissue or cell subtypes in many biomedical contexts. Nevertheless, COT complements not replaces existing methods. Availability and implementation: The Python COT software with a detailed user's manual and a vignette are freely available at https://github.com/MintaYLu/COT. Supplementary information: Supplementary data are available at Bioinformatics Advances online.
- Data-driven detection of subtype-specific differentially expressed genesChen, Lulu; Lu, Yingzhou; Wu, Chiung-Ting; Clarke, Robert; Yu, Guoqiang; Van Eyk, Jennifer E.; Herrington, David M.; Wang, Yue (2021-01-11)Among multiple subtypes of tissue or cell, subtype-specific differentially-expressed genes (SDEGs) are defined as being most-upregulated in only one subtype but not in any other. Detecting SDEGs plays a critical role in the molecular characterization and deconvolution of multicellular complex tissues. Classic differential analysis assumes a null hypothesis whose test statistic is not subtype-specific, thus can produce a high false positive rate and/or lower detection power. Here we first introduce a One-Versus-Everyone Fold Change (OVE-FC) test for detecting SDEGs. We then propose a scaled test statistic (OVE-sFC) for assessing the statistical significance of SDEGs that applies a mixture null distribution model and a tailored permutation test. The OVE-FC/sFC test was validated on both type 1 error rate and detection power using extensive simulation data sets generated from real gene expression profiles of purified subtype samples. The OVE-FC/sFC test was then applied to two benchmark gene expression data sets of purified subtype samples and detected many known or previously unknown SDEGs. Subsequent supervised deconvolution results on synthesized bulk expression data, obtained using the SDEGs detected from the independent purified expression data by the OVE-FC/sFC test, showed superior performance in deconvolution accuracy when compared with popular peer methods.
- DBS: a fast and informative segmentation algorithm for DNA copy number analysisRuan, Jun; Liu, Zhen; Sun, Ming; Wang, Yue; Yue, Junqiu; Yu, Guoqiang (2019-01-03)Background Genome-wide DNA copy number changes are the hallmark events in the initiation and progression of cancers. Quantitative analysis of somatic copy number alterations (CNAs) has broad applications in cancer research. With the increasing capacity of high-throughput sequencing technologies, fast and efficient segmentation algorithms are required when characterizing high density CNAs data. Results A fast and informative segmentation algorithm, DBS (Deviation Binary Segmentation), is developed and discussed. The DBS method is based on the least absolute error principles and is inspired by the segmentation method rooted in the circular binary segmentation procedure. DBS uses point-by-point model calculation to ensure the accuracy of segmentation and combines a binary search algorithm with heuristics derived from the Central Limit Theorem. The DBS algorithm is very efficient requiring a computational complexity of O(n*log n), and is faster than its predecessors. Moreover, DBS measures the change-point amplitude of mean values of two adjacent segments at a breakpoint, where the significant degree of change-point amplitude is determined by the weighted average deviation at breakpoints. Accordingly, using the constructed binary tree of significant degree, DBS informs whether the results of segmentation are over- or under-segmented. Conclusion DBS is implemented in a platform-independent and open-source Java application (ToolSeg), including a graphical user interface and simulation data generation, as well as various segmentation methods in the native Java language.
- EditorialBourlard, Hervé; Pitas, Ioannis; Lam, Kenneth Kin-Man; Wang, Yue (2004-04-21)
- An Efficient Architecture For Networking Event-Based Fluorescent Imaging Analysis ProcessesBright, Mark D. (Virginia Tech, 2023-01)Complex user-end procedures for the execution of computationally expensive processes and tools on high performance computing platforms can hinder the scientific progress of researchers across many domains. In addition, such processes occasionally cannot be executed on user-end platforms either due to insufficient hardware resources or unacceptably long computing times. Such circumstances currently extend to highly sophisticated algorithms and tools utilized for analysis of fluorescent imaging data. Although an extensive collection of cloud-computing solutions exist allowing software developers to resolve these issues, such solutions often abstract both developers and integrators from the executing hardware particulars and can inadvertently incentivize non-ideal software design practices. The discussion herein consists of the theoretical design and real-world realization of an efficient architecture to enable direct multi-user parallel remote utilization of such research tools. Said networked scalable real-time architecture is multi-tier, extensible by design to a vast collection of application archetypes, and is not strictly limited to imaging analysis applications. Transport layer interfaces for packetized binary data transmission, asynchronous command issuance mechanisms, compression and decompression algorithm aggregation, and relational database management systems for inter-tier communication intermediation enable a robust, lightweight, and efficient architecture for networking and remotely interfacing with fluorescent imaging analysis processes.
- Fused feature signatures to probe tumour radiogenomics relationshipsXia, Tian; Kumar, Ashnil; Fulham, Michael; Feng, Dagan; Wang, Yue; Kim, Eun Young; Jung, Younhyun; Kim, Jinman (2022-02-09)Radiogenomics relationships (RRs) aims to identify statistically significant correlations between medical image features and molecular characteristics from analysing tissue samples. Previous radiogenomics studies mainly relied on a single category of image feature extraction techniques (ETs); these are (i) handcrafted ETs that encompass visual imaging characteristics, curated from knowledge of human experts and, (ii) deep ETs that quantify abstract-level imaging characteristics from large data. Prior studies therefore failed to leverage the complementary information that are accessible from fusing the ETs. In this study, we propose a fused feature signature (FFSig): a selection of image features from handcrafted and deep ETs (e.g., transfer learning and fine-tuning of deep learning models). We evaluated the FFSig's ability to better represent RRs compared to individual ET approaches with two public datasets: the first dataset was used to build the FFSig using 89 patients with non-small cell lung cancer (NSCLC) comprising of gene expression data and CT images of the thorax and the upper abdomen for each patient; the second NSCLC dataset comprising of 117 patients with CT images and RNA-Seq data and was used as the validation set. Our results show that our FFSig encoded complementary imaging characteristics of tumours and identified more RRs with a broader range of genes that are related to important biological functions such as tumourigenesis. We suggest that the FFSig has the potential to identify important RRs that may assist cancer diagnosis and treatment in the future.
- Gene Selection for Multiclass Prediction by Weighted Fisher CriterionXuan, Jianhua; Wang, Yue; Dong, Yibin; Feng, Yuanjian; Wang, Bin; Khan, Javed; Bakay, Maria; Wang, Zuyi; Pachman, Lauren; Winokur, Sara; Chen, Yi-Wen; Clarke, Robert; Hoffman, Eric P. (2007-07-10)Gene expression profiling has been widely used to study molecular signatures of many diseases and to develop molecular diagnostics for disease prediction. Gene selection, as an important step for improved diagnostics, screens tens of thousands of genes and identifies a small subset that discriminates between disease types. A two-step gene selection method is proposed to identify informative gene subsets for accurate classification of multiclass phenotypes. In the first step, individually discriminatory genes (IDGs) are identified by using one-dimensional weighted Fisher criterion (wFC). In the second step, jointly discriminatory genes (JDGs) are selected by sequential search methods, based on their joint class separability measured by multidimensional weighted Fisher criterion (wFC). The performance of the selected gene subsets for multiclass prediction is evaluated by artificial neural networks (ANNs) and/or support vector machines (SVMs). By applying the proposed IDG/JDG approach to two microarray studies, that is, small round blue cell tumors (SRBCTs) and muscular dystrophies (MDs), we successfully identified a much smaller yet efficient set of JDGs for diagnosing SRBCTs and MDs with high prediction accuracies (96.9% for SRBCTs and 92.3% for MDs, resp.). These experimental results demonstrated that the two-step gene selection method is able to identify a subset of highly discriminative genes for improved multiclass prediction.
- «
- 1 (current)
- 2
- 3
- »