Browsing by Author "Chen, Li"
Now showing 1 - 14 of 14
Results Per Page
Sort Options
- Comparative analysis of methods for detecting interacting lociChen, Li; Yu, Guoqiang; Langefeld, Carl D.; Miller, David J.; Guy, Richard T.; Raghuram, Jayaram; Yuan, Xiguo; Herrington, David M.; Wang, Yue (Biomed Central, 2011-07-05)Background: Interactions among genetic loci are believed to play an important role in disease risk. While many methods have been proposed for detecting such interactions, their relative performance remains largely unclear, mainly because different data sources, detection performance criteria, and experimental protocols were used in the papers introducing these methods and in subsequent studies. Moreover, there have been very few studies strictly focused on comparison of existing methods. Given the importance of detecting gene-gene and gene-environment interactions, a rigorous, comprehensive comparison of performance and limitations of available interaction detection methods is warranted. Results: We report a comparison of eight representative methods, of which seven were specifically designed to detect interactions among single nucleotide polymorphisms (SNPs), with the last a popular main-effect testing method used as a baseline for performance evaluation. The selected methods, multifactor dimensionality reduction (MDR), full interaction model (FIM), information gain (IG), Bayesian epistasis association mapping (BEAM), SNP harvester (SH), maximum entropy conditional probability modeling (MECPM), logistic regression with an interaction term (LRIT), and logistic regression (LR) were compared on a large number of simulated data sets, each, consistent with complex disease models, embedding multiple sets of interacting SNPs, under different interaction models. The assessment criteria included several relevant detection power measures, family-wise type I error rate, and computational complexity. There are several important results from this study. First, while some SNPs in interactions with strong effects are successfully detected, most of the methods miss many interacting SNPs at an acceptable rate of false positives. In this study, the best-performing method was MECPM. Second, the statistical significance assessment criteria, used by some of the methods to control the type I error rate, are quite conservative, thereby limiting their power and making it difficult to fairly compare them. Third, as expected, power varies for different models and as a function of penetrance, minor allele frequency, linkage disequilibrium and marginal effects. Fourth, the analytical relationships between power and these factors are derived, aiding in the interpretation of the study results. Fifth, for these methods the magnitude of the main effect influences the power of the tests. Sixth, most methods can detect some ground-truth SNPs but have modest power to detect the whole set of interacting SNPs. Conclusion: This comparison study provides new insights into the strengths and limitations of current methods for detecting interacting loci. This study, along with freely available simulation tools we provide, should help support development of improved methods. The simulation tools are available at: http://code.google.com/p/simulationtool-bmc-ms9169818735220977/downloads/list.
- CyNetSVM: A Cytoscape App for Cancer Biomarker Identification Using Network Constrained Support Vector MachinesShi, Xu; Banerjee, Sharmi; Chen, Li; Hilakivi-Clarke, Leena; Clarke, Robert; Xuan, Jianhua (PLOS, 2017-01-25)One of the important tasks in cancer research is to identify biomarkers and build classification models for clinical outcome prediction. In this paper, we develop a CyNetSVM software package, implemented in Java and integrated with Cytoscape as an app, to identify network biomarkers using network-constrained support vector machines (NetSVM). The Cytoscape app of NetSVM is specifically designed to improve the usability of NetSVM with the following enhancements: (1) user-friendly graphical user interface (GUI), (2) computationally efficient core program and (3) convenient network visualization capability. The CyNetSVM app has been used to analyze breast cancer data to identify network genes associated with breast cancer recurrence. The biological function of these network genes is enriched in signaling pathways associated with breast cancer progression, showing the effectiveness of CyNetSVM for cancer biomarker identification. The CyNetSVM package is available at Cytoscape App Store and http://sourceforge.net/projects/netsvmjava; a sample data set is also provided at sourceforge. Net.
- The effects of abrasion on liquid-fabric interaction of selected nonwoven fabricsChen, Li (Virginia Tech, 1996-05-10)The purpose of this research was to investigate and compare the effects of different abrasion treatments on the liquid-fabric interaction of selected nonwoven barrier fabrics. The abrasion treatments included moderate and severe abrasion, flat and flat/flex abrasion, and dry and wet abrasion. The liquid-fabric interactions included wetting/wicking, retention, and penetration through nonwoven fabrics using water/surfactant solution. Results of this study indicated that abrasion treatments increased the wetting/wicking rate of fabrics. The flat/flex abrasion caused a greater increase in the wetting/wicking rate of fabrics than the flat abrasion. Abrasion treatments also increased liquid penetration. The flat abrasion increased liquid penetration more than flat/flex abrasion. On increasing abrasion severity, there was a significant increase in liquid penetration. There was no consistent effect on liquid retention. It was highly influenced by fabric types. Wet abrasion did not differ significantly from dry abrasion in its effects on liquid/fabric interaction. Six nonwoven fabrics used in this study included a hydroentangled cotton fabric with a fluorochemical finish (HCF), a hydroentangled cotton fabric laminated with a microporous film (HCE), a spunbonded polypropylene with microporous film (PSM), a four layer laminated nonwoven including spunbonded polypropylene, microporous film, hydroentangled cotton layer, and spunbonded polypropylene (PECP), a spun-bonded, melt-blown, spun-bonded polypropylene (SMS), and standard Tyvek®. Among the six fabrics, the cotton fabrics with a fluorochemical finish (HCF) and the cotton fabric with a microporous film (HCE) showed an excellent potential as protective material, since they provided high liquid resistance before and after abrasion. However, there was no consistent trend for microporous film fabrics or for cotton containing fabrics to provide a good liquid protection. In general, it was concluded that abrasion significantly decreased liquid protection of protective fabrics.
- Identification of PBX1 Target Genes in Cancer Cells by Global Mapping of PBX1 Binding SitesThiaville, Michelle M.; Stoeck, Alexander; Chen, Li; Wu, Ren-Chin; Magnani, Luca; Oidtman, Jessica; Shih, Ie-Ming; Lupien, Mathieu; Wang, Tian-Li (PLOS, 2012-05-02)PBX1 is a TALE homeodomain transcription factor involved in organogenesis and tumorigenesis. Although it has been shown that ovarian, breast, and melanoma cancer cells depend on PBX1 for cell growth and survival, the molecular mechanism of how PBX1 promotes tumorigenesis remains unclear. Here, we applied an integrated approach by overlapping PBX1 ChIP-chip targets with the PBX1-regulated transcriptome in ovarian cancer cells to identify genes whose transcription was directly regulated by PBX1. We further determined if PBX1 target genes identified in ovarian cancer cells were co-overexpressed with PBX1 in carcinoma tissues. By analyzing TCGA gene expression microarray datasets from ovarian serous carcinomas, we found co-upregulation of PBX1 and a significant number of its direct target genes. Among the PBX1 target genes, a homeodomain protein MEOX1 whose DNA binding motif was enriched in PBX1-immunoprecipicated DNA sequences was selected for functional analysis. We demonstrated that MEOX1 protein interacts with PBX1 protein and inhibition of MEOX1 yields a similar growth inhibitory phenotype as PBX1 suppression. Furthermore, ectopically expressed MEOX1 functionally rescued the PBX1-withdrawn effect, suggesting MEOX1 mediates the cellular growth signal of PBX1. These results demonstrate that MEOX1 is a critical target gene and cofactor of PBX1 in ovarian cancers.
- Identifying cancer biomarkers by network-constrained support vector machinesChen, Li; Xuan, Jianhua; Riggins, Rebecca B.; Clarke, Robert; Wang, Yue (2011-10-12)Background One of the major goals in gene and protein expression profiling of cancer is to identify biomarkers and build classification models for prediction of disease prognosis or treatment response. Many traditional statistical methods, based on microarray gene expression data alone and individual genes' discriminatory power, often fail to identify biologically meaningful biomarkers thus resulting in poor prediction performance across data sets. Nonetheless, the variables in multivariable classifiers should synergistically interact to produce more effective classifiers than individual biomarkers. Results We developed an integrated approach, namely network-constrained support vector machine (netSVM), for cancer biomarker identification with an improved prediction performance. The netSVM approach is specifically designed for network biomarker identification by integrating gene expression data and protein-protein interaction data. We first evaluated the effectiveness of netSVM using simulation studies, demonstrating its improved performance over state-of-the-art network-based methods and gene-based methods for network biomarker identification. We then applied the netSVM approach to two breast cancer data sets to identify prognostic signatures for prediction of breast cancer metastasis. The experimental results show that: (1) network biomarkers identified by netSVM are highly enriched in biological pathways associated with cancer progression; (2) prediction performance is much improved when tested across different data sets. Specifically, many genes related to apoptosis, cell cycle, and cell proliferation, which are hallmark signatures of breast cancer metastasis, were identified by the netSVM approach. More importantly, several novel hub genes, biologically important with many interactions in PPI network but often showing little change in expression as compared with their downstream genes, were also identified as network biomarkers; the genes were enriched in signaling pathways such as TGF-beta signaling pathway, MAPK signaling pathway, and JAK-STAT signaling pathway. These signaling pathways may provide new insight to the underlying mechanism of breast cancer metastasis. Conclusions We have developed a network-based approach for cancer biomarker identification, netSVM, resulting in an improved prediction performance with network biomarkers. We have applied the netSVM approach to breast cancer gene expression data to predict metastasis in patients. Network biomarkers identified by netSVM reveal potential signaling pathways associated with breast cancer metastasis, and help improve the prediction performance across independent data sets.
- Identifying protein interaction subnetworks by a bagging Markov random field-based methodChen, Li; Xuan, Jianhua; Riggins, Rebecca B.; Wang, Yue; Clark, Robert L. (Nucleic Acids Research, 2013)Identification of differentially expressed subnetworks from protein-protein interaction (PPI) networks has become increasingly important to our global understanding of the molecular mechanisms that drive cancer. Several methods have been proposed for PPI subnetwork identification, but the dependency among network member genes is not explicitly considered, leaving many important hub genes largely unidentified. We present a new method, based on a bagging Markov random field (BMRF) framework, to improve subnetwork identification for mechanistic studies of breast cancer. The method follows a maximum a posteriori principle to form a novel network score that explicitly considers pairwise gene interactions in PPI networks, and it searches for subnetworks with maximal network scores. To improve their robustness across data sets, a bagging scheme based on bootstrapping samples is implemented to statistically select high confidence subnetworks. We first compared the BMRF-based method with existing methods on simulation data to demonstrate its improved performance. We then applied our method to breast cancer data to identify PPI subnetworks associated with breast cancer progression and/or tamoxifen resistance. The experimental results show that not only an improved prediction performance can be achieved by the BMRF approach when tested on independent data sets, but biologically meaningful subnetworks can also be revealed that are relevant to breast cancer and tamoxifen resistance.
- Integrative Modeling and Analysis of High-throughput Biological DataChen, Li (Virginia Tech, 2010-12-15)Computational biology is an interdisciplinary field that focuses on developing mathematical models and algorithms to interpret biological data so as to understand biological problems. With current high-throughput technology development, different types of biological data can be measured in a large scale, which calls for more sophisticated computational methods to analyze and interpret the data. In this dissertation research work, we propose novel methods to integrate, model and analyze multiple biological data, including microarray gene expression data, protein-DNA interaction data and protein-protein interaction data. These methods will help improve our understanding of biological systems. First, we propose a knowledge-guided multi-scale independent component analysis (ICA) method for biomarker identification on time course microarray data. Guided by a knowledge gene pool related to a specific disease under study, the method can determine disease relevant biological components from ICA modes and then identify biologically meaningful markers related to the specific disease. We have applied the proposed method to yeast cell cycle microarray data and Rsf-1-induced ovarian cancer microarray data. The results show that our knowledge-guided ICA approach can extract biologically meaningful regulatory modes and outperform several baseline methods for biomarker identification. Second, we propose a novel method for transcriptional regulatory network identification by integrating gene expression data and protein-DNA binding data. The approach is built upon a multi-level analysis strategy designed for suppressing false positive predictions. With this strategy, a regulatory module becomes increasingly significant as more relevant gene sets are formed at finer levels. At each level, a two-stage support vector regression (SVR) method is utilized to reduce false positive predictions by integrating binding motif information and gene expression data; a significance analysis procedure is followed to assess the significance of each regulatory module. The resulting performance on simulation data and yeast cell cycle data shows that the multi-level SVR approach outperforms other existing methods in the identification of both regulators and their target genes. We have further applied the proposed method to breast cancer cell line data to identify condition-specific regulatory modules associated with estrogen treatment. Experimental results show that our method can identify biologically meaningful regulatory modules related to estrogen signaling and action in breast cancer. Third, we propose a bootstrapping Markov Random Filed (MRF)-based method for subnetwork identification on microarray data by incorporating protein-protein interaction data. Methodologically, an MRF-based network score is first derived by considering the dependency among genes to increase the chance of selecting hub genes. A modified simulated annealing search algorithm is then utilized to find the optimal/suboptimal subnetworks with maximal network score. A bootstrapping scheme is finally implemented to generate confident subnetworks. Experimentally, we have compared the proposed method with other existing methods, and the resulting performance on simulation data shows that the bootstrapping MRF-based method outperforms other methods in identifying ground truth subnetwork and hub genes. We have then applied our method to breast cancer data to identify significant subnetworks associated with drug resistance. The identified subnetworks not only show good reproducibility across different data sets, but indicate several pathways and biological functions potentially associated with the development of breast cancer and drug resistance. In addition, we propose to develop network-constrained support vector machines (SVM) for cancer classification and prediction, by taking into account the network structure to construct classification hyperplanes. The simulation study demonstrates the effectiveness of our proposed method. The study on the real microarray data sets shows that our network-constrained SVM, together with the bootstrapping MRF-based subnetwork identification approach, can achieve better classification performance compared with conventional biomarker selection approaches and SVMs. We believe that the research presented in this dissertation not only provides novel and effective methods to model and analyze different types of biological data, the extensive experiments on several real microarray data sets and results also show the potential to improve the understanding of biological mechanisms related to cancers by generating novel hypotheses for further study.
- Knowledge-guided multi-scale independent component analysis for biomarker identificationChen, Li; Xuan, Jianhua; Wang, Chen; Shih, Ie-Ming; Wang, Yue; Zhang, Zhen; Hoffman, Eric P.; Clarke, Robert (2008-10-06)Background Many statistical methods have been proposed to identify disease biomarkers from gene expression profiles. However, from gene expression profile data alone, statistical methods often fail to identify biologically meaningful biomarkers related to a specific disease under study. In this paper, we develop a novel strategy, namely knowledge-guided multi-scale independent component analysis (ICA), to first infer regulatory signals and then identify biologically relevant biomarkers from microarray data. Results Since gene expression levels reflect the joint effect of several underlying biological functions, disease-specific biomarkers may be involved in several distinct biological functions. To identify disease-specific biomarkers that provide unique mechanistic insights, a meta-data "knowledge gene pool" (KGP) is first constructed from multiple data sources to provide important information on the likely functions (such as gene ontology information) and regulatory events (such as promoter responsive elements) associated with potential genes of interest. The gene expression and biological meta data associated with the members of the KGP can then be used to guide subsequent analysis. ICA is then applied to multi-scale gene clusters to reveal regulatory modes reflecting the underlying biological mechanisms. Finally disease-specific biomarkers are extracted by their weighted connectivity scores associated with the extracted regulatory modes. A statistical significance test is used to evaluate the significance of transcription factor enrichment for the extracted gene set based on motif information. We applied the proposed method to yeast cell cycle microarray data and Rsf-1-induced ovarian cancer microarray data. The results show that our knowledge-guided ICA approach can extract biologically meaningful regulatory modes and outperform several baseline methods for biomarker identification. Conclusion We have proposed a novel method, namely knowledge-guided multi-scale ICA, to identify disease-specific biomarkers. The goal is to infer knowledge-relevant regulatory signals and then identify corresponding biomarkers through a multi-scale strategy. The approach has been successfully applied to two expression profiling experiments to demonstrate its improved performance in extracting biologically meaningful and disease-related biomarkers. More importantly, the proposed approach shows promising results to infer novel biomarkers for ovarian cancer and extend current knowledge.
- Mathematical modelling of transcriptional heterogeneity identifies novel markers and subpopulations in complex tissuesWang, Niya; Hoffman, Eric P.; Chen, Lulu; Chen, Li; Zhang, Zhen; Liu, Chunyu; Yu, Guoqiang; Herrington, David M.; Clarke, Robert; Wang, Yue (Springer Nature, 2016-01-07)Tissue heterogeneity is both a major confounding factor and an underexploited information source. While a handful of reports have demonstrated the potential of supervised computational methods to deconvolute tissue heterogeneity, these approaches require a priori information on the marker genes or composition of known subpopulations. To address the critical problem of the absence of validated marker genes for many (including novel) subpopulations, we describe convex analysis of mixtures (CAM), a fully unsupervised in silico method, for identifying subpopulation marker genes directly from the original mixed gene expressions in scatter space that can improve molecular analyses in many biological contexts. Validated with predesigned mixtures, CAM on the gene expression data from peripheral leukocytes, brain tissue, and yeast cell cycle, revealed novel marker genes that were otherwise undetectable using existing methods. Importantly, CAM requires no a priori information on the number, identity, or composition of the subpopulations present in mixed samples, and does not require the presence of pure subpopulations in sample space. This advantage is significant in that CAM can achieve all of its goals using only a small number of heterogeneous samples, and is more powerful to distinguish between phenotypically similar subpopulations.
- Motif-directed network component analysis for regulatory network inferenceWang, Chen; Xuan, Jianhua; Chen, Li; Zhao, Po; Wang, Yue; Clarke, Robert; Hoffman, Eric P. (2008-02-13)Background Network Component Analysis (NCA) has shown its effectiveness in discovering regulators and inferring transcription factor activities (TFAs) when both microarray data and ChIP-on-chip data are available. However, a NCA scheme is not applicable to many biological studies due to limited topology information available, such as lack of ChIP-on-chip data. We propose a new approach, motif-directed NCA (mNCA), to integrate motif information and gene expression data to infer regulatory networks. Results We develop motif-directed NCA (mNCA) to incorporate motif information into NCA for regulatory network inference. While motif information is readily available from knowledge databases, it is a "noisy" source of network topology information consisting of many false positives. To overcome this problem, we develop a stability analysis procedure embedded in mNCA to resolve the inconsistency between motif information and gene expression data, and to enable the identification of stable TFAs. The mNCA approach has been applied to a time course microarray data set of muscle regeneration. The experimental results show that the inferred TFAs are not only numerically stable but also biologically relevant to muscle differentiation process. In particular, several inferred TFAs like those of MyoD, myogenin and YY1 are well supported by biological experiments. Conclusion A novel computational approach, mNCA, has been developed to integrate motif information and gene expression data for regulatory network reconstruction. Specifically, motif analysis is used to obtain initial network topology, and stability analysis is developed and applied with mNCA to extract stable TFAs. Experimental results on muscle regeneration microarray data have demonstrated that mNCA is a practical and reliable computational method for regulatory network inference and pathway discovery.
- Motif-guided sparse decomposition of gene expression data for regulatory module identificationGong, Ting; Xuan, Jianhua; Chen, Li; Riggins, Rebecca B.; Li, Huai; Hoffman, Eric P.; Clarke, Robert; Wang, Yue (2011-03-22)Background Genes work coordinately as gene modules or gene networks. Various computational approaches have been proposed to find gene modules based on gene expression data; for example, gene clustering is a popular method for grouping genes with similar gene expression patterns. However, traditional gene clustering often yields unsatisfactory results for regulatory module identification because the resulting gene clusters are co-expressed but not necessarily co-regulated. Results We propose a novel approach, motif-guided sparse decomposition (mSD), to identify gene regulatory modules by integrating gene expression data and DNA sequence motif information. The mSD approach is implemented as a two-step algorithm comprising estimates of (1) transcription factor activity and (2) the strength of the predicted gene regulation event(s). Specifically, a motif-guided clustering method is first developed to estimate the transcription factor activity of a gene modu≤ sparse component analysis is then applied to estimate the regulation strength, and so predict the target genes of the transcription factors. The mSD approach was first tested for its improved performance in finding regulatory modules using simulated and real yeast data, revealing functionally distinct gene modules enriched with biologically validated transcription factors. We then demonstrated the efficacy of the mSD approach on breast cancer cell line data and uncovered several important gene regulatory modules related to endocrine therapy of breast cancer. Conclusion We have developed a new integrated strategy, namely motif-guided sparse decomposition (mSD) of gene expression data, for regulatory module identification. The mSD method features a novel motif-guided clustering method for transcription factor activity estimation by finding a balance between co-regulation and co-expression. The mSD method further utilizes a sparse decomposition method for regulation strength estimation. The experimental results show that such a motif-guided strategy can provide context-specific regulatory modules in both yeast and breast cancer studies.
- One thousand plant transcriptomes and the phylogenomics of green plantsLeebens-Mack, James H.; Barker, Michael S.; Carpenter, Eric J.; Deyholos, Michael K.; Gitzendanner, Matthew A.; Graham, Sean W.; Grosse, Ivo; Li, Zheng; Melkonian, Michael; Mirarab, Siavash; Porsch, Martin; Quint, Marcel; Rensing, Stefan A.; Soltis, Douglas E.; Soltis, Pamela S.; Stevenson, Dennis W.; Ullrich, Kristian K.; Wickett, Norman J.; DeGironimo, Lisa; Edger, Patrick P.; Jordon-Thaden, Ingrid E.; Joya, Steve; Liu, Tao; Melkonian, Barbara; Miles, Nicholas W.; Pokorny, Lisa; Quigley, Charlotte; Thomas, Philip; Villarreal, Juan Carlos; Augustin, Megan M.; Barrett, Matthew D.; Baucom, Regina S.; Beerling, David J.; Benstein, Ruben Maximilian; Biffin, Ed; Brockington, Samuel F.; Burge, Dylan O.; Burris, Jason N.; Burris, Kellie P.; Burtet-Sarramegna, Valerie; Caicedo, Ana L.; Cannon, Steven B.; Cebi, Zehra; Chang, Ying; Chater, Caspar; Cheeseman, John M.; Chen, Tao; Clarke, Neil D.; Clayton, Harmony; Covshoff, Sarah; Crandall-Stotler, Barbara J.; Cross, Hugh; dePamphilis, Claude W.; Der, Joshua P.; Determann, Ron; Dickson, Rowan C.; Di Stilio, Veronica S.; Ellis, Shona; Fast, Eva; Feja, Nicole; Field, Katie J.; Filatov, Dmitry A.; Finnegan, Patrick M.; Floyd, Sandra K.; Fogliani, Bruno; Garcia, Nicolas; Gateble, Gildas; Godden, Grant T.; Goh, Falicia (Qi Yun); Greiner, Stephan; Harkess, Alex; Heaney, James Mike; Helliwell, Katherine E.; Heyduk, Karolina; Hibberd, Julian M.; Hodel, Richard G. J.; Hollingsworth, Peter M.; Johnson, Marc T. J.; Jost, Ricarda; Joyce, Blake; Kapralov, Maxim V.; Kazamia, Elena; Kellogg, Elizabeth A.; Koch, Marcus A.; Von Konrat, Matt; Konyves, Kalman; Kutchan, Toni M.; Lam, Vivienne; Larsson, Anders; Leitch, Andrew R.; Lentz, Roswitha; Li, Fay-Wei; Lowe, Andrew J.; Ludwig, Martha; Manos, Paul S.; Mavrodiev, Evgeny; McCormick, Melissa K.; McKain, Michael; McLellan, Tracy; McNeal, Joel R.; Miller, Richard E.; Nelson, Matthew N.; Peng, Yanhui; Ralph, Paula E.; Real, Daniel; Riggins, Chance W.; Ruhsam, Markus; Sage, Rowan F.; Sakai, Ann K.; Scascitella, Moira; Schilling, Edward E.; Schlosser, Eva-Marie; Sederoff, Heike; Servick, Stein; Sessa, Emily B.; Shaw, A. Jonathan; Shaw, Shane W.; Sigel, Erin M.; Skema, Cynthia; Smith, Alison G.; Smithson, Ann; Stewart, C. Neal, Jr.; Stinchcombe, John R.; Szovenyi, Peter; Tate, Jennifer A.; Tiebel, Helga; Trapnell, Dorset; Villegente, Matthieu; Wang, Chun-Neng; Weller, Stephen G.; Wenzel, Michael; Weststrand, Stina; Westwood, James H.; Whigham, Dennis F.; Wu, Shuangxiu; Wulff, Adrien S.; Yang, Yu; Zhu, Dan; Zhuang, Cuili; Zuidof, Jennifer; Chase, Mark W.; Pires, J. Chris; Rothfels, Carl J.; Yu, Jun; Chen, Cui; Chen, Li; Cheng, Shifeng; Li, Juanjuan; Li, Ran; Li, Xia; Lu, Haorong; Ou, Yanxiang; Sun, Xiao; Tan, Xuemei; Tang, Jingbo; Tian, Zhijian; Wang, Feng; Wang, Jun; Wei, Xiaofeng; Xu, Xun; Yan, Zhixiang; Yang, Fan; Zhong, Xiaoni; Zhou, Feiyu; Zhu, Ying; Zhang, Yong; Ayyampalayam, Saravanaraj; Barkman, Todd J.; Nam-Phuong Nguyen; Matasci, Naim; Nelson, David R.; Sayyari, Erfan; Wafula, Eric K.; Walls, Ramona L.; Warnow, Tandy; An, Hong; Arrigo, Nils; Baniaga, Anthony E.; Galuska, Sally; Jorgensen, Stacy A.; Kidder, Thomas I.; Kong, Hanghui; Lu-Irving, Patricia; Marx, Hannah E.; Qi, Xinshuai; Reardon, Chris R.; Sutherland, Brittany L.; Tiley, George P.; Welles, Shana R.; Yu, Rongpei; Zhan, Shing; Gramzow, Lydia; Theissen, Gunter; Wong, Gane Ka-Shu (2019-10-31)Green plants (Viridiplantae) include around 450,000-500,000 species(1,2) of great diversity and have important roles in terrestrial and aquatic ecosystems. Here, as part of the One Thousand Plant Transcriptomes Initiative, we sequenced the vegetative transcriptomes of 1,124 species that span the diversity of plants in a broad sense (Archaeplastida), including green plants (Viridiplantae), glaucophytes (Glaucophyta) and red algae (Rhodophyta). Our analysis provides a robust phylogenomic framework for examining the evolution of green plants. Most inferred species relationships are well supported across multiple species tree and supermatrix analyses, but discordance among plastid and nuclear gene trees at a few important nodes highlights the complexity of plant genome evolution, including polyploidy, periods of rapid speciation, and extinction. Incomplete sorting of ancestral variation, polyploidization and massive expansions of gene families punctuate the evolutionary history of green plants. Notably, we find that large expansions of gene families preceded the origins of green plants, land plants and vascular plants, whereas whole-genome duplications are inferred to have occurred repeatedly throughout the evolution of flowering plants and ferns. The increasing availability of high-quality plant genome sequences and advances in functional genomics are enabling research on genome evolution across the green tree of life.
- Statistical Machine Learning for Multi-platform Biomedical Data AnalysisChen, Li (Virginia Tech, 2011-08-24)Recent advances in biotechnologies have enabled multiplatform and large-scale quantitative measurements of biomedical events. The need to analyze the produced vast amount of imaging and genomic data stimulates various novel applications of statistical machine learning methods in many areas of biomedical research. The main objective is to assist biomedical investigators to better interpret, analyze, and understand the biomedical questions based on the acquired data. Given the computational challenges imposed by these high-dimensional and complex data, machine learning research finds its new opportunities and roles. In this dissertation thesis, we propose to develop, test and apply novel statistical machine learning methods to analyze the data mainly acquired by dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) and single nucleotide polymorphism (SNP) microarrays. The research work focuses on: (1) tissue-specific compartmental analysis for dynamic contrast-enhanced MR imaging of complex tumors; (2) computational Analysis for detecting DNA SNP interactions in genome-wide association studies. DCE-MRI provides a noninvasive method for evaluating tumor vasculature patterns based on contrast accumulation and washout. Compartmental analysis is a widely used mathematical tool to model dynamic imaging data and can provide accurate pharmacokinetics parameter estimates. However partial volume effect (PVE) existing in imaging data would have profound effect on the accuracy of pharmacokinetics studies. We therefore propose a convex analysis of mixtures (CAM) algorithm to explicitly eliminate PVE by expressing the kinetics in each pixel as a nonnegative combination of underlying compartments and subsequently identifying pure volume pixels at the corners of the clustered pixel time series scatter plot. The algorithm is supported by a series of newly proved theorems and additional noise filtering and normalization preprocessing. We demonstrate the principle and feasibility of the CAM approach together with compartmental modeling on realistic synthetic data, and compare the accuracy of parameter estimates obtained using CAM or other relevant techniques. Experimental results show a significant improvement in the accuracy of kinetic parameter estimation. We then apply the algorithm to real DCE-MRI data of breast cancer and observe improved pharmacokinetics parameter estimation that separates tumor tissue into sub-regions with differential tracer kinetics on a pixel-by-pixel basis and reveals biologically plausible tumor tissue heterogeneity patterns. This method has combined the advantages of multivariate clustering, convex optimization and compartmental modeling approaches. Interactions among genetic loci are believed to play an important role in disease risk. Due to the huge dimension of SNP data (normally several millions in genome-wide association studies), the combinatorial search and statistical evaluation required to detect multi-locus interactions constitute a significantly challenging computational task. While many approaches have been proposed for detecting such interactions, their relative performance remains largely unclear, due to the fact that performance was evaluated on different data sources, using different performance measures, and under different experimental protocols. Given the importance of detecting gene-gene interactions, a thorough evaluation of the performance and limitations of available methods, a theoretical analysis of the interaction effect and the genetic factors it depends on, and the development of more efficient methods are warranted. Therefore, we perform a computational analysis for detect interactions among SNPs. The contributions are four-fold: (1) developed simulation tools for evaluating performance of any technique designed to detect interactions among genetic variants in case-control studies; (2) used these tools to compare performance of five popular SNP detection methods; and (3) derived analytic relationships between power and the genetic factors, which not only support the experimental results but also gives a quantitative linkage between interaction effect and these factors; (4) based on the novel insights gained by comparative and theoretical analysis, developed an efficient statistically-principled method, namely the hybrid correlation-based association (HCA) to detect interacting SNPs. The HCA algorithm is based on three correlation-based statistics, which are designed to measure the strength of multi-locus interaction with three different interaction types, covering a large portion of possible interactions. Moreover, to maximize the detection power (sensitivity) while suppressing false positive rate (or retaining moderate specificity), we also devised a strategy to hybridize these three statistics in a case-by-case way. A heuristic search strategy is also proposed to largely decrease the computational complexity, especially for high-order interaction detection. We have tested HCA in both simulation study and real disease study. HCA and the selected peer methods were compared on a large number of simulated datasets, each including multiple sets of interaction models. The assessment criteria included several power measures, family-wise type I error rate, and computational complexity. The experimental results of HCA on the simulation data indicate its promising performance in terms of a good balance between detection accuracy and computational complexity. By running on multiple real datasets, HCA also replicates plausible biomarkers reported in previous literatures.
- Unsupervised Deconvolution of Dynamic Imaging Reveals Intratumor Vascular Heterogeneity and Repopulation DynamicsChen, Li; Choyke, Peter L.; Wang, Niya; Clarke, Robert; Bhujwalla, Zaver M.; Hillman, Elizabeth M. C.; Wang, Ge; Wang, Yue (PLOS, 2014-11-07)With the existence of biologically distinctive malignant cells originated within the same tumor, intratumor functional heterogeneity is present in many cancers and is often manifested by the intermingled vascular compartments with distinct pharmacokinetics. However, intratumor vascular heterogeneity cannot be resolved directly by most in vivo dynamic imaging. We developed multi-tissue compartment modeling (MTCM), a completely unsupervised method of deconvoluting dynamic imaging series from heterogeneous tumors that can improve vascular characterization in many biological contexts. Applying MTCM to dynamic contrast-enhanced magnetic resonance imaging of breast cancers revealed characteristic intratumor vascular heterogeneity and therapeutic responses that were otherwise undetectable. MTCM is readily applicable to other dynamic imaging modalities for studying intratumor functional and phenotypic heterogeneity, together with a variety of foreseeable applications in the clinic.