Browsing by Author "Herrington, David M."
Now showing 1 - 13 of 13
Results Per Page
Sort Options
- Age-related variations in the methylome associated with gene expression in human monocytes and T cellsReynolds, Lindsay M.; Taylor, Jackson R.; Ding, Jingzhong; Lohman, Kurt; Johnson, Craig; Siscovick, David; Burke, Gregory L.; Post, Wendy; Shea, Steven; Jacobs, David R. Jr.; Stunnenberg, Hendrik G.; Kritchevsky, Stephen B.; Hoeschele, Ina; McCall, Charles E.; Herrington, David M.; Tracy, Russell P.; Liu, Yongmei (Springer Nature, 2014-11)Age-related variations in DNA methylation have been reported; however, the functional relevance of these differentially methylated sites (age-dMS) are unclear. Here we report potentially functional age-dMS, defined as age-and cis-gene expression-associated methylation sites (age-eMS), identified by integrating genome-wide CpG methylation and gene expression profiles collected ex vivo from circulating T cells (227 CD4+ samples) and monocytes (1,264 CD14+ samples, age range: 55-94 years). None of the age-eMS detected in 227 T-cell samples are detectable in 1,264 monocyte samples, in contrast to the majority of age-dMS detected in T cells that replicated in monocytes. Age-eMS tend to be hypomethylated with older age, located in predicted enhancers and preferentially linked to expression of antigen processing and presentation genes. These results identify and characterize potentially functional age-related methylation in human T cells and monocytes, and provide novel insights into the role age-dMS may have in the aging process.
- Asymmetric independence modeling identifies novel gene-environment interactionsYu, Guoqiang; Miller, David J.; Wu, Chiung-Ting; Hoffman, Eric P.; Liu, Chunyu; Herrington, David M.; Wang, Yue (Springer Nature, 2019-02-21)Most genetic or environmental factors work together in determining complex disease risk. Detecting gene-environment interactions may allow us to elucidate novel and targetable molecular mechanisms on how environmental exposures modify genetic effects. Unfortunately, standard logistic regression (LR) assumes a convenient mathematical structure for the null hypothesis that however results in both poor detection power and type 1 error, and is also susceptible to missing factor, imperfect surrogate, and disease heterogeneity confounding effects. Here we describe a new baseline framework, the asymmetric independence model (AIM) in case-control studies, and provide mathematical proofs and simulation studies verifying its validity across a wide range of conditions. We show that AIM mathematically preserves the asymmetric nature of maintaining health versus acquiring a disease, unlike LR, and thus is more powerful and robust to detect synergistic interactions. We present examples from four clinically discrete domains where AIM identified interactions that were previously either inconsistent or recognized with less statistical certainty.
- Bioinformatic Analysis of Coronary Disease Associated SNPs and Genes to Identify Proteins Potentially Involved in the Pathogenesis of AtherosclerosisMao, Chunhong; Howard, Timothy D.; Sullivan, Dan; Fu, Zongming; Yu, Guoqiang; Parker, Sarah J.; Will, Rebecca; Vander Heide, Richard S.; Wang, Yue; Hixson, James; Van Eyk, Jennifer; Herrington, David M. (Open Access Pub, 2017-03-04)Factors that contribute to the onset of atherosclerosis may be elucidated by bioinformatic techniques applied to multiple sources of genomic and proteomic data. The results of genome wide association studies, such as the CardioGramPlusC4D study, expression data, such as that available from expression quantitative trait loci (eQTL) databases, along with protein interaction and pathway data available in Ingenuity Pathway Analysis (IPA), constitute a substantial set of data amenable to bioinformatics analysis. This study used bioinformatic analyses of recent genome wide association data to identify a seed set of genes likely associated with atherosclerosis. The set was expanded to include protein interaction candidates to create a network of proteins possibly influencing the onset and progression of atherosclerosis. Local average connectivity (LAC), eigenvector centrality, and betweenness metrics were calculated for the interaction network to identify top gene and protein candidates for a better understanding of the atherosclerotic disease process. The top ranking genes included some known to be involved with cardiovascular disease (APOA1, APOA5, APOB, APOC1, APOC2, APOE, CDKN1A, CXCL12, SCARB1, SMARCA4 and TERT), and others that are less obvious and require further investigation (TP53, MYC, PPARG, YWHAQ, RB1, AR, ESR1, EGFR, UBC and YWHAZ). Collectively these data help define a more focused set of genes that likely play a pivotal role in the pathogenesis of atherosclerosis and are therefore natural targets for novel therapeutic interventions.
- Blood monocyte transcriptome and epigenome analyses reveal loci associated with human atherosclerosisLiu, Yongmei; Reynolds, Lindsay M.; Ding, Jingzhong; Hou, Li; Lohman, Kurt; Young, Tracey; Cui, Wei; Huang, Zhiqing; Grenier, Carole; Wan, Ma; Stunnenberg, Hendrik G.; Siscovick, David; Hou, Lifang; Psaty, Bruce M.; Rich, Stephen S.; Rotter, Jerome I.; Kaufman, Joel D.; Burke, Gregory L.; Murphy, Susan F.; Jacobs, David R. Jr.; Post, Wendy; Hoeschele, Ina; Bell, Douglas A.; Herrington, David M.; Parks, John S.; Tracy, Russell P.; McCall, Charles E.; Stein, James H. (Springer Nature, 2017-08-30)Little is known regarding the epigenetic basis of atherosclerosis. Here we present the CD14+ blood monocyte transcriptome and epigenome signatures associated with human atherosclerosis. The transcriptome signature includes transcription coactivator, ARID5B, which is known to form a chromatin derepressor complex with a histone H3K9Me2-specific demethylase and promote adipogenesis and smooth muscle development. ARID5B CpG (cg25953130) methylation is inversely associated with both ARID5B expression and atherosclerosis, consistent with this CpG residing in an ARID5B enhancer region, based on chromatin capture and histone marks data. Mediation analysis supports assumptions that ARID5B expression mediates effects of cg25953130 methylation and several cardiovascular disease risk factors on atherosclerotic burden. In lipopolysaccharide-stimulated human THP1 monocytes, ARID5B knockdown reduced expression of genes involved in atherosclerosis-related inflammatory and lipid metabolism pathways, and inhibited cell migration and phagocytosis. These data suggest that ARID5B expression, possibly regulated by an epigenetically controlled enhancer, promotes atherosclerosis by dysregulating immunometabolism towards a chronic inflammatory phenotype.
- Comparative analysis of methods for detecting interacting lociChen, Li; Yu, Guoqiang; Langefeld, Carl D.; Miller, David J.; Guy, Richard T.; Raghuram, Jayaram; Yuan, Xiguo; Herrington, David M.; Wang, Yue (Biomed Central, 2011-07-05)Background: Interactions among genetic loci are believed to play an important role in disease risk. While many methods have been proposed for detecting such interactions, their relative performance remains largely unclear, mainly because different data sources, detection performance criteria, and experimental protocols were used in the papers introducing these methods and in subsequent studies. Moreover, there have been very few studies strictly focused on comparison of existing methods. Given the importance of detecting gene-gene and gene-environment interactions, a rigorous, comprehensive comparison of performance and limitations of available interaction detection methods is warranted. Results: We report a comparison of eight representative methods, of which seven were specifically designed to detect interactions among single nucleotide polymorphisms (SNPs), with the last a popular main-effect testing method used as a baseline for performance evaluation. The selected methods, multifactor dimensionality reduction (MDR), full interaction model (FIM), information gain (IG), Bayesian epistasis association mapping (BEAM), SNP harvester (SH), maximum entropy conditional probability modeling (MECPM), logistic regression with an interaction term (LRIT), and logistic regression (LR) were compared on a large number of simulated data sets, each, consistent with complex disease models, embedding multiple sets of interacting SNPs, under different interaction models. The assessment criteria included several relevant detection power measures, family-wise type I error rate, and computational complexity. There are several important results from this study. First, while some SNPs in interactions with strong effects are successfully detected, most of the methods miss many interacting SNPs at an acceptable rate of false positives. In this study, the best-performing method was MECPM. Second, the statistical significance assessment criteria, used by some of the methods to control the type I error rate, are quite conservative, thereby limiting their power and making it difficult to fairly compare them. Third, as expected, power varies for different models and as a function of penetrance, minor allele frequency, linkage disequilibrium and marginal effects. Fourth, the analytical relationships between power and these factors are derived, aiding in the interpretation of the study results. Fifth, for these methods the magnitude of the main effect influences the power of the tests. Sixth, most methods can detect some ground-truth SNPs but have modest power to detect the whole set of interacting SNPs. Conclusion: This comparison study provides new insights into the strengths and limitations of current methods for detecting interacting loci. This study, along with freely available simulation tools we provide, should help support development of improved methods. The simulation tools are available at: http://code.google.com/p/simulationtool-bmc-ms9169818735220977/downloads/list.
- Comparative assessment and novel strategy on methods for imputing proteomics dataShen, Minjie; Chang, Yi-Tan; Wu, Chiung-Ting; Parker, Sarah J.; Saylor, Georgia; Wang, Yizhi; Yu, Guoqiang; Van Eyk, Jennifer E.; Clarke, Robert; Herrington, David M.; Wang, Yue (2022-01-20)Missing values are a major issue in quantitative proteomics analysis. While many methods have been developed for imputing missing values in high-throughput proteomics data, a comparative assessment of imputation accuracy remains inconclusive, mainly because mechanisms contributing to true missing values are complex and existing evaluation methodologies are imperfect. Moreover, few studies have provided an outlook of future methodological development. We first re-evaluate the performance of eight representative methods targeting three typical missing mechanisms. These methods are compared on both simulated and masked missing values embedded within real proteomics datasets, and performance is evaluated using three quantitative measures. We then introduce fused regularization matrix factorization, a low-rank global matrix factorization framework, capable of integrating local similarity derived from additional data types. We also explore a biologically-inspired latent variable modeling strategy—convex analysis of mixtures—for missing value imputation and present preliminary experimental results. While some winners emerged from our comparative assessment, the evaluation is intrinsically imperfect because performance is evaluated indirectly on artificial missing or masked values not authentic missing values. Nevertheless, we show that our fused regularization matrix factorization provides a novel incorporation of external and local information, and the exploratory implementation of convex analysis of mixtures presents a biologically plausible new approach.
- Cosbin: cosine score-based iterative normalization of biologically diverse samplesWu, Chiung-Ting; Shen, Minjie; Du, Dongping; Cheng, Zuolin; Parker, Sarah J.; Lu, Yingzhou; Van Eyk, Jennifer E.; Yu, Guoqiang; Clarke, Robert; Herrington, David M.; Wang, Yue (Oxford University Press, 2022)Motivation: Data normalization is essential to ensure accurate inference and comparability of gene expression measures across samples or conditions. Ideally, gene expression data should be rescaled based on consistently expressed reference genes. However, to normalize biologically diverse samples, the most commonly used reference genes exhibit striking expression variability and size-factor or distribution-based normalization methods can be problematic when the amount of asymmetry in differential expression is significant. Results: We report an efficient and accurate data-driven method-Cosine score-based iterative normalization (Cosbin)-to normalize biologically diverse samples. Based on the Cosine scores of cross-condition expression patterns, the Cosbin pipeline iteratively eliminates asymmetric differentially expressed genes, identifies consistently expressed genes, and calculates sample-wise normalization factors. We demonstrate the superior performance and enhanced utility of Cosbin compared with six representative peer methods using both simulation and real multi-omics expression datasets. Implemented in open-source R scripts and specifically designed to address normalization bias due to significant asymmetry in differential expression across multiple conditions, the Cosbin tool complements rather than replaces the existing methods and will allow biologists to more accurately detect true molecular signals among diverse phenotypic groups. Availability and implementation: The R scripts of Cosbin pipeline are freely available at https://github.com/MinjieSh/Cosbin. Supplementary information: Supplementary data are available at Bioinformatics Advances online.
- COT: an efficient and accurate method for detecting marker genes among many subtypesLu, Yingzhou; Wu, Chiung-Ting; Parker, Sarah J.; Cheng, Zuolin; Saylor, Georgia; Van Eyk, Jennifer E.; Yu, Guoqiang; Clarke, Robert; Herrington, David M.; Wang, Yue (Oxford University Press, 2022)Motivation: Ideally, a molecularly distinct subtype would be composed of molecular features that are expressed uniquely in the subtype of interest but in no others-so-called marker genes (MGs). MG plays a critical role in the characterization, classification or deconvolution of tissue or cell subtypes. We and others have recognized that the test statistics used by most methods do not exactly satisfy the MG definition and often identify inaccurate MG. Results: We report an efficient and accurate data-driven method, formulated as a Cosine-based One-sample Test (COT) in scatter space, to detect MG among many subtypes using subtype expression profiles. Fundamentally different from existing approaches, the test statistic in COT precisely matches the mathematical definition of an ideal MG. We demonstrate the performance and utility of COT on both simulated and real gene expression and proteomics data. The open source Python/R tool will allow biologists to efficiently detect MG and perform a more comprehensive and unbiased molecular characterization of tissue or cell subtypes in many biomedical contexts. Nevertheless, COT complements not replaces existing methods. Availability and implementation: The Python COT software with a detailed user's manual and a vignette are freely available at https://github.com/MintaYLu/COT. Supplementary information: Supplementary data are available at Bioinformatics Advances online.
- Data-driven detection of subtype-specific differentially expressed genesChen, Lulu; Lu, Yingzhou; Wu, Chiung-Ting; Clarke, Robert; Yu, Guoqiang; Van Eyk, Jennifer E.; Herrington, David M.; Wang, Yue (2021-01-11)Among multiple subtypes of tissue or cell, subtype-specific differentially-expressed genes (SDEGs) are defined as being most-upregulated in only one subtype but not in any other. Detecting SDEGs plays a critical role in the molecular characterization and deconvolution of multicellular complex tissues. Classic differential analysis assumes a null hypothesis whose test statistic is not subtype-specific, thus can produce a high false positive rate and/or lower detection power. Here we first introduce a One-Versus-Everyone Fold Change (OVE-FC) test for detecting SDEGs. We then propose a scaled test statistic (OVE-sFC) for assessing the statistical significance of SDEGs that applies a mixture null distribution model and a tailored permutation test. The OVE-FC/sFC test was validated on both type 1 error rate and detection power using extensive simulation data sets generated from real gene expression profiles of purified subtype samples. The OVE-FC/sFC test was then applied to two benchmark gene expression data sets of purified subtype samples and detected many known or previously unknown SDEGs. Subsequent supervised deconvolution results on synthesized bulk expression data, obtained using the SDEGs detected from the independent purified expression data by the OVE-FC/sFC test, showed superior performance in deconvolution accuracy when compared with popular peer methods.
- Knowledge-fused differential dependency network models for detecting significant rewiring in biological networksTian, Ye; Zhang, Bai; Hoffman, Eric P.; Clarke, Robert; Zhang, Zhen; Shih, Ie-Ming; Xuan, Jianhua; Herrington, David M.; Wang, Yue (2014-07-24)Modeling biological networks serves as both a major goal and an effective tool of systems biology in studying mechanisms that orchestrate the activities of gene products in cells. Biological networks are context-specific and dynamic in nature. To systematically characterize the selectively activated regulatory components and mechanisms, modeling tools must be able to effectively distinguish significant rewiring from random background fluctuations. While differential networks cannot be constructed by existing knowledge alone, novel incorporation of prior knowledge into data-driven approaches can improve the robustness and biological relevance of network inference. However, the major unresolved roadblocks include: big solution space but a small sample size; highly complex networks; imperfect prior knowled≥ missing significance assessment; and heuristic structural parameter learning. To address these challenges, we formulated the inference of differential dependency networks that incorporate both conditional data and prior knowledge as a convex optimization problem, and developed an efficient learning algorithm to jointly infer the conserved biological network and the significant rewiring across different conditions. We used a novel sampling scheme to estimate the expected error rate due to "random" knowledge. Based on that scheme, we developed a strategy that fully exploits the benefit of this data-knowledge integrated approach. We demonstrated and validated the principle and performance of our method using synthetic datasets. We then applied our method to yeast cell line and breast cancer microarray data and obtained biologically plausible results. The open-source R software package and the experimental data are freely available at http://www.cbil.ece.vt.edu/software.htm. Experiments on both synthetic and real data demonstrate the effectiveness of the knowledge-fused differential dependency network in revealing the statistically significant rewiring in biological networks. The method efficiently leverages data-driven evidence and existing biological knowledge while remaining robust to the false positive edges in the prior knowledge. The identified network rewiring events are supported by previous studies in the literature and also provide new mechanistic insight into the biological systems. We expect the knowledge-fused differential dependency network analysis, together with the open-source R package, to be an important and useful bioinformatics tool in biological network analyses.
- Mathematical modelling of transcriptional heterogeneity identifies novel markers and subpopulations in complex tissuesWang, Niya; Hoffman, Eric P.; Chen, Lulu; Chen, Li; Zhang, Zhen; Liu, Chunyu; Yu, Guoqiang; Herrington, David M.; Clarke, Robert; Wang, Yue (Springer Nature, 2016-01-07)Tissue heterogeneity is both a major confounding factor and an underexploited information source. While a handful of reports have demonstrated the potential of supervised computational methods to deconvolute tissue heterogeneity, these approaches require a priori information on the marker genes or composition of known subpopulations. To address the critical problem of the absence of validated marker genes for many (including novel) subpopulations, we describe convex analysis of mixtures (CAM), a fully unsupervised in silico method, for identifying subpopulation marker genes directly from the original mixed gene expressions in scatter space that can improve molecular analyses in many biological contexts. Validated with predesigned mixtures, CAM on the gene expression data from peripheral leukocytes, brain tissue, and yeast cell cycle, revealed novel marker genes that were otherwise undetectable using existing methods. Importantly, CAM requires no a priori information on the number, identity, or composition of the subpopulations present in mixed samples, and does not require the presence of pure subpopulations in sample space. This advantage is significant in that CAM can achieve all of its goals using only a small number of heterogeneous samples, and is more powerful to distinguish between phenotypically similar subpopulations.
- Proteomic analysis of descending thoracic aorta identifies unique and universal signatures of aneurysm and dissectionSaddic, Louis; Orosco, Amanda; Guo, Dongchuan; Milewicz, Dianna M.; Troxlair, Dana; Vander Heide, Richard; Herrington, David M.; Wang, Yue; Azizzadeh, Ali; Parker, Sarah J. (Elsevier, 2022)Objective: Very few clinical predictors of descending thoracic aorta dissection have been determined. Although aneurysms can dissect in a size-dependent process, most descending dissections will occur without prior enlargement. We compared the proteomic profiles of normal, dissected, aneurysm, and both aneurysm and dissected descending thoracic aortas to identify novel biomarkers and further understand the molecular pathways that lead to tissue at risk of dissection. Methods: We performed proteomic profiling of descending thoracic aortas with four phenotypes: normal (n = 46), aneurysm (n = 22), dissected (n = 12), and combined aneurysm and dissection (n = 8). Pairwise differential protein expression analyses using a Bayesian approach were then performed to identify common proteins that were dysregulated between each diseased tissue type and control aorta and to uncover unique proteins between aneurysmal and dissected aortas. Network and Markov cluster algorithms of differentially expressed proteins were used to find enriched ontology processes. A convex analysis of mixtures was also performed to identify the molecular subtypes within the different tissue types. Results: The diseased aortas had 71 common differentially expressed proteins compared with the control, including higher amounts of the protein thrombospondin 1. We found 42 differentially expressed proteins between the aneurysm and dissected tissue, with an abundance of apolipoproteins in the former and higher quantities of extracellular matrix proteins in the latter. The convex analysis of mixtures showed enhancement of a molecular subtype enriched in contractile proteins within the control tissue compared with the diseased tissue, in addition to increased proportions of molecular subtypes enriched in inflammation and red blood cell expression in the aneurysmal compared with the dissected tissue. Conclusions: We found some overlapping differentially expressed proteins in aneurysmal and nonaneurysmal descending thoracic aortas at risk of dissection compared with normal aortas. However, we also found uniquely altered molecular pathways that might uncover mechanisms for dissection.
- Whole Exome Sequencing to Identify Genetic Variants Associated with Raised Atherosclerotic Lesions in Young PersonsHixson, James E.; Jun, Goo; Shimmin, Lawrence C.; Wang, Yizhi; Yu, Guoqiang; Mao, Chunhong; Warren, Andrew S.; Howard, Timothy D.; Vander Heide, Richard S.; Van Eyk, Jennifer E.; Wang, Yue; Herrington, David M. (Springer Nature, 2017-06-22)We investigated the influence of genetic variants on atherosclerosis using whole exome sequencing in cases and controls from the autopsy study "Pathobiological Determinants of Atherosclerosis in Youth (PDAY)". We identified a PDAY case group with the highest total amounts of raised lesions (n = 359) for comparisons with a control group with no detectable raised lesions (n = 626). In addition to the standard exome capture, we included genome-wide proximal promoter regions that contain sequences that regulate gene expression. Our statistical analyses included single variant analysis for common variants (MAF > 0.01) and rare variant analysis for low frequency and rare variants (MAF < 0.05). In addition, we investigated known CAD genes previously identified by meta-analysis of GWAS studies. We did not identify individual common variants that reached exome-wide significance using single variant analysis. In analysis limited to 60 CAD genes, we detected strong associations with COL4A2/COL4A1 that also previously showed associations with myocardial infarction and arterial stiffness, as well as coronary artery calcification. Likewise, rare variant analysis did not identify genes that reached exomewide significance. Among the 60 CAD genes, the strongest association was with NBEAL1 that was also identified in gene-based analysis of whole exome sequencing for early onset myocardial infarction.