Browsing by Author "Warren, Andrew S."
Now showing 1 - 11 of 11
Results Per Page
Sort Options
- Deep Learning for Taxonomy PredictionRamesh, Shreyas (Virginia Tech, 2019-06-04)The last decade has seen great advances in Next-Generation Sequencing technologies, and, as a result, there has been a rise in the number of genomes sequenced each year. In 2017, there were as many as 10,000 new organisms sequenced and added into the RefSeq Database. Taxonomy prediction is a science involving the hierarchical classification of DNA fragments up to the rank species. In this research, we introduce Predicting Linked Organisms, Plinko, for short. Plinko is a fully-functioning, state-of-the-art predictive system that accurately captures DNA - Taxonomy relationships where other state-of-the-art algorithms falter. Plinko leverages multi-view convolutional neural networks and the pre-defined taxonomy tree structure to improve multi-level taxonomy prediction. In the Plinko strategy, each network takes advantage of different word usage patterns corresponding to different levels of evolutionary divergence. Plinko has the advantages of relatively low storage, GPGPU parallel training and inference, making the solution portable, and scalable with anticipated genome database growth. To the best of our knowledge, Plinko is the first to use multi-view convolutional neural networks as the core algorithm in a compositional,alignment-free approach to taxonomy prediction.
- Functional bias in molecular evolution rate of Arabidopsis thalianaWarren, Andrew S.; Anandakrishnan, Ramu; Zhang, Liqing (2010-05-01)Background Characteristics derived from mutation and other mechanisms that are advantageous for survival are often preserved during evolution by natural selection. Some genes are conserved in many organisms because they are responsible for fundamental biological function, others are conserved for their unique functional characteristics. Therefore one would expect the rate of molecular evolution for individual genes to be dependent on their biological function. Whether this expectation holds for genes duplicated by whole genome duplication is not known. Results We empirically demonstrate here, using duplicated genes generated from the Arabidopsis thaliana α-duplication event, that the rate of molecular evolution of genes duplicated in this event depend on biological function. Using functional clustering based on gene ontology annotation of gene pairs, we show that some duplicated genes, such as defense response genes, are under weaker purifying selection or under stronger diversifying selection than other duplicated genes, such as protein translation genes, as measured by the ratio of nonsynonymous to synonymous divergence (dN/dS). Conclusions These results provide empirical evidence indicating that molecular evolution rate for genes duplicated in whole genome duplication, as measured by dN/dS, may depend on biological function, which we characterize using gene ontology annotation. Furthermore, the general approach used here provides a framework for comparative analysis of molecular evolution rate for genes based on their biological function.
- The Genome Reverse Compiler: an explorative annotation toolWarren, Andrew S.; Setubal, João C. (2009-01-27)Background As sequencing costs have decreased, whole genome sequencing has become a viable and integral part of biological laboratory research. However, the tools with which genes can be found and functionally characterized have not been readily adapted to be part of the everyday biological sciences toolkit. Most annotation pipelines remain as a service provided by large institutions or come as an unwieldy conglomerate of independent components, each requiring their own setup and maintenance. Results To address this issue we have created the Genome Reverse Compiler, an easy-to-use, open-source, automated annotation tool. The GRC is independent of third party software installs and only requires a Linux operating system. This stands in contrast to most annotation packages, which typically require installation of relational databases, sequence similarity software, and a number of other programming language modules. We provide details on the methodology used by GRC and evaluate its performance on several groups of prokaryotes using GRC's built in comparison module. Conclusion Traditionally, to perform whole genome annotation a user would either set up a pipeline or take advantage of an online service. With GRC the user need only provide the genome he or she wants to annotate and the function resource files to use. The result is high usability and a very minimal learning curve for the intended audience of life science researchers and bioinformaticians. We believe that the GRC fills a valuable niche in allowing users to perform explorative, whole-genome annotation.
- Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource CenterWattam, Alice R.; Davis, James J.; Assaf, Rida; Boisvert, Sebastien; Brettin, Thomas; Bun, Christopher; Conrad, Neal; Dietrich, Emily M.; Disz, Terry L.; Gabbard, Joseph L.; Gerdes, Svetlana; Henry, Christopher S.; Kenyon, Ronald W.; Machi, Dustin; Mao, Chunhong; Nordberg, Eric K.; Olsen, Gary J.; Murphy-Olson, Daniel E.; Olson, Robert D.; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; Shukla, Maulik; Vonstein, Veronika; Warren, Andrew S.; Xia, Fangfang; Yoo, Hyunseung; Stevens, Rick L. (2017-01-04)The Pathosystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center (https://www.patricbrc.org). Recent changes to PATRIC include a redesign of the web interface and some new services that provide users with a platform that takes them from raw reads to an integrated analysis experience. The redesigned interface allows researchers direct access to tools and data, and the emphasis has changed to user- created genome-groups, with detailed summaries and views of the data that researchers have selected. Perhaps the biggest change has been the enhanced capability for researchers to analyze their private data and compare it to the available public data. Researchers can assemble their raw sequence reads and annotate the contigs using RASTtk. PATRIC also provides services for RNA-Seq, variation, model reconstruction and differential expression analysis, all delivered through an updated private workspace. Private data can be compared by `virtual integration' to any of PATRIC's public data. The number of genomes available for comparison in PATRIC has expanded to over 80 000, with a special emphasis on genomes with antimicrobial resistance data. PATRIC uses this data to improve both subsystem annotation and k-mer classification, and tags new genomes as having signatures that indicate susceptibility or resistance to specific antibiotics.
- Methods for Analysis of Prokaryotic Genome ArchitectureWarren, Andrew S. (Virginia Tech, 2017-07-19)Research in comparative microbial genomics has largely been organized around the concept of reference genomes. Reference genomes provide a useful comparative touchstone for closely related organisms. However, they do not necessarily represent the biological diversity in a group of genomes. Currently there are more than 96,000 bacterial genomes sequenced and this number is rapidly increasing. Some closely related groups have large numbers of genomes sequenced creating interesting comparative challenges: E. coli more than 5,400 isolates, S. aureus almost 9,000. As this sampling through sequencing becomes both deeper and broader, reference genome based methods become less effective at characterizing groups of organisms. Functional motifs can help explain the organizing principles behind cellular systems in bacteria which have yet to be well understood. Currently there are relatively few bioinformatic tools for analyzing potential patterns at the level of genome organization that do not depend directly on sequence similarity. We present a framework for conducting genomic data mining to look for patterns that currently require human expert designation. We establish new computational methods for identifying patterns in prokaryotic genome construction through a mapping of genomic features, using semantic similarity, independent of a particular corpus to better approximate functional similarity. We also present an algorithm for creating whole genome multiple sequence comparisons and a model for representing the similarities and di erences among sequences as a graph of syntenic gene families. This e ort touches on several di erent research fronts: graph representation of genomes and their alignments, synteny block analysis, whole genome sequence alignment, pan-genome analysis, multiple sequence alignment, and genome rearrangement analysis. Though our approach was originally developed from a pan-genome perspective for prokaryotes, the methods involved have the potential to speed up more expensive computation such as phylogenetic tree construction and SNP analysis. Novel elements include the contextualization of synteny analysis both between and within multi-contig genomes and an analytical framework for detecting genome level evolutionary events such as insertions, inversions, translocations, and fusions.
- Missing genes in the annotation of prokaryotic genomesWarren, Andrew S.; Archuleta, Jeremy; Feng, Wu-chun; Setubal, João C. (BioMed Central, 2010-03-15)Background Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting). Therefore the question arises as to whether current genome annotations have systematically missing, small genes. Results We have developed a high-performance computing methodology to investigate this problem. In this methodology we compare all ORFs larger than or equal to 33 aa from all fully-sequenced prokaryotic replicons. Based on that comparison, and using conservative criteria requiring a minimum taxonomic diversity between conserved ORFs in different genomes, we have discovered 1,153 candidate genes that are missing from current genome annotations. These missing genes are similar only to each other and do not have any strong similarity to gene sequences in public databases, with the implication that these ORFs belong to missing gene families. We also uncovered 38,895 intergenic ORFs, readily identified as putative genes by similarity to currently annotated genes (we call these absent annotations). The vast majority of the missing genes found are small (less than 100 aa). A comparison of select examples with GeneMark, EasyGene and Glimmer predictions yields evidence that some of these genes are escaping detection by these programs. Conclusions Prokaryotic gene finders and prokaryotic genome annotations require improvement for accurate prediction of small genes. The number of missing gene families found is likely a lower bound on the actual number, due to the conservative criteria used to determine whether an ORF corresponds to a real gene.
- Missing genes in the annotation of prokaryotic genomesWarren, Andrew S.; Archuleta, Jeremy; Feng, Wu-chun; Setubal, João Carlos (BioMed Central, 2010)Background: Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting). Therefore the question arises as to whether current genome annotations have systematically missing, small genes. Results: We have developed a high-performance computing methodology to investigate this problem. In this methodology we compare all ORFs larger than or equal to 33 aa from all fully-sequenced prokaryotic replicons. Based on that comparison, and using conservative criteria requiring a minimum taxonomic diversity between conserved ORFs in different genomes, we have discovered 1,153 candidate genes that are missing from current genome annotations. These missing genes are similar only to each other and do not have any strong similarity to gene sequences in public databases, with the implication that these ORFs belong to missing gene families. We also uncovered 38,895 intergenic ORFs, readily identified as putative genes by similarity to currently annotated genes (we call these absent annotations). The vast majority of the missing genes found are small (less than 100 aa). A comparison of select examples with GeneMark, EasyGene and Glimmer predictions yields evidence that some of these genes are escaping detection by these programs. Conclusions: Prokaryotic gene finders and prokaryotic genome annotations require improvement for accurate prediction of small genes. The number of missing gene families found is likely a lower bound on the actual number, due to the conservative criteria used to determine whether an ORF corresponds to a real gene.
- The PATRIC Bioinformatics Resource Center: expanding data and analysis capabilitiesDavis, James J.; Wattam, Alice R.; Aziz, Ramy K.; Brettin, Thomas; Butler, Ralph; Butler, Rory M.; Chlenski, Philippe; Conrad, Neal; Dickerman, Allan W.; Dietrich, Emily M.; Gabbard, Joseph L.; Gerdes, Svetlana; Guard, Andrew; Kenyon, Ronald W.; Machi, Dustin; Mao, Chunhong; Murphy-Olson, Daniel E.; Nguyen, Marcus; Nordberg, Eric K.; Olsen, Gary J.; Olson, Robert D.; Overbeek, Jamie C.; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; Shukla, Maulik; Thomas, Chris; VanOeffelen, Margo; Vonstein, Veronika; Warren, Andrew S.; Xia, Fangfang; Xie, Dawen; Yoo, Hyunseung; Stevens, Rick L. (2020-01-08)The PathoSystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center funded by the National Institute of Allergy and Infectious Diseases (https://www.patricbrc.org). PATRIC supports bioinformatic analyses of all bacteria with a special emphasis on pathogens, offering a rich comparative analysis environment that provides users with access to over 250 000 uniformly annotated and publicly available genomes with curated metadata. PATRIC offers web-based visualization and comparative analysis tools, a private workspace in which users can analyze their own data in the context of the public collections, services that streamline complex bioinformatic workflows and command-line tools for bulk data analysis. Over the past several years, as genomic and other omics-related experiments have become more cost-effective and widespread, we have observed considerable growth in the usage of and demand for easy-to-use, publicly available bioinformatic tools and services. Here we report the recent updates to the PATRIC resource, including new web-based comparative analysis tools, eight new services and the release of a command-line interface to access, query and analyze data.
- PATRIC, the bacterial bioinformatics database and analysis resourceWattam, Alice R.; Abraham, David; Dalay, Oral; Disz, Terry L.; Driscoll, Timothy; Gabbard, Joseph L.; Gillespie, Joseph J.; Gough, Roger; Hix, Deborah; Kenyon, Ronald W.; Machi, Dustin; Mao, Chunhong; Nordberg, Eric K.; Olson, Robert; Overbeek, Ross; Pusch, Gordon D.; Shukla, Maulik; Schulman, Julie; Stevens, Rick L.; Sullivan, Daniel E.; Vonstein, Veronika; Warren, Andrew S.; Will, Rebecca; Wilson, Meredith J. C.; Yoo, Hyunseung; Zhang, Chengdong; Zhang, Yan; Sobral, Bruno (2014-01)The Pathosystems Resource Integration Center (PATRIC) is the all-bacterial Bioinformatics Resource Center (BRC) (http://www.patricbrc.org). A joint effort by two of the original National Institute of Allergy and Infectious Diseases-funded BRCs, PATRIC provides researchers with an online resource that stores and integrates a variety of data types [e. g. genomics, transcriptomics, protein-protein interactions (PPIs), three-dimensional protein structures and sequence typing data] and associated metadata. Datatypes are summarized for individual genomes and across taxonomic levels. All genomes in PATRIC, currently more than 10 000, are consistently annotated using RAST, the Rapid Annotations using Subsystems Technology. Summaries of different data types are also provided for individual genes, where comparisons of different annotations are available, and also include available transcriptomic data. PATRIC provides a variety of ways for researchers to find data of interest and a private workspace where they can store both genomic and gene associations, and their own private data. Both private and public data can be analyzed together using a suite of tools to perform comparative genomic or transcriptomic analysis. PATRIC also includes integrated information related to disease and PPIs. All the data and integrated analysis and visualization tools are freely available. This manuscript describes updates to the PATRIC since its initial report in the 2007 NAR Database Issue.
- Towards a Genome Reverse CompilerWarren, Andrew S. (Virginia Tech, 2007-11-05)The Genome Reverse Compiler (GRC) is an annotation tool for prokaryotic genomes. Its name and philosophy are based on analogy with a high-level programming language compiler. In this analogy, the genome is a program in a certain low-level language that humans cannot understand. Given the sequence of any prokaryotic genome, GRC produces its corresponding "high-level program"--its annotation. GRC works in a completely automatic manner, using standard input and output formats. The goal is to provide an open-source, easy-to-run, very efficient annotation program.
- Whole Exome Sequencing to Identify Genetic Variants Associated with Raised Atherosclerotic Lesions in Young PersonsHixson, James E.; Jun, Goo; Shimmin, Lawrence C.; Wang, Yizhi; Yu, Guoqiang; Mao, Chunhong; Warren, Andrew S.; Howard, Timothy D.; Vander Heide, Richard S.; Van Eyk, Jennifer E.; Wang, Yue; Herrington, David M. (Springer Nature, 2017-06-22)We investigated the influence of genetic variants on atherosclerosis using whole exome sequencing in cases and controls from the autopsy study "Pathobiological Determinants of Atherosclerosis in Youth (PDAY)". We identified a PDAY case group with the highest total amounts of raised lesions (n = 359) for comparisons with a control group with no detectable raised lesions (n = 626). In addition to the standard exome capture, we included genome-wide proximal promoter regions that contain sequences that regulate gene expression. Our statistical analyses included single variant analysis for common variants (MAF > 0.01) and rare variant analysis for low frequency and rare variants (MAF < 0.05). In addition, we investigated known CAD genes previously identified by meta-analysis of GWAS studies. We did not identify individual common variants that reached exome-wide significance using single variant analysis. In analysis limited to 60 CAD genes, we detected strong associations with COL4A2/COL4A1 that also previously showed associations with myocardial infarction and arterial stiffness, as well as coronary artery calcification. Likewise, rare variant analysis did not identify genes that reached exomewide significance. Among the 60 CAD genes, the strongest association was with NBEAL1 that was also identified in gene-based analysis of whole exome sequencing for early onset myocardial infarction.