Browsing by Author "Wattam, Alice Rebecca"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
- Creating Scientific Software, with Application to Phylogenetics and Oligonucleotide Probe DesignNordberg, Eric Kinsley (Virginia Tech, 2015-12-09)The demands placed on scientific software are different from those placed on general purpose software, and as a result, creating software for science and for scientists requires a specialized approach. Much of software engineering practices have developed in situations in which a tool is desired to perform some definable task, with measurable and verifiable outcomes. The users and the developers know what the tool "should" do. Scientific software often uses unproven or experimental techniques to address unsolved problems. The software is often run on "experimental" High Performance Computing hardware, adding another layer of complexity. It may not be possible to say what the software should do, or what the results should be, as these may be connected to very scientific questions for which the software is being developed. Software development in this realm requires a deep understanding of the relevent scientific domain area. The present work describes applications resulting from a scientific software development process that builds upon detailed understanding of the scientific domain area. YODA is an application primarily for selecting microarray probe sequences for measuring gene expression. At the time of its development, none of the existing programs for this task satisfied the best-known requirements for microarray probe selection. The question of what makes a good microarray probe was a research area at the time, and YODA was developed to incorporate the latest understanding of these requirements, drawn from the research literature, into a tool that can be used by a research biologist. An appendix examines the response and use in the years since YODA was released. PEPR is a software system for inferring highly resolved whole-genome phylogenies for hundreds of genomes. It encodes a process developed through years of research and collaboration to produce some of the highest quality phylogenies available for large sets of bacterial genomes, with no manual intervention required. This process is described in detail, and results are compared with high quality results from the literature to show that the process is at least as successful as more labor-intensive manual efforts. An appendix presents additional results, including high quality phylogenies for many bacterial Orders.
- Evaluation of Word and Paragraph Embeddings and Analogical Reasoning as an Alternative to Term Frequency-Inverse Document Frequency-based Classification in Support of BiocurationSullivan, Daniel Edward (Virginia Tech, 2016-06-07)This research addresses the problem, can unsupervised learning generate a representation that improves on the commonly used term frequency-inverse document frequency (TF-IDF ) representation by capturing semantic relations? The analysis measures the quality of sentence classification using term TF-IDF representations, and finds a practical upper limit to precision and recall in a biomedical text classification task (F1-score of 0.85). Arguably, one could use ontologies to supplement TF-IDF, but ontologies are sparse in coverage and costly to create. This prompts a correlated question: can unsupervised learning capture semantic relations at least as well as existing ontologies, and thus supplement existing sparse ontologies? A shallow neural network implementing the Skip-Gram algorithm is used to generate semantic vectors using a corpus of approximately 2.4 billion words. The ability to capture meaning is assessed by comparing semantic vectors generated with MESH. Results indicate that semantic vectors trained by unsupervised methods capture comparable levels of semantic features in some cases, such as amino acid (92% of similarity represented in MESH), but perform substantially poorer in more expansive topics, such as pathogenic bacteria (37.8% similarity represented in MESH). Possible explanations for this difference in performance are proposed along with a method to combine manually curated ontologies with semantic vector spaces to produce a more comprehensive representation than either alone. Semantic vectors are also used as representations for paragraphs, which, when used for classification, achieve an F1-score of 0.92. The results of classification and analogical reasoning tasks are promising but a formal model of semantic vectors, subject to the constraints of known linguistic phenomenon, is needed. This research includes initial steps for developing a formal model of semantic vectors based on a combination of linear algebra and fuzzy set theory subject to the semantic molecularism linguistic model. This research is novel in its analysis of semantic vectors applied to the biomedical domain, analysis of different performance characteristics in biomedical analogical reasoning tasks, comparison semantic relations captured by between vectors and MESH, and the initial development of a formal model of semantic vectors.
- Genome Sequences of Three Brucella canis Strains Isolated from Humans and a DogCanario Viana, Marcus Vinicius; Wattam, Alice Rebecca; Batra, Dhwani Govil; Boisvert, Sebastien; Brettin, Thomas Scott; Frace, Michael; Xia, Fangfang; Azevedo, Vasco; Tiller, Rebekah; Hoffmaster, Alex R. (2017-02)Brucella canis is a facultative intracellular pathogen that preferentially infects members of the Canidae family. Here, we report the genome sequencing of two Brucella canis strains isolated from humans and one isolated from a dog host.
- Rapidly evolving changes and gene loss associated with host switching in Corynebacterium pseudotuberculosisCanario Viana, Marcus Vinicius; Sahm, Arne; Goes Neto, Aristoteles; Pereira Figueiredo, Henrique Cesar; Wattam, Alice Rebecca; Azevedo, Vasco (PLOS, 2018-11-12)Phylogenomics and genome scale positive selection analyses were performed on 29 Corynebacterium pseudotuberculosis genomes that were isolated from different hosts, including representatives of the Ovis and Equi biovars. A total of 27 genes were identified as undergoing adaptive changes. An analysis of the clades within this species and these biovars, the genes specific to each branch, and the genes responding to selective pressure show clear differences, indicating that adaptation and specialization is occurring in different clades. These changes are often correlated with the isolation host but could indicate responses to some undetermined factor in the respective niches. The fact that some of these more-rapidly evolving genes have homology to known virulence factors, antimicrobial resistance genes and drug targets shows that this type of analysis could be used to identify novel targets, and that these could be used as a way to control this pathogen.