Creating Scientific Software, with Application to Phylogenetics and Oligonucleotide Probe Design

TR Number



Journal Title

Journal ISSN

Volume Title


Virginia Tech


The demands placed on scientific software are different from those placed on general purpose software, and as a result, creating software for science and for scientists requires a specialized approach. Much of software engineering practices have developed in situations in which a tool is desired to perform some definable task, with measurable and verifiable outcomes. The users and the developers know what the tool "should" do. Scientific software often uses unproven or experimental techniques to address unsolved problems. The software is often run on "experimental" High Performance Computing hardware, adding another layer of complexity. It may not be possible to say what the software should do, or what the results should be, as these may be connected to very scientific questions for which the software is being developed. Software development in this realm requires a deep understanding of the relevent scientific domain area. The present work describes applications resulting from a scientific software development process that builds upon detailed understanding of the scientific domain area.

YODA is an application primarily for selecting microarray probe sequences for measuring gene expression. At the time of its development, none of the existing programs for this task satisfied the best-known requirements for microarray probe selection. The question of what makes a good microarray probe was a research area at the time, and YODA was developed to incorporate the latest understanding of these requirements, drawn from the research literature, into a tool that can be used by a research biologist. An appendix examines the response and use in the years since YODA was released.

PEPR is a software system for inferring highly resolved whole-genome phylogenies for hundreds of genomes. It encodes a process developed through years of research and collaboration to produce some of the highest quality phylogenies available for large sets of bacterial genomes, with no manual intervention required. This process is described in detail, and results are compared with high quality results from the literature to show that the process is at least as successful as more labor-intensive manual efforts. An appendix presents additional results, including high quality phylogenies for many bacterial Orders.



Phylogenetics, Microarray Probes, Oligonucleotide Design, Scientific Software, Automation