Computer Science Technical Reports
Permanent URI for this collection
The Department of Computer Science collection of technical
reports began in 1973. Please use the subject headings listed below for all submissions.
Subject Headings:
- Algorithms
- Big Data
- Bioinformatics
- Computational Biology
- Computational Science and Engineering
- Computer Graphics/Animation
- Computer Science Education
- Computer Systems
- Cyberarts
- Cybersecurity
- Data and Text Mining
- Digital Education
- Digital Libraries
- Discrete Event Simulation
- High Performance Computing
- Human Computer Interaction
- Information Retrieval
- Machine Learning
- Mathematical Programming
- Mathematical Software
- Modeling and Simulation
- Networking
- Numerical Analysis
- Parallel and Distributed Computing
- Problem Solving Environments
- Software Engineering
- Theoretical Computer Science
- Virtual/Augmented Reality
- Visualization
Browse
Browsing Computer Science Technical Reports by Subject "Bioinformatics"
Now showing 1 - 15 of 15
Results Per Page
Sort Options
- Accelerating Electrostatic Surface Potential Calculation with Multiscale Approximation on Graphics Processing UnitsAnandakrishnan, Ramu; Scogland, Thomas R. W.; Fenley, Andrew T.; Gordon, John; Feng, Wu-chun; Onufriev, Alexey V. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2009)Tools that compute and visualize biomolecular electrostatic surface potential have been used extensively for studying biomolecular function. However, determining the surface potential for large biomolecules on a typical desktop computer can take days or longer using currently available tools and methods. This paper demonstrates how one can take advantage of graphic processing units (GPUs) available in today’s typical desktop computer, together with a multiscale approximation method, to significantly speedup such computations. Specifically, the electrostatic potential computation, using an analytical linearized Poisson Boltzmann (ALPB) method, is implemented on an ATI Radeon 4870 GPU in combination with the hierarchical charge partitioning (HCP) multiscale approximation. This implementation delivers a combined 1800-fold speedup for a 476,040 atom viral capsid.
- Algorithms for Feature Selection in Rank-Order SpacesSlotta, Douglas J.; Vergara, John Paul C.; Ramakrishnan, Naren; Heath, Lenwood S. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2005)The problem of feature selection in supervised learning situations is considered, where all features are drawn from a common domain and are best interpreted via ordinal comparisons with other features, rather than as numerical values. In particular, each instance is a member of a space of ranked features. This problem is pertinent in electoral, financial, and bioinformatics contexts, where features denote assessments in terms of counts, ratings, or rankings. Four algorithms for feature selection in such rank-order spaces are presented; two are information-theoretic, and two are order-theoretic. These algorithms are empirically evaluated against both synthetic and real world datasets. The main results of this paper are (i) characterization of relationships and equivalences between different feature selection strategies with respect to the spaces in which they operate, and the distributions they seek to approximate; (ii) identification of computationally simple and efficient strategies that perform surprisingly well; and (iii) a feasibility study of order-theoretic feature selection for large scale datasets.
- Algorithms for StorytellingKumar, Deept; Ramakrishnan, Naren; Helm, Richard F.; Potts, Malcolm (Department of Computer Science, Virginia Polytechnic Institute & State University, 2006)We formulate a new data mining problem called "storytelling" as a generalization of redescription mining. In traditional redescription mining, we are given a set of objects and a collection of subsets defined over these objects. The goal is to view the set system as a vocabulary and identify two expressions in this vocabulary that induce the same set of objects. Storytelling, on the other hand, aims to explicitly relate object sets that are disjoint (and hence, maximally dissimilar) by finding a chain of (approximate) redescriptions between the sets. This problem finds applications in bioinformatics, for instance, where the biologist is trying to relate a set of genes expressed in one experiment to another set, implicated in a different pathway. We outline an efficient storytelling implementation that embeds the CARTwheels redescription mining algorithm in an A* search procedure, using the former to supply next move operators on search branches to the latter. This approach is practical and effective for mining large datasets and, at the same time, exploits the structure of partitions imposed by the given vocabulary. Three application case studies are presented: a study of word overlaps in large English dictionaries, exploring connections between genesets in a bioinformatics dataset, and relating publications in the PubMed index of abstracts.
- Analysis of the Fitness Effect of Compensatory MutationsZhang, Liqing; Watson, Layne T. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2008)We extend our previous work on the fitness effect of the fixation of deleterious mutations on a population by incorporating the effect of compensatory mutations. Compensatory mutations are important in the sense that they make the deleterious mutations less deleterious, thus reducing the genetic load of the population. The essential phenomenon underlying compensatory mutations is the nonindependence of mutations in biological systems. Therefore, it is an important phenomenon that cannot be ignored when considering the fixation and fitness effect of deleterious mutations. Since having compensatory mutations essentially changes the distributional shapes of deleterious mutations, we can consider the effect of compensatory mutations by comparing two distributions where one distribution reflects the reduced fitness effects of deleterious mutations with the influence of compensatory mutations. We compare different distributions of deleterious mutations without compensatory mutations to those with compensatory mutations, and study the effect of population sizes, the shape of the distribution, and the mutation rates of the population on the total fitness reduction of the population.
- Architectural Refactoring for Fast and Modular Bioinformatics Sequence SearchArchuleta, Jeremy; Tilevich, Eli; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2006-09-01)Bioinformaticists use the Basic Local Alignment Search Tool (BLAST) to characterize an unknown sequence by comparing it against a database of known sequences, thus detecting evolutionary relationships and biological properties. mpiBLAST is a widely-used, high-performance, open-source parallelization of BLAST that runs on a computer cluster delivering super-linear speedups. However, the Achilles heel of mpiBLAST is its lack of modularity, adversely affecting maintainability and extensibility; an effective architectural refactoring will benefit both users and developers. This paper describes our experiences in the architectural refactoring of mpiBLAST into a modular, high-performance software package. Our evaluation of five component-oriented designs culminated in a design that enables modularity while retaining high-performance. Furthermore, we achieved this refactoring effectively and efficiently using eXtreme Programming techniques. These experiences will be of value to software engineers faced with the challenge of creating maintainable and extensible, high-performance, bioinformatics software.
- CAN-zip – Centroid Based Delta Compression of Next Generation Sequencing DataSteere, Edward; An, Lin; Zhang, Liqing (Department of Computer Science, Virginia Polytechnic Institute & State University, 2015-11-09)We present CANzip, a novel algorithm for compressing short read DNA sequencing data in FastQ format. CANzip is based on delta compression, a process in which only the differences of a specific data stream relative to a given reference stream are stored. However CANzip uniquely assumes no given reference stream. Instead it creates artificial references for different clusters of reads, by constructing an artificial representative sequence for each given cluster. Each cluster sequence is then recoded to indicate only how it differs relative to this artificially created reference sequence. Remodeling the data in this way greatly improves the compression ratio achieved when used in conjunction with commodity tools such as bzip2. Our results indicate that CANzip outperforms gzip on average and that it can outperform bzip2.
- Compositional Mining of Multi-Relational Biological DatasetsJin, Ying; Murali, T. M.; Ramakrishnan, Naren (Department of Computer Science, Virginia Polytechnic Institute & State University, 2007-08-01)High-throughput biological screens are yielding ever-growing streams of information about multiple aspects of cellular activity. As more and more categories of datasets come online, there is a corresponding multitude of ways in which inferences can be chained across them, motivating the need for compositional data mining algorithms. In this paper, we argue that such compositional data mining can be effectively realized by functionally cascading redescription mining and biclustering algorithms as primitives. Both these primitives mirror shifts of vocabulary that can be composed in arbitrary ways to create rich chains of inferences. Given a relational database and its schema, we show how the schema can be automatically compiled into a compositional data mining program, and how different domains in the schema can be related through logical sequences of biclustering and redescription invocations. This feature allows us to rapidly prototype new data mining applications, yielding greater understanding of scientific datasets. We describe two applications of compositional data mining: (i) matching terms across categories of the Gene Ontology and (ii) understanding the molecular mechanisms underlying stress response in human cells.
- Dynamic Multigrain Parallelization on the Cell Broadband EngineBlagojevic, Filip; Nikolopoulos, Dimitrios S.; Stamatakis, Alexandros; Antonopoulos, Christos D. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2006)This paper addresses the problem of orchestrating and scheduling parallelism at multiple levels of granularity on heterogeneous multicore processors. We present policies and mechanisms for adaptive exploitation and scheduling of multiple layers of parallelism on the Cell Broadband Engine. Our policies combine event-driven task scheduling with malleable loop-level parallelism, which is exposed from the runtime system whenever task-level parallelism leaves cores idle. We present a runtime system for scheduling applications with layered parallelism on Cell and investigate its potential with RAxML, a computational biology application which infers large phylogenetic trees, using the Maximum Likelihood (ML) method. Our experiments show that the Cell benefits significantly from dynamic parallelization methods, that selectively exploit the layers of parallelism in the system, in response to workload characteristics. Our runtime environment outperforms naive parallelization and scheduling based on MPI and Linux by up to a factor of 2.6. We are able to execute RAxML on one Cell four times faster than on a dual-processor system with Hyperthreaded Xeon processors, and 5--10\% faster than on a single-processor system with a dual-core, quad-thread IBM Power5 processor.
- A General Probabilistic Model of the PCR ProcessSaha, Nilanjan; Watson, Layne T.; Kafadar, Karen; Onufriev, Alexey V.; Ramakrishnan, Naren; Vasquez-Robinet, Cecilia; Watkinson, Jonathan (Department of Computer Science, Virginia Polytechnic Institute & State University, 2004)This paper rigorously derives a general probabilistic model for the PCR process; this model includes as a special case the Velikanov-Kapral model where all nucleotide reaction rates are the same. In this model the probability of binding of deoxy-nucleoside triphosphate (dNTP) molecules with template strands is derived from the microscopic chemical kinetics. A recursive solution for the probability distribution of binding of dNTPs is developed for a single cycle and is used to calculate expected yield for a multicycle PCR. The model is able to reproduce important features of the PCR amplification process quantitatively. With a set of favorable reaction conditions, the amplification of the target sequence is fast enough to rapidly outnumber all side products. Furthemore, the final yield of the target sequence in a multicycle PCR run always approaches an asymptotic limit that is less than one. The amplification process itself is highly sensitive to initial concentrations and the reaction rates of addition to the template strand of each type of dNTP in the solution.
- GridWeaver: A Fully-Automatic System for Microarray Image Analysis Using Fast Fourier TransformsVergara, John Paul C.; Heath, Lenwood S.; Grene, Ruth; Ramakrishnan, Naren; Watson, Layne T. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2008)Experiments using microarray technology generate large amounts of image data that are used in the analysis of genetic function. An important stage in the analysis is the determination of relative intensities of spots on the images generated. This paper presents GridWeaver, a program that reads in images from a microarray experiment, automatically locates subgrids and spots in the images, and then determines the spot intensities needed in the analysis of gene function. Automatic gridding is performed by running Fast Fourier Transforms on pixel intensity sums. Tests on several data sets show that the program responds well even on images that have significant noise, both random and systemic.
- A Mathematical Programming Formulation for the Budding Yeast Cell CyclePanning, Thomas D.; Watson, Layne T.; Shaffer, Clifford A.; Tyson, John J. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2007)The budding yeast cell cycle can be modeled by a set of ordinary differential equations with 143 rate constant parameters. The quality of the model (and an associated vector of parameter settings) is measured by comparing simulation results to the experimental data derived from observing the cell cycles of over 100 selected mutated forms. Unfortunately, determining whether the simulated phenotype matches experimental data is difficult since the experimental data tend to be qualitative in nature (i.e., whether the mutation is viable, or which development phase it died in). Because of this, previous methods for automatically comparing simulation results to experimental data used a discontinuous penalty function, which limits the range of techniques available for automated estimation of the differential equation parameters. This paper presents a system of smooth inequality constraints that will be satisfied if and only if the model matches the experimental data. Results are presented for evaluating the mutants with the two most frequent phenotypes. This nonlinear inequality formulation is the first step toward solving a large-scale feasibility problem to determine the ordinary differential equation model parameters.
- Mining Novellas from PubMed Abstracts using a Storytelling AlgorithmGresock, Joseph; Kumar, Deept; Helm, Richard F.; Potts, Malcolm; Ramakrishnan, Naren (Department of Computer Science, Virginia Polytechnic Institute & State University, 2007)Motivation: There are now a multitude of articles published in a diversity of journals providing information about genes, proteins, pathways, and entire processes. Each article investigates particular subsets of a biological process, but to gain insight into the functioning of a system as a whole, we must computationally integrate information across multiple publications. This is especially important in problems such as modeling cross-talk in signaling networks, designing drug therapies for combinatorial selectivity, and unraveling the role of gene interactions in deleterious phenotypes, where the cost of performing combinatorial screens is exorbitant. Results: We present an automated approach to biological knowledge discovery from PubMed abstracts, suitable for unraveling combinatorial relationships. It involves the systematic application of a `storytelling' algorithm followed by compression of the stories into `novellas.' Given a start and end publication, typically with little or no overlap in content, storytelling identifies a chain of intermediate publications from one to the other, such that neighboring publications have significant content similarity. Stories discovered thus provide an argued approach to relate distant concepts through compositions of related concepts. The chains of links employed by stories are then mined to find frequently reused sub-stories, which can be compressed to yield novellas, or compact templates of connections. We demonstrate a successful application of storytelling and novella finding to modeling combinatorial relationships between introduction of extracellular factors and downstream cellular events. Availability: A story visualizer, suitable for interactive exploration of stories and novellas described in this paper, is available for demo/download at https://bioinformatics.cs.vt.edu/storytelling.
- A Modified Uniformization Method for the Solution of the Chemical Master EquationZhang, Jingwei; Watson, Layne T.; Cao, Yang (Department of Computer Science, Virginia Polytechnic Institute & State University, 2007)The chemical master equation is considered an accurate description of general chemical systems, and especially so for modeling cell cycle and gene regulatory networks. This paper proposes an efficient way of solving the chemical master equation for some prototypical problems in systems biology. A comparison between this new approach and some traditional approaches is also given.
- Partially assembled nucleosome structures at atomic detailRychkov, Georgy; Nazarov, Igor; Ilatovskiy, Andrey; Shvetsov, Alexey; Konev, Alexander; Isaev-Ivanov, Vladimir; Onufriev, Alexey V. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2015-05-21)Evidence is now overwhelming that partially assembled states of the nucleosome are as important as the canonical structure for the understanding of how DNA accessibility is regulated in cells. We use a combination of atomistic simulation and Atomic Force Microscopy (AFM) to propose models of key partially assembled structures of the nucleosome at atomic detail: the heaxasome ((H3·H4)2·(H2A·H2B)), the tetrasome ((H3·H4)2), and the disome (H3·H4). Despite large-scale fluctuations of the “free” DNA not in direct contact with the histones in these structures, the histone-DNA contacts are stable, establishing the basis for interpretation of the role of these structures in providing variable degree of DNA accessibility within the chromatin. We show how the atomistically detailed partially assembled nucleosome structures can be used to interpret experimental observations, both in-vitro and in-vivo. Specifically, interpretation of FRET, AFM, (and SAXS) experiments under conditions where partially assembled nucleosome states are expected is greatly facilitated by the availability of atomic- resolution structural ensembles of these states. We also suggest an alternative interpretation of a recent genome-wide study that investigated types of DNA protection in classical “active” chromatin, lending support for the transcription speed-up scenario that involves partially assembled sub-nucleosome structures.
- RAxML-Cell: Parallel Phylogenetic Tree Inference on the Cell Broadband EngineBlagojevic, Filip; Stamatakis, Alexandros; Antonopoulos, Christos D.; Nikolopoulos, Dimitrios S. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2006)Phylogenetic tree reconstruction is one of the grand challenge problems in Bioinformatics. The search for a best-scoring tree with 50 organisms, under a reasonable optimality criterion, creates a topological search space which is as large as the number of atoms in the universe. Computational phylogeny is challenging even for the most powerful supercomputers. It is also an ideal candidate for benchmarking emerging multiprocessor architectures, because it exhibits various levels of fine and coarse-grain parallelism. In this paper, we present the porting, optimization, and evaluation of RAxML on the Cell Broadband Engine. RAxML is a provably efficient, hill climbing algorithm for computing phylogenetic trees based on the Maximum Likelihood (ML) method. The algorithm uses an embarrassingly parallel search method, which also exhibits data-level parallelism and control parallelism in the computation of the likelihood functions. We present the optimization of one of the currently fastest tree search algorithms, on a real Cell blade prototype. We also investigate problems and present solutions pertaining to the optimization of floating point code, control flow, communication, scheduling, and multi-level parallelization on the Cell.