Browsing by Author "Murali, T. M."
Now showing 1 - 20 of 98
Results Per Page
Sort Options
- Accurate and Efficient Gene Function Prediction using a Multi-Bacterial NetworkLaw, Jeffrey N.; Kale, Shiv D.; Murali, T. M. (2019-05-24)The rapid rise in newly sequenced genomes requires the development of computational methods to supplement experimental functional annotations. The challenge that arises is to develop methods for gene function prediction that integrate information for multiple species while also operating on a genomewide scale. We develop a label propagation algorithm called FastSinkSource and apply it to a sequence similarity network integrated with species-specific heterogeneous data for 19 pathogenic bacterial species. By using mathematically-provable bounds on the rate of progress of FastSinkSource during power iteration, we decrease the running time by a factor of 100 or more without sacrificing prediction accuracy. To demonstrate scalability, we expand to a 73-million edge network across 200 bacterial species while maintaining accuracy and efficiency improvements. Our results point to the feasibility and promise of multi-species, genomewide gene function prediction, especially as more experimental data and annotations become available for a diverse variety of organisms.
- Algorithms for Reconstructing and Reasoning about Chemical Reaction NetworksCho, Yong Ju (Virginia Tech, 2013-01-24)Recent advances in systems biology have uncovered detailed mechanisms of biological processes such as the cell cycle, circadian rhythms, and signaling pathways. These mechanisms are modeled by chemical reaction networks (CRNs) which are typically simulated by converting to ordinary differential equations (ODEs), so that the goal is to closely reproduce the observed quantitative and qualitative behaviors of the modeled process. This thesis proposes two algorithmic problems related to the construction and comprehension of CRN models. The first problem focuses on reconstructing CRNs from given time series. Given multivariate time course data obtained by perturbing a given CRN, how can we systematically deduce the interconnections between the species of the network? We demonstrate how this problem can be modeled as, first, one of uncovering conditional independence relationships using buffering experiments and, second, of determining the properties of the individual chemical reactions. Experimental results demonstrate the effectiveness of our approach on both synthetic and real CRNs. The second problem this work focuses on is to aid in network comprehension, i.e., to understand the motifs underlying complex dynamical behaviors of CRNs. Specifically, we focus on bistability---an important dynamical property of a CRN---and propose algorithms to identify the core structures responsible for conferring bistability. The approach we take is to systematically infer the instability causing structures (ICSs) of a CRN and use machine learning techniques to relate properties of the CRN to the presence of such ICSs. This work has the potential to aid in not just network comprehension but also model simplification, by helping reduce the complexity of known bistable systems.
- Algorithms for regulatory network inference and experiment planning in systems biologyPratapa, Aditya (Virginia Tech, 2020-07-17)I present novel solutions to two different classes of computational problems that arise in the study of complex cellular processes. The first problem arises in the context of planning large-scale genetic cross experiments that can be used to validate predictions of multigenic perturbations made by mathematical models. (i) I present CrossPlan, a novel methodology for systematically planning genetic crosses to make a set of target mutants from a set of source mutants. CrossPlan is based on a generic experimental workflow used in performing genetic crosses in budding yeast. CrossPlan uses an integer-linear-program (ILP) to maximize the number of target mutants that we can make under certain experimental constraints. I apply it to a comprehensive mathematical model of the protein regulatory network controlling cell division in budding yeast. (ii) I formulate several natural problems related to efficient synthesis of a target mutant from source mutants. These formulations capture experimentally-useful notions of verifiability (e.g., the need to confirm that a mutant contains mutations in the desired genes) and permissibility (e.g., the requirement that no intermediate mutants in the synthesis be inviable). I present several polynomial time or fixed-parameter tractable algorithms for optimal synthesis of a target mutant for special cases of the problem that arise in practice. The second problem I address is inferring gene regulatory networks (GRNs) from single cell transcriptomic (scRNA-seq) data. These GRNs can serve as starting points to build mathematical models. (iii) I present BEELINE, a comprehensive evaluation of state-of-the-art algorithms for inferring gene regulatory networks (GRNs) from single-cell gene expression data. The evaluations from BEELINE suggest that the area under the precision-recall curve and early precision of these algorithms are moderate. Techniques that do not require pseudotime-ordered cells are generally more accurate. Based on these results, I present recommendations to end users of GRN inference methods. BEELINE will aid the development of gene regulatory network inference algorithms. (iv) Based on the insights gained from BEELINE, I propose a novel graph convolutional neural network (GCN) based supervised algorithm for GRN inference form single-cell gene expression data. This GCN-based model has a considerably better accuracy than existing supervised learning algorithms for GRN inference from scRNA-seq data and can infer cell-type specific regulatory networks.
- Assessing the Role of Clusters Derived from Large Sequence Similarity Networks for Gene Function PredictionsVora, Parth Harish (Virginia Tech, 2020-05-29)Large scale genomic sequencing efforts have resulted in a massive inflow of raw sequence data. This raw data, when appropriately processed and analyzed, can provide insight to a trained biologist and aid in hypothesis-driven research. Given the time and resource requirements necessary for biological experiments, computational predictions of gene functions can aid in reducing a large list of candidate genes to a few promising targets. Various computational solutions have been proposed and developed for gene function prediction. These solutions utilize various forms of data, such as DNA/RNA/protein sequences, protein structures, interaction networks, literature mining, and a combination of these data sources. However, these methods do not always produce precise results as the underlying data sets used for training or modeling are quite sparse. We developed and used a massive sequence similarity network build over 108 million known protein sequences to aid in protein function prediction. Predictions are made through the alignment of query sequences to representative sequences for a given cluster derived from the massive sequence similarity network. Derived clusters aggregate information (particularly that from the Gene Ontology) from respective members, which we then consolidate through a novel weighted path method. We evaluate our method on four holdout datasets using CAFA evaluation metrics. Our results suggest that clustering significantly reduces the time and memory requirements, with a marginal impact on predictive power. At lower sequence similarity thresholds, our method outperforms other gold standard methods.
- Automatic layout and visualization of biclustersGrothaus, Gregory A.; Mufti, Adeel; Murali, T. M. (2006-09-04)Background Biclustering has emerged as a powerful algorithmic tool for analyzing measurements of gene expression. A number of different methods have emerged for computing biclusters in gene expression data. Many of these algorithms may output a very large number of biclusters with varying degrees of overlap. There are no systematic methods that create a two-dimensional layout of the computed biclusters and display overlaps between them. Results We develop a novel algorithm for laying out biclusters in a two-dimensional matrix whose rows (respectively, columns) are rows (respectively, columns) of the original dataset. We display each bicluster as a contiguous submatrix in the layout. We allow the layout to have repeated rows and/or columns from the original matrix as required, but we seek a layout of the smallest size. We also develop a web-based search interface for the user to query the genes and samples of interest and visualise the layout of biclusters matching the queries. Conclusion We demonstrate the usefulness of our approach on gene expression data for two types of leukaemia and on protein-DNA binding data for two growth conditions in Saccharomyces cerevisiae. The software implementing the layout algorithm is available at http://bioinformatics.cs.vt.edu/~murali/papers/bivoc.
- Automatic Reconstruction of the Building Blocks of Molecular Interaction NetworksRivera, Corban G. (Virginia Tech, 2008-08-11)High-throughput whole-genome biological assays are highly intricate and difficult to interpret. The molecular interaction networks generated from evaluation of those experiments suggest that cellular functions are carried out by modules of interacting molecules. Reverse-engineering the modular structure of cellular interaction networks has the promise of significantly easing their analysis. We hypothesize that: • cellular wiring diagrams can be decomposed into overlapping modules, where each module is a set of coherently-interacting molecules and • a cell responds to a stress or a stimulus by appropriately modulating the activities of a subset of these modules. Motivated by these hypotheses, we develop models and algorithms that can reverse-engineer molecular modules from large-scale functional genomic data. We address two major problems: 1. Given a wiring diagram and genome-wide gene expression data measured after the application of a stress or in a disease state, compute the active network of molecular interactions perturbed by the stress or the disease. 2. Given the active networks for multiple stresses, stimuli, or diseases, compute a set of network legos, which are molecular modules with the property that each active network can be expressed as an appropriate combination of a subset of modules. To address the first problem, we propose an approach that computes the most-perturbed subgraph of a curated pathway of molecular interactions in a disease state. Our method is based on a novel score for pathway perturbation that incorporates both differential gene expression and the interaction structure of the pathway. We apply our method to a compendium of cancer types. We show that the significance of the most perturbed sub-pathway is frequently larger than that of the entire pathway. We identify an association that suggests that IL-2 infusion may have a similar therapeutic effect in bladder cancer as it does in melanoma. We propose two models to address the second problem. First, we formulate a Boolean model for constructing network legos from a set of active networks. We reduce the problem of computing network legos to that of constructing closed biclusters in a binary matrix. Applying this method to a compendium of 13 stresses on human cells, we automatically detect that about four to six hours after treatment with chemicals cause endoplasmic reticulum stress, fibroblasts shut down the cell cycle far more aggressively than fibroblasts or HeLa cells do in response to other treatments. Our second model represents each active network as an additive combination of network legos. We formulate the problem as one of computing network legos that can be used to recover active networks in an optimal manner. We use existing methods for non-negative matrix approximation to solve this problem. We apply our method to a human cancer dataset including 190 samples from 18 cancers. We identify a network lego that associates integrins and matrix metalloproteinases in ovarian adenoma and other cancers and a network lego including the retinoblastoma pathway associated with multiple leukemias.
- Automating the PathLinker app for CytoscapeHuang, Li Jun; Law, Jeffrey N.; Murali, T. M. (F1000Research, 2018-06-12)PathLinker is a graph-theoretic algorithm originally developed to reconstruct the interactions in a signaling pathway of interest. It efficiently computes multiple short paths within a background protein interaction network from the receptors to transcription factors (TFs) in a pathway. Since December 2015, PathLinker has been available as an app for Cytoscape. This paper describes how we automated the app to use the CyRest infrastructure and how users can incorporate PathLinker into their software pipelines.
- Benchmarking Methods For Predicting Phenotype Gene AssociationsTyagi, Tanya (Virginia Tech, 2020-09-16)Assigning human genes to diseases and related phenotypes is an important topic in modern genomics. Human Phenotype Ontology (HPO) is a standardized vocabulary of phenotypic abnormalities that occur in human diseases. Computational methods such as label-propagation and supervised-learning address challenges posed by traditional approaches such as manual curation to link genes to phenotypes in the HPO. It is only in recent years that computational methods have been applied in a network-based approach for predicting genes to disease-related phenotypes. In this thesis, we present an extensive benchmarking of various computational methods for the task of network-based gene classification. These methods are evaluated on multiple protein interaction networks and feature representations. We empirically evaluate the performance of multiple prediction tasks using two evaluation experiments: cross-fold validation and the more stringent temporal holdout. We demonstrate that all of the prediction methods considered in our benchmarking analysis have similar performance, with each of the methods outperforming a random predictor.
- The Betweenness Centrality Of Biological NetworksNarayanan, Shivaram (Virginia Tech, 2005-09-16)In the last few years, large-scale experiments have generated genome-wide protein interaction networks for many organisms including Saccharomyces cerevisiae (baker's yeast), Caenorhabditis elegans (worm) and Drosophila melanogaster (fruit fly). In this thesis, we examine the vertex and edge betweenness centrality measures of these graphs. These measures capture how "central" a vertex or an edge is in the graph by considering the fraction of shortest paths that pass through that vertex or edge. Our primary observation is that the distribution of the vertex betweenness centrality follows a power law, but the distribution of the edge betweenness centrality has a Poisson-like distribution with a very sharp spike. To investigate this phenomenon, we generated random networks with degree distribution identical to those of the protein interaction networks. To our surprise, we found out that the random networks and the protein interaction networks had almost identical distribution of edge betweenness. We conjecture that the "Poisson-like" distribution of the edge betweenness centrality is the property of any graph whose degree distribution satisfies power law.
- A Biclustering Approach to Combinatorial Transcription ControlSrinivasan, Venkataraghavan (Virginia Tech, 2005-07-06)Combinatorial control of transcription is a well established phenomenon in the cell. Multiple transcription factors often bind to the same transcriptional control region of a gene and interact with each other to control the expression of the gene. It is thus necessary to consider the joint conservation of sequence pairs in order to identify combinations of binding sites to which the transcription factors bind. Conventional motif finding algorithms fail to address this issue. We propose a novel biclustering algorithm based on random sampling to identify candidate binding site combinations. We establish bounds on the various parameters to the algorithm and study the conditions under which the algorithm is guaranteed to identify candidate binding sites. We analyzed a yeast cell cycle gene expression data set using our algorithm and recovered certain novel combinations of binding sites, besides those already reported in the literature.
- Biologically-Interpretable Disease Classification Based on Gene Expression DataGrothaus, Gregory (Virginia Tech, 2005-05-13)Classification of tissues and diseases based on gene expression data is a powerful application of DNA microarrays. Many popular classifiers like support vector machines, nearest-neighbour methods, and boosting have been applied successfully to this problem. However, it is difficult to determine from these classifiers which genes are responsible for the distinctions between the diseases. We propose a novel framework for classification of gene expression data based on notion of condition-specific clusters of co-expressed genes called xMotifs. Our xMotif-based classifier is biologically interpretable: we show how we can detect relationships between xMotifs and gene functional annotations. Our classifier achieves high-accuracy on leave-one-out cross-validation on both two-class and multi-class data. Our technique has the potential to be the method of choice for researchers interested in disease and tissue classification.
- Bridging Methodological Gaps in Network-Based Systems BiologyPoirel, Christopher L. (Virginia Tech, 2013-10-16)Functioning of the living cell is controlled by a complex network of interactions among genes, proteins, and other molecules. A major goal of systems biology is to understand and explain the mechanisms by which these interactions govern the cell's response to various conditions. Molecular interaction networks have proven to be a powerful representation for studying cellular behavior. Numerous algorithms have been developed to unravel the complexity of these networks. Our work addresses the drawbacks of existing techniques. This thesis includes three related research efforts that introduce network-based approaches to bridge current methodological gaps in systems biology. i. Functional enrichment methods provide a summary of biological functions that are overrepresented in an interesting collection of genes (e.g., highly differentially expressed genes between a diseased cell and a healthy cell). Standard functional enrichment algorithms ignore the known interactions among proteins. We propose a novel network-based approach to functional enrichment that explicitly accounts for these underlying molecular interactions. Through this work, we close the gap between set-based functional enrichment and topological analysis of molecular interaction networks. ii. Many techniques have been developed to compute the response network of a cell. A recent trend in this area is to compute response networks of small size, with the rationale that only part of a pathway is often changed by disease and that interpreting small subnetworks is easier than interpreting larger ones. However, these methods may not uncover the spectrum of pathways perturbed in a particular experiment or disease. To avoid these difficulties, we propose to use algorithms that reconcile case-control DNA microarray data with a molecular interaction network by modifying per-gene differential expression p-values such that two genes connected by an interaction show similar changes in their gene expression values. iii. Top-down analyses in systems biology can automatically find correlations among genes and proteins in large-scale datasets. However, it is often difficult to design experiments from these results. In contrast, bottom-up approaches painstakingly craft detailed models of cellular processes. However, developing the models is a manual process that can take many years. These approaches have largely been developed independently. We present Linker, an efficient and automated data-driven method that analyzes molecular interactomes. Linker combines teleporting random walks and k-shortest path computations to discover connections from a set of source proteins to a set of target proteins. We demonstrate the efficacy of Linker through two applications: proposing extensions to an existing model of cell cycle regulation in budding yeast and automated reconstruction of human signaling pathways. Linker achieves superior precision and recall compared to state-of-the-art algorithms from the literature.
- Capturing Truthiness: Mining Truth Tables in Binary DatasetsOwens, Clifford Conley; Murali, T. M.; Ramakrishnan, Naren (Department of Computer Science, Virginia Polytechnic Institute & State University, 2007-02-01)We introduce a new data mining problem: mining truth tables in binary datasets. Given a matrix of objects and the properties they satisfy, a truth table identifies a subset of properties that exhibit maximal variability (and hence, complete independence) in occurrence patterns over the underlying objects. This problem is relevant in many domains, e.g., bioinformatics where we seek to identify and model independent components of combinatorial regulatory pathways, and in social/economic demographics where we desire to determine independent behavioral attributes of populations. Besides intrinsic interest in such patterns, we show how the problem of mining truth tables is dual to the problem of mining redescriptions, in that a set of properties involved in a truth table cannot participate in any possible redescription. This allows us to adapt our algorithm to the problem of mining redescriptions as well, by first identifying regions where redescriptions cannot happen, and then pursuing a divide and conquer strategy around these regions. Furthermore, our work suggests dual mining strategies where both classes of algorithms can be brought to bear upon either data mining task. We outline a family of levelwise approaches adapted to mining truth tables, algorithmic optimizations, and applications to bioinformatics and political datasets.
- Combinatorial Algorithms for Server Allocation ProblemSowle, Rachita (Virginia Tech, 2024-09-05)Motivated by problems in logistics, image recognition, and statistics, we consider the server allocation problem. In this problem, we are given $k$ servers (with capacities) and $n$ requests, which are points in a metric space. A server serves a request by moving to the request location, and the goal is to serve all requests while minimizing the total movement of servers, subject to the constraint that the number of requests served by a server cannot exceed its capacity. When the server capacity is $1$, and for the Euclidean metric, the problem reduces to the Euclidean bipartite matching problem. When the capacity is $infty$, suppose we are also provided with the order in which requests are to be served, the problem is the $k$-first come first served routing problem. We also consider a generalization of the $k$-first come first served routing problem to the taxi allocation problem, where each request is associated with a pickup location, dropoff location, and pickup time, and the server's velocity is also given as input. We present new algorithms for the Euclidean bipartite matching problem, showing improvements over existing algorithms. In particular, for two point sets $A, B subset mathbb{R}^d$ with $|A| = |B| = n$ and dimension $d > 1$ being constant, we developed: begin{itemize} item A faster algorithm that computes an $varepsilon$-approximate minimum-cost perfect matching in $O(n(varepsilon^{-O(d^3)}loglog n + varepsilon^{-O(d)}log^4 nlog^5log n))$ time. This is an improvement over previous algorithms, which took $n(varepsilon^{-1}log n)^{Omega(d)}$ time. item An algorithm that boosts the accuracy of any $varepsilon$-additive approximation algorithm, achieving an expected additive error of $min{varepsilon, (dloglog n)w^*}$ from the optimal matching cost $w^*$ in $O(T(n, varepsilon/d)loglog n)$ time, where $T(n, varepsilon)$ is the time complexity of any given $eps$-additive approximation algorithm. end{itemize} For the $k$-first come first served routing problem, we present the following results. begin{itemize} item The online version of the $k$-first come first served routing problem is the celebrated $k$-server problem. The best-known online algorithm for this problem is the Work Function algorithm. We present a new implementation of the work function algorithm, where processing the $i$th request takes $O((i+k)^2)$ time, improving on the previous methods that take $Omega(k(i+k)^2)$ time. item For the offline setting, we show that the $k$-first come first served routing problem and the taxi allocation problem can be reduced to the minimum-cost bipartite matching problem. Using this reduction, begin{itemize} item we develop a time-based divide-and-conquer algorithm to obtain an optimal solution in $tilde{O}(kn^2)$ time, which can be further improved to $tilde{O}(kn)$ when the requests and servers are in two-dimensional Euclidean space, and, item we apply a recently presented geometric divide-and-conquer algorithm to obtain an optimal solution for the taxi routing problem in a two-dimensional Euclidean space. As a result, we obtain significant empirical performance improvements for the taxi allocation problem in a two-dimensional space where the cost of moving from one location to another is lower bounded by the Euclidean cost. end{itemize} end{itemize}
- Compositional Mining of Multi-Relational Biological DatasetsJin, Ying; Murali, T. M.; Ramakrishnan, Naren (Department of Computer Science, Virginia Polytechnic Institute & State University, 2007-08-01)High-throughput biological screens are yielding ever-growing streams of information about multiple aspects of cellular activity. As more and more categories of datasets come online, there is a corresponding multitude of ways in which inferences can be chained across them, motivating the need for compositional data mining algorithms. In this paper, we argue that such compositional data mining can be effectively realized by functionally cascading redescription mining and biclustering algorithms as primitives. Both these primitives mirror shifts of vocabulary that can be composed in arbitrary ways to create rich chains of inferences. Given a relational database and its schema, we show how the schema can be automatically compiled into a compositional data mining program, and how different domains in the schema can be related through logical sequences of biclustering and redescription invocations. This feature allows us to rapidly prototype new data mining applications, yielding greater understanding of scientific datasets. We describe two applications of compositional data mining: (i) matching terms across categories of the Gene Ontology and (ii) understanding the molecular mechanisms underlying stress response in human cells.
- Computational approaches for discovery of common immunomodulators in fungal infections: towards broad-spectrum immunotherapeutic interventionsKidane, Yared H.; Lawrence, Christopher B.; Murali, T. M. (2013-10-07)Background Fungi are the second most abundant type of human pathogens. Invasive fungal pathogens are leading causes of life-threatening infections in clinical settings. Toxicity to the host and drug-resistance are two major deleterious issues associated with existing antifungal agents. Increasing a host’s tolerance and/or immunity to fungal pathogens has potential to alleviate these problems. A host’s tolerance may be improved by modulating the immune system such that it responds more rapidly and robustly in all facets, ranging from the recognition of pathogens to their clearance from the host. An understanding of biological processes and genes that are perturbed during attempted fungal exposure, colonization, and/or invasion will help guide the identification of endogenous immunomodulators and/or small molecules that activate host-immune responses such as specialized adjuvants. Results In this study, we present computational techniques and approaches using publicly available transcriptional data sets, to predict immunomodulators that may act against multiple fungal pathogens. Our study analyzed data sets derived from host cells exposed to five fungal pathogens, namely, Alternaria alternata, Aspergillus fumigatus, Candida albicans, Pneumocystis jirovecii, and Stachybotrys chartarum. We observed statistically significant associations between host responses to A. fumigatus and C. albicans. Our analysis identified biological processes that were consistently perturbed by these two pathogens. These processes contained both immune response-inducing genes such as MALT1, SERPINE1, ICAM1, and IL8, and immune response-repressing genes such as DUSP8, DUSP6, and SPRED2. We hypothesize that these genes belong to a pool of common immunomodulators that can potentially be activated or suppressed (agonized or antagonized) in order to render the host more tolerant to infections caused by A. fumigatus and C. albicans. Conclusions Our computational approaches and methodologies described here can now be applied to newly generated or expanded data sets for further elucidation of additional drug targets. Moreover, identified immunomodulators may be used to generate experimentally testable hypotheses that could help in the discovery of broad-spectrum immunotherapeutic interventions. All of our results are available at the following supplementary website: http://bioinformatics.cs.vt.edu/~murali/supplements/2013-kidane-bmc
- Computational prediction of host-pathogen protein–protein interactionsDyer, Matthew D.; Murali, T. M.; Sobral, Bruno (Oxford University Press, 2007)Motivation: Infectious diseases such as malaria result in millions of deaths each year. An important aspect of any host-pathogen system is the mechanism by which a pathogen can infect its host. One method of infection is via protein–protein interactions (PPIs) where pathogen proteins target host proteins. Developing computational methods that identify which PPIs enable a pathogen to infect a host has great implications in identifying potential targets for therapeutics. Results: We present a method that integrates known intra-species PPIs with protein-domain profiles to predict PPIs between host and pathogen proteins. Given a set of intra-species PPIs, we identify the functional domains in each of the interacting proteins. For every pair of functional domains, we use Bayesian statistics to assess the probability that two proteins with that pair of domains will interact. We apply our method to the Homo sapiens – Plasmodium falciparum host-pathogen system. Our system predicts 516 PPIs between proteins from these two organisms. We show that pairs of human proteins we predict to interact with the same Plasmodium protein are close to each other in the human PPI network and that Plasmodium pairs predicted to interact with same human protein are co-expressed in DNA microarray datasets measured during various stages of the Plasmodium life cycle. Finally, we identify functionally enriched sub-networks spanned by the predicted interactions and discuss the plausibility of our predictions.
- Connectivity Measures for Signaling Pathway TopologiesFranzese, Nicholas; Groce, Adam; Murali, T. M.; Ritz, Anna (Virginia Tech, 2019-03-30)Characterizing cellular responses to different extrinsic signals is an active area of research, and curated pathway databases describe these complex signaling reactions. Here, we revisit a fundamental question in signaling pathway analysis: are two molecules “connected” in a network? This question is the first step towards understanding the potential influence of molecules in a pathway, and the answer depends on the choice of modeling framework. We examined the connectivity of Reactome signaling pathways using four different pathway representations. We find that Reactome is very well connected as a graph, moderately well connected as a compound graph or bipartite graph, and poorly connected as a hypergraph (which captures many-to-many relationships in reaction networks). We present a novel relaxation of hypergraph connectivity that iteratively increases connectivity from a node while preserving the hypergraph topology. This measure, B-relaxation distance, provides a parameterized transition between hypergraph connectivity and graph connectivity. B-relaxation distance is sensitive to the presence of small molecules that participate in many functionally unrelated reactions in the network. We also define a score that quantifies one pathway’s downstream influence on another, which can be calculated as B-relaxation distance gradually relaxes the connectivity constraint in hypergraphs. Computing this score across all pairs of 34 Reactome pathways reveals two case studies of pathway influence, and we describe the specific reactions that contribute to the large influence score. Our method lays the groundwork for other generalizations of graph-theoretic concepts to hypergraphs in order to facilitate signaling pathway analysis.
- Creating Scientific Software, with Application to Phylogenetics and Oligonucleotide Probe DesignNordberg, Eric Kinsley (Virginia Tech, 2015-12-09)The demands placed on scientific software are different from those placed on general purpose software, and as a result, creating software for science and for scientists requires a specialized approach. Much of software engineering practices have developed in situations in which a tool is desired to perform some definable task, with measurable and verifiable outcomes. The users and the developers know what the tool "should" do. Scientific software often uses unproven or experimental techniques to address unsolved problems. The software is often run on "experimental" High Performance Computing hardware, adding another layer of complexity. It may not be possible to say what the software should do, or what the results should be, as these may be connected to very scientific questions for which the software is being developed. Software development in this realm requires a deep understanding of the relevent scientific domain area. The present work describes applications resulting from a scientific software development process that builds upon detailed understanding of the scientific domain area. YODA is an application primarily for selecting microarray probe sequences for measuring gene expression. At the time of its development, none of the existing programs for this task satisfied the best-known requirements for microarray probe selection. The question of what makes a good microarray probe was a research area at the time, and YODA was developed to incorporate the latest understanding of these requirements, drawn from the research literature, into a tool that can be used by a research biologist. An appendix examines the response and use in the years since YODA was released. PEPR is a software system for inferring highly resolved whole-genome phylogenies for hundreds of genomes. It encodes a process developed through years of research and collaboration to produce some of the highest quality phylogenies available for large sets of bacterial genomes, with no manual intervention required. This process is described in detail, and results are compared with high quality results from the literature to show that the process is at least as successful as more labor-intensive manual efforts. An appendix presents additional results, including high quality phylogenies for many bacterial Orders.
- CrowdLayout: Crowdsourced Design and Evaluation of Biological Network VisualizationsSingh, Divit P.; Lisle, Lee; Murali, T. M.; Luther, Kurt (ACM, 2018-04)Biologists often perform experiments whose results generate large quantities of data, such as interactions between molecules in a cell, that are best represented as networks (graphs). To visualize these networks and communicate them in publications, biologists must manually position the nodes and edges of each network to reflect their real-world physical structure. This process does not scale well, and graph layout algorithms lack the biological underpinnings to offer a viable alternative. In this paper, we present CrowdLayout, a crowdsourcing system that leverages human intelligence and creativity to design layouts of biological network visualizations. CrowdLayout provides design guidelines, abstractions, and editing tools to help novice workers perform like experts. We evaluated CrowdLayout in two experiments with paid crowd workers and real biological network data, finding that crowds could both create and evaluate meaningful, high-quality layouts. We also discuss implications for crowdsourced design and network visualizations in other domains.