Browsing by Author "Kumar, Deept"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
- Algorithms for StorytellingKumar, Deept; Ramakrishnan, Naren; Helm, Richard F.; Potts, Malcolm (Department of Computer Science, Virginia Polytechnic Institute & State University, 2006)We formulate a new data mining problem called "storytelling" as a generalization of redescription mining. In traditional redescription mining, we are given a set of objects and a collection of subsets defined over these objects. The goal is to view the set system as a vocabulary and identify two expressions in this vocabulary that induce the same set of objects. Storytelling, on the other hand, aims to explicitly relate object sets that are disjoint (and hence, maximally dissimilar) by finding a chain of (approximate) redescriptions between the sets. This problem finds applications in bioinformatics, for instance, where the biologist is trying to relate a set of genes expressed in one experiment to another set, implicated in a different pathway. We outline an efficient storytelling implementation that embeds the CARTwheels redescription mining algorithm in an A* search procedure, using the former to supply next move operators on search branches to the latter. This approach is practical and effective for mining large datasets and, at the same time, exploits the structure of partitions imposed by the given vocabulary. Three application case studies are presented: a study of word overlaps in large English dictionaries, exploring connections between genesets in a bioinformatics dataset, and relating publications in the PubMed index of abstracts.
- Mining Novellas from PubMed Abstracts using a Storytelling AlgorithmGresock, Joseph; Kumar, Deept; Helm, Richard F.; Potts, Malcolm; Ramakrishnan, Naren (Department of Computer Science, Virginia Polytechnic Institute & State University, 2007)Motivation: There are now a multitude of articles published in a diversity of journals providing information about genes, proteins, pathways, and entire processes. Each article investigates particular subsets of a biological process, but to gain insight into the functioning of a system as a whole, we must computationally integrate information across multiple publications. This is especially important in problems such as modeling cross-talk in signaling networks, designing drug therapies for combinatorial selectivity, and unraveling the role of gene interactions in deleterious phenotypes, where the cost of performing combinatorial screens is exorbitant. Results: We present an automated approach to biological knowledge discovery from PubMed abstracts, suitable for unraveling combinatorial relationships. It involves the systematic application of a `storytelling' algorithm followed by compression of the stories into `novellas.' Given a start and end publication, typically with little or no overlap in content, storytelling identifies a chain of intermediate publications from one to the other, such that neighboring publications have significant content similarity. Stories discovered thus provide an argued approach to relate distant concepts through compositions of related concepts. The chains of links employed by stories are then mined to find frequently reused sub-stories, which can be compressed to yield novellas, or compact templates of connections. We demonstrate a successful application of storytelling and novella finding to modeling combinatorial relationships between introduction of extracellular factors and downstream cellular events. Availability: A story visualizer, suitable for interactive exploration of stories and novellas described in this paper, is available for demo/download at https://bioinformatics.cs.vt.edu/storytelling.
- Modeling Diffusion-Controlled Emissions of Volatile Organic Compounds From Layered Building MaterialsKumar, Deept (Virginia Tech, 2002-06-19)Building materials are a major source of indoor air contaminants. Volatile organic compounds (VOCs) are an important class of contaminants prevalent in indoor air. Attempts have been made to model the emission of VOCs from building materials. Diffusion has been shown to control the rate of mass transfer within certain types of building materials. The primary objective of this research is to develop a fundamental diffusion-based model for single and double layer building materials. The single-layer model considers a slab of material located on the floor of a chamber or room with the material acting either as a source or a sink for VOCs. The behavior of the model is governed by the material phase diffusion coefficient (D), the material/air partition coefficient (K), the concentration of VOC in the influent air stream, and the initial concentration within the material phase. The single-layer model extends a previously developed version, incorporating the non-uniform initial concentration inside the building material and a transient influent concentration. Experimental work is performed to check the validity of the model. A steel chamber housing a piece of vinyl flooring is used to simulate building material within a room. D and K values for two representative VOCs, n-dodecane and phenol, are available from earlier experiments. These parameters are used in the model to predict the VOC concentration inside the chamber. The predicted values compare very well to the observed experimental data. A double layer version of the model is developed and studied from a theoretical perspective. The model also permits a time dependent influent concentration and a non-uniform initial concentration profile within each of the two layers. A parametric analysis is performed varying the ratio of the diffusion coefficients, the partition coefficients and the thickness of the two layers. Three cases of practical interest are studied using the double-layer model. The use of a thin low-permeability barrier layer placed on top of a building material is shown to hold considerable promise for reducing the emission rate of VOCs into indoor air.
- Redescription Mining: Algorithms and Applications in BioinformaticsKumar, Deept (Virginia Tech, 2007-04-19)Scientific data mining purports to extract useful knowledge from massive datasets curated through computational science efforts, e.g., in bioinformatics, cosmology, geographic sciences, and computational chemistry. In the recent past, we have witnessed major transformations of these applied sciences into data-driven endeavors. In particular, scientists are now faced with an overload of vocabularies for describing domain entities. All of these vocabularies offer alternative and mostly complementary (sometimes, even contradictory) ways to organize information and each vocabulary provides a different perspective into the problem being studied. To further knowledge discovery, computational scientists need tools to help uniformly reason across vocabularies, integrate multiple forms of characterizing datasets, and situate knowledge gained from one study in terms of others. This dissertation defines a new pattern class called redescriptions that provides high level capabilities for reasoning across domain vocabularies. A redescription is a shift of vocabulary, or a different way of communicating the same information; redescription mining finds concerted sets of objects that can be defined in (at least) two ways using given descriptors. We present the CARTwheels algorithm for mining redescriptions by exploiting equivalences of partitions induced by distinct descriptor classes as well as applications of CARTwheels to several bioinformatics datasets. We then outline how we can build more complex data mining operations by cascading redescriptions to realize a story, leading to a new data mining capability called storytelling. Besides applications to characterizing gene sets, we showcase its uses in other datasets as well. Finally, we extend the core CARTwheels algorithm by introducing a theoretical framework, based on partitions, to systematically explore redescription space; generalizing from mining redescriptions (and stories) within a single domain to relating descriptors across different domains, to support complex relational data mining scenarios; and exploiting structure of the underlying descriptor space to yield more effective algorithms for specific classes of datasets.