Browsing by Author "Kim, Yoonjin"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
- ARGem: a new metagenomics pipeline for antibiotic resistance genes: metadata, analysis, and visualizationLiang, Xiao; Zhang, Jingyi; Kim, Yoonjin; Ho, Josh; Liu, Kevin; Keenum, Ishi M.; Gupta, Suraj; Davis, Benjamin; Hepp, Shannon L.; Zhang, Liqing; Xia, Kang; Knowlton, Katharine F.; Liao, Jingqiu; Vikesland, Peter J.; Pruden, Amy; Heath, Lenwood S. (Frontiers, 2023-09-15)Antibiotic resistance is of crucial interest to both human and animal medicine. It has been recognized that increased environmental monitoring of antibiotic resistance is needed. Metagenomic DNA sequencing is becoming an attractive method to profile antibiotic resistance genes (ARGs), including a special focus on pathogens. A number of computational pipelines are available and under development to support environmental ARG monitoring; the pipeline we present here is promising for general adoption for the purpose of harmonized global monitoring. Specifically, ARGem is a user-friendly pipeline that provides full-service analysis, from the initial DNA short reads to the final visualization of results. The capture of extensive metadata is also facilitated to support comparability across projects and broader monitoring goals. The ARGem pipeline offers efficient analysis of a modest number of samples along with affordable computational components, though the throughput could be increased through cloud resources, based on the user’s configuration. The pipeline components were carefully assessed and selected to satisfy tradeoffs, balancing efficiency and flexibility. It was essential to provide a step to perform short read assembly in a reasonable time frame to ensure accurate annotation of identified ARGs. Comprehensive ARG and mobile genetic element databases are included in ARGem for annotation support. ARGem further includes an expandable set of analysis tools that include statistical and network analysis and supports various useful visualization techniques, including Cytoscape visualization of co-occurrence and correlation networks. The performance and flexibility of the ARGem pipeline is demonstrated with analysis of aquatic metagenomes. The pipeline is freely available at https://github.com/xlxlxlx/ARGem.
- Flud: A Hybrid Crowd–Algorithm Approach for Visualizing Biological NetworksBharadwaj, Aditya; Gwizdala, David; Kim, Yoonjin; Luther, Kurt; Murali, T. M. (2022-01)Modern experiments in many disciplines generate large quantities of network (graph) data. Researchers require aesthetic layouts of these networks that clearly convey the domain knowledge and meaning. However, the problem remains challenging due to multiple conflicting aesthetic criteria and complex domain-specific constraints. In this article, we present a strategy for generating visualizations that can help network biologists understand the protein interactions that underlie processes that take place in the cell. Specifically, we have developed Flud, a crowd-powered system that allows humans with no expertise to design biologically meaningful graph layouts with the help of algorithmically generated suggestions. Furthermore, we propose a novel hybrid approach for graph layout wherein crowd workers and a simulated annealing algorithm build on each other’s progress. A study of about 2,000 crowd workers on Amazon Mechanical Turk showed that the hybrid crowd–algorithm approach outperforms the crowd-only approach and state-of-the-art techniques when workers were asked to lay out complex networks that represent signaling pathways. Another study of seven participants with biological training showed that Flud layouts are more effective compared to those created by state-of-the-art techniques.We also found that the algorithmically generated suggestions guided the workers when they are stuck and helped them improve their score. Finally, we discuss broader implications for mixed-initiative interactions in layout design tasks beyond biology.
- The probability of chromatin to be at the nuclear lamina has no systematic effect on its transcription level in fruit fliesAfanasyev, Alexander Y.; Kim, Yoonjin; Tolokh, Igor S.; Sharakhov, Igor V.; Onufriev, Alexey V. (2024-05-06)Background: Multiple studies have demonstrated a negative correlation between gene expression and positioning of genes at the nuclear envelope (NE) lined by nuclear lamina, but the exact relationship remains unclear, especially in light of the highly stochastic, transient nature of the gene association with the NE. Results: In this paper, we ask whether there is a causal, systematic, genome-wide relationship between the expression levels of the groups of genes in topologically associating domains (TADs) of Drosophila nuclei and the probabilities of TADs to be found at the NE. To investigate the nature of this possible relationship, we combine a coarse-grained dynamic model of the entire Drosophila nucleus with genome-wide gene expression data; we analyze the TAD averaged transcription levels of genes against the probabilities of individual TADs to be in contact with the NE in the control and lamins-depleted nuclei. Our findings demonstrate that, within the statistical error margin, the stochastic positioning of Drosophila melanogaster TADs at the NE does not, by itself, systematically affect the mean level of gene expression in these TADs, while the expected negative correlation is confirmed. The correlation is weak and disappears completely for TADs not containing lamina-associated domains (LADs) or TADs containing LADs, considered separately. Verifiable hypotheses regarding the underlying mechanism for the presence of the correlation without causality are discussed. These include the possibility that the epigenetic marks and affinity to the NE of a TAD are determined by various non-mutually exclusive mechanisms and remain relatively stable during interphase. Conclusions: At the level of TADs, the probability of chromatin being in contact with the nuclear envelope has no systematic, causal effect on the transcription level in Drosophila. The conclusion is reached by combining model-derived time-evolution of TAD locations within the nucleus with their experimental gene expression levels.
- Summarization of Maryland Shooting CollectionKhawas, Prapti; Banerjee, Bipasha; Zhao, Shuqi; Fan, Yiyang; Kim, Yoonjin (Virginia Tech, 2018-12-12)The goal of this work is to generate summaries of two Maryland shooting events from a large collection of web pages related to a shooting at Great Mills High School and another at the Capital Gazette newsroom. Since our team did not have prior experience with Computational Linguistics / Natural Language Processing (NLP), we followed an approach where we built summaries using 10 different methods, as suggested by course instructor Dr. Edward Fox, with each method being more sophisticated than the previous ones, to enable learning of key concepts in NLP. First, we started with finding a set of most frequent important words. Then, we found other words occurring in the articles which mean the same as the frequent words found. Along with the synonyms, we found sets of hypernyms and hyponyms. We identified a set of words constrained by POS, e.g., nouns and verbs. We then tried out various classification techniques in Apache Mahout to classify the documents into the two different events and eliminate irrelevant documents. Next, we identified a set of frequent and important named entities using NLTK and SpaCy Named Entity Recognition (NER) modules. We identified a set of important topics identified using Latent Dirichlet Allocation (LDA). We then generated clusters of documents using K-means. Next, we extracted a set of values for each slot matching collection semantics using regular expressions and generated a readable summary explaining the slots and values using a Context Free Grammar we developed. Finally, we used the Pointer Generator deep learning approach to generate a readable abstractive summary. Using the above approach, we generated two extractive summaries for newsroom shooting event and school shooting event with ROUGE-1 scores around 0.33 and 0.26 respectively. For the abstractive summaries, that we generated, the ROUGE-1 score was 0.36 for newsroom shooting event and 0.20 for school shooting event. We also evaluated the summaries at sentence level and we found that the abstractive school shooting summary had a higher ROUGE-1 score, being 0.88, than abstractive newsroom shooting summary with 0.73. We employed the Hadoop MapReduce framework to speed up the processing time for our large collection. We used various other tools like the NLTK language processing library and Apache Mahout, a distributed linear algebra framework to simplify our development. We learned that a variety of different methods and techniques which suit the collection are necessary in order to provide an accurate summary. We also learned the importance of cleaning the collection and challenges in the task.