Browsing by Author "Mazloom, Reza"
Now showing 1 - 5 of 5
Results Per Page
Sort Options
- CS5604 (Information Retrieval) Fall 2020 Front-end (FE) Team ProjectCao, Yusheng; Mazloom, Reza; Ogunleye, Makanjuola (Virginia Tech, 2020-12-16)With the demand and abundance of information increasing over the last two decades, generations of computer scientists are trying to improve the whole process of information searching, retrieval, and storage. With the diversification of the information sources, users' demand for various requirements of the data has also changed drastically both in terms of usability and performance. Due to the growth of the source material and requirements, correctly sorting, filtering, and storing has given rise to many new challenges in the field. With the help of all four other teams on this project, we are developing an information retrieval, analysis, and storage system to retrieve data from Virginia Tech's Electronic Thesis and Dissertation (ETD), Twitter, and Web Page archives. We seek to provide an appropriate data research and management tool to the users to access specific data. The system will also give certain users the authority to manage and add more data to the system. This project's deliverable will be combined with four others to produce a system usable by Virginia Tech's library system to manage, maintain, and analyze these archives. This report attempts to introduce the system components and design decisions regarding how it has been planned and implemented. Our team has developed a front end web interface that is able to search, retrieve, and manage three important content collection types: ETDs, tweets, and web pages. The interface incorporates a simple hierarchical user permission system, providing different levels of access to its users. In order to facilitate the workflow with other teams, we have containerized this system and made it available on the Virginia Tech cloud server. The system also makes use of a dynamic workflow system using a KnowledgeGraph and Apache Airflow, providing high levels of functional extensibility to the system. This allows curators and researchers to use containerised services for crawling, pre-processing, parsing, and indexing their custom corpora and collections that are available to them in the system.
- Genomic delineation and description of species and within-species lineages in the genus PantoeaCrosby, Katherine C.; Rojas, Mariah; Sharma, Parul; Johnson, Marcela A.; Mazloom, Reza; Kvitko, Brian H.; Smits, Theo HM M.; Venter, Stephanus N.; Coutinho, Teresa A.; Heath, Lenwood S.; Palmer, Marike; Vinatzer, Boris A. (Frontiers, 2023-11-09)As the name of the genus Pantoea (“of all sorts and sources”) suggests, this genus includes bacteria with a wide range of provenances, including plants, animals, soils, components of the water cycle, and humans. Some members of the genus are pathogenic to plants, and some are suspected to be opportunistic human pathogens; while others are used as microbial pesticides or show promise in biotechnological applications. During its taxonomic history, the genus and its species have seen many revisions. However, evolutionary and comparative genomics studies have started to provide a solid foundation for a more stable taxonomy. To move further toward this goal, we have built a 2,509-gene core genome tree of 437 public genome sequences representing the currently known diversity of the genus Pantoea. Clades were evaluated for being evolutionarily and ecologically significant by determining bootstrap support, gene content differences, and recent recombination events. These results were then integrated with genome metadata, published literature, descriptions of named species with standing in nomenclature, and circumscriptions of yet-unnamed species clusters, 15 of which we assigned names under the nascent SeqCode. Finally, genome-based circumscriptions and descriptions of each species and each significant genetic lineage within species were uploaded to the LINbase Web server so that newly sequenced genomes of isolates belonging to any of these groups could be precisely and accurately identified.
- LINbase: a web server for genome-based identification of prokaryotes as members of crowdsourced taxaTian, Long; Huang, Chengjie; Mazloom, Reza; Heath, Lenwood S.; Vinatzer, Boris A. (Oxford University Press, 2020-03-30)High throughput DNA sequencing in combination with efficient algorithms could provide the basis for a highly resolved, genome phylogeny-based and digital prokaryotic taxonomy. However, current taxonomic practice continues to rely on cumbersome journal publications for the description of new species, which still constitute the smallest taxonomic units. In response, we introduce LINbase, a web server that allows users to genomically circumscribe any group of prokaryotes with measurable DNA similarity and that uses the individual isolate as smallest unit. Since LINbase leverages the concept of Life Identification Numbers (LINs), which are codes assigned to individual genomes based on reciprocal average nucleotide identity, we refer to groups circumscribed in LINbase as LINgroups. Users can associate with each LINgroup a name, a short description, and a URL to a peer-reviewed publication. As soon as a LINgroup is circumscribed, any user can immediately identify query genomes as members and submit comments about the LINgroup. Most genomes currently in LINbase were imported from GenBank, but users can upload their own genome sequences as well. In conclusion, LINbase combines the resolution of LINs with the power of crowdsourcing in support of a highly resolved, genome phylogeny-based digital taxonomy. LINbase is available at http://www.LINbase.org.
- LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomesTian, Long; Mazloom, Reza; Heath, Lenwood S.; Vinatzer, Boris A. (2021-03-24)Background: Computing genomic similarity between strains is a prerequisite for genome-based prokaryotic classification and identification. Genomic similarity was first computed as Average Nucleotide Identity (ANI) values based on the alignment of genomic fragments. Since this is computationally expensive, faster and computationally cheaper alignment-free methods have been developed to estimate ANI. However, these methods do not reach the level of accuracy of alignment-based methods. Methods: Here we introduce LINflow, a computational pipeline that infers pairwise genomic similarity in a set of genomes. LINflow takes advantage of the speed of the alignment-free sourmash tool to identify the genome in a dataset that is most similar to a query genome and the precision of the alignment-based pyani software to precisely compute ANI between the query genome and the most similar genome identified by sourmash. This is repeated for each new genome that is added to a dataset. The sequentially computed ANI values are stored as Life Identification Numbers (LINs), which are then used to infer all other pairwise ANI values in the set. We tested LINflow on four sets, 484 genomes in total, and compared the needed time and the generated similarity matrices with other tools. Results: LINflow is up to 150 times faster than pyani and pairwise ANI values generated by LINflow are highly correlated with those computed by pyani. However, because LINflow infers most pairwise ANI values instead of computing them directly, ANI values occasionally depart from the ANI values computed by pyani. In conclusion, LINflow is a fast and memory-efficient pipeline to infer similarity among a large set of prokaryotic genomes. Its ability to quickly add new genome sequences to an already computed similarity matrix makes LINflow particularly useful for projects when new genome sequences need to be regularly added to an existing dataset.
- Meta-analysis of the Ralstonia solanacearum species complex (RSSC) based on comparative evolutionary genomics and reverse ecologySharma, Parul; Johnson, Marcela A.; Mazloom, Reza; Allen, Caitilyn; Heath, Lenwood S.; Lowe-Power, Tiffany M.; Vinatzer, Boris A. (Microbiology Society, 2022-03)Ralstonia solanacearum species complex (RSSC) strains are bacteria that colonize plant xylem tissue and cause vascular wilt diseases. However, individual strains vary in host range, optimal disease temperatures and physiological traits. To increase our understanding of the evolution, diversity and biology of the RSSC, we performed a meta-analysis of 100 representative RSSC genomes. These 100 RSSC genomes contain 4940 genes on average, and a pangenome analysis found that there are 3262 genes in the core genome (similar to 60 % of the mean RSSC genome) with 13 128 genes in the extensive flexible genome. A core genome phylogenetic tree and a whole-genome similarity matrix aligned with the previously named species (R. solanacearum, R. pseudosolanacearum, R. syzygii) and phylotypes (I-IV). These analyses also highlighted a third unrecognized sub-clade of phylotype II. Additionally, we identified differences between phylotypes with respect to gene content and recombination rate, and we delineated population clusters based on the extent of horizontal gene transfer. Multiple analyses indicate that phylotype II is the most diverse phylotype, and it may thus represent the ancestral group of the RSSC. We also used our genome- based framework to test whether the RSSC sequence variant (sequevar) taxonomy is a robust method to define within-species relationships of strains. The sequevar taxonomy is based on alignments of a single conserved gene (egl). Although sequevars in phylotype II describe monophyletic groups, the sequevar system breaks down in the highly recombinogenic phylotype I, which highlights the need for an improved, cost-effective method for genotyping strains in phylotype I. Finally, we enabled quick and precise genome- based identification of newly sequenced RSSC strains by assigning Life Identification Numbers (LINs) to the 100 strains and by circumscribing the RSSC and its sub-groups in the LINbase Web service.