Browsing by Author "Nerella, Chandra Sekhar"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- Comparative Analysis of Genomic Similarity Tools in Species IdentificationNerella, Chandra Sekhar (Virginia Tech, 2025-01-14)This study presents the development and evaluation of an automated pipeline for genome comparison, leveraging four bioinformatics tools: alignment-based methods (pyANI, Fas- tANI) and k-mer-based methods (Sourmash, BinDash 2.0). The analysis focuses on high- quality genomic datasets characterized by 100% completeness, ensuring consistency and accuracy in the comparison process. The pipeline processes genomes under uniform con- ditions, recording key performance metrics such as execution time and rank correlations. Initial comparisons were conducted on a subset of five genomes, generating 10 unique pair- wise comparisons to establish baseline performance. This preliminary analysis identified k = 10 as the optimal k-mer size for Sourmash and BinDash, significantly improving their comparability with alignment-based methods. For the expanded dataset of 175 genomes, encompassing (175C2) = 15,225 unique comparisons, pyANI and FastANI demonstrated high similarity values, often exceeding 90% for closely related genomes. Rank correlations, calculated using Spearman's ρ and Kendall's τ , high- lighted strong agreement between pyANI and FastANI (ρ = 0.9630 , τ = 0.8625) due to their shared alignment-based methodology. Similarly, Sourmash and BinDash, both employing k-mer-based approaches, exhibited moderate-to-strong rank correlations (ρ = 0.6967, τ = 0.5290). In contrast, the rank correlations between alignment-based and k-mer-based tools were lower, underscoring methodological differences in genome similarity calculations. Execution times revealed significant contrasts between the tools. Alignment-based meth- ods required substantial computation time, with pyANI taking an average of 1.97 seconds per comparison and FastANI averaging 0.81 seconds per comparison. Conversely, k-mer- based methods demonstrated exceptional computational efficiency, with Sourmash complet- ing comparisons in 2.1 milliseconds and BinDash in just 0.25 milliseconds per comparison, reflecting a difference of nearly three orders of magnitude between the two categories. These results underscore the trade-offs between computational cost and methodological approaches in genome similarity estimation. This study provides valuable insights into the relative strengths and weaknesses of genome comparison tools, offering a comprehensive framework for selecting appropriate methods for diverse genomic research applications. The findings emphasize the importance of param- eter optimization for k-mer-based tools and highlight the scalability of these methods for large-scale genomic analyses.