COVID-19 Variant Analyzer through Genomic Sequences and Jaccard Similarities
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The COVID-19 pandemic has underscored the urgent need for efficient genomic surveillance to track the emergence and spread of SARS-CoV-2 variants. This study developed a novel computational framework to enhance variant detection by leveraging a database-driven approach and genomic sequence analysis. The framework utilizes MySQL database architecture where each variant is stored in distinct tables, enabling rapid comparison and classification of new variants through Jaccard similarity calculations. The innovative aspect of this research lies in its unique database structure and classification method. Unlike traditional clustering approaches, this system creates individual tables for each variant, allowing for dynamic updates and efficient comparisons. When a new variant is introduced, the framework calculates Jaccard similarity scores between the new variant and existing variant tables, automatically creating new tables for potentially novel variants that fall below-established similarity thresholds. This approach enables real-time variant tracking and classification, adapting to the evolving nature of the virus. The system employs advanced bioinformatics tools including sourmash for signature generation and NumPy for computational analysis, alongside Python-MySQL connectors for seamless database interactions. It implements similarity thresholds of 0.817 for primary classification and 0.867 for secondary validation to determine variant group membership. Whole-genome data was analyzed to compare its effectiveness in identifying variants of concern, with the database structure accommodating genomic data. The results demonstrated the framework's ability to accurately detect and classify SARS-CoV-2 variants with high sensitivity and specificity. The study highlighted the potential of whole-genome sequences as a cost-effective alternative for variant detection in resource-limited settings, while also revealing their limitations compared to whole-genome analysis. This research contributes to global genomic surveillance efforts by providing scalable database tools for rapid variant identification, aiding public health strategies, vaccine development, and therapeutic interventions.